LinuxLists.cc - [PATCH v2 0/5] perf: KVM: Enable callchains for guests

2023-10-08 14:50:09

Subject: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Hi there,

This series of patches enables callchains for guests (used by perf kvm),
which holds the top spot on the perf wiki TODO list [1]. This allows users
to perform guest OS callchain or performance analysis from external
using PMU events.

The event processing flow is as follows (shown as backtrace):
#0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
#1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
<callback function pointers in `struct perf_guest_info_callbacks`>
#2 perf_guest_get_frame_pointer / perf_guest_read_virt
#3 perf_callchain_guest
#4 get_perf_callchain
#5 perf_callchain

Between #0 and #1 is the interface between KVM and the arch-specific
impl, while between #1 and #2 is the interface between Perf and KVM.
The 1st patch implements #0. The 2nd patch extends interfaces between #1
and #2, while the 3rd patch implements #1. The 4th patch implements #3
and modifies #4 #5. The last patch is for userspace utils.

Since arm64 hasn't provided some foundational infrastructure (interface
for reading from a virtual address of guest), the arm64 implementation
is stubbed for now because it's a bit complex, and will be implemented
later.

Tested with both 32-bit and 64-bit guest operating systems / unikernels,
that `perf script` could correctly show the certain callchains.
FlameGraphs can also be generated with this series of patches and [2].

Any feedback will be greatly appreciated.

[1] https://perf.wiki.kernel.org/index.php/Todo
[2] https://github.com/brendangregg/FlameGraph

v1:
https://lore.kernel.org/kvm/SYYP282MB108686A73C0F896D90D246569DE5A@SYYP282MB1086.AUSP282.PROD.OUTLOOK.COM/

Changes since v1:
- v1 only includes partial KVM modifications, while v2 is a complete
implementation. Also updated based on Sean's feedback.

Tianyi Liu (5):
KVM: Add arch specific interfaces for sampling guest callchains
perf kvm: Introduce guest interfaces for sampling callchains
KVM: implement new perf interfaces
perf kvm: Support sampling guest callchains
perf tools: Support PERF_CONTEXT_GUEST_* flags

arch/arm64/kvm/arm.c | 17 +++++++++
arch/x86/events/core.c | 56 +++++++++++++++++++++++------
arch/x86/kvm/x86.c | 18 ++++++++++
include/linux/kvm_host.h | 4 +++
include/linux/perf_event.h | 18 +++++++++-
kernel/bpf/stackmap.c | 8 ++---
kernel/events/callchain.c | 27 +++++++++++++-
kernel/events/core.c | 17 ++++++++-
tools/perf/builtin-timechart.c | 6 ++++
tools/perf/util/data-convert-json.c | 6 ++++
tools/perf/util/machine.c | 6 ++++
virt/kvm/kvm_main.c | 25 +++++++++++++
12 files changed, 191 insertions(+), 17 deletions(-)

base-commit: 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa
--
2.42.0

2023-10-08 14:53:26

by Tianyi Liu

[permalink] [raw]

Subject: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains

This patch adds three architecture specific interfaces and x86
implementations used by `perf kvm`:

- kvm_arch_vcpu_get_frame_pointer: Return the frame pointer of vcpu,
for x86 it's RBP, and for arm64 it's x29.

- kvm_arch_vcpu_read_virt: Read data from a virtual address
of the given guest vm.

- kvm_arch_vcpu_is_64bit: Return whether the vcpu is working in 64-bit
mode. It's used for determining the size of a stack frame.

Since arm64 hasn't provided some foundational infrastructure,
stub the arm64 implementation for now because it's a bit complex.

Signed-off-by: Tianyi Liu <[email protected]>
---
arch/arm64/kvm/arm.c | 17 +++++++++++++++++
arch/x86/kvm/x86.c | 18 ++++++++++++++++++
include/linux/kvm_host.h | 4 ++++
3 files changed, 39 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 4866b3f7b..b57b88c58 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -571,6 +571,23 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
{
return *vcpu_pc(vcpu);
}
+
+unsigned long kvm_arch_vcpu_get_frame_pointer(struct kvm_vcpu *vcpu)
+{
+ /* TODO: implement */
+ return NULL;
+}
+
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, void *addr, void *dest, unsigned int length)
+{
+ /* TODO: implement */
+ return false;
+}
+
+bool kvm_arch_vcpu_is_64bit(struct kvm_vcpu *vcpu)
+{
+ return !vcpu_mode_is_32bit(vcpu);
+}
#endif

static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f18b06bb..17dea02b7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12904,6 +12904,24 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
return kvm_rip_read(vcpu);
}

+unsigned long kvm_arch_vcpu_get_frame_pointer(struct kvm_vcpu *vcpu)
+{
+ return kvm_register_read_raw(vcpu, VCPU_REGS_RBP);
+}
+
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, void *addr, void *dest, unsigned int length)
+{
+ struct x86_exception e;
+
+ /* Return true on success */
+ return kvm_read_guest_virt(vcpu, addr, dest, length, &e) == X86EMUL_CONTINUE;
+}
+
+bool kvm_arch_vcpu_is_64bit(struct kvm_vcpu *vcpu)
+{
+ return is_64_bit_mode(vcpu);
+}
+
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
{
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109f..f92f1a9c8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1595,6 +1595,10 @@ static inline bool kvm_arch_intc_initialized(struct kvm *kvm)

#ifdef CONFIG_GUEST_PERF_EVENTS
unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
+unsigned long kvm_arch_vcpu_get_frame_pointer(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, void *addr, void *dest,
+ unsigned int length);
+bool kvm_arch_vcpu_is_64bit(struct kvm_vcpu *vcpu);

void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
void kvm_unregister_perf_callbacks(void);
--
2.42.0

2023-10-08 14:54:47

by Tianyi Liu

[permalink] [raw]

Subject: [PATCH v2 2/5] perf kvm: Introduce guest interfaces for sampling callchains

This patch introduces two callback interfaces used between perf and KVM:

- get_frame_pointer: Return the frame pointer of the running vm.

- read_virt: Read data from a virtual address of the running vm,
used for reading the stack frames from the guest.

This also introduces a new flag, `PERF_GUEST_64BIT`, to the `.state`
callback interface, which indicates whether the vm is running in
64-bit mode.

Signed-off-by: Tianyi Liu <[email protected]>
---
include/linux/perf_event.h | 15 +++++++++++++++
kernel/events/core.c | 10 ++++++++++
2 files changed, 25 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e85cd1c0e..d0f937a62 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -28,10 +28,13 @@

#define PERF_GUEST_ACTIVE 0x01
#define PERF_GUEST_USER 0x02
+#define PERF_GUEST_64BIT 0x04

struct perf_guest_info_callbacks {
unsigned int (*state)(void);
unsigned long (*get_ip)(void);
+ unsigned long (*get_frame_pointer)(void);
+ bool (*read_virt)(void *addr, void *dest, unsigned int len);
unsigned int (*handle_intel_pt_intr)(void);
};

@@ -1495,6 +1498,8 @@ extern struct perf_guest_info_callbacks __rcu *perf_guest_cbs;

DECLARE_STATIC_CALL(__perf_guest_state, *perf_guest_cbs->state);
DECLARE_STATIC_CALL(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DECLARE_STATIC_CALL(__perf_guest_get_frame_pointer, *perf_guest_cbs->get_frame_pointer);
+DECLARE_STATIC_CALL(__perf_guest_read_virt, *perf_guest_cbs->read_virt);
DECLARE_STATIC_CALL(__perf_guest_handle_intel_pt_intr, *perf_guest_cbs->handle_intel_pt_intr);

static inline unsigned int perf_guest_state(void)
@@ -1505,6 +1510,14 @@ static inline unsigned long perf_guest_get_ip(void)
{
return static_call(__perf_guest_get_ip)();
}
+static inline unsigned long perf_guest_get_frame_pointer(void)
+{
+ return static_call(__perf_guest_get_frame_pointer)();
+}
+static inline bool perf_guest_read_virt(void *addr, void *dest, unsigned int length)
+{
+ return static_call(__perf_guest_read_virt)(addr, dest, length);
+}
static inline unsigned int perf_guest_handle_intel_pt_intr(void)
{
return static_call(__perf_guest_handle_intel_pt_intr)();
@@ -1514,6 +1527,8 @@ extern void perf_unregister_guest_info_callbacks(struct perf_guest_info_callback
#else
static inline unsigned int perf_guest_state(void) { return 0; }
static inline unsigned long perf_guest_get_ip(void) { return 0; }
+static inline unsigned long perf_guest_get_frame_pointer(void) { return 0; }
+static inline bool perf_guest_read_virt(void*, void*, unsigned int) { return 0; }
static inline unsigned int perf_guest_handle_intel_pt_intr(void) { return 0; }
#endif /* CONFIG_GUEST_PERF_EVENTS */

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c72a41f1..eaba00ec2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6759,6 +6759,8 @@ struct perf_guest_info_callbacks __rcu *perf_guest_cbs;

DEFINE_STATIC_CALL_RET0(__perf_guest_state, *perf_guest_cbs->state);
DEFINE_STATIC_CALL_RET0(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DEFINE_STATIC_CALL_RET0(__perf_guest_get_frame_pointer, *perf_guest_cbs->get_frame_pointer);
+DEFINE_STATIC_CALL_RET0(__perf_guest_read_virt, *perf_guest_cbs->read_virt);
DEFINE_STATIC_CALL_RET0(__perf_guest_handle_intel_pt_intr, *perf_guest_cbs->handle_intel_pt_intr);

void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
@@ -6770,6 +6772,12 @@ void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
static_call_update(__perf_guest_state, cbs->state);
static_call_update(__perf_guest_get_ip, cbs->get_ip);

+ if (cbs->get_frame_pointer)
+ static_call_update(__perf_guest_get_frame_pointer, cbs->get_frame_pointer);
+
+ if (cbs->read_virt)
+ static_call_update(__perf_guest_read_virt, cbs->read_virt);
+
/* Implementing ->handle_intel_pt_intr is optional. */
if (cbs->handle_intel_pt_intr)
static_call_update(__perf_guest_handle_intel_pt_intr,
@@ -6785,6 +6793,8 @@ void perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
rcu_assign_pointer(perf_guest_cbs, NULL);
static_call_update(__perf_guest_state, (void *)&__static_call_return0);
static_call_update(__perf_guest_get_ip, (void *)&__static_call_return0);
+ static_call_update(__perf_guest_get_frame_pointer, (void *)&__static_call_return0);
+ static_call_update(__perf_guest_read_virt, (void *)&__static_call_return0);
static_call_update(__perf_guest_handle_intel_pt_intr,
(void *)&__static_call_return0);
synchronize_rcu();
--
2.42.0

2023-10-08 14:56:59

by Tianyi Liu

[permalink] [raw]

Subject: [PATCH v2 3/5] KVM: implement new perf interfaces

This patch provides two KVM implementations for the following new perf
interfaces, just redirecting them to the arch-specific implementations:

- get_frame_pointer: Return the frame pointer of the running vm.
- read_virt: Read data from a virtual address of the running vm.

Signed-off-by: Tianyi Liu <[email protected]>
---
virt/kvm/kvm_main.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a70..6fd6ee6c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6018,6 +6018,9 @@ static unsigned int kvm_guest_state(void)
if (!kvm_arch_vcpu_in_kernel(vcpu))
state |= PERF_GUEST_USER;

+ if (kvm_arch_vcpu_is_64bit(vcpu))
+ state |= PERF_GUEST_64BIT;
+
return state;
}

@@ -6032,9 +6035,31 @@ static unsigned long kvm_guest_get_ip(void)
return kvm_arch_vcpu_get_ip(vcpu);
}

+static unsigned long kvm_guest_get_frame_pointer(void)
+{
+ struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+ if (WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)))
+ return 0;
+
+ return kvm_arch_vcpu_get_frame_pointer(vcpu);
+}
+
+static bool kvm_guest_read_virt(void *addr, void *dest, unsigned int length)
+{
+ struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+ if (WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)))
+ return false;
+
+ return kvm_arch_vcpu_read_virt(vcpu, addr, dest, length);
+}
+
static struct perf_guest_info_callbacks kvm_guest_cbs = {
.state = kvm_guest_state,
.get_ip = kvm_guest_get_ip,
+ .get_frame_pointer = kvm_guest_get_frame_pointer,
+ .read_virt = kvm_guest_read_virt,
.handle_intel_pt_intr = NULL,
};

--
2.42.0

2023-10-08 14:57:46

by Tianyi Liu

[permalink] [raw]

Subject: [PATCH v2 4/5] perf kvm: Support sampling guest callchains

This patch provides support for sampling guests' callchains.

The signature of `get_perf_callchain` has been modified to explicitly
specify whether it needs to sample the host or guest callchain.
Based on the context, it will distribute the sampling request to one of
`perf_callchain_user`, `perf_callchain_kernel`, or `perf_callchain_guest`.

The reason for separately implementing `perf_callchain_user` and
`perf_callchain_kernel` is that the kernel may utilize special unwinders
such as `ORC`. However, for the guest, we only support stackframe-based
unwinding, so the implementation is generic and only needs to be
separately implemented for 32-bit and 64-bit.

Signed-off-by: Tianyi Liu <[email protected]>
---
arch/x86/events/core.c | 56 +++++++++++++++++++++++++++++++-------
include/linux/perf_event.h | 3 +-
kernel/bpf/stackmap.c | 8 +++---
kernel/events/callchain.c | 27 +++++++++++++++++-
kernel/events/core.c | 7 ++++-
5 files changed, 84 insertions(+), 17 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 185f902e5..ea4c86175 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2758,11 +2758,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
struct unwind_state state;
unsigned long addr;

- if (perf_guest_state()) {
- /* TODO: We don't support guest os callchain now */
- return;
- }
-
if (perf_callchain_store(entry, regs->ip))
return;

@@ -2778,6 +2773,52 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
}
}

+static inline void
+perf_callchain_guest32(struct perf_callchain_entry_ctx *entry)
+{
+ struct stack_frame_ia32 frame;
+ const struct stack_frame_ia32 *fp;
+
+ fp = (void *)perf_guest_get_frame_pointer();
+ while (fp && entry->nr < entry->max_stack) {
+ if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
+ sizeof(frame.next_frame)))
+ break;
+ if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
+ sizeof(frame.return_address)))
+ break;
+ perf_callchain_store(entry, frame.return_address);
+ fp = (void *)frame.next_frame;
+ }
+}
+
+void
+perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
+{
+ struct stack_frame frame;
+ const struct stack_frame *fp;
+ unsigned int guest_state;
+
+ guest_state = perf_guest_state();
+ perf_callchain_store(entry, perf_guest_get_ip());
+
+ if (guest_state & PERF_GUEST_64BIT) {
+ fp = (void *)perf_guest_get_frame_pointer();
+ while (fp && entry->nr < entry->max_stack) {
+ if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
+ sizeof(frame.next_frame)))
+ break;
+ if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
+ sizeof(frame.return_address)))
+ break;
+ perf_callchain_store(entry, frame.return_address);
+ fp = (void *)frame.next_frame;
+ }
+ } else {
+ perf_callchain_guest32(entry);
+ }
+}
+
static inline int
valid_user_frame(const void __user *fp, unsigned long size)
{
@@ -2861,11 +2902,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
struct stack_frame frame;
const struct stack_frame __user *fp;

- if (perf_guest_state()) {
- /* TODO: We don't support guest os callchain now */
- return;
- }
-
/*
* We don't know what to do with VM86 stacks.. ignore them for now.
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d0f937a62..a2baf4856 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1545,9 +1545,10 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);

extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+extern void perf_callchain_guest(struct perf_callchain_entry_ctx *entry);
extern struct perf_callchain_entry *
get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
- u32 max_stack, bool crosstask, bool add_mark);
+ bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark);
extern int get_callchain_buffers(int max_stack);
extern void put_callchain_buffers(void);
extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 458bb80b1..2e88d4639 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,8 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
if (max_depth > sysctl_perf_event_max_stack)
max_depth = sysctl_perf_event_max_stack;

- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
- false, false);
+ trace = get_perf_callchain(regs, 0, kernel, user, true, false,
+ max_depth, false, false);

if (unlikely(!trace))
/* couldn't fetch the stack trace */
@@ -420,8 +420,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
else if (kernel && task)
trace = get_callchain_entry_for_task(task, max_depth);
else
- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
- false, false);
+ trace = get_perf_callchain(regs, 0, kernel, user, true, false,
+ max_depth, false, false);
if (unlikely(!trace))
goto err_fault;

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be843..7e80729e9 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -45,6 +45,10 @@ __weak void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
{
}

+__weak void perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
+{
+}
+
static void release_callchain_buffers_rcu(struct rcu_head *head)
{
struct callchain_cpus_entries *entries;
@@ -178,11 +182,12 @@ put_callchain_entry(int rctx)

struct perf_callchain_entry *
get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
- u32 max_stack, bool crosstask, bool add_mark)
+ bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark)
{
struct perf_callchain_entry *entry;
struct perf_callchain_entry_ctx ctx;
int rctx;
+ unsigned int guest_state;

entry = get_callchain_entry(&rctx);
if (!entry)
@@ -194,6 +199,26 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
ctx.contexts = 0;
ctx.contexts_maxed = false;

+ guest_state = perf_guest_state();
+ if (guest_state) {
+ if (!guest)
+ goto exit_put;
+ if (user && (guest_state & PERF_GUEST_USER)) {
+ if (add_mark)
+ perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_USER);
+ perf_callchain_guest(&ctx);
+ }
+ if (kernel && !(guest_state & PERF_GUEST_USER)) {
+ if (add_mark)
+ perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_KERNEL);
+ perf_callchain_guest(&ctx);
+ }
+ goto exit_put;
+ }
+
+ if (unlikely(!host))
+ goto exit_put;
+
if (kernel && !user_mode(regs)) {
if (add_mark)
perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index eaba00ec2..b3401f403 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7559,6 +7559,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
{
bool kernel = !event->attr.exclude_callchain_kernel;
bool user = !event->attr.exclude_callchain_user;
+ bool host = !event->attr.exclude_host;
+ bool guest = !event->attr.exclude_guest;
/* Disallow cross-task user callchains. */
bool crosstask = event->ctx->task && event->ctx->task != current;
const u32 max_stack = event->attr.sample_max_stack;
@@ -7567,7 +7569,10 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
if (!kernel && !user)
return &__empty_callchain;

- callchain = get_perf_callchain(regs, 0, kernel, user,
+ if (!host && !guest)
+ return &__empty_callchain;
+
+ callchain = get_perf_callchain(regs, 0, kernel, user, host, guest,
max_stack, crosstask, true);
return callchain ?: &__empty_callchain;
}
--
2.42.0

2023-10-08 14:58:30

by Tianyi Liu

[permalink] [raw]

Subject: [PATCH v2 5/5] perf tools: Support PERF_CONTEXT_GUEST_* flags

The `perf` util currently has a incomplete implementation for the
following event flags, that events with these flags will be dropped or
be identified as unknown types:

`PERF_CONTEXT_GUEST_KERNEL`
`PERF_CONTEXT_GUEST_USER`

This patch makes `perf script`, `perf timechart` and `perf data`
to correctly identify these flags.

Signed-off-by: Tianyi Liu <[email protected]>
---
tools/perf/builtin-timechart.c | 6 ++++++
tools/perf/util/data-convert-json.c | 6 ++++++
tools/perf/util/machine.c | 6 ++++++
3 files changed, 18 insertions(+)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 19d4542ea..6a368b6a3 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -536,6 +536,12 @@ static const char *cat_backtrace(union perf_event *event,
case PERF_CONTEXT_USER:
cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: "
"%"PRId64"\n", (s64) ip);
diff --git a/tools/perf/util/data-convert-json.c b/tools/perf/util/data-convert-json.c
index 5bb3c2ba9..62686f78d 100644
--- a/tools/perf/util/data-convert-json.c
+++ b/tools/perf/util/data-convert-json.c
@@ -205,6 +205,12 @@ static int process_sample_event(struct perf_tool *tool,
case PERF_CONTEXT_USER:
cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: %"
PRId64 "\n", (s64) ip);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 88f31b3a6..d88458c3b 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2346,6 +2346,12 @@ static int add_callchain_ip(struct thread *thread,
case PERF_CONTEXT_USER:
*cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ *cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ *cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: "
"%"PRId64"\n", (s64) ip);
--
2.42.0

2023-10-08 20:00:19

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] perf kvm: Support sampling guest callchains

Hi Tianyi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa]

url: https://github.com/intel-lab-lkp/linux/commits/Tianyi-Liu/KVM-Add-arch-specific-interfaces-for-sampling-guest-callchains/20231008-230042
base: 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa
patch link: https://lore.kernel.org/r/SY4P282MB108433024762F1F292D47C2A9DCFA%40SY4P282MB1084.AUSP282.PROD.OUTLOOK.COM
patch subject: [PATCH v2 4/5] perf kvm: Support sampling guest callchains
config: i386-tinyconfig (https://download.01.org/0day-ci/archive/20231009/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231009/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

arch/x86/events/core.c: In function 'perf_callchain_guest32':
>> arch/x86/events/core.c:2784:43: warning: passing argument 1 of 'perf_guest_read_virt' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
2784 | if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
| ^~~~~~~~~~~~~~~
In file included from arch/x86/events/core.c:15:
include/linux/perf_event.h:1531:41: note: expected 'void *' but argument is of type 'const u32 *' {aka 'const unsigned int *'}
1531 | static inline bool perf_guest_read_virt(void*, void*, unsigned int) { return 0; }
| ^~~~~
arch/x86/events/core.c:2787:43: warning: passing argument 1 of 'perf_guest_read_virt' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
2787 | if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
| ^~~~~~~~~~~~~~~~~~~
include/linux/perf_event.h:1531:41: note: expected 'void *' but argument is of type 'const u32 *' {aka 'const unsigned int *'}
1531 | static inline bool perf_guest_read_virt(void*, void*, unsigned int) { return 0; }
| ^~~~~
arch/x86/events/core.c: In function 'perf_callchain_guest':
arch/x86/events/core.c:2808:51: warning: passing argument 1 of 'perf_guest_read_virt' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
2808 | if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
| ^~~~~~~~~~~~~~~
include/linux/perf_event.h:1531:41: note: expected 'void *' but argument is of type 'struct stack_frame * const*'
1531 | static inline bool perf_guest_read_virt(void*, void*, unsigned int) { return 0; }
| ^~~~~
arch/x86/events/core.c:2811:51: warning: passing argument 1 of 'perf_guest_read_virt' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
2811 | if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
| ^~~~~~~~~~~~~~~~~~~
include/linux/perf_event.h:1531:41: note: expected 'void *' but argument is of type 'const long unsigned int *'
1531 | static inline bool perf_guest_read_virt(void*, void*, unsigned int) { return 0; }
| ^~~~~

vim +2784 arch/x86/events/core.c

2775
2776 static inline void
2777 perf_callchain_guest32(struct perf_callchain_entry_ctx *entry)
2778 {
2779 struct stack_frame_ia32 frame;
2780 const struct stack_frame_ia32 *fp;
2781
2782 fp = (void *)perf_guest_get_frame_pointer();
2783 while (fp && entry->nr < entry->max_stack) {
> 2784 if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
2785 sizeof(frame.next_frame)))
2786 break;
2787 if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
2788 sizeof(frame.return_address)))
2789 break;
2790 perf_callchain_store(entry, frame.return_address);
2791 fp = (void *)frame.next_frame;
2792 }
2793 }
2794

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-10-08 21:13:14

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains

2023-10-08 21:33:49

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains

2023-10-08 22:25:56

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains

Hi Tianyi,

kernel test robot noticed the following build errors:

[auto build test ERROR on 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa]

url: https://github.com/intel-lab-lkp/linux/commits/Tianyi-Liu/KVM-Add-arch-specific-interfaces-for-sampling-guest-callchains/20231008-230042
base: 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa
patch link: https://lore.kernel.org/r/SY4P282MB10840154D4F09917D6528BC69DCFA%40SY4P282MB1084.AUSP282.PROD.OUTLOOK.COM
patch subject: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains
config: x86_64-rhel-8.3-rust (https://download.01.org/0day-ci/archive/20231009/[email protected]/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231009/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> arch/x86/kvm/x86.c:12917:35: error: incompatible pointer to integer conversion passing 'void *' to parameter of type 'gva_t' (aka 'unsigned long') [-Wint-conversion]
return kvm_read_guest_virt(vcpu, addr, dest, length, &e) == X86EMUL_CONTINUE;
^~~~
arch/x86/kvm/x86.c:7403:19: note: passing argument to parameter here
EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
^
1 error generated.

vim +12917 arch/x86/kvm/x86.c

12911
12912 bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, void *addr, void *dest, unsigned int length)
12913 {
12914 struct x86_exception e;
12915
12916 /* Return true on success */
12917 return kvm_read_guest_virt(vcpu, addr, dest, length, &e) == X86EMUL_CONTINUE;
12918 }
12919

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-10-09 03:21:16

by Tianyi Liu

[permalink] [raw]

Subject: Re: [PATCH v2 1/5] KVM: Add arch specific interfaces for sampling guest callchains

> All warnings (new ones prefixed by >>):
>
> arch/x86/kvm/x86.c: In function 'kvm_arch_vcpu_read_virt':
> >> arch/x86/kvm/x86.c:12917:42: warning: passing argument 2 of 'kvm_read_guest_virt' makes integer from pointer without a cast [-Wint-conversion]
> 12917 | return kvm_read_guest_virt(vcpu, addr, dest, length, &e) == X86EMUL_CONTINUE;
> | ^~~~
> | |
> | void *
> arch/x86/kvm/x86.c:7388:38: note: expected 'gva_t' {aka 'long unsigned int'} but argument is of type 'void *'
> 7388 | gva_t addr, void *val, unsigned int bytes,
> | ~~~~~~^~~~
>

Terribly sorry for the build warnings, which is caused by type casting.
Will be fixed in the next version.

2023-10-10 16:13:35

by Maxim Levitsky

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] perf kvm: Support sampling guest callchains

У нд, 2023-10-08 у 22:57 +0800, Tianyi Liu пише:
> This patch provides support for sampling guests' callchains.
>
> The signature of `get_perf_callchain` has been modified to explicitly
> specify whether it needs to sample the host or guest callchain.
> Based on the context, it will distribute the sampling request to one of
> `perf_callchain_user`, `perf_callchain_kernel`, or `perf_callchain_guest`.
>
> The reason for separately implementing `perf_callchain_user` and
> `perf_callchain_kernel` is that the kernel may utilize special unwinders
> such as `ORC`. However, for the guest, we only support stackframe-based
> unwinding, so the implementation is generic and only needs to be
> separately implemented for 32-bit and 64-bit.
>
> Signed-off-by: Tianyi Liu <[email protected]>
> ---
> arch/x86/events/core.c | 56 +++++++++++++++++++++++++++++++-------
> include/linux/perf_event.h | 3 +-
> kernel/bpf/stackmap.c | 8 +++---
> kernel/events/callchain.c | 27 +++++++++++++++++-
> kernel/events/core.c | 7 ++++-
> 5 files changed, 84 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 185f902e5..ea4c86175 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2758,11 +2758,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
> struct unwind_state state;
> unsigned long addr;
>
> - if (perf_guest_state()) {
> - /* TODO: We don't support guest os callchain now */
> - return;
> - }
> -
> if (perf_callchain_store(entry, regs->ip))
> return;
>
> @@ -2778,6 +2773,52 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
> }
> }
>
> +static inline void
> +perf_callchain_guest32(struct perf_callchain_entry_ctx *entry)
> +{
> + struct stack_frame_ia32 frame;
> + const struct stack_frame_ia32 *fp;
> +
> + fp = (void *)perf_guest_get_frame_pointer();
> + while (fp && entry->nr < entry->max_stack) {
> + if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
This should be fp->next_frame.
> + sizeof(frame.next_frame)))
> + break;
> + if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
Same here.
> + sizeof(frame.return_address)))
> + break;
> + perf_callchain_store(entry, frame.return_address);
> + fp = (void *)frame.next_frame;
> + }
> +}
> +
> +void
> +perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
> +{
> + struct stack_frame frame;
> + const struct stack_frame *fp;
> + unsigned int guest_state;
> +
> + guest_state = perf_guest_state();
> + perf_callchain_store(entry, perf_guest_get_ip());
> +
> + if (guest_state & PERF_GUEST_64BIT) {
> + fp = (void *)perf_guest_get_frame_pointer();
> + while (fp && entry->nr < entry->max_stack) {
> + if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
Same here.
> + sizeof(frame.next_frame)))
> + break;
> + if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
And here.

> + sizeof(frame.return_address)))
> + break;
> + perf_callchain_store(entry, frame.return_address);
> + fp = (void *)frame.next_frame;
> + }
> + } else {
> + perf_callchain_guest32(entry);
> + }
> +}

For symmetry, maybe it makes sense to have perf_callchain_guest32 and perf_callchain_guest64
and then make perf_callchain_guest call each? No strong opinion on this of course.

> +
> static inline int
> valid_user_frame(const void __user *fp, unsigned long size)
> {
> @@ -2861,11 +2902,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
> struct stack_frame frame;
> const struct stack_frame __user *fp;
>
> - if (perf_guest_state()) {
> - /* TODO: We don't support guest os callchain now */
> - return;
> - }
> -
> /*
> * We don't know what to do with VM86 stacks.. ignore them for now.
> */
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index d0f937a62..a2baf4856 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1545,9 +1545,10 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>
> extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
> extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
> +extern void perf_callchain_guest(struct perf_callchain_entry_ctx *entry);
> extern struct perf_callchain_entry *
> get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
> - u32 max_stack, bool crosstask, bool add_mark);
> + bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark);
> extern int get_callchain_buffers(int max_stack);
> extern void put_callchain_buffers(void);
> extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 458bb80b1..2e88d4639 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -294,8 +294,8 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> if (max_depth > sysctl_perf_event_max_stack)
> max_depth = sysctl_perf_event_max_stack;
>
> - trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
> - false, false);
> + trace = get_perf_callchain(regs, 0, kernel, user, true, false,
> + max_depth, false, false);
>
> if (unlikely(!trace))
> /* couldn't fetch the stack trace */
> @@ -420,8 +420,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> else if (kernel && task)
> trace = get_callchain_entry_for_task(task, max_depth);
> else
> - trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
> - false, false);
> + trace = get_perf_callchain(regs, 0, kernel, user, true, false,
> + max_depth, false, false);
> if (unlikely(!trace))
> goto err_fault;
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1273be843..7e80729e9 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -45,6 +45,10 @@ __weak void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
> {
> }
>
> +__weak void perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
> +{
> +}
> +
> static void release_callchain_buffers_rcu(struct rcu_head *head)
> {
> struct callchain_cpus_entries *entries;
> @@ -178,11 +182,12 @@ put_callchain_entry(int rctx)
>
> struct perf_callchain_entry *
> get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
> - u32 max_stack, bool crosstask, bool add_mark)
> + bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark)
> {
> struct perf_callchain_entry *entry;
> struct perf_callchain_entry_ctx ctx;
> int rctx;
> + unsigned int guest_state;
>
> entry = get_callchain_entry(&rctx);
> if (!entry)
> @@ -194,6 +199,26 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
> ctx.contexts = 0;
> ctx.contexts_maxed = false;
>
> + guest_state = perf_guest_state();
> + if (guest_state) {
> + if (!guest)
> + goto exit_put;
> + if (user && (guest_state & PERF_GUEST_USER)) {
> + if (add_mark)
> + perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_USER);
> + perf_callchain_guest(&ctx);
> + }
> + if (kernel && !(guest_state & PERF_GUEST_USER)) {
> + if (add_mark)
> + perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_KERNEL);
> + perf_callchain_guest(&ctx);
> + }
> + goto exit_put;
> + }
> +
> + if (unlikely(!host))
> + goto exit_put;
> +
> if (kernel && !user_mode(regs)) {
> if (add_mark)
> perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index eaba00ec2..b3401f403 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7559,6 +7559,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
> {
> bool kernel = !event->attr.exclude_callchain_kernel;
> bool user = !event->attr.exclude_callchain_user;
> + bool host = !event->attr.exclude_host;
> + bool guest = !event->attr.exclude_guest;
> /* Disallow cross-task user callchains. */
> bool crosstask = event->ctx->task && event->ctx->task != current;
> const u32 max_stack = event->attr.sample_max_stack;
> @@ -7567,7 +7569,10 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
> if (!kernel && !user)
> return &__empty_callchain;
>
> - callchain = get_perf_callchain(regs, 0, kernel, user,
> + if (!host && !guest)
> + return &__empty_callchain;
> +
> + callchain = get_perf_callchain(regs, 0, kernel, user, host, guest,
> max_stack, crosstask, true);
> return callchain ?: &__empty_callchain;
> }

Best regards,
Maxim Levitsky

2023-10-11 14:46:47

by Tianyi Liu

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] perf kvm: Support sampling guest callchains

Hi Maxim,

At 2023-10-10 16:12 +0000, Maxim Levitsky wrote:
> > +static inline void
> > +perf_callchain_guest32(struct perf_callchain_entry_ctx *entry)
> > +{
> > + struct stack_frame_ia32 frame;
> > + const struct stack_frame_ia32 *fp;
> > +
> > + fp = (void *)perf_guest_get_frame_pointer();
> > + while (fp && entry->nr < entry->max_stack) {
> > + if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
> This should be fp->next_frame.
> > + sizeof(frame.next_frame)))
> > + break;
> > + if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
> Same here.
> > + sizeof(frame.return_address)))
> > + break;
> > + perf_callchain_store(entry, frame.return_address);
> > + fp = (void *)frame.next_frame;
> > + }
> > +}
> > +

The address space where `fp` resides here is in the guest memory, not in
the directly accessible kernel address space. `&fp->next_frame` and
`&fp->return_address` are simply calculating address offsets in a more
readable manner, much like `fp + 0` and `fp + 4`.

The original implementation of `perf_callchain_user` and
`perf_callchain_user32` also use this approach [1].

>
> For symmetry, maybe it makes sense to have perf_callchain_guest32 and perf_callchain_guest64
> and then make perf_callchain_guest call each? No strong opinion on this of course.
>

The `perf_callchain_guest` and `perf_callchain_guest32` here are simply
designed to mimic `perf_callchain_user` and `perf_callchain_user32` [2].
I'm also open to make the logic fully separate, if this doesn't seem
elegant enough.

[1] https://github.com/torvalds/linux/blob/master/arch/x86/events/core.c#L2890
[2] https://github.com/torvalds/linux/blob/master/arch/x86/events/core.c#L2820

Best regards,
Tianyi Liu

2023-10-11 16:46:27

by Marc Zyngier

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

On Sun, 08 Oct 2023 15:48:17 +0100,
Tianyi Liu <[email protected]> wrote:
>
> Hi there,
>
> This series of patches enables callchains for guests (used by perf kvm),
> which holds the top spot on the perf wiki TODO list [1]. This allows users
> to perform guest OS callchain or performance analysis from external
> using PMU events.
>
> The event processing flow is as follows (shown as backtrace):
> #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> <callback function pointers in `struct perf_guest_info_callbacks`>
> #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> #3 perf_callchain_guest
> #4 get_perf_callchain
> #5 perf_callchain
>
> Between #0 and #1 is the interface between KVM and the arch-specific
> impl, while between #1 and #2 is the interface between Perf and KVM.
> The 1st patch implements #0. The 2nd patch extends interfaces between #1
> and #2, while the 3rd patch implements #1. The 4th patch implements #3
> and modifies #4 #5. The last patch is for userspace utils.
>
> Since arm64 hasn't provided some foundational infrastructure (interface
> for reading from a virtual address of guest), the arm64 implementation
> is stubbed for now because it's a bit complex, and will be implemented
> later.

I hope you realise that such an "interface" would be, by definition,
fragile and very likely to break in a subtle way. The only existing
case where we walk the guest's page tables is for NV, and even that is
extremely fragile.

Given that, I really wonder why this needs to happen in the kernel.
Userspace has all the required information to interrupt a vcpu and
walk its current context, without any additional kernel support. What
are the bits here that cannot be implemented anywhere else?

>
> Tested with both 32-bit and 64-bit guest operating systems / unikernels,
> that `perf script` could correctly show the certain callchains.
> FlameGraphs can also be generated with this series of patches and [2].
>
> Any feedback will be greatly appreciated.
>
> [1] https://perf.wiki.kernel.org/index.php/Todo
> [2] https://github.com/brendangregg/FlameGraph
>
> v1:
> https://lore.kernel.org/kvm/SYYP282MB108686A73C0F896D90D246569DE5A@SYYP282MB1086.AUSP282.PROD.OUTLOOK.COM/
>
> Changes since v1:
> - v1 only includes partial KVM modifications, while v2 is a complete
> implementation. Also updated based on Sean's feedback.
>
> Tianyi Liu (5):
> KVM: Add arch specific interfaces for sampling guest callchains
> perf kvm: Introduce guest interfaces for sampling callchains
> KVM: implement new perf interfaces
> perf kvm: Support sampling guest callchains
> perf tools: Support PERF_CONTEXT_GUEST_* flags
>
> arch/arm64/kvm/arm.c | 17 +++++++++

Given that there is more to KVM than just arm64 and x86, I suggest
that you move the lack of support for this feature into the main KVM
code.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-10-12 06:37:21

by Tianyi Liu

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Hi Marc,

On Sun, 11 Oct 2023 16:45:17 +0000, Marc Zyngier wrote:
> > The event processing flow is as follows (shown as backtrace):
> > #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > <callback function pointers in `struct perf_guest_info_callbacks`>
> > #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > #3 perf_callchain_guest
> > #4 get_perf_callchain
> > #5 perf_callchain
> >
> > Between #0 and #1 is the interface between KVM and the arch-specific
> > impl, while between #1 and #2 is the interface between Perf and KVM.
> > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > and modifies #4 #5. The last patch is for userspace utils.
> >
> > Since arm64 hasn't provided some foundational infrastructure (interface
> > for reading from a virtual address of guest), the arm64 implementation
> > is stubbed for now because it's a bit complex, and will be implemented
> > later.
>
> I hope you realise that such an "interface" would be, by definition,
> fragile and very likely to break in a subtle way. The only existing
> case where we walk the guest's page tables is for NV, and even that is
> extremely fragile.
>

For walking the guest's page tables, yes, there're only very few
use cases. Most of them are used in nested virtualization and XEN.

> Given that, I really wonder why this needs to happen in the kernel.
> Userspace has all the required information to interrupt a vcpu and
> walk its current context, without any additional kernel support. What
> are the bits here that cannot be implemented anywhere else?
>

Thanks for pointing this out, I agree with your opinion.
Whether it's walking guest's contexts or performing an unwind,
user space can indeed accomplish these tasks.
The only reasons I see for implementing them in the kernel are performance
and the access to a broader range of PMU events.

Consider if I were to implement these functionalities in userspace:
I could have `perf kvm` periodically access the guest through the KVM API
to retrieve the necessary information. However, interrupting a VCPU
through the KVM API from user space might introduce higher latency
(not tested specifically), and the overhead of syscalls could also
limit the sampling frequency.

Additionally, it seems that user space can only interrupt the VCPU
at a certain frequency, without harnessing the richness of the PMU's
performance events. And if we incorporate the logic into the kernel,
`perf kvm` can bind to various PMU events and sample with a faster
performance in PMU interrupts.

So, it appears to be a tradeoff -- whether it's necessary to introduce
more complexity in the kernel to gain access to a broader range and more
precise performance data with less overhead. In my current use case,
I just require simple periodic sampling, which is sufficient for me,
so I'm open to both approaches.

> > Tianyi Liu (5):
> > KVM: Add arch specific interfaces for sampling guest callchains
> > perf kvm: Introduce guest interfaces for sampling callchains
> > KVM: implement new perf interfaces
> > perf kvm: Support sampling guest callchains
> > perf tools: Support PERF_CONTEXT_GUEST_* flags
> >
> > arch/arm64/kvm/arm.c | 17 +++++++++
>
> Given that there is more to KVM than just arm64 and x86, I suggest
> that you move the lack of support for this feature into the main KVM
> code.

Currently, sampling for KVM guests is only available for the guest's
instruction pointer, and even the support is limited, it is available
on only two architectures (x86 and arm64). This functionality relies on
a kernel configuration option called `CONFIG_GUEST_PERF_EVENTS`,
which will only be enabled on x86 and arm64.
Within the main KVM code, these interfaces are enclosed within
`#ifdef CONFIG_GUEST_PERF_EVENTS`. Do you think these are enough?

Best regards,
Tianyi Liu

2023-10-12 20:43:03

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH v2 4/5] perf kvm: Support sampling guest callchains

Hi Tianyi,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa]

url: https://github.com/intel-lab-lkp/linux/commits/Tianyi-Liu/KVM-Add-arch-specific-interfaces-for-sampling-guest-callchains/20231008-230042
base: 8a749fd1a8720d4619c91c8b6e7528c0a355c0aa
patch link: https://lore.kernel.org/r/SY4P282MB108433024762F1F292D47C2A9DCFA%40SY4P282MB1084.AUSP282.PROD.OUTLOOK.COM
patch subject: [PATCH v2 4/5] perf kvm: Support sampling guest callchains
config: i386-randconfig-061-20231012 (https://download.01.org/0day-ci/archive/20231013/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231013/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> arch/x86/events/core.c:2808:52: sparse: sparse: incorrect type in argument 1 (different modifiers) @@ expected void *addr @@ got struct stack_frame *const * @@
arch/x86/events/core.c:2808:52: sparse: expected void *addr
arch/x86/events/core.c:2808:52: sparse: got struct stack_frame *const *
>> arch/x86/events/core.c:2811:52: sparse: sparse: incorrect type in argument 1 (different modifiers) @@ expected void *addr @@ got unsigned long const * @@
arch/x86/events/core.c:2811:52: sparse: expected void *addr
arch/x86/events/core.c:2811:52: sparse: got unsigned long const *
>> arch/x86/events/core.c:2784:44: sparse: sparse: incorrect type in argument 1 (different modifiers) @@ expected void *addr @@ got unsigned int const * @@
arch/x86/events/core.c:2784:44: sparse: expected void *addr
arch/x86/events/core.c:2784:44: sparse: got unsigned int const *
arch/x86/events/core.c:2787:44: sparse: sparse: incorrect type in argument 1 (different modifiers) @@ expected void *addr @@ got unsigned int const * @@
arch/x86/events/core.c:2787:44: sparse: expected void *addr
arch/x86/events/core.c:2787:44: sparse: got unsigned int const *

vim +2808 arch/x86/events/core.c

2775
2776 static inline void
2777 perf_callchain_guest32(struct perf_callchain_entry_ctx *entry)
2778 {
2779 struct stack_frame_ia32 frame;
2780 const struct stack_frame_ia32 *fp;
2781
2782 fp = (void *)perf_guest_get_frame_pointer();
2783 while (fp && entry->nr < entry->max_stack) {
> 2784 if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
2785 sizeof(frame.next_frame)))
2786 break;
2787 if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
2788 sizeof(frame.return_address)))
2789 break;
2790 perf_callchain_store(entry, frame.return_address);
2791 fp = (void *)frame.next_frame;
2792 }
2793 }
2794
2795 void
2796 perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
2797 {
2798 struct stack_frame frame;
2799 const struct stack_frame *fp;
2800 unsigned int guest_state;
2801
2802 guest_state = perf_guest_state();
2803 perf_callchain_store(entry, perf_guest_get_ip());
2804
2805 if (guest_state & PERF_GUEST_64BIT) {
2806 fp = (void *)perf_guest_get_frame_pointer();
2807 while (fp && entry->nr < entry->max_stack) {
> 2808 if (!perf_guest_read_virt(&fp->next_frame, &frame.next_frame,
2809 sizeof(frame.next_frame)))
2810 break;
> 2811 if (!perf_guest_read_virt(&fp->return_address, &frame.return_address,
2812 sizeof(frame.return_address)))
2813 break;
2814 perf_callchain_store(entry, frame.return_address);
2815 fp = (void *)frame.next_frame;
2816 }
2817 } else {
2818 perf_callchain_guest32(entry);
2819 }
2820 }
2821

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-10-13 14:02:02

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

On Thu, Oct 12, 2023 at 02:35:42PM +0800, Tianyi Liu wrote:
> Hi Marc,
>
> On Sun, 11 Oct 2023 16:45:17 +0000, Marc Zyngier wrote:
> > > The event processing flow is as follows (shown as backtrace):
> > > #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > > #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > > <callback function pointers in `struct perf_guest_info_callbacks`>
> > > #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > > #3 perf_callchain_guest
> > > #4 get_perf_callchain
> > > #5 perf_callchain
> > >
> > > Between #0 and #1 is the interface between KVM and the arch-specific
> > > impl, while between #1 and #2 is the interface between Perf and KVM.
> > > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > > and modifies #4 #5. The last patch is for userspace utils.
> > >
> > > Since arm64 hasn't provided some foundational infrastructure (interface
> > > for reading from a virtual address of guest), the arm64 implementation
> > > is stubbed for now because it's a bit complex, and will be implemented
> > > later.
> >
> > I hope you realise that such an "interface" would be, by definition,
> > fragile and very likely to break in a subtle way. The only existing
> > case where we walk the guest's page tables is for NV, and even that is
> > extremely fragile.
>
> For walking the guest's page tables, yes, there're only very few
> use cases. Most of them are used in nested virtualization and XEN.

The key point isn't the lack of use cases; the key point is that *this is
fragile*.

Consider that walking guest page tables is only safe because:

(a) The walks happen in the guest-physical / intermiediate-physical address
space of the guest, and so are not themselves subject to translation via
the guest's page tables.

(b) Special traps were added to the architecture (e.g. for TLB invalidation)
which allow the host to avoid race conditions when the guest modifies page
tables.

For unwind we'd have to walk structures in the guest's virtual address space,
which can change under our feet at any time the guest is running, and handling
that requires much more care.

I think this needs a stronger justification, and an explanation of how you
handle such races.

Mark.

> > Given that, I really wonder why this needs to happen in the kernel.
> > Userspace has all the required information to interrupt a vcpu and
> > walk its current context, without any additional kernel support. What
> > are the bits here that cannot be implemented anywhere else?
>
> Thanks for pointing this out, I agree with your opinion.
> Whether it's walking guest's contexts or performing an unwind,
> user space can indeed accomplish these tasks.
> The only reasons I see for implementing them in the kernel are performance
> and the access to a broader range of PMU events.
>
> Consider if I were to implement these functionalities in userspace:
> I could have `perf kvm` periodically access the guest through the KVM API
> to retrieve the necessary information. However, interrupting a VCPU
> through the KVM API from user space might introduce higher latency
> (not tested specifically), and the overhead of syscalls could also
> limit the sampling frequency.
>
> Additionally, it seems that user space can only interrupt the VCPU
> at a certain frequency, without harnessing the richness of the PMU's
> performance events. And if we incorporate the logic into the kernel,
> `perf kvm` can bind to various PMU events and sample with a faster
> performance in PMU interrupts.
>
> So, it appears to be a tradeoff -- whether it's necessary to introduce
> more complexity in the kernel to gain access to a broader range and more
> precise performance data with less overhead. In my current use case,
> I just require simple periodic sampling, which is sufficient for me,
> so I'm open to both approaches.
>
> > > Tianyi Liu (5):
> > > KVM: Add arch specific interfaces for sampling guest callchains
> > > perf kvm: Introduce guest interfaces for sampling callchains
> > > KVM: implement new perf interfaces
> > > perf kvm: Support sampling guest callchains
> > > perf tools: Support PERF_CONTEXT_GUEST_* flags
> > >
> > > arch/arm64/kvm/arm.c | 17 +++++++++
> >
> > Given that there is more to KVM than just arm64 and x86, I suggest
> > that you move the lack of support for this feature into the main KVM
> > code.
>
> Currently, sampling for KVM guests is only available for the guest's
> instruction pointer, and even the support is limited, it is available
> on only two architectures (x86 and arm64). This functionality relies on
> a kernel configuration option called `CONFIG_GUEST_PERF_EVENTS`,
> which will only be enabled on x86 and arm64.
> Within the main KVM code, these interfaces are enclosed within
> `#ifdef CONFIG_GUEST_PERF_EVENTS`. Do you think these are enough?
>
> Best regards,
> Tianyi Liu

2023-10-20 09:24:50

by Tianyi Liu

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] perf: KVM: Enable callchains for guests

Hi Mark,

On Fri, 13 Oct 2023 15:01:22 +0100, Mark Rutland wrote:
> > > > The event processing flow is as follows (shown as backtrace):
> > > > #0 kvm_arch_vcpu_get_frame_pointer / kvm_arch_vcpu_read_virt (per arch)
> > > > #1 kvm_guest_get_frame_pointer / kvm_guest_read_virt
> > > > <callback function pointers in `struct perf_guest_info_callbacks`>
> > > > #2 perf_guest_get_frame_pointer / perf_guest_read_virt
> > > > #3 perf_callchain_guest
> > > > #4 get_perf_callchain
> > > > #5 perf_callchain
> > > >
> > > > Between #0 and #1 is the interface between KVM and the arch-specific
> > > > impl, while between #1 and #2 is the interface between Perf and KVM.
> > > > The 1st patch implements #0. The 2nd patch extends interfaces between #1
> > > > and #2, while the 3rd patch implements #1. The 4th patch implements #3
> > > > and modifies #4 #5. The last patch is for userspace utils.
> > > >
> > > > Since arm64 hasn't provided some foundational infrastructure (interface
> > > > for reading from a virtual address of guest), the arm64 implementation
> > > > is stubbed for now because it's a bit complex, and will be implemented
> > > > later.
> > >
> > > I hope you realise that such an "interface" would be, by definition,
> > > fragile and very likely to break in a subtle way. The only existing
> > > case where we walk the guest's page tables is for NV, and even that is
> > > extremely fragile.
> >
> > For walking the guest's page tables, yes, there're only very few
> > use cases. Most of them are used in nested virtualization and XEN.
>
> The key point isn't the lack of use cases; the key point is that *this is
> fragile*.
>
> Consider that walking guest page tables is only safe because:
>
> (a) The walks happen in the guest-physical / intermiediate-physical address
> space of the guest, and so are not themselves subject to translation via
> the guest's page tables.
>
> (b) Special traps were added to the architecture (e.g. for TLB invalidation)
> which allow the host to avoid race conditions when the guest modifies page
> tables.
>
> For unwind we'd have to walk structures in the guest's virtual address space,
> which can change under our feet at any time the guest is running, and handling
> that requires much more care.
>
> I think this needs a stronger justification, and an explanation of how you
> handle such races.

Yes, guests can modify the page tables at any time, so the page table
we obtain may be corrupted. We may not be able to complete the traversal
of the page table or may receive incorrect data.

However, these are not critical issues because we often encounter
incorrect stack unwinding results. In fact, here we assume that the
guest OS/program has stack frames (compiled with `fno-omit-frame-pointer`),
but many programs do not adhere to such an assumption, which often leads
to invalid results. This is almost unavoidable, especially when the
guest OS is running third-party programs. The unwind results we record
may be incorrect; if the unwind cannot continue, we only record the
existing results. Addresses that cannot be resolved to symbols will be
later marked as `[unknown]` by `perf kvm`, and this is very common.

Our unwind strategy is conservative to ensure safety and do our best in
readonly situations. If the guest page table is broken, or the address
to be read is somehow not in the guest page table, we will not inject a
page fault but simply stop the unwind. The function that walks the
page table is done entirely in software and is readonly, having no
additional impact on the guest. Some results could also be incorrect.
It is sufficient as long as most of the records are correct for profiling.

Do you think these address your concerns?

Thanks,
Tianyi Liu