2023-12-10 08:09:19

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 0/5] perf: KVM: Enable callchains for guests

This series of patches enables callchains for guests (used by `perf kvm`),
which holds the top spot on the perf wiki TODO list [1]. This allows users
to perform guest OS callchain or performance analysis from external
using PMU events. This is also useful for guests like unikernels that
lack performance event subsystems.

The event processing flow is as follows (shown as backtrace):
@0 kvm_arch_vcpu_get_unwind_info / kvm_arch_vcpu_read_virt (per arch impl)
@1 kvm_guest_get_unwind_info / kvm_guest_read_virt
<callback function pointers in `struct perf_guest_info_callbacks`>
@2 perf_guest_get_unwind_info / perf_guest_read_virt
@3 perf_callchain_guest
@4 get_perf_callchain
@5 perf_callchain

Between @0 and @1 is the interface between KVM and the arch-specific
impl, while between @1 and @2 is the interface between Perf and KVM.
The 1st patch implements @0. The 2nd patch extends interfaces between @1
and @2, while the 3rd patch implements @1. The 4th patch implements @3
and modifies @4 @5. The last patch is for userspace tools.

Since arm64 hasn't provided some foundational infrastructure (interface
for reading from a virtual address of guest), the arm64 implementation
is stubbed for now because it's a bit complex, and will be implemented
later.

For safety, guests are designed to be read-only in this feature,
and we will never inject page faults into the guests, ensuring that the
guests are not interfered by profiling. In extremely rare cases, if the
guest is modifying the page table, there is a possibility of reading
incorrect data. Additionally, if certain programs running in the guest OS
do not support frame pointers, it may also result in some erroneous data.
These erroneous data will eventually appear as `[unknown]` entries in the
report. It is sufficient as long as most of the records are correct for
profiling.

Regarding the necessity of implementing in the kernel:
Indeed, we could implement this in userspace and access the guest vm
through the KVM APIs, to interrupt the guest and perform unwinding.
However, this approach will introduce higher latency, and the overhead of
syscalls could limit the sampling frequency. Moreover, it appears that
user space can only interrupt the VCPU at a certain frequency, without
fully leveraging the richness of the PMU's performance events. On the
other hand, if we incorporate the logic into kernel, `perf kvm` can bind
to various PMU events and achieve faster performance in PMU interrupts.

Tested with both Linux and unikernels as guests, the `perf script` command
could correctly show the callchains.
FlameGraphs could also be generated with this series of patches and [2].

[1] https://perf.wiki.kernel.org/index.php/Todo
[2] https://github.com/brendangregg/FlameGraph

v1:
https://lore.kernel.org/kvm/SYYP282MB108686A73C0F896D90D246569DE5A@SYYP282MB1086.AUSP282.PROD.OUTLOOK.COM/

Changes since v1:
Post the complete implementation, also updated some code based on
Sean's feedback.

v2:
https://lore.kernel.org/kvm/SY4P282MB1084ECBCC1B176153B9E2A009DCFA@SY4P282MB1084.AUSP282.PROD.OUTLOOK.COM/

Changes since v2:
Refactored interface, packaged the info required by unwinding into
a struct; Resolved some type mismatches; Provided more explanations
based on the feedback from v2; more tests were performed.

Tianyi Liu (5):
KVM: Add arch specific interfaces for sampling guest callchains
perf kvm: Introduce guest interfaces for sampling callchains
KVM: implement new perf callback interfaces
perf kvm: Support sampling guest callchains
perf tools: Support PERF_CONTEXT_GUEST_* flags

MAINTAINERS | 1 +
arch/arm64/kvm/arm.c | 12 ++++++
arch/x86/events/core.c | 63 ++++++++++++++++++++++++-----
arch/x86/kvm/x86.c | 24 +++++++++++
include/linux/kvm_host.h | 5 +++
include/linux/perf_event.h | 20 ++++++++-
include/linux/perf_kvm.h | 18 +++++++++
kernel/bpf/stackmap.c | 8 ++--
kernel/events/callchain.c | 27 ++++++++++++-
kernel/events/core.c | 17 +++++++-
tools/perf/builtin-timechart.c | 6 +++
tools/perf/util/data-convert-json.c | 6 +++
tools/perf/util/machine.c | 6 +++
virt/kvm/kvm_main.c | 22 ++++++++++
14 files changed, 218 insertions(+), 17 deletions(-)
create mode 100644 include/linux/perf_kvm.h


base-commit: 33cc938e65a98f1d29d0a18403dbbee050dcad9a
--
2.34.1


2023-12-10 08:13:17

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 1/5] KVM: Add arch specific interfaces for sampling guest callchains

This patch adds two architecture specific interfaces used by `perf kvm`:

- kvm_arch_vcpu_get_unwind_info: Return required data for unwinding
at once; including ip address, frame pointer, whether the guest vCPU
is running in 32 or 64 bits, and possibly the base addresses of
the segments.

- kvm_arch_vcpu_read_virt: Read data from a virtual address of the
guest vm.

`perf_kvm.h` has been added to the `include/linux/` directory to store
the interface structures between the perf events subsystem and the KVM
subsystem.

Since arm64 hasn't provided some foundational infrastructure, stub the
arm64 implementation for now because it's a bit complex.

The above interfaces require architecture support for
`CONFIG_GUEST_PERF_EVENTS`, which is only implemented by x86 and arm64
currently. For more architectures, they need to implement these interfaces
when enabling `CONFIG_GUEST_PERF_EVENTS`.

In terms of safety, guests are designed to be read-only in this feature,
and we will never inject page faults into the guests, ensuring that the
guests are not interfered by profiling. In extremely rare cases, if the
guest is modifying the page table, there is a possibility of reading
incorrect data. Additionally, if certain programs running in the guest OS
do not support frame pointers, it may also result in some erroneous data.
These erroneous data will eventually appear as `[unknown]` entries in the
report. It is sufficient as long as most of the records are correct for
profiling.

Signed-off-by: Tianyi Liu <[email protected]>
---
MAINTAINERS | 1 +
arch/arm64/kvm/arm.c | 12 ++++++++++++
arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
include/linux/kvm_host.h | 5 +++++
include/linux/perf_kvm.h | 18 ++++++++++++++++++
5 files changed, 60 insertions(+)
create mode 100644 include/linux/perf_kvm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 788be9ab5b73..5ee36b4a9701 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16976,6 +16976,7 @@ F: arch/*/kernel/*/perf_event*.c
F: arch/*/kernel/perf_callchain.c
F: arch/*/kernel/perf_event*.c
F: include/linux/perf_event.h
+F: include/linux/perf_kvm.h
F: include/uapi/linux/perf_event.h
F: kernel/events/*
F: tools/lib/perf/
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e5f75f1f1085..5ae74b5c263a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -574,6 +574,18 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
{
return *vcpu_pc(vcpu);
}
+
+bool kvm_arch_vcpu_get_unwind_info(struct kvm_vcpu *vcpu, struct perf_kvm_guest_unwind_info *info)
+{
+ /* TODO: implement */
+ return false;
+}
+
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, gva_t addr, void *dest, unsigned int length)
+{
+ /* TODO: implement */
+ return false;
+}
#endif

static int kvm_vcpu_initialized(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..9341cd80f665 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13039,6 +13039,30 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
return kvm_rip_read(vcpu);
}

+bool kvm_arch_vcpu_get_unwind_info(struct kvm_vcpu *vcpu, struct perf_kvm_guest_unwind_info *info)
+{
+ info->ip_pointer = kvm_rip_read(vcpu);
+ info->frame_pointer = kvm_register_read_raw(vcpu, VCPU_REGS_RBP);
+
+ info->is_guest_64bit = is_64_bit_mode(vcpu);
+ if (info->is_guest_64bit) {
+ info->segment_cs_base = 0;
+ info->segment_ss_base = 0;
+ } else {
+ info->segment_cs_base = get_segment_base(vcpu, VCPU_SREG_CS);
+ info->segment_ss_base = get_segment_base(vcpu, VCPU_SREG_SS);
+ }
+ return true;
+}
+
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, gva_t addr, void *dest, unsigned int length)
+{
+ struct x86_exception e;
+
+ /* Return true on success */
+ return kvm_read_guest_virt(vcpu, addr, dest, length, &e) == X86EMUL_CONTINUE;
+}
+
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
{
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4944136efaa2..6f5ff4209b0c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -41,6 +41,7 @@
#include <linux/kvm_para.h>

#include <linux/kvm_types.h>
+#include <linux/perf_kvm.h>

#include <asm/kvm_host.h>
#include <linux/kvm_dirty_ring.h>
@@ -1595,6 +1596,10 @@ static inline bool kvm_arch_intc_initialized(struct kvm *kvm)

#ifdef CONFIG_GUEST_PERF_EVENTS
unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
+bool kvm_arch_vcpu_get_unwind_info(struct kvm_vcpu *vcpu,
+ struct perf_kvm_guest_unwind_info *info);
+bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, gva_t addr, void *dest,
+ unsigned int length);

void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
void kvm_unregister_perf_callbacks(void);
diff --git a/include/linux/perf_kvm.h b/include/linux/perf_kvm.h
new file mode 100644
index 000000000000..e77eeebddabb
--- /dev/null
+++ b/include/linux/perf_kvm.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PERF_KVM_H
+#define _LINUX_PERF_KVM_H
+
+/*
+ * Structures as interface between Perf Event and KVM subsystem.
+ * Add more members for new architectures if necessary.
+ */
+
+struct perf_kvm_guest_unwind_info {
+ unsigned long ip_pointer;
+ unsigned long frame_pointer;
+ bool is_guest_64bit;
+ unsigned long segment_cs_base;
+ unsigned long segment_ss_base;
+};
+
+#endif /* _LINUX_PERF_KVM_H */
--
2.34.1

2023-12-10 08:16:23

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 3/5] KVM: implement new perf callback interfaces

This patch provides two KVM implementations for the following two new perf
interfaces, just redirecting them to the arch-specific implementations:

- get_unwind_info
- read_virt

Signed-off-by: Tianyi Liu <[email protected]>
---
virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..c47b6c0f8e94 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6032,9 +6032,31 @@ static unsigned long kvm_guest_get_ip(void)
return kvm_arch_vcpu_get_ip(vcpu);
}

+static bool kvm_guest_get_unwind_info(struct perf_kvm_guest_unwind_info *info)
+{
+ struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+ if (WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)))
+ return false;
+
+ return kvm_arch_vcpu_get_unwind_info(vcpu, info);
+}
+
+static bool kvm_guest_read_virt(unsigned long addr, void *dest, unsigned int length)
+{
+ struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+ if (WARN_ON_ONCE(!kvm_arch_pmi_in_guest(vcpu)))
+ return false;
+
+ return kvm_arch_vcpu_read_virt(vcpu, addr, dest, length);
+}
+
static struct perf_guest_info_callbacks kvm_guest_cbs = {
.state = kvm_guest_state,
.get_ip = kvm_guest_get_ip,
+ .get_unwind_info = kvm_guest_get_unwind_info,
+ .read_virt = kvm_guest_read_virt,
.handle_intel_pt_intr = NULL,
};

--
2.34.1

2023-12-10 08:16:25

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 4/5] perf kvm: Support sampling guest callchains

This patch provides support for sampling guests' callchains.

The signature of `get_perf_callchain` has been modified to explicitly
specify whether it needs to sample the host or guest callchain. Based on
the context, `get_perf_callchain` will distribute each sampling request
to one of `perf_callchain_user`, `perf_callchain_kernel`,
or `perf_callchain_guest`.

The reason for separately implementing `perf_callchain_user` and
`perf_callchain_kernel` is that the kernel may utilize special unwinders
like `ORC`. However, for the guest, we only support stackframe-based
unwinding, so the implementation is generic and only needs to be
separately implemented for 32-bit and 64-bit.

Signed-off-by: Tianyi Liu <[email protected]>
---
arch/x86/events/core.c | 63 ++++++++++++++++++++++++++++++++------
include/linux/perf_event.h | 3 +-
kernel/bpf/stackmap.c | 8 ++---
kernel/events/callchain.c | 27 +++++++++++++++-
kernel/events/core.c | 7 ++++-
5 files changed, 91 insertions(+), 17 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40ad1425ffa2..4ff412225217 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2758,11 +2758,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
struct unwind_state state;
unsigned long addr;

- if (perf_guest_state()) {
- /* TODO: We don't support guest os callchain now */
- return;
- }
-
if (perf_callchain_store(entry, regs->ip))
return;

@@ -2778,6 +2773,59 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
}
}

+static inline void
+perf_callchain_guest32(struct perf_callchain_entry_ctx *entry,
+ const struct perf_kvm_guest_unwind_info *unwind_info)
+{
+ unsigned long ss_base, cs_base;
+ struct stack_frame_ia32 frame;
+ const struct stack_frame_ia32 *fp;
+
+ cs_base = unwind_info->segment_cs_base;
+ ss_base = unwind_info->segment_ss_base;
+
+ fp = (void *)(ss_base + unwind_info->frame_pointer);
+ while (fp && entry->nr < entry->max_stack) {
+ if (!perf_guest_read_virt((unsigned long)&fp->next_frame,
+ &frame.next_frame, sizeof(frame.next_frame)))
+ break;
+ if (!perf_guest_read_virt((unsigned long)&fp->return_address,
+ &frame.return_address, sizeof(frame.return_address)))
+ break;
+ perf_callchain_store(entry, cs_base + frame.return_address);
+ fp = (void *)(ss_base + frame.next_frame);
+ }
+}
+
+void
+perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
+{
+ struct stack_frame frame;
+ const struct stack_frame *fp;
+ struct perf_kvm_guest_unwind_info unwind_info;
+
+ if (!perf_guest_get_unwind_info(&unwind_info))
+ return;
+
+ perf_callchain_store(entry, unwind_info.ip_pointer);
+
+ if (unwind_info.is_guest_64bit) {
+ fp = (void *)unwind_info.frame_pointer;
+ while (fp && entry->nr < entry->max_stack) {
+ if (!perf_guest_read_virt((unsigned long)&fp->next_frame,
+ &frame.next_frame, sizeof(frame.next_frame)))
+ break;
+ if (!perf_guest_read_virt((unsigned long)&fp->return_address,
+ &frame.return_address, sizeof(frame.return_address)))
+ break;
+ perf_callchain_store(entry, frame.return_address);
+ fp = (void *)frame.next_frame;
+ }
+ } else {
+ perf_callchain_guest32(entry, &unwind_info);
+ }
+}
+
static inline int
valid_user_frame(const void __user *fp, unsigned long size)
{
@@ -2861,11 +2909,6 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
struct stack_frame frame;
const struct stack_frame __user *fp;

- if (perf_guest_state()) {
- /* TODO: We don't support guest os callchain now */
- return;
- }
-
/*
* We don't know what to do with VM86 stacks.. ignore them for now.
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index dacc1623dcaa..483578672868 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1552,9 +1552,10 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);

extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+extern void perf_callchain_guest(struct perf_callchain_entry_ctx *entry);
extern struct perf_callchain_entry *
get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
- u32 max_stack, bool crosstask, bool add_mark);
+ bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark);
extern int get_callchain_buffers(int max_stack);
extern void put_callchain_buffers(void);
extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d6b277482085..5ca41ca08d8a 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,8 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
if (max_depth > sysctl_perf_event_max_stack)
max_depth = sysctl_perf_event_max_stack;

- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
- false, false);
+ trace = get_perf_callchain(regs, 0, kernel, user, true, false,
+ max_depth, false, false);

if (unlikely(!trace))
/* couldn't fetch the stack trace */
@@ -420,8 +420,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
else if (kernel && task)
trace = get_callchain_entry_for_task(task, max_depth);
else
- trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
- false, false);
+ trace = get_perf_callchain(regs, 0, kernel, user, true, false,
+ max_depth, false, false);
if (unlikely(!trace))
goto err_fault;

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392c..7e80729e95d0 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -45,6 +45,10 @@ __weak void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
{
}

+__weak void perf_callchain_guest(struct perf_callchain_entry_ctx *entry)
+{
+}
+
static void release_callchain_buffers_rcu(struct rcu_head *head)
{
struct callchain_cpus_entries *entries;
@@ -178,11 +182,12 @@ put_callchain_entry(int rctx)

struct perf_callchain_entry *
get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
- u32 max_stack, bool crosstask, bool add_mark)
+ bool host, bool guest, u32 max_stack, bool crosstask, bool add_mark)
{
struct perf_callchain_entry *entry;
struct perf_callchain_entry_ctx ctx;
int rctx;
+ unsigned int guest_state;

entry = get_callchain_entry(&rctx);
if (!entry)
@@ -194,6 +199,26 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
ctx.contexts = 0;
ctx.contexts_maxed = false;

+ guest_state = perf_guest_state();
+ if (guest_state) {
+ if (!guest)
+ goto exit_put;
+ if (user && (guest_state & PERF_GUEST_USER)) {
+ if (add_mark)
+ perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_USER);
+ perf_callchain_guest(&ctx);
+ }
+ if (kernel && !(guest_state & PERF_GUEST_USER)) {
+ if (add_mark)
+ perf_callchain_store_context(&ctx, PERF_CONTEXT_GUEST_KERNEL);
+ perf_callchain_guest(&ctx);
+ }
+ goto exit_put;
+ }
+
+ if (unlikely(!host))
+ goto exit_put;
+
if (kernel && !user_mode(regs)) {
if (add_mark)
perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c5e35006217..3dea3fe840e6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7607,6 +7607,8 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
{
bool kernel = !event->attr.exclude_callchain_kernel;
bool user = !event->attr.exclude_callchain_user;
+ bool host = !event->attr.exclude_host;
+ bool guest = !event->attr.exclude_guest;
/* Disallow cross-task user callchains. */
bool crosstask = event->ctx->task && event->ctx->task != current;
const u32 max_stack = event->attr.sample_max_stack;
@@ -7615,7 +7617,10 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
if (!kernel && !user)
return &__empty_callchain;

- callchain = get_perf_callchain(regs, 0, kernel, user,
+ if (!host && !guest)
+ return &__empty_callchain;
+
+ callchain = get_perf_callchain(regs, 0, kernel, user, host, guest,
max_stack, crosstask, true);
return callchain ?: &__empty_callchain;
}
--
2.34.1

2023-12-10 08:16:37

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 2/5] perf kvm: Introduce guest interfaces for sampling callchains

This patch introduces two callback interfaces used between perf and KVM:

- get_unwind_info: Return required data for unwinding at once; including
ip address, frame pointer, whether the guest vCPU is running in 32 or 64
bits, and possibly the base addresses of the segments.

- read_virt: Read data from a virtual address of the guest vm, used for
reading the stack frames from the guest.

Signed-off-by: Tianyi Liu <[email protected]>
---
include/linux/perf_event.h | 17 +++++++++++++++++
kernel/events/core.c | 10 ++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5547ba68e6e4..dacc1623dcaa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -17,6 +17,8 @@
#include <uapi/linux/perf_event.h>
#include <uapi/linux/bpf_perf_event.h>

+#include <linux/perf_kvm.h>
+
/*
* Kernel-internal data types and definitions:
*/
@@ -32,6 +34,9 @@
struct perf_guest_info_callbacks {
unsigned int (*state)(void);
unsigned long (*get_ip)(void);
+ bool (*get_unwind_info)(struct perf_kvm_guest_unwind_info *info);
+ bool (*read_virt)(unsigned long addr, void *dest,
+ unsigned int len);
unsigned int (*handle_intel_pt_intr)(void);
};

@@ -1500,6 +1505,8 @@ extern struct perf_guest_info_callbacks __rcu *perf_guest_cbs;

DECLARE_STATIC_CALL(__perf_guest_state, *perf_guest_cbs->state);
DECLARE_STATIC_CALL(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DECLARE_STATIC_CALL(__perf_guest_get_unwind_info, *perf_guest_cbs->get_unwind_info);
+DECLARE_STATIC_CALL(__perf_guest_read_virt, *perf_guest_cbs->read_virt);
DECLARE_STATIC_CALL(__perf_guest_handle_intel_pt_intr, *perf_guest_cbs->handle_intel_pt_intr);

static inline unsigned int perf_guest_state(void)
@@ -1510,6 +1517,14 @@ static inline unsigned long perf_guest_get_ip(void)
{
return static_call(__perf_guest_get_ip)();
}
+static inline bool perf_guest_get_unwind_info(struct perf_kvm_guest_unwind_info *info)
+{
+ return static_call(__perf_guest_get_unwind_info)(info);
+}
+static inline bool perf_guest_read_virt(unsigned long addr, void *dest, unsigned int length)
+{
+ return static_call(__perf_guest_read_virt)(addr, dest, length);
+}
static inline unsigned int perf_guest_handle_intel_pt_intr(void)
{
return static_call(__perf_guest_handle_intel_pt_intr)();
@@ -1519,6 +1534,8 @@ extern void perf_unregister_guest_info_callbacks(struct perf_guest_info_callback
#else
static inline unsigned int perf_guest_state(void) { return 0; }
static inline unsigned long perf_guest_get_ip(void) { return 0; }
+static inline bool perf_guest_get_unwind_info(struct perf_kvm_guest_unwind_info *) { return 0; }
+static inline bool perf_guest_read_virt(unsigned long, void*, unsigned int) { return 0; }
static inline unsigned int perf_guest_handle_intel_pt_intr(void) { return 0; }
#endif /* CONFIG_GUEST_PERF_EVENTS */

diff --git a/kernel/events/core.c b/kernel/events/core.c
index b704d83a28b2..4c5e35006217 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6807,6 +6807,8 @@ struct perf_guest_info_callbacks __rcu *perf_guest_cbs;

DEFINE_STATIC_CALL_RET0(__perf_guest_state, *perf_guest_cbs->state);
DEFINE_STATIC_CALL_RET0(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DEFINE_STATIC_CALL_RET0(__perf_guest_get_unwind_info, *perf_guest_cbs->get_unwind_info);
+DEFINE_STATIC_CALL_RET0(__perf_guest_read_virt, *perf_guest_cbs->read_virt);
DEFINE_STATIC_CALL_RET0(__perf_guest_handle_intel_pt_intr, *perf_guest_cbs->handle_intel_pt_intr);

void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
@@ -6818,6 +6820,12 @@ void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
static_call_update(__perf_guest_state, cbs->state);
static_call_update(__perf_guest_get_ip, cbs->get_ip);

+ if (cbs->get_unwind_info)
+ static_call_update(__perf_guest_get_unwind_info, cbs->get_unwind_info);
+
+ if (cbs->read_virt)
+ static_call_update(__perf_guest_read_virt, cbs->read_virt);
+
/* Implementing ->handle_intel_pt_intr is optional. */
if (cbs->handle_intel_pt_intr)
static_call_update(__perf_guest_handle_intel_pt_intr,
@@ -6833,6 +6841,8 @@ void perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
rcu_assign_pointer(perf_guest_cbs, NULL);
static_call_update(__perf_guest_state, (void *)&__static_call_return0);
static_call_update(__perf_guest_get_ip, (void *)&__static_call_return0);
+ static_call_update(__perf_guest_get_unwind_info, (void *)&__static_call_return0);
+ static_call_update(__perf_guest_read_virt, (void *)&__static_call_return0);
static_call_update(__perf_guest_handle_intel_pt_intr,
(void *)&__static_call_return0);
synchronize_rcu();
--
2.34.1

2023-12-10 08:16:59

by Tianyi Liu

[permalink] [raw]
Subject: [PATCH v3 5/5] perf tools: Support PERF_CONTEXT_GUEST_* flags

The `perf` util currently has an incomplete implementation for the
following event flags. Events with these flags will be dropped
or be identified as unknown types:

`PERF_CONTEXT_GUEST_KERNEL`
`PERF_CONTEXT_GUEST_USER`

This patch makes `perf script`, `perf timechart` and `perf data` to
correctly identify these flags.

Signed-off-by: Tianyi Liu <[email protected]>
---
tools/perf/builtin-timechart.c | 6 ++++++
tools/perf/util/data-convert-json.c | 6 ++++++
tools/perf/util/machine.c | 6 ++++++
3 files changed, 18 insertions(+)

diff --git a/tools/perf/builtin-timechart.c b/tools/perf/builtin-timechart.c
index 19d4542ea18a..6a368b6a323e 100644
--- a/tools/perf/builtin-timechart.c
+++ b/tools/perf/builtin-timechart.c
@@ -536,6 +536,12 @@ static const char *cat_backtrace(union perf_event *event,
case PERF_CONTEXT_USER:
cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: "
"%"PRId64"\n", (s64) ip);
diff --git a/tools/perf/util/data-convert-json.c b/tools/perf/util/data-convert-json.c
index 5bb3c2ba95ca..62686f78d973 100644
--- a/tools/perf/util/data-convert-json.c
+++ b/tools/perf/util/data-convert-json.c
@@ -205,6 +205,12 @@ static int process_sample_event(struct perf_tool *tool,
case PERF_CONTEXT_USER:
cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: %"
PRId64 "\n", (s64) ip);
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 90c750150b19..28eac11d0f61 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -2343,6 +2343,12 @@ static int add_callchain_ip(struct thread *thread,
case PERF_CONTEXT_USER:
*cpumode = PERF_RECORD_MISC_USER;
break;
+ case PERF_CONTEXT_GUEST_KERNEL:
+ *cpumode = PERF_RECORD_MISC_GUEST_KERNEL;
+ break;
+ case PERF_CONTEXT_GUEST_USER:
+ *cpumode = PERF_RECORD_MISC_GUEST_USER;
+ break;
default:
pr_debug("invalid callchain context: "
"%"PRId64"\n", (s64) ip);
--
2.34.1

2023-12-10 12:16:49

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] KVM: Add arch specific interfaces for sampling guest callchains

On Sun, 10 Dec 2023 08:12:18 +0000,
Tianyi Liu <[email protected]> wrote:
>
> This patch adds two architecture specific interfaces used by `perf kvm`:
>
> - kvm_arch_vcpu_get_unwind_info: Return required data for unwinding
> at once; including ip address, frame pointer, whether the guest vCPU
> is running in 32 or 64 bits, and possibly the base addresses of
> the segments.
>
> - kvm_arch_vcpu_read_virt: Read data from a virtual address of the
> guest vm.
>
> `perf_kvm.h` has been added to the `include/linux/` directory to store
> the interface structures between the perf events subsystem and the KVM
> subsystem.
>
> Since arm64 hasn't provided some foundational infrastructure, stub the
> arm64 implementation for now because it's a bit complex.

It's not complex. It is *unsafe*. Do you see the difference?

>
> The above interfaces require architecture support for
> `CONFIG_GUEST_PERF_EVENTS`, which is only implemented by x86 and arm64
> currently. For more architectures, they need to implement these interfaces
> when enabling `CONFIG_GUEST_PERF_EVENTS`.
>
> In terms of safety, guests are designed to be read-only in this feature,
> and we will never inject page faults into the guests, ensuring that the
> guests are not interfered by profiling. In extremely rare cases, if the
> guest is modifying the page table, there is a possibility of reading
> incorrect data. Additionally, if certain programs running in the guest OS
> do not support frame pointers, it may also result in some erroneous data.
> These erroneous data will eventually appear as `[unknown]` entries in the
> report. It is sufficient as long as most of the records are correct for
> profiling.
>
> Signed-off-by: Tianyi Liu <[email protected]>
> ---
> MAINTAINERS | 1 +
> arch/arm64/kvm/arm.c | 12 ++++++++++++
> arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
> include/linux/kvm_host.h | 5 +++++
> include/linux/perf_kvm.h | 18 ++++++++++++++++++
> 5 files changed, 60 insertions(+)
> create mode 100644 include/linux/perf_kvm.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 788be9ab5b73..5ee36b4a9701 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16976,6 +16976,7 @@ F: arch/*/kernel/*/perf_event*.c
> F: arch/*/kernel/perf_callchain.c
> F: arch/*/kernel/perf_event*.c
> F: include/linux/perf_event.h
> +F: include/linux/perf_kvm.h
> F: include/uapi/linux/perf_event.h
> F: kernel/events/*
> F: tools/lib/perf/
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e5f75f1f1085..5ae74b5c263a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -574,6 +574,18 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
> {
> return *vcpu_pc(vcpu);
> }
> +
> +bool kvm_arch_vcpu_get_unwind_info(struct kvm_vcpu *vcpu, struct perf_kvm_guest_unwind_info *info)
> +{
> + /* TODO: implement */
> + return false;
> +}
> +
> +bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, gva_t addr, void *dest, unsigned int length)
> +{
> + /* TODO: implement */
> + return false;
> +}

I don't do it very often, but the only thing I can say about this is
*NAK*.

You have decided to ignore the previous review comments, which is your
prerogative. However, I absolutely refuse to add half baked and
*dangerous* stuff to the arm64's version of KVM.

If you can convince the x86 folks that they absolutely want this, fine
by me. But this need to be a buy-in interface, not something that is
required for each and every architecture to have stubs, wrongly
suggesting that extra work is needed.

For arm64, the way to go is to have this in userspace. Which is both
easy to implement and safe. And until we have such a userspace
implementation as a baseline, I will not consider a kernel
version.

M.

--
Without deviation from the norm, progress is not possible.

2023-12-11 22:57:15

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] KVM: Add arch specific interfaces for sampling guest callchains

On Sun, Dec 10, 2023, Marc Zyngier wrote:
> On Sun, 10 Dec 2023 08:12:18 +0000, Tianyi Liu <[email protected]> wrote:
> > +bool kvm_arch_vcpu_read_virt(struct kvm_vcpu *vcpu, gva_t addr, void *dest, unsigned int length)
> > +{
> > + /* TODO: implement */
> > + return false;
> > +}
>
> I don't do it very often, but the only thing I can say about this is
> *NAK*.
>
> You have decided to ignore the previous review comments, which is your
> prerogative. However, I absolutely refuse to add half baked and
> *dangerous* stuff to the arm64's version of KVM.
>
> If you can convince the x86 folks that they absolutely want this, fine
> by me. But this need to be a buy-in interface, not something that is
> required for each and every architecture to have stubs, wrongly
> suggesting that extra work is needed.
>
> For arm64, the way to go is to have this in userspace. Which is both
> easy to implement and safe. And until we have such a userspace
> implementation as a baseline, I will not consider a kernel
> version.

I too want more justification of why this needs to be handled in the kernel[*].
The usefulness of this is dubious for many modern setups/workloads, and outright
undesirable for some, e.g. many (most?) cloud providers want to make it all but
impossible to access customer data.

[*] https://lore.kernel.org/all/[email protected]

2023-12-12 15:39:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] perf kvm: Support sampling guest callchains

On Sun, Dec 10, 2023, Tianyi Liu wrote:
> This patch provides support for sampling guests' callchains.
>
> The signature of `get_perf_callchain` has been modified to explicitly
> specify whether it needs to sample the host or guest callchain. Based on
> the context, `get_perf_callchain` will distribute each sampling request
> to one of `perf_callchain_user`, `perf_callchain_kernel`,
> or `perf_callchain_guest`.
>
> The reason for separately implementing `perf_callchain_user` and
> `perf_callchain_kernel` is that the kernel may utilize special unwinders
> like `ORC`. However, for the guest, we only support stackframe-based
> unwinding, so the implementation is generic and only needs to be
> separately implemented for 32-bit and 64-bit.
>
> Signed-off-by: Tianyi Liu <[email protected]>
> ---
> arch/x86/events/core.c | 63 ++++++++++++++++++++++++++++++++------
> include/linux/perf_event.h | 3 +-
> kernel/bpf/stackmap.c | 8 ++---
> kernel/events/callchain.c | 27 +++++++++++++++-
> kernel/events/core.c | 7 ++++-
> 5 files changed, 91 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 40ad1425ffa2..4ff412225217 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2758,11 +2758,6 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
> struct unwind_state state;
> unsigned long addr;
>
> - if (perf_guest_state()) {
> - /* TODO: We don't support guest os callchain now */
> - return;
> - }
> -
> if (perf_callchain_store(entry, regs->ip))
> return;
>
> @@ -2778,6 +2773,59 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
> }
> }
>
> +static inline void
> +perf_callchain_guest32(struct perf_callchain_entry_ctx *entry,
> + const struct perf_kvm_guest_unwind_info *unwind_info)
> +{
> + unsigned long ss_base, cs_base;
> + struct stack_frame_ia32 frame;
> + const struct stack_frame_ia32 *fp;
> +
> + cs_base = unwind_info->segment_cs_base;
> + ss_base = unwind_info->segment_ss_base;
> +
> + fp = (void *)(ss_base + unwind_info->frame_pointer);
> + while (fp && entry->nr < entry->max_stack) {
> + if (!perf_guest_read_virt((unsigned long)&fp->next_frame,

This is extremely confusing and potentially dangerous. ss_base and
unwind_info->frame_pointer are *guest* SS:RBP, i.e. this is referencing a guest
virtual address. It works, but it _looks_ like the code is fully dereferencing
a guest virtual address in the hose kernel. And I can only imagine what type of
speculative accesses this generates.

*If* we want to support guest callchains, I think it would make more sense to
have a single hook for KVM/virtualization to fill perf_callchain_entry_ctx. Then
there's no need for "struct perf_kvm_guest_unwind_info", perf doesn't need a hook
to read guest memory, and KVM can decide/control what to do with respect to
mitigating speculatiion issues.

> + &frame.next_frame, sizeof(frame.next_frame)))
> + break;
> + if (!perf_guest_read_virt((unsigned long)&fp->return_address,
> + &frame.return_address, sizeof(frame.return_address)))
> + break;
> + perf_callchain_store(entry, cs_base + frame.return_address);
> + fp = (void *)(ss_base + frame.next_frame);
> + }
> +}