2010-04-14 09:05:35

by Yanmin Zhang

[permalink] [raw]
Subject: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

Here is the new patch of V3 against tip/master of April 13th
if anyone wants to try it.

ChangeLog V3:
1) Add --guestmount=/dir/to/all/guestos parameter. Admin mounts guest os
root directories under /dir/to/all/guestos by sshfs. For example, I start
2 guest os. The one's pid is 8888 and the other's is 9999.
#mkdir ~/guestmount; cd ~/guestmount
#sshfs -o allow_other,direct_io -p 5551 localhost:/ 8888/
#sshfs -o allow_other,direct_io -p 5552 localhost:/ 9999/
#perf kvm --host --guest --guestmount=~/guestmount top

The old --guestkallsyms and --guestmodules are still supported as default
guest os symbol parsing.

2) Add guest os buildid support.
3) Add sub command 'perf kvm buildid-list'.
4) Delete sub command 'perf kvm stat', because our current implementation
doesn't transfer guest/host requirement to kernel, and kernel always
collects both host and guest statistics. So regular 'perf stat' is ok.
5) Fix a couple of perf bugs.
6) We still have no support on command with parameter 'any' as current KVM
just uses process id to identify specific guest os instance. Users could
uses parameter -p to collect specific guest os instance statistics.

ChangeLog V2:
1) Based on Avi's suggestion, I moved callback functions
to generic code area. So the kernel part of the patch is
clearer.
2) Add 'perf kvm stat'.


From: Zhang, Yanmin <[email protected]>

Based on the discussion in KVM community, I worked out the patch to support
perf to collect guest os statistics from host side. This patch is implemented
with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a
critical bug and provided good suggestions with other guys. I really appreciate
their kind help.

The patch adds new sub command kvm to perf.

perf kvm top
perf kvm record
perf kvm report
perf kvm diff
perf kvm buildid-list

The new perf could profile guest os kernel except guest os user space, but it
could summarize guest os user space utilization per guest os.

Below are some examples.
1) perf kvm top
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules top

--------------------------------------------------------------------------------------------------------------------------
PerfTop: 16010 irqs/sec kernel:59.1% us: 1.5% guest kernel:31.9% guest us: 7.5% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
--------------------------------------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ _______________________

38770.00 20.4% __ticket_spin_lock [guest.kernel.kallsyms]
22560.00 11.9% ftrace_likely_update [kernel.kallsyms]
9208.00 4.8% __lock_acquire [kernel.kallsyms]
5473.00 2.9% trace_hardirqs_off_caller [kernel.kallsyms]
5222.00 2.7% copy_user_generic_string [guest.kernel.kallsyms]
4450.00 2.3% validate_chain [kernel.kallsyms]
4262.00 2.2% trace_hardirqs_on_caller [kernel.kallsyms]
4239.00 2.2% do_raw_spin_lock [kernel.kallsyms]
3548.00 1.9% do_raw_spin_unlock [kernel.kallsyms]
2487.00 1.3% lock_release [kernel.kallsyms]
2165.00 1.1% __local_bh_disable [kernel.kallsyms]
1905.00 1.0% check_chain_key [kernel.kallsyms]
1737.00 0.9% lock_acquire [kernel.kallsyms]
1604.00 0.8% tcp_recvmsg [kernel.kallsyms]
1524.00 0.8% mark_lock [kernel.kallsyms]
1464.00 0.8% schedule [kernel.kallsyms]
1423.00 0.7% __d_lookup [guest.kernel.kallsyms]

If you want to just show host data, pls. don't use parameter --guest.
The headline includes guest os kernel and userspace percentage.

2) perf kvm record
[root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules record -f -a sleep 60
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 29.385 MB perf.data.kvm (~1283837 samples) ]

3) perf kvm report
3.1) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report --sort pid --showcpuutilization>norm.host.guest.report.pid
# Samples: 424719292247
#
# Overhead sys us guest sys guest us Command: Pid
# ........ .....................
#
50.57% 1.02% 0.00% 39.97% 9.58% qemu-system-x86: 3587
49.32% 1.35% 0.01% 35.20% 12.76% qemu-system-x86: 3347
0.07% 0.07% 0.00% 0.00% 0.00% perf: 5217


Some performance guys require perf to show sys/us/guest_sys/guest_us per KVM guest
instance which is actually just a multi-threaded process. Above sub parameter --showcpuutilization
does so.

3.2) [root@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms
--guestmodules=/home/ymzhang/guest/modules report >norm.host.guest.report
# Samples: 2466991384118
#
# Overhead Command Shared Object Symbol
# ........ ............... ........................................................................ ......
#
29.11% qemu-system-x86 [guest.kernel.kallsyms] [g] __ticket_spin_lock
5.88% tbench_srv [kernel.kallsyms] [k] ftrace_likely_update
5.76% tbench [kernel.kallsyms] [k] ftrace_likely_update
3.88% qemu-system-x86 34c3255482 [u] 0x000034c3255482
1.83% tbench [kernel.kallsyms] [k] __lock_acquire
1.81% tbench_srv [kernel.kallsyms] [k] __lock_acquire
1.38% tbench_srv [kernel.kallsyms] [k] trace_hardirqs_off_caller
1.37% tbench [kernel.kallsyms] [k] trace_hardirqs_off_caller
1.13% qemu-system-x86 [guest.kernel.kallsyms] [g] copy_user_generic_string
1.04% tbench_srv [kernel.kallsyms] [k] validate_chain
1.00% tbench [kernel.kallsyms] [k] trace_hardirqs_on_caller
1.00% tbench_srv [kernel.kallsyms] [k] trace_hardirqs_on_caller
0.95% tbench [kernel.kallsyms] [k] do_raw_spin_lock


[u] means it's in guest os user space. [g] means in guest os kernel. Other info is very direct.
If it shows a module such like [ext4], it means guest kernel module, because native host kernel's
modules are start from something like /lib/modules/XXX.

4) --guestmount example. I started 2 guest os. Run dbench testing in the 1st and tbench in 2nd guest os.
[root@lkp-ne01 norm]#perf kvm --host --guest --guestmount=/home/ymzhang/guestmount/ top
---------------------------------------------------------------------------------------------------------------------------------------
PerfTop: 15972 irqs/sec kernel: 8.3% us: 0.5% guest kernel:73.9% guest us:17.3% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------

samples pcnt function DSO
_______ _____ _________________________ __________________________________________________

32960.00 17.4% __ticket_spin_lock [guest.kernel.kallsyms]
5464.00 2.9% copy_user_generic_string [guest.kernel.kallsyms]
4069.00 2.1% copy_user_generic_string [guest.kernel.kallsyms]
3238.00 1.7% ftrace_likely_update /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
2997.00 1.6% __lock_acquire /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
2797.00 1.5% tcp_sendmsg [guest.kernel.kallsyms]
2703.00 1.4% schedule [guest.kernel.kallsyms]
2384.00 1.3% __switch_to [guest.kernel.kallsyms]
2125.00 1.1% tcp_ack [guest.kernel.kallsyms]
2045.00 1.1% tcp_recvmsg [guest.kernel.kallsyms]
1862.00 1.0% tcp_transmit_skb [guest.kernel.kallsyms]
1734.00 0.9% __ticket_spin_lock [guest.kernel.kallsyms]
1388.00 0.7% lock_release /lib/modules/2.6.34-rc4-tip-yangkvm+/build/vmlinux
1367.00 0.7% update_curr [guest.kernel.kallsyms]
1339.00 0.7% fget_light [guest.kernel.kallsyms]
1332.00 0.7% put_page [guest.kernel.kallsyms]
1324.00 0.7% ip_queue_xmit [guest.kernel.kallsyms]
1296.00 0.7% __d_lookup [guest.kernel.kallsyms]
1296.00 0.7% tcp_rcv_established [guest.kernel.kallsyms]
1230.00 0.6% tcp_v4_rcv [guest.kernel.kallsyms]
1092.00 0.6% dev_queue_xmit [guest.kernel.kallsyms]
1073.00 0.6% kmem_cache_alloc [guest.kernel.kallsyms]
1066.00 0.6% ip_rcv [guest.kernel.kallsyms]
1049.00 0.6% __inet_lookup_established [guest.kernel.kallsyms]
1048.00 0.6% tcp_write_xmit [guest.kernel.kallsyms]


Below is the patch against tip/master tree of 13th April.

Signed-off-by: Zhang Yanmin <[email protected]>

---

diff -Nraup linux-2.6_tip0413/arch/x86/include/asm/perf_event.h linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h
--- linux-2.6_tip0413/arch/x86/include/asm/perf_event.h 2010-04-14 11:11:03.992966568 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/include/asm/perf_event.h 2010-04-14 11:13:17.261881591 +0800
@@ -135,17 +135,10 @@ extern void perf_events_lapic_init(void)
*/
#define PERF_EFLAGS_EXACT (1UL << 3)

-#define perf_misc_flags(regs) \
-({ int misc = 0; \
- if (user_mode(regs)) \
- misc |= PERF_RECORD_MISC_USER; \
- else \
- misc |= PERF_RECORD_MISC_KERNEL; \
- if (regs->flags & PERF_EFLAGS_EXACT) \
- misc |= PERF_RECORD_MISC_EXACT; \
- misc; })
-
-#define perf_instruction_pointer(regs) ((regs)->ip)
+struct pt_regs;
+extern unsigned long perf_instruction_pointer(struct pt_regs *regs);
+extern unsigned long perf_misc_flags(struct pt_regs *regs);
+#define perf_misc_flags(regs) perf_misc_flags(regs)

#else
static inline void init_hw_perf_events(void) { }
diff -Nraup linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c
--- linux-2.6_tip0413/arch/x86/kernel/cpu/perf_event.c 2010-04-14 11:11:04.825028810 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kernel/cpu/perf_event.c 2010-04-14 17:02:12.198063684 +0800
@@ -1720,6 +1720,11 @@ struct perf_callchain_entry *perf_callch
{
struct perf_callchain_entry *entry;

+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ /* TODO: We don't support guest os callchain now */
+ return NULL;
+ }
+
if (in_nmi())
entry = &__get_cpu_var(pmc_nmi_entry);
else
@@ -1743,3 +1748,30 @@ void perf_arch_fetch_caller_regs(struct
regs->cs = __KERNEL_CS;
local_save_flags(regs->flags);
}
+
+unsigned long perf_instruction_pointer(struct pt_regs *regs)
+{
+ unsigned long ip;
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
+ ip = perf_guest_cbs->get_guest_ip();
+ else
+ ip = instruction_pointer(regs);
+ return ip;
+}
+
+unsigned long perf_misc_flags(struct pt_regs *regs)
+{
+ int misc = 0;
+ if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ misc |= perf_guest_cbs->is_user_mode() ?
+ PERF_RECORD_MISC_GUEST_USER :
+ PERF_RECORD_MISC_GUEST_KERNEL;
+ } else
+ misc |= user_mode(regs) ? PERF_RECORD_MISC_USER :
+ PERF_RECORD_MISC_KERNEL;
+ if (regs->flags & PERF_EFLAGS_EXACT)
+ misc |= PERF_RECORD_MISC_EXACT;
+
+ return misc;
+}
+
diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
--- linux-2.6_tip0413/arch/x86/kvm/x86.c 2010-04-14 11:11:04.341042024 +0800
+++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c 2010-04-14 11:32:45.841278890 +0800
@@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
}
}

+static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
+
+static int kvm_is_in_guest(void)
+{
+ return percpu_read(current_vcpu) != NULL;
+}
+
+static int kvm_is_user_mode(void)
+{
+ int user_mode = 3;
+ if (percpu_read(current_vcpu))
+ user_mode = kvm_x86_ops->get_cpl(percpu_read(current_vcpu));
+ return user_mode != 0;
+}
+
+static unsigned long kvm_get_guest_ip(void)
+{
+ unsigned long ip = 0;
+ if (percpu_read(current_vcpu))
+ ip = kvm_rip_read(percpu_read(current_vcpu));
+ return ip;
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+ .is_in_guest = kvm_is_in_guest,
+ .is_user_mode = kvm_is_user_mode,
+ .get_guest_ip = kvm_get_guest_ip,
+};
+
int kvm_arch_init(void *opaque)
{
int r;
@@ -3801,6 +3830,8 @@ int kvm_arch_init(void *opaque)

kvm_timer_init();

+ perf_register_guest_info_callbacks(&kvm_guest_cbs);
+
return 0;

out:
@@ -3809,6 +3840,8 @@ out:

void kvm_arch_exit(void)
{
+ perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
@@ -4339,7 +4372,10 @@ static int vcpu_enter_guest(struct kvm_v
}

trace_kvm_entry(vcpu->vcpu_id);
+
+ percpu_write(current_vcpu, vcpu);
kvm_x86_ops->run(vcpu);
+ percpu_write(current_vcpu, NULL);

/*
* If the guest has used debug registers, at least dr7
diff -Nraup linux-2.6_tip0413/include/linux/perf_event.h linux-2.6_tip0413_perfkvm/include/linux/perf_event.h
--- linux-2.6_tip0413/include/linux/perf_event.h 2010-04-14 11:11:16.922212684 +0800
+++ linux-2.6_tip0413_perfkvm/include/linux/perf_event.h 2010-04-14 11:34:33.478072738 +0800
@@ -288,11 +288,13 @@ struct perf_event_mmap_page {
__u64 data_tail; /* user-space written tail */
};

-#define PERF_RECORD_MISC_CPUMODE_MASK (3 << 0)
+#define PERF_RECORD_MISC_CPUMODE_MASK (7 << 0)
#define PERF_RECORD_MISC_CPUMODE_UNKNOWN (0 << 0)
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (2 << 0)
#define PERF_RECORD_MISC_HYPERVISOR (3 << 0)
+#define PERF_RECORD_MISC_GUEST_KERNEL (4 << 0)
+#define PERF_RECORD_MISC_GUEST_USER (5 << 0)

#define PERF_RECORD_MISC_EXACT (1 << 14)
/*
@@ -446,6 +448,12 @@ enum perf_callchain_context {
# include <asm/perf_event.h>
#endif

+struct perf_guest_info_callbacks {
+ int (*is_in_guest) (void);
+ int (*is_user_mode) (void);
+ unsigned long (*get_guest_ip) (void);
+};
+
#ifdef CONFIG_HAVE_HW_BREAKPOINT
#include <asm/hw_breakpoint.h>
#endif
@@ -920,6 +928,12 @@ static inline void perf_event_mmap(struc
__perf_event_mmap(vma);
}

+extern struct perf_guest_info_callbacks *perf_guest_cbs;
+extern int perf_register_guest_info_callbacks(
+ struct perf_guest_info_callbacks *);
+extern int perf_unregister_guest_info_callbacks(
+ struct perf_guest_info_callbacks *);
+
extern void perf_event_comm(struct task_struct *tsk);
extern void perf_event_fork(struct task_struct *tsk);

@@ -989,6 +1003,11 @@ perf_sw_event(u32 event_id, u64 nr, int
static inline void
perf_bp_event(struct perf_event *event, void *data) { }

+static inline int perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+static inline int perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *) {return 0; }
+
static inline void perf_event_mmap(struct vm_area_struct *vma) { }
static inline void perf_event_comm(struct task_struct *tsk) { }
static inline void perf_event_fork(struct task_struct *tsk) { }
diff -Nraup linux-2.6_tip0413/kernel/perf_event.c linux-2.6_tip0413_perfkvm/kernel/perf_event.c
--- linux-2.6_tip0413/kernel/perf_event.c 2010-04-14 11:12:04.090770764 +0800
+++ linux-2.6_tip0413_perfkvm/kernel/perf_event.c 2010-04-14 11:13:17.265859229 +0800
@@ -2797,6 +2797,27 @@ void perf_arch_fetch_caller_regs(struct


/*
+ * We assume there is only KVM supporting the callbacks.
+ * Later on, we might change it to a list if there is
+ * another virtualization implementation supporting the callbacks.
+ */
+struct perf_guest_info_callbacks *perf_guest_cbs;
+
+int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+{
+ perf_guest_cbs = cbs;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
+
+int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+{
+ perf_guest_cbs = NULL;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+/*
* Output
*/
static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
@@ -3748,7 +3769,7 @@ void __perf_event_mmap(struct vm_area_st
.event_id = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 0,
+ .misc = PERF_RECORD_MISC_USER,
/* .size */
},
/* .pid */
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-annotate.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-annotate.c
--- linux-2.6_tip0413/tools/perf/builtin-annotate.c 2010-04-14 11:11:58.474229259 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-annotate.c 2010-04-14 11:13:17.269859901 +0800
@@ -571,7 +571,7 @@ static int __cmd_annotate(void)
perf_session__fprintf(session, stdout);

if (verbose > 2)
- dsos__fprintf(stdout);
+ dsos__fprintf(&session->kerninfo_root, stdout);

perf_session__collapse_resort(&session->hists);
perf_session__output_resort(&session->hists, session->event_total[0]);
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-buildid-list.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-buildid-list.c
--- linux-2.6_tip0413/tools/perf/builtin-buildid-list.c 2010-04-14 11:11:58.462227060 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-buildid-list.c 2010-04-14 11:13:17.269859901 +0800
@@ -46,7 +46,7 @@ static int __cmd_buildid_list(void)
if (with_hits)
perf_session__process_events(session, &build_id__mark_dso_hit_ops);

- dsos__fprintf_buildid(stdout, with_hits);
+ dsos__fprintf_buildid(&session->kerninfo_root, stdout, with_hits);

perf_session__delete(session);
return err;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-diff.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-diff.c
--- linux-2.6_tip0413/tools/perf/builtin-diff.c 2010-04-14 11:11:58.426247688 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-diff.c 2010-04-14 11:35:43.245364332 +0800
@@ -33,7 +33,7 @@ static int perf_session__add_hist_entry(
return -ENOMEM;

if (hit)
- he->count += count;
+ __perf_session__add_count(he, al, count);

return 0;
}
@@ -225,6 +225,10 @@ int cmd_diff(int argc, const char **argv
input_new = argv[1];
} else
input_new = argv[0];
+ } else if (symbol_conf.default_guest_vmlinux_name ||
+ symbol_conf.default_guest_kallsyms) {
+ input_old = "perf.data.host";
+ input_new = "perf.data.guest";
}

symbol_conf.exclude_other = false;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin.h linux-2.6_tip0413_perfkvm/tools/perf/builtin.h
--- linux-2.6_tip0413/tools/perf/builtin.h 2010-04-14 11:11:58.234222967 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin.h 2010-04-14 11:13:17.313858518 +0800
@@ -32,5 +32,6 @@ extern int cmd_version(int argc, const c
extern int cmd_probe(int argc, const char **argv, const char *prefix);
extern int cmd_kmem(int argc, const char **argv, const char *prefix);
extern int cmd_lock(int argc, const char **argv, const char *prefix);
+extern int cmd_kvm(int argc, const char **argv, const char *prefix);

#endif
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-kmem.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-kmem.c
--- linux-2.6_tip0413/tools/perf/builtin-kmem.c 2010-04-14 11:11:58.806260439 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-kmem.c 2010-04-14 11:39:10.199395473 +0800
@@ -351,6 +351,7 @@ static void __print_result(struct rb_roo
int n_lines, int is_caller)
{
struct rb_node *next;
+ struct kernel_info *kerninfo;

printf("%.102s\n", graph_dotted_line);
printf(" %-34s |", is_caller ? "Callsite": "Alloc Ptr");
@@ -359,6 +360,11 @@ static void __print_result(struct rb_roo

next = rb_first(root);

+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
+ if (!kerninfo) {
+ pr_err("__print_result: couldn't find kernel information\n");
+ return;
+ }
while (next && n_lines--) {
struct alloc_stat *data = rb_entry(next, struct alloc_stat,
node);
@@ -370,7 +376,7 @@ static void __print_result(struct rb_roo
if (is_caller) {
addr = data->call_site;
if (!raw_ip)
- sym = map_groups__find_function(&session->kmaps,
+ sym = map_groups__find_function(&kerninfo->kmaps,
addr, &map, NULL);
} else
addr = data->ptr;
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-kvm.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-kvm.c
--- linux-2.6_tip0413/tools/perf/builtin-kvm.c 1970-01-01 08:00:00.000000000 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-kvm.c 2010-04-14 11:40:06.551652083 +0800
@@ -0,0 +1,145 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/util.h"
+#include "util/cache.h"
+#include "util/symbol.h"
+#include "util/thread.h"
+#include "util/header.h"
+#include "util/session.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+
+#include "util/debug.h"
+
+#include <sys/prctl.h>
+
+#include <semaphore.h>
+#include <pthread.h>
+#include <math.h>
+
+static char *file_name = NULL;
+static char name_buffer[256];
+
+int perf_host = 1;
+int perf_guest = 0;
+
+static const char * const kvm_usage[] = {
+ "perf kvm [<options>] {top|record|report|diff}",
+ NULL
+};
+
+static const struct option kvm_options[] = {
+ OPT_STRING('i', "input", &file_name, "file",
+ "Input file name"),
+ OPT_STRING('o', "output", &file_name, "file",
+ "Output file name"),
+ OPT_BOOLEAN(0, "guest", &perf_guest,
+ "Collect guest os data"),
+ OPT_BOOLEAN(0, "host", &perf_host,
+ "Collect guest os data"),
+ OPT_STRING(0, "guestmount", &symbol_conf.guestmount, "directory",
+ "guest mount directory under which every guest os instance has a subdir"),
+ OPT_STRING(0, "guestvmlinux", &symbol_conf.default_guest_vmlinux_name, "file",
+ "file saving guest os vmlinux"),
+ OPT_STRING(0, "guestkallsyms", &symbol_conf.default_guest_kallsyms, "file",
+ "file saving guest os /proc/kallsyms"),
+ OPT_STRING(0, "guestmodules", &symbol_conf.default_guest_modules, "file",
+ "file saving guest os /proc/modules"),
+ OPT_END()
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+
+ rec_argc = argc + 2;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ rec_argv[i++] = strdup("record");
+ rec_argv[i++] = strdup("-o");
+ rec_argv[i++] = strdup(file_name);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_record(i, rec_argv, NULL);
+}
+
+static int __cmd_report(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+
+ rec_argc = argc + 2;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ rec_argv[i++] = strdup("report");
+ rec_argv[i++] = strdup("-i");
+ rec_argv[i++] = strdup(file_name);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_report(i, rec_argv, NULL);
+}
+
+static int __cmd_buildid_list(int argc, const char **argv)
+{
+ int rec_argc, i = 0, j;
+ const char **rec_argv;
+
+ rec_argc = argc + 2;
+ rec_argv = calloc(rec_argc + 1, sizeof(char *));
+ rec_argv[i++] = strdup("buildid-list");
+ rec_argv[i++] = strdup("-i");
+ rec_argv[i++] = strdup(file_name);
+ for (j = 1; j < argc; j++, i++)
+ rec_argv[i] = argv[j];
+
+ BUG_ON(i != rec_argc);
+
+ return cmd_buildid_list(i, rec_argv, NULL);
+}
+
+int cmd_kvm(int argc, const char **argv, const char *prefix __used)
+{
+ perf_host = perf_guest = 0;
+
+ argc = parse_options(argc, argv, kvm_options, kvm_usage,
+ PARSE_OPT_STOP_AT_NON_OPTION);
+ if (!argc)
+ usage_with_options(kvm_usage, kvm_options);
+
+ if (!perf_host)
+ perf_guest = 1;
+
+ if (!file_name) {
+ if (perf_host && !perf_guest)
+ sprintf(name_buffer, "perf.data.host");
+ else if (!perf_host && perf_guest)
+ sprintf(name_buffer, "perf.data.guest");
+ else
+ sprintf(name_buffer, "perf.data.kvm");
+ file_name = name_buffer;
+ }
+
+ if (!strncmp(argv[0], "rec", 3)) {
+ return __cmd_record(argc, argv);
+ } else if (!strncmp(argv[0], "rep", 3)) {
+ return __cmd_report(argc, argv);
+ } else if (!strncmp(argv[0], "diff", 4)) {
+ return cmd_diff(argc, argv, NULL);
+ } else if (!strncmp(argv[0], "top", 3)) {
+ return cmd_top(argc, argv, NULL);
+ } else if (!strncmp(argv[0], "buildid-list", 12)) {
+ return __cmd_buildid_list(argc, argv);
+ } else {
+ usage_with_options(kvm_usage, kvm_options);
+ }
+
+ return 0;
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-record.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-record.c
--- linux-2.6_tip0413/tools/perf/builtin-record.c 2010-04-14 11:11:58.806260439 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-record.c 2010-04-14 14:11:09.625252460 +0800
@@ -426,6 +426,52 @@ static void atexit_header(void)
perf_header__write(&session->header, output, true);
}

+static void event__synthesize_guest_os(struct kernel_info *kerninfo,
+ void *data __attribute__((unused)))
+{
+ int err;
+ char *guest_kallsyms;
+ char path[PATH_MAX];
+
+ if (is_host_kernel(kerninfo))
+ return;
+
+ /*
+ *As for guest kernel when processing subcommand record&report,
+ *we arrange module mmap prior to guest kernel mmap and trigger
+ *a preload dso because default guest module symbols are loaded
+ *from guest kallsyms instead of /lib/modules/XXX/XXX. This
+ *method is used to avoid symbol missing when the first addr is
+ *in module instead of in guest kernel.
+ */
+ err = event__synthesize_modules(process_synthesized_event,
+ session,
+ kerninfo);
+ if (err < 0)
+ pr_err("Couldn't record guest kernel [%d]'s reference"
+ " relocation symbol.\n", kerninfo->pid);
+
+ if (is_default_guest(kerninfo))
+ guest_kallsyms = (char *) symbol_conf.default_guest_kallsyms;
+ else {
+ sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+ guest_kallsyms = path;
+ }
+
+ /*
+ * We use _stext for guest kernel because guest kernel's /proc/kallsyms
+ * have no _text sometimes.
+ */
+ err = event__synthesize_kernel_mmap(process_synthesized_event,
+ session, kerninfo, "_text");
+ if (err < 0)
+ err = event__synthesize_kernel_mmap(process_synthesized_event,
+ session, kerninfo, "_stext");
+ if (err < 0)
+ pr_err("Couldn't record guest kernel [%d]'s reference"
+ " relocation symbol.\n", kerninfo->pid);
+}
+
static int __cmd_record(int argc, const char **argv)
{
int i, counter;
@@ -437,6 +483,7 @@ static int __cmd_record(int argc, const
int child_ready_pipe[2], go_pipe[2];
const bool forks = argc > 0;
char buf;
+ struct kernel_info *kerninfo;

page_size = sysconf(_SC_PAGE_SIZE);

@@ -572,21 +619,31 @@ static int __cmd_record(int argc, const

post_processing_offset = lseek(output, 0, SEEK_CUR);

+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
+ if (!kerninfo) {
+ pr_err("Couldn't find native kernel information.\n");
+ return -1;
+ }
+
err = event__synthesize_kernel_mmap(process_synthesized_event,
- session, "_text");
+ session, kerninfo, "_text");
if (err < 0)
err = event__synthesize_kernel_mmap(process_synthesized_event,
- session, "_stext");
+ session, kerninfo, "_stext");
if (err < 0) {
pr_err("Couldn't record kernel reference relocation symbol.\n");
return err;
}

- err = event__synthesize_modules(process_synthesized_event, session);
+ err = event__synthesize_modules(process_synthesized_event,
+ session, kerninfo);
if (err < 0) {
pr_err("Couldn't record kernel reference relocation symbol.\n");
return err;
}
+ if (perf_guest)
+ kerninfo__process_allkernels(&session->kerninfo_root,
+ event__synthesize_guest_os, session);

if (!system_wide && profile_cpu == -1)
event__synthesize_thread(target_tid, process_synthesized_event,
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-report.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-report.c
--- linux-2.6_tip0413/tools/perf/builtin-report.c 2010-04-14 11:11:58.462227060 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-report.c 2010-04-14 11:13:17.313858518 +0800
@@ -108,7 +108,7 @@ static int perf_session__add_hist_entry(
return -ENOMEM;

if (hit)
- he->count += data->period;
+ __perf_session__add_count(he, al, data->period);

if (symbol_conf.use_callchain) {
if (!hit)
@@ -300,7 +300,7 @@ static int __cmd_report(void)
perf_session__fprintf(session, stdout);

if (verbose > 2)
- dsos__fprintf(stdout);
+ dsos__fprintf(&session->kerninfo_root, stdout);

next = rb_first(&session->stats_by_id);
while (next) {
@@ -437,6 +437,8 @@ static const struct option options[] = {
"sort by key(s): pid, comm, dso, symbol, parent"),
OPT_BOOLEAN('P', "full-paths", &symbol_conf.full_paths,
"Don't shorten the pathnames taking into account the cwd"),
+ OPT_BOOLEAN(0, "showcpuutilization", &symbol_conf.show_cpu_utilization,
+ "Show sample percentage for different cpu modes"),
OPT_STRING('p', "parent", &parent_pattern, "regex",
"regex filter to identify parent, see: '--sort parent'"),
OPT_BOOLEAN('x', "exclude-other", &symbol_conf.exclude_other,
diff -Nraup linux-2.6_tip0413/tools/perf/builtin-top.c linux-2.6_tip0413_perfkvm/tools/perf/builtin-top.c
--- linux-2.6_tip0413/tools/perf/builtin-top.c 2010-04-14 11:11:58.458238567 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/builtin-top.c 2010-04-14 14:28:14.576215651 +0800
@@ -420,8 +420,9 @@ static double sym_weight(const struct sy
}

static long samples;
-static long userspace_samples;
+static long kernel_samples, us_samples;
static long exact_samples;
+static long guest_us_samples, guest_kernel_samples;
static const char CONSOLE_CLEAR[] = "";

static void __list_insert_active_sym(struct sym_entry *syme)
@@ -461,7 +462,10 @@ static void print_sym_table(void)
int printed = 0, j;
int counter, snap = !display_weighted ? sym_counter : 0;
float samples_per_sec = samples/delay_secs;
- float ksamples_per_sec = (samples-userspace_samples)/delay_secs;
+ float ksamples_per_sec = kernel_samples/delay_secs;
+ float us_samples_per_sec = (us_samples)/delay_secs;
+ float guest_kernel_samples_per_sec = (guest_kernel_samples)/delay_secs;
+ float guest_us_samples_per_sec = (guest_us_samples)/delay_secs;
float esamples_percent = (100.0*exact_samples)/samples;
float sum_ksamples = 0.0;
struct sym_entry *syme, *n;
@@ -470,7 +474,8 @@ static void print_sym_table(void)
int sym_width = 0, dso_width = 0, dso_short_width = 0;
const int win_width = winsize.ws_col - 1;

- samples = userspace_samples = exact_samples = 0;
+ samples = us_samples = kernel_samples = exact_samples = 0;
+ guest_kernel_samples = guest_us_samples = 0;

/* Sort the active symbols */
pthread_mutex_lock(&active_symbols_lock);
@@ -501,10 +506,21 @@ static void print_sym_table(void)
puts(CONSOLE_CLEAR);

printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
- printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% exact: %4.1f%% [",
- samples_per_sec,
- 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
- esamples_percent);
+ if (!perf_guest) {
+ printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% exact: %4.1f%% [",
+ samples_per_sec,
+ 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+ esamples_percent);
+ } else {
+ printf( " PerfTop:%8.0f irqs/sec kernel:%4.1f%% us:%4.1f%%"
+ " guest kernel:%4.1f%% guest us:%4.1f%% exact: %4.1f%% [",
+ samples_per_sec,
+ 100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-us_samples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-guest_kernel_samples_per_sec)/samples_per_sec)),
+ 100.0 - (100.0*((samples_per_sec-guest_us_samples_per_sec)/samples_per_sec)),
+ esamples_percent);
+ }

if (nr_counters == 1 || !display_weighted) {
printf("%Ld", (u64)attrs[0].sample_period);
@@ -597,7 +613,6 @@ static void print_sym_table(void)

syme = rb_entry(nd, struct sym_entry, rb_node);
sym = sym_entry__symbol(syme);
-
if (++printed > print_entries || (int)syme->snap_count < count_filter)
continue;

@@ -761,7 +776,7 @@ static int key_mapped(int c)
return 0;
}

-static void handle_keypress(int c)
+static void handle_keypress(struct perf_session *session, int c)
{
if (!key_mapped(c)) {
struct pollfd stdin_poll = { .fd = 0, .events = POLLIN };
@@ -830,7 +845,7 @@ static void handle_keypress(int c)
case 'Q':
printf("exiting.\n");
if (dump_symtab)
- dsos__fprintf(stderr);
+ dsos__fprintf(&session->kerninfo_root, stderr);
exit(0);
case 's':
prompt_symbol(&sym_filter_entry, "Enter details symbol");
@@ -866,6 +881,7 @@ static void *display_thread(void *arg __
struct pollfd stdin_poll = { .fd = 0, .events = POLLIN };
struct termios tc, save;
int delay_msecs, c;
+ struct perf_session *session = (struct perf_session *) arg;

tcgetattr(0, &save);
tc = save;
@@ -886,7 +902,7 @@ repeat:
c = getc(stdin);
tcsetattr(0, TCSAFLUSH, &save);

- handle_keypress(c);
+ handle_keypress(session, c);
goto repeat;

return NULL;
@@ -957,24 +973,46 @@ static void event__process_sample(const
u64 ip = self->ip.ip;
struct sym_entry *syme;
struct addr_location al;
+ struct kernel_info *kerninfo;
u8 origin = self->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;

++samples;

switch (origin) {
case PERF_RECORD_MISC_USER:
- ++userspace_samples;
+ ++us_samples;
if (hide_user_symbols)
return;
+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
break;
case PERF_RECORD_MISC_KERNEL:
+ ++kernel_samples;
if (hide_kernel_symbols)
return;
+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
break;
+ case PERF_RECORD_MISC_GUEST_KERNEL:
+ ++guest_kernel_samples;
+ kerninfo = kerninfo__find(&session->kerninfo_root,
+ self->ip.pid);
+ break;
+ case PERF_RECORD_MISC_GUEST_USER:
+ ++guest_us_samples;
+ /*
+ * TODO: we don't process guest user from host side
+ * except simple counting
+ */
+ return;
default:
return;
}

+ if (!kerninfo && perf_guest) {
+ pr_err("Can't find guest [%d]'s kernel information\n",
+ self->ip.pid);
+ return;
+ }
+
if (self->header.misc & PERF_RECORD_MISC_EXACT)
exact_samples++;

@@ -994,7 +1032,7 @@ static void event__process_sample(const
* --hide-kernel-symbols, even if the user specifies an
* invalid --vmlinux ;-)
*/
- if (al.map == session->vmlinux_maps[MAP__FUNCTION] &&
+ if (al.map == kerninfo->vmlinux_maps[MAP__FUNCTION] &&
RB_EMPTY_ROOT(&al.map->dso->symbols[MAP__FUNCTION])) {
pr_err("The %s file can't be used\n",
symbol_conf.vmlinux_name);
@@ -1261,7 +1299,7 @@ static int __cmd_top(void)

perf_session__mmap_read(session);

- if (pthread_create(&thread, NULL, display_thread, NULL)) {
+ if (pthread_create(&thread, NULL, display_thread, session)) {
printf("Could not create display thread.\n");
exit(-1);
}
diff -Nraup linux-2.6_tip0413/tools/perf/Makefile linux-2.6_tip0413_perfkvm/tools/perf/Makefile
--- linux-2.6_tip0413/tools/perf/Makefile 2010-04-14 11:11:58.802281816 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/Makefile 2010-04-14 11:13:17.313858518 +0800
@@ -472,6 +472,7 @@ BUILTIN_OBJS += $(OUTPUT)builtin-trace.o
BUILTIN_OBJS += $(OUTPUT)builtin-probe.o
BUILTIN_OBJS += $(OUTPUT)builtin-kmem.o
BUILTIN_OBJS += $(OUTPUT)builtin-lock.o
+BUILTIN_OBJS += $(OUTPUT)builtin-kvm.o

PERFLIBS = $(LIB_FILE)

diff -Nraup linux-2.6_tip0413/tools/perf/perf.c linux-2.6_tip0413_perfkvm/tools/perf/perf.c
--- linux-2.6_tip0413/tools/perf/perf.c 2010-04-14 11:11:58.478250552 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/perf.c 2010-04-14 11:13:17.313858518 +0800
@@ -307,6 +307,7 @@ static void handle_internal_command(int
{ "probe", cmd_probe, 0 },
{ "kmem", cmd_kmem, 0 },
{ "lock", cmd_lock, 0 },
+ { "kvm", cmd_kvm, 0 },
};
unsigned int i;
static const char ext[] = STRIP_EXTENSION;
diff -Nraup linux-2.6_tip0413/tools/perf/perf.h linux-2.6_tip0413_perfkvm/tools/perf/perf.h
--- linux-2.6_tip0413/tools/perf/perf.h 2010-04-14 11:11:58.810277694 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/perf.h 2010-04-14 11:13:17.313858518 +0800
@@ -131,4 +131,6 @@ struct ip_callchain {
u64 ips[0];
};

+extern int perf_host, perf_guest;
+
#endif
diff -Nraup linux-2.6_tip0413/tools/perf/util/build-id.c linux-2.6_tip0413_perfkvm/tools/perf/util/build-id.c
--- linux-2.6_tip0413/tools/perf/util/build-id.c 2010-04-14 11:11:58.654213263 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/build-id.c 2010-04-14 11:13:17.317861518 +0800
@@ -24,7 +24,7 @@ static int build_id__mark_dso_hit(event_
}

thread__find_addr_map(thread, session, cpumode, MAP__FUNCTION,
- event->ip.ip, &al);
+ event->ip.pid, event->ip.ip, &al);

if (al.map != NULL)
al.map->dso->hit = 1;
diff -Nraup linux-2.6_tip0413/tools/perf/util/event.c linux-2.6_tip0413_perfkvm/tools/perf/util/event.c
--- linux-2.6_tip0413/tools/perf/util/event.c 2010-04-14 11:11:58.662259868 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/event.c 2010-04-14 15:33:50.903104472 +0800
@@ -112,7 +112,11 @@ static int event__synthesize_mmap_events
event_t ev = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 0, /* Just like the kernel, see kernel/perf_event.c __perf_event_mmap */
+ /*
+ * Just like the kernel, see kernel/perf_event.c
+ * __perf_event_mmap
+ */
+ .misc = PERF_RECORD_MISC_USER,
},
};
int n;
@@ -167,11 +171,23 @@ static int event__synthesize_mmap_events
}

int event__synthesize_modules(event__handler_t process,
- struct perf_session *session)
+ struct perf_session *session,
+ struct kernel_info *kerninfo)
{
struct rb_node *nd;
+ struct map_groups *kmaps = &kerninfo->kmaps;
+ u16 misc;

- for (nd = rb_first(&session->kmaps.maps[MAP__FUNCTION]);
+ /*
+ * kernel uses 0 for user space maps, see kernel/perf_event.c
+ * __perf_event_mmap
+ */
+ if (is_host_kernel(kerninfo))
+ misc = PERF_RECORD_MISC_KERNEL;
+ else
+ misc = PERF_RECORD_MISC_GUEST_KERNEL;
+
+ for (nd = rb_first(&kmaps->maps[MAP__FUNCTION]);
nd; nd = rb_next(nd)) {
event_t ev;
size_t size;
@@ -182,12 +198,13 @@ int event__synthesize_modules(event__han

size = ALIGN(pos->dso->long_name_len + 1, sizeof(u64));
memset(&ev, 0, sizeof(ev));
- ev.mmap.header.misc = 1; /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
+ ev.mmap.header.misc = misc;
ev.mmap.header.type = PERF_RECORD_MMAP;
ev.mmap.header.size = (sizeof(ev.mmap) -
(sizeof(ev.mmap.filename) - size));
ev.mmap.start = pos->start;
ev.mmap.len = pos->end - pos->start;
+ ev.mmap.pid = kerninfo->pid;

memcpy(ev.mmap.filename, pos->dso->long_name,
pos->dso->long_name_len + 1);
@@ -250,13 +267,17 @@ static int find_symbol_cb(void *arg, con

int event__synthesize_kernel_mmap(event__handler_t process,
struct perf_session *session,
+ struct kernel_info *kerninfo,
const char *symbol_name)
{
size_t size;
+ const char *filename, *mmap_name;
+ char path[PATH_MAX];
+ struct map *map;
+
event_t ev = {
.header = {
.type = PERF_RECORD_MMAP,
- .misc = 1, /* kernel uses 0 for user space maps, see kernel/perf_event.c __perf_event_mmap */
},
};
/*
@@ -266,16 +287,38 @@ int event__synthesize_kernel_mmap(event_
*/
struct process_symbol_args args = { .name = symbol_name, };

- if (kallsyms__parse("/proc/kallsyms", &args, find_symbol_cb) <= 0)
+ if (is_host_kernel(kerninfo)) {
+ /*
+ * kernel uses PERF_RECORD_MISC_USER for user space maps,
+ * see kernel/perf_event.c __perf_event_mmap
+ */
+ ev.header.misc = PERF_RECORD_MISC_KERNEL;
+ mmap_name = "kernel.kallsyms";
+ filename = "/proc/kallsyms";
+ } else {
+ ev.header.misc = PERF_RECORD_MISC_GUEST_KERNEL;
+ mmap_name = "guest.kernel.kallsyms";
+ if (is_default_guest(kerninfo))
+ filename = (char *) symbol_conf.default_guest_kallsyms;
+ else {
+ sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+ filename = path;
+ }
+ }
+
+ if (kallsyms__parse(filename, &args, find_symbol_cb) <= 0)
return -ENOENT;

+ map = kerninfo->vmlinux_maps[MAP__FUNCTION];
size = snprintf(ev.mmap.filename, sizeof(ev.mmap.filename),
- "[kernel.kallsyms.%s]", symbol_name) + 1;
+ "[%s.%s]", mmap_name, symbol_name) + 1;
size = ALIGN(size, sizeof(u64));
- ev.mmap.header.size = (sizeof(ev.mmap) - (sizeof(ev.mmap.filename) - size));
+ ev.mmap.header.size = (sizeof(ev.mmap) -
+ (sizeof(ev.mmap.filename) - size));
ev.mmap.pgoff = args.start;
- ev.mmap.start = session->vmlinux_maps[MAP__FUNCTION]->start;
- ev.mmap.len = session->vmlinux_maps[MAP__FUNCTION]->end - ev.mmap.start ;
+ ev.mmap.start = map->start;
+ ev.mmap.len = map->end - ev.mmap.start;
+ ev.mmap.pid = kerninfo->pid;

return process(&ev, session);
}
@@ -329,82 +372,134 @@ int event__process_lost(event_t *self, s
return 0;
}

-int event__process_mmap(event_t *self, struct perf_session *session)
+static void event_set_kernel_mmap_len(struct map **maps, event_t *self)
{
- struct thread *thread;
- struct map *map;
-
- dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
- self->mmap.pid, self->mmap.tid, self->mmap.start,
- self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+ maps[MAP__FUNCTION]->start = self->mmap.start;
+ maps[MAP__FUNCTION]->end = self->mmap.start + self->mmap.len;
+ /*
+ * Be a bit paranoid here, some perf.data file came with
+ * a zero sized synthesized MMAP event for the kernel.
+ */
+ if (maps[MAP__FUNCTION]->end == 0)
+ maps[MAP__FUNCTION]->end = ~0UL;
+}

- if (self->mmap.pid == 0) {
- static const char kmmap_prefix[] = "[kernel.kallsyms.";
+static int event__process_kernel_mmap(event_t *self,
+ struct perf_session *session)
+{
+ struct map *map;
+ const char *kmmap_prefix, *short_name;
+ struct kernel_info *kerninfo;
+ enum dso_kernel_type kernel_type;
+
+ kerninfo = kerninfo__findnew(&session->kerninfo_root, self->mmap.pid);
+ if (!kerninfo) {
+ pr_err("Can't find id %d's kerninfo\n", self->mmap.pid);
+ goto out_problem;
+ }

- if (self->mmap.filename[0] == '/') {
- char short_module_name[1024];
- char *name = strrchr(self->mmap.filename, '/'), *dot;
-
- if (name == NULL)
- goto out_problem;
-
- ++name; /* skip / */
- dot = strrchr(name, '.');
- if (dot == NULL)
- goto out_problem;
-
- snprintf(short_module_name, sizeof(short_module_name),
- "[%.*s]", (int)(dot - name), name);
- strxfrchar(short_module_name, '-', '_');
-
- map = perf_session__new_module_map(session,
- self->mmap.start,
- self->mmap.filename);
- if (map == NULL)
- goto out_problem;
-
- name = strdup(short_module_name);
- if (name == NULL)
- goto out_problem;
-
- map->dso->short_name = name;
- map->end = map->start + self->mmap.len;
- } else if (memcmp(self->mmap.filename, kmmap_prefix,
+ if (is_host_kernel(kerninfo)) {
+ kmmap_prefix = "[kernel.kallsyms.";
+ short_name = "[kernel.kallsyms]";
+ kernel_type = DSO_TYPE_KERNEL;
+ } else {
+ kmmap_prefix = "[guest.kernel.kallsyms.";
+ short_name = "[guest.kernel.kallsyms]";
+ kernel_type = DSO_TYPE_GUEST_KERNEL;
+ }
+
+ if (self->mmap.filename[0] == '/') {
+
+ char short_module_name[1024];
+ char *name = strrchr(self->mmap.filename, '/'), *dot;
+
+ if (name == NULL)
+ goto out_problem;
+
+ ++name; /* skip / */
+ dot = strrchr(name, '.');
+ if (dot == NULL)
+ goto out_problem;
+
+ snprintf(short_module_name, sizeof(short_module_name),
+ "[%.*s]", (int)(dot - name), name);
+ strxfrchar(short_module_name, '-', '_');
+
+ map = map_groups__new_module(&kerninfo->kmaps,
+ self->mmap.start,
+ self->mmap.filename,
+ kerninfo);
+ if (map == NULL)
+ goto out_problem;
+
+ name = strdup(short_module_name);
+ if (name == NULL)
+ goto out_problem;
+
+ map->dso->short_name = name;
+ map->end = map->start + self->mmap.len;
+ } else if (memcmp(self->mmap.filename, kmmap_prefix,
sizeof(kmmap_prefix) - 1) == 0) {
- const char *symbol_name = (self->mmap.filename +
- sizeof(kmmap_prefix) - 1);
+ const char *symbol_name = (self->mmap.filename +
+ sizeof(kmmap_prefix) - 1);
+ /*
+ * Should be there already, from the build-id table in
+ * the header.
+ */
+ struct dso *kernel = __dsos__findnew(&kerninfo->dsos__kernel,
+ short_name);
+ if (kernel == NULL)
+ goto out_problem;
+
+ kernel->kernel = kernel_type;
+ if (__map_groups__create_kernel_maps(&kerninfo->kmaps,
+ kerninfo->vmlinux_maps, kernel) < 0)
+ goto out_problem;
+
+ event_set_kernel_mmap_len(kerninfo->vmlinux_maps, self);
+ perf_session__set_kallsyms_ref_reloc_sym(kerninfo->vmlinux_maps,
+ symbol_name,
+ self->mmap.pgoff);
+ if (is_default_guest(kerninfo)) {
/*
- * Should be there already, from the build-id table in
- * the header.
+ * preload dso of guest kernel and modules
*/
- struct dso *kernel = __dsos__findnew(&dsos__kernel,
- "[kernel.kallsyms]");
- if (kernel == NULL)
- goto out_problem;
-
- kernel->kernel = 1;
- if (__perf_session__create_kernel_maps(session, kernel) < 0)
- goto out_problem;
+ dso__load(kernel,
+ kerninfo->vmlinux_maps[MAP__FUNCTION],
+ NULL);
+ }
+ }
+ return 0;
+out_problem:
+ return -1;
+}

- session->vmlinux_maps[MAP__FUNCTION]->start = self->mmap.start;
- session->vmlinux_maps[MAP__FUNCTION]->end = self->mmap.start + self->mmap.len;
- /*
- * Be a bit paranoid here, some perf.data file came with
- * a zero sized synthesized MMAP event for the kernel.
- */
- if (session->vmlinux_maps[MAP__FUNCTION]->end == 0)
- session->vmlinux_maps[MAP__FUNCTION]->end = ~0UL;
+int event__process_mmap(event_t *self, struct perf_session *session)
+{
+ struct kernel_info *kerninfo;
+ struct thread *thread;
+ struct map *map;
+ u8 cpumode = self->header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+ int ret = 0;

- perf_session__set_kallsyms_ref_reloc_sym(session, symbol_name,
- self->mmap.pgoff);
- }
+ dump_printf(" %d/%d: [%#Lx(%#Lx) @ %#Lx]: %s\n",
+ self->mmap.pid, self->mmap.tid, self->mmap.start,
+ self->mmap.len, self->mmap.pgoff, self->mmap.filename);
+
+ if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL ||
+ cpumode == PERF_RECORD_MISC_KERNEL) {
+ ret = event__process_kernel_mmap(self, session);
+ if (ret < 0)
+ goto out_problem;
return 0;
}

thread = perf_session__findnew(session, self->mmap.pid);
- map = map__new(self->mmap.start, self->mmap.len, self->mmap.pgoff,
- self->mmap.pid, self->mmap.filename, MAP__FUNCTION,
- session->cwd, session->cwdlen);
+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
+ map = map__new(&kerninfo->dsos__user, self->mmap.start,
+ self->mmap.len, self->mmap.pgoff,
+ self->mmap.pid, self->mmap.filename,
+ MAP__FUNCTION, session->cwd, session->cwdlen);

if (thread == NULL || map == NULL)
goto out_problem;
@@ -444,22 +539,52 @@ int event__process_task(event_t *self, s

void thread__find_addr_map(struct thread *self,
struct perf_session *session, u8 cpumode,
- enum map_type type, u64 addr,
+ enum map_type type, pid_t pid, u64 addr,
struct addr_location *al)
{
struct map_groups *mg = &self->mg;
+ struct kernel_info *kerninfo = NULL;

al->thread = self;
al->addr = addr;
+ al->cpumode = cpumode;
+ al->filtered = false;

- if (cpumode == PERF_RECORD_MISC_KERNEL) {
+ if (cpumode == PERF_RECORD_MISC_KERNEL && perf_host) {
al->level = 'k';
- mg = &session->kmaps;
- } else if (cpumode == PERF_RECORD_MISC_USER)
+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
+ mg = &kerninfo->kmaps;
+ } else if (cpumode == PERF_RECORD_MISC_USER && perf_host) {
al->level = '.';
- else {
- al->level = 'H';
+ kerninfo = kerninfo__findhost(&session->kerninfo_root);
+ } else if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL && perf_guest) {
+ al->level = 'g';
+ kerninfo = kerninfo__find(&session->kerninfo_root, pid);
+ if (!kerninfo) {
+ al->map = NULL;
+ return;
+ }
+ mg = &kerninfo->kmaps;
+ } else {
+ /*
+ * 'u' means guest os user space.
+ * TODO: We don't support guest user space. Might support late.
+ */
+ if (cpumode == PERF_RECORD_MISC_GUEST_USER && perf_guest)
+ al->level = 'u';
+ else
+ al->level = 'H';
al->map = NULL;
+
+ if ((cpumode == PERF_RECORD_MISC_GUEST_USER ||
+ cpumode == PERF_RECORD_MISC_GUEST_KERNEL) &&
+ !perf_guest)
+ al->filtered = true;
+ if ((cpumode == PERF_RECORD_MISC_USER ||
+ cpumode == PERF_RECORD_MISC_KERNEL) &&
+ !perf_host)
+ al->filtered = true;
+
return;
}
try_again:
@@ -474,8 +599,11 @@ try_again:
* "[vdso]" dso, but for now lets use the old trick of looking
* in the whole kernel symbol list.
*/
- if ((long long)al->addr < 0 && mg != &session->kmaps) {
- mg = &session->kmaps;
+ if ((long long)al->addr < 0 &&
+ cpumode == PERF_RECORD_MISC_KERNEL &&
+ kerninfo &&
+ mg != &kerninfo->kmaps) {
+ mg = &kerninfo->kmaps;
goto try_again;
}
} else
@@ -484,11 +612,11 @@ try_again:

void thread__find_addr_location(struct thread *self,
struct perf_session *session, u8 cpumode,
- enum map_type type, u64 addr,
+ enum map_type type, pid_t pid, u64 addr,
struct addr_location *al,
symbol_filter_t filter)
{
- thread__find_addr_map(self, session, cpumode, type, addr, al);
+ thread__find_addr_map(self, session, cpumode, type, pid, addr, al);
if (al->map != NULL)
al->sym = map__find_symbol(al->map, al->addr, filter);
else
@@ -524,7 +652,7 @@ int event__preprocess_sample(const event
dump_printf(" ... thread: %s:%d\n", thread->comm, thread->pid);

thread__find_addr_map(thread, session, cpumode, MAP__FUNCTION,
- self->ip.ip, al);
+ self->ip.pid, self->ip.ip, al);
dump_printf(" ...... dso: %s\n",
al->map ? al->map->dso->long_name :
al->level == 'H' ? "[hypervisor]" : "<not found>");
@@ -554,7 +682,6 @@ int event__preprocess_sample(const event
!strlist__has_entry(symbol_conf.sym_list, al->sym->name))
goto out_filtered;

- al->filtered = false;
return 0;

out_filtered:
diff -Nraup linux-2.6_tip0413/tools/perf/util/event.h linux-2.6_tip0413_perfkvm/tools/perf/util/event.h
--- linux-2.6_tip0413/tools/perf/util/event.h 2010-04-14 11:11:58.638239002 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/event.h 2010-04-14 14:12:02.533688079 +0800
@@ -79,6 +79,7 @@ struct sample_data {

struct build_id_event {
struct perf_event_header header;
+ pid_t pid;
u8 build_id[ALIGN(BUILD_ID_SIZE, sizeof(u64))];
char filename[];
};
@@ -119,10 +120,13 @@ int event__synthesize_thread(pid_t pid,
void event__synthesize_threads(event__handler_t process,
struct perf_session *session);
int event__synthesize_kernel_mmap(event__handler_t process,
- struct perf_session *session,
- const char *symbol_name);
+ struct perf_session *session,
+ struct kernel_info *kerninfo,
+ const char *symbol_name);
+
int event__synthesize_modules(event__handler_t process,
- struct perf_session *session);
+ struct perf_session *session,
+ struct kernel_info *kerninfo);

int event__process_comm(event_t *self, struct perf_session *session);
int event__process_lost(event_t *self, struct perf_session *session);
diff -Nraup linux-2.6_tip0413/tools/perf/util/header.c linux-2.6_tip0413_perfkvm/tools/perf/util/header.c
--- linux-2.6_tip0413/tools/perf/util/header.c 2010-04-14 11:11:58.594236160 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/header.c 2010-04-14 11:13:17.317861518 +0800
@@ -197,7 +197,8 @@ static int write_padded(int fd, const vo
continue; \
else

-static int __dsos__write_buildid_table(struct list_head *head, u16 misc, int fd)
+static int __dsos__write_buildid_table(struct list_head *head, pid_t pid,
+ u16 misc, int fd)
{
struct dso *pos;

@@ -212,6 +213,7 @@ static int __dsos__write_buildid_table(s
len = ALIGN(len, NAME_ALIGN);
memset(&b, 0, sizeof(b));
memcpy(&b.build_id, pos->build_id, sizeof(pos->build_id));
+ b.pid = pid;
b.header.misc = misc;
b.header.size = sizeof(b) + len;
err = do_write(fd, &b, sizeof(b));
@@ -226,13 +228,33 @@ static int __dsos__write_buildid_table(s
return 0;
}

-static int dsos__write_buildid_table(int fd)
+static int dsos__write_buildid_table(struct perf_header *header, int fd)
{
- int err = __dsos__write_buildid_table(&dsos__kernel,
- PERF_RECORD_MISC_KERNEL, fd);
- if (err == 0)
- err = __dsos__write_buildid_table(&dsos__user,
- PERF_RECORD_MISC_USER, fd);
+ struct perf_session *session = container_of(header,
+ struct perf_session, header);
+ struct rb_node *nd;
+ int err = 0;
+ u16 kmisc, umisc;
+
+ for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ if (is_host_kernel(pos)) {
+ kmisc = PERF_RECORD_MISC_KERNEL;
+ umisc = PERF_RECORD_MISC_USER;
+ } else {
+ kmisc = PERF_RECORD_MISC_GUEST_KERNEL;
+ umisc = PERF_RECORD_MISC_GUEST_USER;
+ }
+
+ err = __dsos__write_buildid_table(&pos->dsos__kernel, pos->pid,
+ kmisc, fd);
+ if (err == 0)
+ err = __dsos__write_buildid_table(&pos->dsos__user,
+ pos->pid, umisc, fd);
+ if (err)
+ break;
+ }
return err;
}

@@ -349,9 +371,12 @@ static int __dsos__cache_build_ids(struc
return err;
}

-static int dsos__cache_build_ids(void)
+static int dsos__cache_build_ids(struct perf_header *self)
{
- int err_kernel, err_user;
+ struct perf_session *session = container_of(self,
+ struct perf_session, header);
+ struct rb_node *nd;
+ int ret = 0;
char debugdir[PATH_MAX];

snprintf(debugdir, sizeof(debugdir), "%s/%s", getenv("HOME"),
@@ -360,9 +385,30 @@ static int dsos__cache_build_ids(void)
if (mkdir(debugdir, 0755) != 0 && errno != EEXIST)
return -1;

- err_kernel = __dsos__cache_build_ids(&dsos__kernel, debugdir);
- err_user = __dsos__cache_build_ids(&dsos__user, debugdir);
- return err_kernel || err_user ? -1 : 0;
+ for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ ret |= __dsos__cache_build_ids(&pos->dsos__kernel, debugdir);
+ ret |= __dsos__cache_build_ids(&pos->dsos__user, debugdir);
+ }
+ return ret ? -1 : 0;
+}
+
+static bool dsos__read_build_ids(struct perf_header *self, bool with_hits)
+{
+ bool ret = false;
+ struct perf_session *session = container_of(self,
+ struct perf_session, header);
+ struct rb_node *nd;
+
+ for (nd = rb_first(&session->kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ ret |= __dsos__read_build_ids(&pos->dsos__kernel, with_hits);
+ ret |= __dsos__read_build_ids(&pos->dsos__user, with_hits);
+ }
+
+ return ret;
}

static int perf_header__adds_write(struct perf_header *self, int fd)
@@ -373,7 +419,7 @@ static int perf_header__adds_write(struc
u64 sec_start;
int idx = 0, err;

- if (dsos__read_build_ids(true))
+ if (dsos__read_build_ids(self, true))
perf_header__set_feat(self, HEADER_BUILD_ID);

nr_sections = bitmap_weight(self->adds_features, HEADER_FEAT_BITS);
@@ -408,14 +454,14 @@ static int perf_header__adds_write(struc

/* Write build-ids */
buildid_sec->offset = lseek(fd, 0, SEEK_CUR);
- err = dsos__write_buildid_table(fd);
+ err = dsos__write_buildid_table(self, fd);
if (err < 0) {
pr_debug("failed to write buildid table\n");
goto out_free;
}
buildid_sec->size = lseek(fd, 0, SEEK_CUR) -
buildid_sec->offset;
- dsos__cache_build_ids();
+ dsos__cache_build_ids(self);
}

lseek(fd, sec_start, SEEK_SET);
@@ -636,6 +682,72 @@ int perf_file_header__read(struct perf_f
return 0;
}

+static int perf_header__read_build_ids(struct perf_header *self,
+ int input, u64 offset, u64 size)
+{
+ struct perf_session *session = container_of(self,
+ struct perf_session, header);
+ struct build_id_event bev;
+ char filename[PATH_MAX];
+ u64 limit = offset + size;
+ int err = -1;
+ struct list_head *head;
+ struct kernel_info *kerninfo;
+ u16 misc;
+
+ while (offset < limit) {
+ struct dso *dso;
+ ssize_t len;
+ enum dso_kernel_type dso_type;
+
+ if (read(input, &bev, sizeof(bev)) != sizeof(bev))
+ goto out;
+
+ kerninfo = kerninfo__findnew(&session->kerninfo_root, bev.pid);
+ if (!kerninfo)
+ goto out;
+
+ if (self->needs_swap)
+ perf_event_header__bswap(&bev.header);
+
+ len = bev.header.size - sizeof(bev);
+ if (read(input, filename, len) != len)
+ goto out;
+
+ misc = bev.header.misc & PERF_RECORD_MISC_CPUMODE_MASK;
+
+ switch(misc) {
+ case PERF_RECORD_MISC_KERNEL:
+ dso_type = DSO_TYPE_KERNEL;
+ head = &kerninfo->dsos__kernel;
+ break;
+ case PERF_RECORD_MISC_GUEST_KERNEL:
+ dso_type = DSO_TYPE_GUEST_KERNEL;
+ head = &kerninfo->dsos__kernel;
+ break;
+ case PERF_RECORD_MISC_USER:
+ case PERF_RECORD_MISC_GUEST_USER:
+ dso_type = DSO_TYPE_USER;
+ head = &kerninfo->dsos__user;
+ break;
+ default:
+ goto out;
+ }
+
+ dso = __dsos__findnew(head, filename);
+ if (dso != NULL) {
+ dso__set_build_id(dso, &bev.build_id);
+ if (filename[0] == '[')
+ dso->kernel = dso_type;
+ }
+
+ offset += bev.header.size;
+ }
+ err = 0;
+out:
+ return err;
+}
+
static int perf_file_section__process(struct perf_file_section *self,
struct perf_header *ph,
int feat, int fd)
diff -Nraup linux-2.6_tip0413/tools/perf/util/hist.c linux-2.6_tip0413_perfkvm/tools/perf/util/hist.c
--- linux-2.6_tip0413/tools/perf/util/hist.c 2010-04-14 11:11:58.766255670 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/hist.c 2010-04-14 16:02:22.299845756 +0800
@@ -8,6 +8,30 @@ struct callchain_param callchain_param =
.min_percent = 0.5
};

+void __perf_session__add_count(struct hist_entry *he,
+ struct addr_location *al,
+ u64 count)
+{
+ he->count += count;
+
+ switch (al->cpumode) {
+ case PERF_RECORD_MISC_KERNEL:
+ he->count_sys += count;
+ break;
+ case PERF_RECORD_MISC_USER:
+ he->count_us += count;
+ break;
+ case PERF_RECORD_MISC_GUEST_KERNEL:
+ he->count_guest_sys += count;
+ break;
+ case PERF_RECORD_MISC_GUEST_USER:
+ he->count_guest_us += count;
+ break;
+ default:
+ break;
+ }
+}
+
/*
* histogram, sorted on item, collects counts
*/
@@ -464,7 +488,7 @@ int hist_entry__snprintf(struct hist_ent
u64 session_total)
{
struct sort_entry *se;
- u64 count, total;
+ u64 count, total, count_sys, count_us, count_guest_sys, count_guest_us;
const char *sep = symbol_conf.field_sep;
int ret;

@@ -474,9 +498,17 @@ int hist_entry__snprintf(struct hist_ent
if (pair_session) {
count = self->pair ? self->pair->count : 0;
total = pair_session->events_stats.total;
+ count_sys = self->pair ? self->pair->count_sys : 0;
+ count_us = self->pair ? self->pair->count_us : 0;
+ count_guest_sys = self->pair ? self->pair->count_guest_sys : 0;
+ count_guest_us = self->pair ? self->pair->count_guest_us : 0;
} else {
count = self->count;
total = session_total;
+ count_sys = self->count_sys;
+ count_us = self->count_us;
+ count_guest_sys = self->count_guest_sys;
+ count_guest_us = self->count_guest_us;
}

if (total) {
@@ -487,6 +519,22 @@ int hist_entry__snprintf(struct hist_ent
else
ret = snprintf(s, size, sep ? "%.2f" : " %6.2f%%",
(count * 100.0) / total);
+ if (symbol_conf.show_cpu_utilization) {
+ ret += percent_color_snprintf(s + ret, size - ret,
+ sep ? "%.2f" : " %6.2f%%",
+ (count_sys * 100.0) / total);
+ ret += percent_color_snprintf(s + ret, size - ret,
+ sep ? "%.2f" : " %6.2f%%",
+ (count_us * 100.0) / total);
+ if (perf_guest) {
+ ret += percent_color_snprintf(s + ret, size - ret,
+ sep ? "%.2f" : " %6.2f%%",
+ (count_guest_sys * 100.0) / total);
+ ret += percent_color_snprintf(s + ret, size - ret,
+ sep ? "%.2f" : " %6.2f%%",
+ (count_guest_us * 100.0) / total);
+ }
+ }
} else
ret = snprintf(s, size, sep ? "%lld" : "%12lld ", count);

@@ -597,6 +645,24 @@ size_t perf_session__fprintf_hists(struc
fputs(" Samples ", fp);
}

+ if (symbol_conf.show_cpu_utilization) {
+ if (sep) {
+ ret += fprintf(fp, "%csys", *sep);
+ ret += fprintf(fp, "%cus", *sep);
+ if (perf_guest) {
+ ret += fprintf(fp, "%cguest sys", *sep);
+ ret += fprintf(fp, "%cguest us", *sep);
+ }
+ } else {
+ ret += fprintf(fp, " sys ");
+ ret += fprintf(fp, " us ");
+ if (perf_guest) {
+ ret += fprintf(fp, " guest sys ");
+ ret += fprintf(fp, " guest us ");
+ }
+ }
+ }
+
if (pair) {
if (sep)
ret += fprintf(fp, "%cDelta", *sep);
diff -Nraup linux-2.6_tip0413/tools/perf/util/hist.h linux-2.6_tip0413_perfkvm/tools/perf/util/hist.h
--- linux-2.6_tip0413/tools/perf/util/hist.h 2010-04-14 11:11:58.674215806 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/hist.h 2010-04-14 11:13:17.317861518 +0800
@@ -12,6 +12,9 @@ struct addr_location;
struct symbol;
struct rb_root;

+void __perf_session__add_count(struct hist_entry *he,
+ struct addr_location *al,
+ u64 count);
struct hist_entry *__perf_session__add_hist_entry(struct rb_root *hists,
struct addr_location *al,
struct symbol *parent,
diff -Nraup linux-2.6_tip0413/tools/perf/util/map.c linux-2.6_tip0413_perfkvm/tools/perf/util/map.c
--- linux-2.6_tip0413/tools/perf/util/map.c 2010-04-14 11:11:58.642241284 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/map.c 2010-04-14 16:08:55.377366557 +0800
@@ -4,6 +4,7 @@
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
+#include <unistd.h>
#include "map.h"

const char *map_type__name[MAP__NR_TYPES] = {
@@ -37,9 +38,11 @@ void map__init(struct map *self, enum ma
self->map_ip = map__map_ip;
self->unmap_ip = map__unmap_ip;
RB_CLEAR_NODE(&self->rb_node);
+ self->groups = NULL;
}

-struct map *map__new(u64 start, u64 len, u64 pgoff, u32 pid, char *filename,
+struct map *map__new(struct list_head *dsos__list, u64 start, u64 len,
+ u64 pgoff, u32 pid, char *filename,
enum map_type type, char *cwd, int cwdlen)
{
struct map *self = malloc(sizeof(*self));
@@ -66,7 +69,7 @@ struct map *map__new(u64 start, u64 len,
filename = newfilename;
}

- dso = dsos__findnew(filename);
+ dso = __dsos__findnew(dsos__list, filename);
if (dso == NULL)
goto out_delete;

@@ -242,6 +245,7 @@ void map_groups__init(struct map_groups
self->maps[i] = RB_ROOT;
INIT_LIST_HEAD(&self->removed_maps[i]);
}
+ self->this_kerninfo = NULL;
}

void map_groups__flush(struct map_groups *self)
@@ -508,3 +512,123 @@ struct map *maps__find(struct rb_root *m

return NULL;
}
+
+struct kernel_info * add_new_kernel_info(struct rb_root *kerninfo_root,
+ pid_t pid, const char * root_dir)
+{
+ struct rb_node **p = &kerninfo_root->rb_node;
+ struct rb_node *parent = NULL;
+ struct kernel_info *kerninfo, *pos;
+
+ kerninfo = malloc(sizeof(struct kernel_info));
+ if (!kerninfo)
+ return NULL;
+
+ kerninfo->pid = pid;
+ map_groups__init(&kerninfo->kmaps);
+ kerninfo->root_dir = strdup(root_dir);
+ RB_CLEAR_NODE(&kerninfo->rb_node);
+ INIT_LIST_HEAD(&kerninfo->dsos__user);
+ INIT_LIST_HEAD(&kerninfo->dsos__kernel);
+ kerninfo->kmaps.this_kerninfo = kerninfo;
+
+ while (*p != NULL) {
+ parent = *p;
+ pos = rb_entry(parent, struct kernel_info, rb_node);
+ if (pid < pos->pid)
+ p = &(*p)->rb_left;
+ else
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&kerninfo->rb_node, parent, p);
+ rb_insert_color(&kerninfo->rb_node, kerninfo_root);
+
+ return kerninfo;
+}
+
+struct kernel_info *kerninfo__find(struct rb_root *kerninfo_root, pid_t pid)
+{
+ struct rb_node **p = &kerninfo_root->rb_node;
+ struct rb_node *parent = NULL;
+ struct kernel_info *kerninfo;
+ struct kernel_info *default_kerninfo = NULL;
+
+ while (*p != NULL) {
+ parent = *p;
+ kerninfo = rb_entry(parent, struct kernel_info, rb_node);
+ if (pid < kerninfo->pid)
+ p = &(*p)->rb_left;
+ else if (pid > kerninfo->pid)
+ p = &(*p)->rb_right;
+ else
+ return kerninfo;
+ if (!kerninfo->pid)
+ default_kerninfo = kerninfo;
+ }
+
+ return default_kerninfo;
+}
+
+struct kernel_info *kerninfo__findhost(struct rb_root *kerninfo_root)
+{
+ struct rb_node **p = &kerninfo_root->rb_node;
+ struct rb_node *parent = NULL;
+ struct kernel_info *kerninfo;
+ pid_t pid = HOST_KERNEL_ID;
+
+ while (*p != NULL) {
+ parent = *p;
+ kerninfo = rb_entry(parent, struct kernel_info, rb_node);
+ if (pid < kerninfo->pid)
+ p = &(*p)->rb_left;
+ else if (pid > kerninfo->pid)
+ p = &(*p)->rb_right;
+ else
+ return kerninfo;
+ }
+
+ return NULL;
+}
+
+struct kernel_info *kerninfo__findnew(struct rb_root *kerninfo_root, pid_t pid)
+{
+ char path[PATH_MAX];
+ const char * root_dir;
+ int ret;
+ struct kernel_info *kerninfo = kerninfo__find(kerninfo_root, pid);
+
+ if (!kerninfo || kerninfo->pid != pid) {
+ if (pid == HOST_KERNEL_ID || pid == DEFAULT_GUEST_KERNEL_ID)
+ root_dir = "";
+ else {
+ if (!symbol_conf.guestmount)
+ goto out;
+ sprintf(path, "%s/%d", symbol_conf.guestmount, pid);
+ ret = access(path, R_OK);
+ if (ret) {
+ pr_err("Can't access file %s\n", path);
+ goto out;
+ }
+ root_dir = path;
+ }
+ kerninfo = add_new_kernel_info(kerninfo_root, pid, root_dir);
+ }
+
+out:
+ return kerninfo;
+}
+
+void kerninfo__process_allkernels(struct rb_root *kerninfo_root,
+ process_kernel_info process,
+ void * data)
+{
+ struct rb_node *nd;
+
+ for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ process(pos, data);
+ }
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/util/map.h linux-2.6_tip0413_perfkvm/tools/perf/util/map.h
--- linux-2.6_tip0413/tools/perf/util/map.h 2010-04-14 11:11:58.686216105 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/map.h 2010-04-14 16:12:24.683245583 +0800
@@ -19,6 +19,7 @@ extern const char *map_type__name[MAP__N
struct dso;
struct ref_reloc_sym;
struct map_groups;
+struct kernel_info;

struct map {
union {
@@ -36,6 +37,7 @@ struct map {
u64 (*unmap_ip)(struct map *, u64);

struct dso *dso;
+ struct map_groups *groups;
};

struct kmap {
@@ -43,6 +45,26 @@ struct kmap {
struct map_groups *kmaps;
};

+struct map_groups {
+ struct rb_root maps[MAP__NR_TYPES];
+ struct list_head removed_maps[MAP__NR_TYPES];
+ struct kernel_info *this_kerninfo;
+};
+
+/* Native host kernel uses -1 as pid index in kernel_info */
+#define HOST_KERNEL_ID (-1)
+#define DEFAULT_GUEST_KERNEL_ID (0)
+
+struct kernel_info {
+ struct rb_node rb_node;
+ pid_t pid;
+ char * root_dir;
+ struct list_head dsos__user;
+ struct list_head dsos__kernel;
+ struct map_groups kmaps;
+ struct map *vmlinux_maps[MAP__NR_TYPES];
+};
+
static inline struct kmap *map__kmap(struct map *self)
{
return (struct kmap *)(self + 1);
@@ -74,7 +96,8 @@ typedef int (*symbol_filter_t)(struct ma

void map__init(struct map *self, enum map_type type,
u64 start, u64 end, u64 pgoff, struct dso *dso);
-struct map *map__new(u64 start, u64 len, u64 pgoff, u32 pid, char *filename,
+struct map *map__new(struct list_head *dsos__list, u64 start, u64 len,
+ u64 pgoff, u32 pid, char *filename,
enum map_type type, char *cwd, int cwdlen);
void map__delete(struct map *self);
struct map *map__clone(struct map *self);
@@ -91,11 +114,6 @@ void map__fixup_end(struct map *self);

void map__reloc_vmlinux(struct map *self);

-struct map_groups {
- struct rb_root maps[MAP__NR_TYPES];
- struct list_head removed_maps[MAP__NR_TYPES];
-};
-
size_t __map_groups__fprintf_maps(struct map_groups *self,
enum map_type type, int verbose, FILE *fp);
void maps__insert(struct rb_root *maps, struct map *map);
@@ -106,9 +124,39 @@ int map_groups__clone(struct map_groups
size_t map_groups__fprintf(struct map_groups *self, int verbose, FILE *fp);
size_t map_groups__fprintf_maps(struct map_groups *self, int verbose, FILE *fp);

+struct kernel_info * add_new_kernel_info(struct rb_root *kerninfo_root,
+ pid_t pid, const char * root_dir);
+struct kernel_info *kerninfo__find(struct rb_root *kerninfo_root, pid_t pid);
+struct kernel_info *kerninfo__findnew(struct rb_root *kerninfo_root, pid_t pid);
+struct kernel_info *kerninfo__findhost(struct rb_root *kerninfo_root);
+
+/*
+ * Default guest kernel is defined by parameter --guestkallsyms
+ * and --guestmodules
+ */
+static inline int is_default_guest(struct kernel_info * kerninfo)
+{
+ if (!kerninfo)
+ return 0;
+ return kerninfo->pid == DEFAULT_GUEST_KERNEL_ID;
+}
+
+static inline int is_host_kernel(struct kernel_info * kerninfo)
+{
+ if (!kerninfo)
+ return 0;
+ return kerninfo->pid == HOST_KERNEL_ID;
+}
+
+typedef void (*process_kernel_info)(struct kernel_info *kerninfo, void *data);
+void kerninfo__process_allkernels(struct rb_root *kerninfo_root,
+ process_kernel_info process,
+ void * data);
+
static inline void map_groups__insert(struct map_groups *self, struct map *map)
{
- maps__insert(&self->maps[map->type], map);
+ maps__insert(&self->maps[map->type], map);
+ map->groups = self;
}

static inline struct map *map_groups__find(struct map_groups *self,
@@ -148,13 +196,11 @@ int map_groups__fixup_overlappings(struc

struct map *map_groups__find_by_name(struct map_groups *self,
enum map_type type, const char *name);
-int __map_groups__create_kernel_maps(struct map_groups *self,
- struct map *vmlinux_maps[MAP__NR_TYPES],
- struct dso *kernel);
-int map_groups__create_kernel_maps(struct map_groups *self,
- struct map *vmlinux_maps[MAP__NR_TYPES]);
-struct map *map_groups__new_module(struct map_groups *self, u64 start,
- const char *filename);
+struct map *map_groups__new_module(struct map_groups *self,
+ u64 start,
+ const char *filename,
+ struct kernel_info *kerninfo);
+
void map_groups__flush(struct map_groups *self);

#endif /* __PERF_MAP_H */
diff -Nraup linux-2.6_tip0413/tools/perf/util/probe-event.c linux-2.6_tip0413_perfkvm/tools/perf/util/probe-event.c
--- linux-2.6_tip0413/tools/perf/util/probe-event.c 2010-04-14 11:11:58.614279111 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/probe-event.c 2010-04-14 11:13:17.321860837 +0800
@@ -78,6 +78,8 @@ static struct map *kmaps[MAP__NR_TYPES];
/* Initialize symbol maps and path of vmlinux */
static void init_vmlinux(void)
{
+ struct dso *kernel;
+
symbol_conf.sort_by_name = true;
if (symbol_conf.vmlinux_name == NULL)
symbol_conf.try_vmlinux_path = true;
@@ -86,8 +88,12 @@ static void init_vmlinux(void)
if (symbol__init() < 0)
die("Failed to init symbol map.");

+ kernel = dso__new_kernel(symbol_conf.vmlinux_name);
+ if (kernel == NULL)
+ die("Failed to create kernel dso.");
+
map_groups__init(&kmap_groups);
- if (map_groups__create_kernel_maps(&kmap_groups, kmaps) < 0)
+ if (__map_groups__create_kernel_maps(&kmap_groups, kmaps, kernel) < 0)
die("Failed to create kernel maps.");
}

diff -Nraup linux-2.6_tip0413/tools/perf/util/session.c linux-2.6_tip0413_perfkvm/tools/perf/util/session.c
--- linux-2.6_tip0413/tools/perf/util/session.c 2010-04-14 11:11:58.794254600 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/session.c 2010-04-14 16:15:56.564948860 +0800
@@ -52,6 +52,17 @@ out_close:
return -1;
}

+int perf_session__create_kernel_maps(struct perf_session *self)
+{
+ int ret;
+ struct rb_root *root = &self->kerninfo_root;
+
+ ret = map_groups__create_kernel_maps(root, HOST_KERNEL_ID);
+ if (ret >= 0)
+ ret = map_groups__create_guest_kernel_maps(root);
+ return ret;
+}
+
struct perf_session *perf_session__new(const char *filename, int mode, bool force)
{
size_t len = filename ? strlen(filename) + 1 : 0;
@@ -71,7 +82,7 @@ struct perf_session *perf_session__new(c
self->cwd = NULL;
self->cwdlen = 0;
self->unknown_events = 0;
- map_groups__init(&self->kmaps);
+ self->kerninfo_root = RB_ROOT;

if (mode == O_RDONLY) {
if (perf_session__open(self, force) < 0)
@@ -142,8 +153,9 @@ struct map_symbol *perf_session__resolve
continue;
}

+ al.filtered = false;
thread__find_addr_location(thread, self, cpumode,
- MAP__FUNCTION, ip, &al, NULL);
+ MAP__FUNCTION, thread->pid, ip, &al, NULL);
if (al.sym != NULL) {
if (sort__has_parent && !*parent &&
symbol__match_parent_regex(al.sym))
@@ -324,46 +336,6 @@ void perf_event_header__bswap(struct per
self->size = bswap_16(self->size);
}

-int perf_header__read_build_ids(struct perf_header *self,
- int input, u64 offset, u64 size)
-{
- struct build_id_event bev;
- char filename[PATH_MAX];
- u64 limit = offset + size;
- int err = -1;
-
- while (offset < limit) {
- struct dso *dso;
- ssize_t len;
- struct list_head *head = &dsos__user;
-
- if (read(input, &bev, sizeof(bev)) != sizeof(bev))
- goto out;
-
- if (self->needs_swap)
- perf_event_header__bswap(&bev.header);
-
- len = bev.header.size - sizeof(bev);
- if (read(input, filename, len) != len)
- goto out;
-
- if (bev.header.misc & PERF_RECORD_MISC_KERNEL)
- head = &dsos__kernel;
-
- dso = __dsos__findnew(head, filename);
- if (dso != NULL) {
- dso__set_build_id(dso, &bev.build_id);
- if (head == &dsos__kernel && filename[0] == '[')
- dso->kernel = 1;
- }
-
- offset += bev.header.size;
- }
- err = 0;
-out:
- return err;
-}
-
static struct thread *perf_session__register_idle_thread(struct perf_session *self)
{
struct thread *thread = perf_session__findnew(self, 0);
@@ -516,26 +488,33 @@ bool perf_session__has_traces(struct per
return true;
}

-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
const char *symbol_name,
u64 addr)
{
char *bracket;
enum map_type i;
+ struct ref_reloc_sym *ref;
+
+ ref = zalloc(sizeof(struct ref_reloc_sym));
+ if (ref == NULL)
+ return -ENOMEM;

- self->ref_reloc_sym.name = strdup(symbol_name);
- if (self->ref_reloc_sym.name == NULL)
+ ref->name = strdup(symbol_name);
+ if (ref->name == NULL) {
+ free(ref);
return -ENOMEM;
+ }

- bracket = strchr(self->ref_reloc_sym.name, ']');
+ bracket = strchr(ref->name, ']');
if (bracket)
*bracket = '\0';

- self->ref_reloc_sym.addr = addr;
+ ref->addr = addr;

for (i = 0; i < MAP__NR_TYPES; ++i) {
- struct kmap *kmap = map__kmap(self->vmlinux_maps[i]);
- kmap->ref_reloc_sym = &self->ref_reloc_sym;
+ struct kmap *kmap = map__kmap(maps[i]);
+ kmap->ref_reloc_sym = ref;
}

return 0;
diff -Nraup linux-2.6_tip0413/tools/perf/util/session.h linux-2.6_tip0413_perfkvm/tools/perf/util/session.h
--- linux-2.6_tip0413/tools/perf/util/session.h 2010-04-14 11:11:58.606252925 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/session.h 2010-04-14 11:13:17.321860837 +0800
@@ -15,17 +15,15 @@ struct perf_session {
struct perf_header header;
unsigned long size;
unsigned long mmap_window;
- struct map_groups kmaps;
struct rb_root threads;
struct thread *last_match;
- struct map *vmlinux_maps[MAP__NR_TYPES];
+ struct rb_root kerninfo_root;
struct events_stats events_stats;
struct rb_root stats_by_id;
unsigned long event_total[PERF_RECORD_MAX];
unsigned long unknown_events;
struct rb_root hists;
u64 sample_type;
- struct ref_reloc_sym ref_reloc_sym;
int fd;
int cwdlen;
char *cwd;
@@ -64,33 +62,13 @@ struct map_symbol *perf_session__resolve

bool perf_session__has_traces(struct perf_session *self, const char *msg);

-int perf_header__read_build_ids(struct perf_header *self, int input,
- u64 offset, u64 file_size);
-
-int perf_session__set_kallsyms_ref_reloc_sym(struct perf_session *self,
+int perf_session__set_kallsyms_ref_reloc_sym(struct map ** maps,
const char *symbol_name,
u64 addr);

void mem_bswap_64(void *src, int byte_size);

-static inline int __perf_session__create_kernel_maps(struct perf_session *self,
- struct dso *kernel)
-{
- return __map_groups__create_kernel_maps(&self->kmaps,
- self->vmlinux_maps, kernel);
-}
-
-static inline int perf_session__create_kernel_maps(struct perf_session *self)
-{
- return map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
-}
-
-static inline struct map *
- perf_session__new_module_map(struct perf_session *self,
- u64 start, const char *filename)
-{
- return map_groups__new_module(&self->kmaps, start, filename);
-}
+int perf_session__create_kernel_maps(struct perf_session *self);

#ifdef NO_NEWT_SUPPORT
static inline int perf_session__browse_hists(struct rb_root *hists __used,
diff -Nraup linux-2.6_tip0413/tools/perf/util/sort.h linux-2.6_tip0413_perfkvm/tools/perf/util/sort.h
--- linux-2.6_tip0413/tools/perf/util/sort.h 2010-04-14 11:11:58.610258472 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/sort.h 2010-04-14 11:13:17.321860837 +0800
@@ -44,6 +44,11 @@ extern enum sort_type sort__first_dimens
struct hist_entry {
struct rb_node rb_node;
u64 count;
+ u64 count_sys;
+ u64 count_us;
+ u64 count_guest_sys;
+ u64 count_guest_us;
+
/*
* XXX WARNING!
* thread _has_ to come after ms, see
diff -Nraup linux-2.6_tip0413/tools/perf/util/symbol.c linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.c
--- linux-2.6_tip0413/tools/perf/util/symbol.c 2010-04-14 11:11:58.614279111 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.c 2010-04-14 16:51:51.803796961 +0800
@@ -28,6 +28,8 @@ static void dsos__add(struct list_head *
static struct map *map__new2(u64 start, struct dso *dso, enum map_type type);
static int dso__load_kernel_sym(struct dso *self, struct map *map,
symbol_filter_t filter);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+ symbol_filter_t filter);
static int vmlinux_path__nr_entries;
static char **vmlinux_path;

@@ -186,6 +188,7 @@ struct dso *dso__new(const char *name)
self->loaded = 0;
self->sorted_by_name = 0;
self->has_build_id = 0;
+ self->kernel = DSO_TYPE_USER;
}

return self;
@@ -402,12 +405,9 @@ int kallsyms__parse(const char *filename
char *symbol_name;

line_len = getline(&line, &n, file);
- if (line_len < 0)
+ if (line_len < 0 || !line)
break;

- if (!line)
- goto out_failure;
-
line[--line_len] = '\0'; /* \n */

len = hex2u64(line, &start);
@@ -459,6 +459,7 @@ static int map__process_kallsym_symbol(v
* map__split_kallsyms, when we have split the maps per module
*/
symbols__insert(root, sym);
+
return 0;
}

@@ -489,6 +490,7 @@ static int dso__split_kallsyms(struct ds
struct rb_root *root = &self->symbols[map->type];
struct rb_node *next = rb_first(root);
int kernel_range = 0;
+ const char *root_dir;

while (next) {
char *module;
@@ -504,15 +506,32 @@ static int dso__split_kallsyms(struct ds
*module++ = '\0';

if (strcmp(curr_map->dso->short_name, module)) {
+ if (curr_map != map &&
+ self->kernel == DSO_TYPE_GUEST_KERNEL &&
+ is_default_guest(kmaps->this_kerninfo)) {
+ /*
+ * We assume all symbols of a module are continuous in
+ * kallsyms, so curr_map points to a module and all its
+ * symbols are in its kmap. Mark it as loaded.
+ */
+ dso__set_loaded(curr_map->dso, curr_map->type);
+ }
+
curr_map = map_groups__find_by_name(kmaps, map->type, module);
if (curr_map == NULL) {
- pr_debug("/proc/{kallsyms,modules} "
+ if (kmaps->this_kerninfo)
+ root_dir = kmaps->this_kerninfo->root_dir;
+ else
+ root_dir = "";
+ pr_debug("%s/proc/{kallsyms,modules} "
"inconsistency while looking "
- "for \"%s\" module!\n", module);
+ "for \"%s\" module!\n",
+ root_dir, module);
return -1;
}

- if (curr_map->dso->loaded)
+ if (curr_map->dso->loaded &&
+ !is_default_guest(kmaps->this_kerninfo))
goto discard_symbol;
}
/*
@@ -525,13 +544,21 @@ static int dso__split_kallsyms(struct ds
char dso_name[PATH_MAX];
struct dso *dso;

- snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
- kernel_range++);
+ if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ snprintf(dso_name, sizeof(dso_name),
+ "[guest.kernel].%d",
+ kernel_range++);
+ else
+ snprintf(dso_name, sizeof(dso_name),
+ "[kernel].%d",
+ kernel_range++);

dso = dso__new(dso_name);
if (dso == NULL)
return -1;

+ dso->kernel = self->kernel;
+
curr_map = map__new2(pos->start, dso, map->type);
if (curr_map == NULL) {
dso__delete(dso);
@@ -555,6 +582,12 @@ discard_symbol: rb_erase(&pos->rb_node,
}
}

+ if (curr_map != map &&
+ self->kernel == DSO_TYPE_GUEST_KERNEL &&
+ is_default_guest(kmaps->this_kerninfo)) {
+ dso__set_loaded(curr_map->dso, curr_map->type);
+ }
+
return count;
}

@@ -565,7 +598,10 @@ int dso__load_kallsyms(struct dso *self,
return -1;

symbols__fixup_end(&self->symbols[map->type]);
- self->origin = DSO__ORIG_KERNEL;
+ if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ self->origin = DSO__ORIG_GUEST_KERNEL;
+ else
+ self->origin = DSO__ORIG_KERNEL;

return dso__split_kallsyms(self, map, filter);
}
@@ -952,7 +988,7 @@ static int dso__load_sym(struct dso *sel
nr_syms = shdr.sh_size / shdr.sh_entsize;

memset(&sym, 0, sizeof(sym));
- if (!self->kernel) {
+ if (self->kernel == DSO_TYPE_USER) {
self->adjust_symbols = (ehdr.e_type == ET_EXEC ||
elf_section_by_name(elf, &ehdr, &shdr,
".gnu.prelink_undo",
@@ -984,7 +1020,7 @@ static int dso__load_sym(struct dso *sel

section_name = elf_sec__name(&shdr, secstrs);

- if (self->kernel || kmodule) {
+ if (self->kernel != DSO_TYPE_USER || kmodule) {
char dso_name[PATH_MAX];

if (strcmp(section_name,
@@ -1011,6 +1047,7 @@ static int dso__load_sym(struct dso *sel
curr_dso = dso__new(dso_name);
if (curr_dso == NULL)
goto out_elf_end;
+ curr_dso->kernel = self->kernel;
curr_map = map__new2(start, curr_dso,
map->type);
if (curr_map == NULL) {
@@ -1021,7 +1058,7 @@ static int dso__load_sym(struct dso *sel
curr_map->unmap_ip = identity__map_ip;
curr_dso->origin = self->origin;
map_groups__insert(kmap->kmaps, curr_map);
- dsos__add(&dsos__kernel, curr_dso);
+ dsos__add(&self->node, curr_dso);
dso__set_loaded(curr_dso, map->type);
} else
curr_dso = curr_map->dso;
@@ -1083,7 +1120,7 @@ static bool dso__build_id_equal(const st
return memcmp(self->build_id, build_id, sizeof(self->build_id)) == 0;
}

-static bool __dsos__read_build_ids(struct list_head *head, bool with_hits)
+bool __dsos__read_build_ids(struct list_head *head, bool with_hits)
{
bool have_build_id = false;
struct dso *pos;
@@ -1101,13 +1138,6 @@ static bool __dsos__read_build_ids(struc
return have_build_id;
}

-bool dsos__read_build_ids(bool with_hits)
-{
- bool kbuildids = __dsos__read_build_ids(&dsos__kernel, with_hits),
- ubuildids = __dsos__read_build_ids(&dsos__user, with_hits);
- return kbuildids || ubuildids;
-}
-
/*
* Align offset to 4 bytes as needed for note name and descriptor data.
*/
@@ -1242,6 +1272,8 @@ char dso__symtab_origin(const struct dso
[DSO__ORIG_BUILDID] = 'b',
[DSO__ORIG_DSO] = 'd',
[DSO__ORIG_KMODULE] = 'K',
+ [DSO__ORIG_GUEST_KERNEL] = 'g',
+ [DSO__ORIG_GUEST_KMODULE] = 'G',
};

if (self == NULL || self->origin == DSO__ORIG_NOT_FOUND)
@@ -1257,11 +1289,20 @@ int dso__load(struct dso *self, struct m
char build_id_hex[BUILD_ID_SIZE * 2 + 1];
int ret = -1;
int fd;
+ struct kernel_info *kerninfo;
+ const char *root_dir;

dso__set_loaded(self, map->type);

- if (self->kernel)
+ if (self->kernel == DSO_TYPE_KERNEL)
return dso__load_kernel_sym(self, map, filter);
+ else if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+ return dso__load_guest_kernel_sym(self, map, filter);
+
+ if (map->groups && map->groups->this_kerninfo)
+ kerninfo = map->groups->this_kerninfo;
+ else
+ kerninfo = NULL;

name = malloc(size);
if (!name)
@@ -1315,6 +1356,13 @@ more:
case DSO__ORIG_DSO:
snprintf(name, size, "%s", self->long_name);
break;
+ case DSO__ORIG_GUEST_KMODULE:
+ if (map->groups && map->groups->this_kerninfo)
+ root_dir = map->groups->this_kerninfo->root_dir;
+ else
+ root_dir = "";
+ snprintf(name, size, "%s%s", root_dir, self->long_name);
+ break;

default:
goto out;
@@ -1368,7 +1416,8 @@ struct map *map_groups__find_by_name(str
return NULL;
}

-static int dso__kernel_module_get_build_id(struct dso *self)
+static int dso__kernel_module_get_build_id(struct dso *self,
+ const char * root_dir)
{
char filename[PATH_MAX];
/*
@@ -1378,8 +1427,8 @@ static int dso__kernel_module_get_build_
const char *name = self->short_name + 1;

snprintf(filename, sizeof(filename),
- "/sys/module/%.*s/notes/.note.gnu.build-id",
- (int)strlen(name - 1), name);
+ "%s/sys/module/%.*s/notes/.note.gnu.build-id",
+ root_dir, (int)strlen(name) - 1, name);

if (sysfs__read_build_id(filename, self->build_id,
sizeof(self->build_id)) == 0)
@@ -1388,7 +1437,8 @@ static int dso__kernel_module_get_build_
return 0;
}

-static int map_groups__set_modules_path_dir(struct map_groups *self, char *dir_name)
+static int map_groups__set_modules_path_dir(struct map_groups *self,
+ const char *dir_name)
{
struct dirent *dent;
DIR *dir = opendir(dir_name);
@@ -1400,8 +1450,14 @@ static int map_groups__set_modules_path_

while ((dent = readdir(dir)) != NULL) {
char path[PATH_MAX];
+ struct stat st;
+
+ /*sshfs might return bad dent->d_type, so we have to stat*/
+ sprintf(path, "%s/%s", dir_name, dent->d_name);
+ if (stat(path, &st))
+ continue;

- if (dent->d_type == DT_DIR) {
+ if (S_ISDIR(st.st_mode)) {
if (!strcmp(dent->d_name, ".") ||
!strcmp(dent->d_name, ".."))
continue;
@@ -1433,7 +1489,7 @@ static int map_groups__set_modules_path_
if (long_name == NULL)
goto failure;
dso__set_long_name(map->dso, long_name);
- dso__kernel_module_get_build_id(map->dso);
+ dso__kernel_module_get_build_id(map->dso, "");
}
}

@@ -1443,16 +1499,46 @@ failure:
return -1;
}

-static int map_groups__set_modules_path(struct map_groups *self)
+static char * get_kernel_version(const char * root_dir)
{
- struct utsname uts;
+ char version[PATH_MAX];
+ FILE *file;
+ char *name, *tmp;
+ const char * prefix="Linux version ";
+
+ sprintf(version, "%s/proc/version", root_dir);
+ file = fopen(version, "r");
+ if (!file)
+ return NULL;
+
+ version[0] = '\0';
+ tmp = fgets(version, sizeof(version), file);
+ fclose(file);
+
+ name = strstr(version, prefix);
+ if (!name)
+ return NULL;
+ name += strlen(prefix);
+ tmp = strchr(name, ' ');
+ if (tmp)
+ *tmp = '\0';
+
+ return strdup(name);
+}
+
+static int map_groups__set_modules_path(struct map_groups *self,
+ const char * root_dir)
+{
+ char *version;
char modules_path[PATH_MAX];

- if (uname(&uts) < 0)
+ version = get_kernel_version(root_dir);
+ if (!version)
return -1;

- snprintf(modules_path, sizeof(modules_path), "/lib/modules/%s/kernel",
- uts.release);
+ snprintf(modules_path, sizeof(modules_path), "%s/lib/modules/%s/kernel",
+ root_dir, version);
+ free(version);

return map_groups__set_modules_path_dir(self, modules_path);
}
@@ -1477,11 +1563,13 @@ static struct map *map__new2(u64 start,
}

struct map *map_groups__new_module(struct map_groups *self, u64 start,
- const char *filename)
+ const char *filename,
+ struct kernel_info *kerninfo)
{
struct map *map;
- struct dso *dso = __dsos__findnew(&dsos__kernel, filename);
+ struct dso *dso;

+ dso = __dsos__findnew(&kerninfo->dsos__kernel, filename);
if (dso == NULL)
return NULL;

@@ -1489,21 +1577,37 @@ struct map *map_groups__new_module(struc
if (map == NULL)
return NULL;

- dso->origin = DSO__ORIG_KMODULE;
+ if (is_host_kernel(kerninfo))
+ dso->origin = DSO__ORIG_KMODULE;
+ else
+ dso->origin = DSO__ORIG_GUEST_KMODULE;
map_groups__insert(self, map);
return map;
}

-static int map_groups__create_modules(struct map_groups *self)
+static int map_groups__create_modules(struct kernel_info *kerninfo)
{
char *line = NULL;
size_t n;
- FILE *file = fopen("/proc/modules", "r");
+ FILE *file;
struct map *map;
+ const char * root_dir;
+ const char *modules;
+ char path[PATH_MAX];
+
+ if(is_default_guest(kerninfo))
+ modules = symbol_conf.default_guest_modules;
+ else {
+ sprintf(path, "%s/proc/modules", kerninfo->root_dir);
+ modules = path;
+ }

+ file = fopen(modules, "r");
if (file == NULL)
return -1;

+ root_dir = kerninfo->root_dir;
+
while (!feof(file)) {
char name[PATH_MAX];
u64 start;
@@ -1532,16 +1636,17 @@ static int map_groups__create_modules(st
*sep = '\0';

snprintf(name, sizeof(name), "[%s]", line);
- map = map_groups__new_module(self, start, name);
+ map = map_groups__new_module(&kerninfo->kmaps,
+ start, name, kerninfo);
if (map == NULL)
goto out_delete_line;
- dso__kernel_module_get_build_id(map->dso);
+ dso__kernel_module_get_build_id(map->dso, root_dir);
}

free(line);
fclose(file);

- return map_groups__set_modules_path(self);
+ return map_groups__set_modules_path(&kerninfo->kmaps, root_dir);

out_delete_line:
free(line);
@@ -1708,8 +1813,54 @@ out_fixup:
return err;
}

-LIST_HEAD(dsos__user);
-LIST_HEAD(dsos__kernel);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+ symbol_filter_t filter)
+{
+ int err;
+ const char *kallsyms_filename = NULL;
+ struct kernel_info *kerninfo;
+ char path[PATH_MAX];
+
+ if (!map->groups) {
+ pr_debug("Guest kernel map hasn't the point to groups\n");
+ return -1;
+ }
+ kerninfo = map->groups->this_kerninfo;
+
+ if (is_default_guest(kerninfo)) {
+ /*
+ * if the user specified a vmlinux filename, use it and only
+ * it, reporting errors to the user if it cannot be used.
+ * Or use file guest_kallsyms inputted by user on commandline
+ */
+ if (symbol_conf.default_guest_vmlinux_name != NULL) {
+ err = dso__load_vmlinux(self, map,
+ symbol_conf.default_guest_vmlinux_name, filter);
+ goto out_try_fixup;
+ }
+
+ kallsyms_filename = symbol_conf.default_guest_kallsyms;
+ if (!kallsyms_filename)
+ return -1;
+ } else {
+ sprintf(path, "%s/proc/kallsyms", kerninfo->root_dir);
+ kallsyms_filename = path;
+ }
+
+ err = dso__load_kallsyms(self, kallsyms_filename, map, filter);
+ if (err > 0)
+ pr_debug("Using %s for symbols\n", kallsyms_filename);
+
+out_try_fixup:
+ if (err > 0) {
+ if (kallsyms_filename != NULL)
+ dso__set_long_name(self, strdup("[guest.kernel.kallsyms]"));
+ map__fixup_start(map);
+ map__fixup_end(map);
+ }
+
+ return err;
+}

static void dsos__add(struct list_head *head, struct dso *dso)
{
@@ -1752,10 +1903,16 @@ static void __dsos__fprintf(struct list_
}
}

-void dsos__fprintf(FILE *fp)
+void dsos__fprintf(struct rb_root *kerninfo_root, FILE *fp)
{
- __dsos__fprintf(&dsos__kernel, fp);
- __dsos__fprintf(&dsos__user, fp);
+ struct rb_node *nd;
+
+ for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ __dsos__fprintf(&pos->dsos__kernel, fp);
+ __dsos__fprintf(&pos->dsos__user, fp);
+ }
}

static size_t __dsos__fprintf_buildid(struct list_head *head, FILE *fp,
@@ -1773,10 +1930,21 @@ static size_t __dsos__fprintf_buildid(st
return ret;
}

-size_t dsos__fprintf_buildid(FILE *fp, bool with_hits)
+size_t dsos__fprintf_buildid(struct rb_root *kerninfo_root,
+ FILE *fp, bool with_hits)
{
- return (__dsos__fprintf_buildid(&dsos__kernel, fp, with_hits) +
- __dsos__fprintf_buildid(&dsos__user, fp, with_hits));
+ struct rb_node *nd;
+ size_t ret = 0;
+
+ for (nd = rb_first(kerninfo_root); nd; nd = rb_next(nd)) {
+ struct kernel_info *pos = rb_entry(nd, struct kernel_info,
+ rb_node);
+ ret += __dsos__fprintf_buildid(&pos->dsos__kernel,
+ fp, with_hits);
+ ret += __dsos__fprintf_buildid(&pos->dsos__user,
+ fp, with_hits);
+ }
+ return ret;
}

struct dso *dso__new_kernel(const char *name)
@@ -1785,28 +1953,55 @@ struct dso *dso__new_kernel(const char *

if (self != NULL) {
dso__set_short_name(self, "[kernel]");
- self->kernel = 1;
+ self->kernel = DSO_TYPE_KERNEL;
+ }
+
+ return self;
+}
+
+struct dso *dso__new_guest_kernel(const char *name)
+{
+ struct dso *self = dso__new(name ?: "[guest.kernel.kallsyms]");
+
+ if (self != NULL) {
+ dso__set_short_name(self, "[guest.kernel]");
+ self->kernel = DSO_TYPE_GUEST_KERNEL;
}

return self;
}

-void dso__read_running_kernel_build_id(struct dso *self)
+void dso__read_running_kernel_build_id(struct dso *self,
+ struct kernel_info *kerninfo)
{
- if (sysfs__read_build_id("/sys/kernel/notes", self->build_id,
+ char path[PATH_MAX];
+
+ if (is_default_guest(kerninfo))
+ return;
+ sprintf(path, "%s/sys/kernel/notes", kerninfo->root_dir);
+ if (sysfs__read_build_id(path, self->build_id,
sizeof(self->build_id)) == 0)
self->has_build_id = true;
}

-static struct dso *dsos__create_kernel(const char *vmlinux)
+static struct dso *dsos__create_kernel(struct kernel_info *kerninfo)
{
- struct dso *kernel = dso__new_kernel(vmlinux);
+ const char * vmlinux_name = NULL;
+ struct dso *kernel;

- if (kernel != NULL) {
- dso__read_running_kernel_build_id(kernel);
- dsos__add(&dsos__kernel, kernel);
+ if (is_host_kernel(kerninfo)) {
+ vmlinux_name = symbol_conf.vmlinux_name;
+ kernel = dso__new_kernel(vmlinux_name);
+ } else {
+ if (is_default_guest(kerninfo))
+ vmlinux_name = symbol_conf.default_guest_vmlinux_name;
+ kernel = dso__new_guest_kernel(vmlinux_name);
}

+ if (kernel != NULL) {
+ dso__read_running_kernel_build_id(kernel, kerninfo);
+ dsos__add(&kerninfo->dsos__kernel, kernel);
+ }
return kernel;
}

@@ -1950,23 +2145,29 @@ out_free_comm_list:
return -1;
}

-int map_groups__create_kernel_maps(struct map_groups *self,
- struct map *vmlinux_maps[MAP__NR_TYPES])
+int map_groups__create_kernel_maps(struct rb_root *kerninfo_root, pid_t pid)
{
- struct dso *kernel = dsos__create_kernel(symbol_conf.vmlinux_name);
+ struct kernel_info *kerninfo;
+ struct dso *kernel;

+ kerninfo = kerninfo__findnew(kerninfo_root, pid);
+ if (kerninfo == NULL)
+ return -1;
+ kernel = dsos__create_kernel(kerninfo);
if (kernel == NULL)
return -1;

- if (__map_groups__create_kernel_maps(self, vmlinux_maps, kernel) < 0)
+ if (__map_groups__create_kernel_maps(&kerninfo->kmaps,
+ kerninfo->vmlinux_maps, kernel) < 0)
return -1;

- if (symbol_conf.use_modules && map_groups__create_modules(self) < 0)
+ if (symbol_conf.use_modules &&
+ map_groups__create_modules(kerninfo) < 0)
pr_debug("Problems creating module maps, continuing anyway...\n");
/*
* Now that we have all the maps created, just set the ->end of them:
*/
- map_groups__fixup_end(self);
+ map_groups__fixup_end(&kerninfo->kmaps);
return 0;
}

@@ -2012,3 +2213,47 @@ char *strxfrchar(char *s, char from, cha

return s;
}
+
+int map_groups__create_guest_kernel_maps(struct rb_root *kerninfo_root)
+{
+ int ret = 0;
+ struct dirent **namelist = NULL;
+ int i, items = 0;
+ char path[PATH_MAX];
+ pid_t pid;
+
+ if (symbol_conf.default_guest_vmlinux_name ||
+ symbol_conf.default_guest_modules ||
+ symbol_conf.default_guest_kallsyms) {
+ map_groups__create_kernel_maps(kerninfo_root,
+ DEFAULT_GUEST_KERNEL_ID);
+ }
+
+ if (symbol_conf.guestmount) {
+ items = scandir(symbol_conf.guestmount, &namelist, NULL, NULL);
+ if (items <= 0)
+ return -ENOENT;
+ for (i = 0; i < items; i++) {
+ if (!isdigit(namelist[i]->d_name[0])) {
+ /* Filter out . and .. */
+ continue;
+ }
+ pid = atoi(namelist[i]->d_name);
+ sprintf(path, "%s/%s/proc/kallsyms",
+ symbol_conf.guestmount,
+ namelist[i]->d_name);
+ ret = access(path, R_OK);
+ if (ret) {
+ pr_debug("Can't access file %s\n", path);
+ goto failure;
+ }
+ map_groups__create_kernel_maps(kerninfo_root,
+ pid);
+ }
+failure:
+ free(namelist);
+ }
+
+ return ret;
+}
+
diff -Nraup linux-2.6_tip0413/tools/perf/util/symbol.h linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.h
--- linux-2.6_tip0413/tools/perf/util/symbol.h 2010-04-14 11:11:58.766255670 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/symbol.h 2010-04-14 11:13:17.321860837 +0800
@@ -69,10 +69,15 @@ struct symbol_conf {
show_nr_samples,
use_callchain,
exclude_other,
- full_paths;
+ full_paths,
+ show_cpu_utilization;
const char *vmlinux_name,
*field_sep;
- char *dso_list_str,
+ const char *default_guest_vmlinux_name,
+ *default_guest_kallsyms,
+ *default_guest_modules;
+ const char *guestmount;
+ char *dso_list_str,
*comm_list_str,
*sym_list_str,
*col_width_list_str;
@@ -106,6 +111,13 @@ struct addr_location {
u64 addr;
char level;
bool filtered;
+ unsigned int cpumode;
+};
+
+enum dso_kernel_type {
+ DSO_TYPE_USER = 0,
+ DSO_TYPE_KERNEL,
+ DSO_TYPE_GUEST_KERNEL
};

struct dso {
@@ -115,7 +127,7 @@ struct dso {
u8 adjust_symbols:1;
u8 slen_calculated:1;
u8 has_build_id:1;
- u8 kernel:1;
+ enum dso_kernel_type kernel;
u8 hit:1;
u8 annotate_warned:1;
unsigned char origin;
@@ -131,6 +143,7 @@ struct dso {

struct dso *dso__new(const char *name);
struct dso *dso__new_kernel(const char *name);
+struct dso *dso__new_guest_kernel(const char *name);
void dso__delete(struct dso *self);

bool dso__loaded(const struct dso *self, enum map_type type);
@@ -143,34 +156,30 @@ static inline void dso__set_loaded(struc

void dso__sort_by_name(struct dso *self, enum map_type type);

-extern struct list_head dsos__user, dsos__kernel;
-
struct dso *__dsos__findnew(struct list_head *head, const char *name);

-static inline struct dso *dsos__findnew(const char *name)
-{
- return __dsos__findnew(&dsos__user, name);
-}
-
int dso__load(struct dso *self, struct map *map, symbol_filter_t filter);
int dso__load_vmlinux_path(struct dso *self, struct map *map,
symbol_filter_t filter);
int dso__load_kallsyms(struct dso *self, const char *filename, struct map *map,
symbol_filter_t filter);
-void dsos__fprintf(FILE *fp);
-size_t dsos__fprintf_buildid(FILE *fp, bool with_hits);
+void dsos__fprintf(struct rb_root *kerninfo_root, FILE *fp);
+size_t dsos__fprintf_buildid(struct rb_root *kerninfo_root,
+ FILE *fp, bool with_hits);

size_t dso__fprintf_buildid(struct dso *self, FILE *fp);
size_t dso__fprintf(struct dso *self, enum map_type type, FILE *fp);

enum dso_origin {
DSO__ORIG_KERNEL = 0,
+ DSO__ORIG_GUEST_KERNEL,
DSO__ORIG_JAVA_JIT,
DSO__ORIG_BUILD_ID_CACHE,
DSO__ORIG_FEDORA,
DSO__ORIG_UBUNTU,
DSO__ORIG_BUILDID,
DSO__ORIG_DSO,
+ DSO__ORIG_GUEST_KMODULE,
DSO__ORIG_KMODULE,
DSO__ORIG_NOT_FOUND,
};
@@ -178,19 +187,26 @@ enum dso_origin {
char dso__symtab_origin(const struct dso *self);
void dso__set_long_name(struct dso *self, char *name);
void dso__set_build_id(struct dso *self, void *build_id);
-void dso__read_running_kernel_build_id(struct dso *self);
+void dso__read_running_kernel_build_id(struct dso *self,
+ struct kernel_info *kerninfo);
struct symbol *dso__find_symbol(struct dso *self, enum map_type type, u64 addr);
struct symbol *dso__find_symbol_by_name(struct dso *self, enum map_type type,
const char *name);

int filename__read_build_id(const char *filename, void *bf, size_t size);
int sysfs__read_build_id(const char *filename, void *bf, size_t size);
-bool dsos__read_build_ids(bool with_hits);
+bool __dsos__read_build_ids(struct list_head *head, bool with_hits);
int build_id__sprintf(const u8 *self, int len, char *bf);
int kallsyms__parse(const char *filename, void *arg,
int (*process_symbol)(void *arg, const char *name,
char type, u64 start));

+int __map_groups__create_kernel_maps(struct map_groups *self,
+ struct map *vmlinux_maps[MAP__NR_TYPES],
+ struct dso *kernel);
+int map_groups__create_kernel_maps(struct rb_root *kerninfo_root, pid_t pid);
+int map_groups__create_guest_kernel_maps(struct rb_root *kerninfo_root);
+
int symbol__init(void);
bool symbol_type__is_a(char symbol_type, enum map_type map_type);

diff -Nraup linux-2.6_tip0413/tools/perf/util/thread.h linux-2.6_tip0413_perfkvm/tools/perf/util/thread.h
--- linux-2.6_tip0413/tools/perf/util/thread.h 2010-04-14 11:11:58.594236160 +0800
+++ linux-2.6_tip0413_perfkvm/tools/perf/util/thread.h 2010-04-14 11:13:17.321860837 +0800
@@ -33,12 +33,12 @@ static inline struct map *thread__find_m

void thread__find_addr_map(struct thread *self,
struct perf_session *session, u8 cpumode,
- enum map_type type, u64 addr,
+ enum map_type type, pid_t pid, u64 addr,
struct addr_location *al);

void thread__find_addr_location(struct thread *self,
struct perf_session *session, u8 cpumode,
- enum map_type type, u64 addr,
+ enum map_type type, pid_t pid, u64 addr,
struct addr_location *al,
symbol_filter_t filter);
#endif /* __PERF_THREAD_H */


2010-04-14 09:20:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> Here is the new patch of V3 against tip/master of April 13th
> if anyone wants to try it.
>
>

Thanks for persisting despite the flames.

Can you please separate arch/x86/kvm part of the patch? That will make
for easier reviewing, and will need to go through separate trees.

Sheng, did you make any progress with the NMI injection issue?

> +
> diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
> --- linux-2.6_tip0413/arch/x86/kvm/x86.c 2010-04-14 11:11:04.341042024 +0800
> +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c 2010-04-14 11:32:45.841278890 +0800
> @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
> }
> }
>
> +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> +
> +static int kvm_is_in_guest(void)
> +{
> + return percpu_read(current_vcpu) != NULL;
>

An even more accurate way to determine this is to check whether the
interrupt frame points back at the 'int $2' instruction. However we
plan to switch to a self-IPI method to inject the NMI, and I'm not sure
wether APIC NMIs are accepted on an instruction boundary or whether
there's some latency involved.

> +static unsigned long kvm_get_guest_ip(void)
> +{
> + unsigned long ip = 0;
> + if (percpu_read(current_vcpu))
> + ip = kvm_rip_read(percpu_read(current_vcpu));
> + return ip;
> +}
>

This may be racy. kvm_rip_read() accesses a cache in memory; if we're
in the process of updating the cache, then we may read a stale value.
See below.

>
> trace_kvm_entry(vcpu->vcpu_id);
> +
> + percpu_write(current_vcpu, vcpu);
> kvm_x86_ops->run(vcpu);
> + percpu_write(current_vcpu, NULL);
>

If you move this around the 'int $2' instructions you will close the
race, as a stray NMI won't catch us updating the rip cache. But that
depends on whether self-IPI is accepted on the next instruction or not.


--
error compiling committee.c: too many arguments to function

2010-04-14 09:43:47

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> > Here is the new patch of V3 against tip/master of April 13th
> > if anyone wants to try it.
>
> Thanks for persisting despite the flames.
>
> Can you please separate arch/x86/kvm part of the patch? That will make
> for easier reviewing, and will need to go through separate trees.
>
> Sheng, did you make any progress with the NMI injection issue?

Yes, though some other works interrupt me lately...

The very first version has issue due to SELF_IPI mode can't be used to send
NMI according to SDM. That's the reason why x2apic don't have way to do this.

But later I found another issue of fail to inspect inside the guest. I think
it's due to NMI is asynchronous event, though it should be triggered very
quickly, you can't guarantee that the handler would be triggered before the
state(current_vcpu) is cleared with current code.

Maybe just extended the "guest state" region would be fine, if the latency is
stable enough(though I think it maybe platform depended). I am working on this
now.

--
regards
Yang, Sheng

>
> > +
> > diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c
> > linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c ---
> > linux-2.6_tip0413/arch/x86/kvm/x86.c 2010-04-14 11:11:04.341042024 +0800
> > +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c 2010-04-14
> > 11:32:45.841278890 +0800 @@ -3765,6 +3765,35 @@ static void
> > kvm_timer_init(void)
> > }
> > }
> >
> > +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> > +
> > +static int kvm_is_in_guest(void)
> > +{
> > + return percpu_read(current_vcpu) != NULL;
>
> An even more accurate way to determine this is to check whether the
> interrupt frame points back at the 'int $2' instruction. However we
> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
> wether APIC NMIs are accepted on an instruction boundary or whether
> there's some latency involved.
>
> > +static unsigned long kvm_get_guest_ip(void)
> > +{
> > + unsigned long ip = 0;
> > + if (percpu_read(current_vcpu))
> > + ip = kvm_rip_read(percpu_read(current_vcpu));
> > + return ip;
> > +}
>
> This may be racy. kvm_rip_read() accesses a cache in memory; if we're
> in the process of updating the cache, then we may read a stale value.
> See below.
>
> > trace_kvm_entry(vcpu->vcpu_id);
> > +
> > + percpu_write(current_vcpu, vcpu);
> > kvm_x86_ops->run(vcpu);
> > + percpu_write(current_vcpu, NULL);
>
> If you move this around the 'int $2' instructions you will close the
> race, as a stray NMI won't catch us updating the rip cache. But that
> depends on whether self-IPI is accepted on the next instruction or not.
>

2010-04-14 09:58:10

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/14/2010 12:43 PM, Sheng Yang wrote:
> On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
>
>> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
>>
>>> Here is the new patch of V3 against tip/master of April 13th
>>> if anyone wants to try it.
>>>
>> Thanks for persisting despite the flames.
>>
>> Can you please separate arch/x86/kvm part of the patch? That will make
>> for easier reviewing, and will need to go through separate trees.
>>
>> Sheng, did you make any progress with the NMI injection issue?
>>
> Yes, though some other works interrupt me lately...
>
> The very first version has issue due to SELF_IPI mode can't be used to send
> NMI according to SDM. That's the reason why x2apic don't have way to do this.
>

Yes, I see that now. Looks like others have the same questions...

> But later I found another issue of fail to inspect inside the guest. I think
> it's due to NMI is asynchronous event, though it should be triggered very
> quickly, you can't guarantee that the handler would be triggered before the
> state(current_vcpu) is cleared with current code.
>
> Maybe just extended the "guest state" region would be fine, if the latency is
> stable enough(though I think it maybe platform depended). I am working on this
> now.
>

I wouldn't like to depend on model specific behaviour.

One option is to read all the information synchronously and store it in
a per-cpu area with atomic instructions, then queue the NMI. Another
option is to have another callback which tells us that the NMI is done,
and have a busy loop wait until the NMI is delivered.

--
error compiling committee.c: too many arguments to function

2010-04-14 10:14:35

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Wednesday 14 April 2010 17:57:50 Avi Kivity wrote:
> On 04/14/2010 12:43 PM, Sheng Yang wrote:
> > On Wednesday 14 April 2010 17:20:15 Avi Kivity wrote:
> >> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> >>> Here is the new patch of V3 against tip/master of April 13th
> >>> if anyone wants to try it.
> >>
> >> Thanks for persisting despite the flames.
> >>
> >> Can you please separate arch/x86/kvm part of the patch? That will make
> >> for easier reviewing, and will need to go through separate trees.
> >>
> >> Sheng, did you make any progress with the NMI injection issue?
> >
> > Yes, though some other works interrupt me lately...
> >
> > The very first version has issue due to SELF_IPI mode can't be used to
> > send NMI according to SDM. That's the reason why x2apic don't have way to
> > do this.
>
> Yes, I see that now. Looks like others have the same questions...
>
> > But later I found another issue of fail to inspect inside the guest. I
> > think it's due to NMI is asynchronous event, though it should be
> > triggered very quickly, you can't guarantee that the handler would be
> > triggered before the state(current_vcpu) is cleared with current code.
> >
> > Maybe just extended the "guest state" region would be fine, if the
> > latency is stable enough(though I think it maybe platform depended). I am
> > working on this now.
>
> I wouldn't like to depend on model specific behaviour.
>
> One option is to read all the information synchronously and store it in
> a per-cpu area with atomic instructions, then queue the NMI. Another
> option is to have another callback which tells us that the NMI is done,
> and have a busy loop wait until the NMI is delivered.
>
Callback seems too heavy, may affect the performance badly. Maybe a short
queue would help, though this one is more complex.

But I am still curious if we extend the region, how much it would help. Would
get a result soon...

--
regards
Yang, Sheng

2010-04-14 10:20:23

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/14/2010 01:14 PM, Sheng Yang wrote:
>
>> I wouldn't like to depend on model specific behaviour.
>>
>> One option is to read all the information synchronously and store it in
>> a per-cpu area with atomic instructions, then queue the NMI. Another
>> option is to have another callback which tells us that the NMI is done,
>> and have a busy loop wait until the NMI is delivered.
>>
>>
> Callback seems too heavy, may affect the performance badly. Maybe a short
> queue would help, though this one is more complex.
>

The patch we're replying to adds callbacks (to read rip, etc.), so it's
no big deal. For the queue solution, a queue of size one would probably
be sufficient even if not guaranteed by the spec. I don't see how the
cpu can do another guest entry without delivering the NMI.

> But I am still curious if we extend the region, how much it would help. Would
> get a result soon...
>

Yes, interesting to see what the latency is. If it's reasonably short
(and I expect it will be so), we can do the busy wait solution.

If we have an NMI counter somewhere, we can simply wait until it changes.

--
error compiling committee.c: too many arguments to function

2010-04-14 10:27:36

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Wednesday 14 April 2010 18:19:49 Avi Kivity wrote:
> On 04/14/2010 01:14 PM, Sheng Yang wrote:
> >> I wouldn't like to depend on model specific behaviour.
> >>
> >> One option is to read all the information synchronously and store it in
> >> a per-cpu area with atomic instructions, then queue the NMI. Another
> >> option is to have another callback which tells us that the NMI is done,
> >> and have a busy loop wait until the NMI is delivered.
> >
> > Callback seems too heavy, may affect the performance badly. Maybe a short
> > queue would help, though this one is more complex.
>
> The patch we're replying to adds callbacks (to read rip, etc.), so it's
> no big deal. For the queue solution, a queue of size one would probably
> be sufficient even if not guaranteed by the spec. I don't see how the
> cpu can do another guest entry without delivering the NMI.
>
> > But I am still curious if we extend the region, how much it would help.
> > Would get a result soon...
>
> Yes, interesting to see what the latency is. If it's reasonably short
> (and I expect it will be so), we can do the busy wait solution.
>
> If we have an NMI counter somewhere, we can simply wait until it changes.

Good idea. Of course we have one(at least on x86). There is
irq_stat.irq__nmi_count for per cpu. :)

--
regards
Yang, Sheng

2010-04-14 10:33:58

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/14/2010 01:27 PM, Sheng Yang wrote:
>
>> Yes, interesting to see what the latency is. If it's reasonably short
>> (and I expect it will be so), we can do the busy wait solution.
>>
>> If we have an NMI counter somewhere, we can simply wait until it changes.
>>
>
> Good idea. Of course we have one(at least on x86). There is
> irq_stat.irq__nmi_count for per cpu. :)
>

Okay, but kvm doesn't want to know about it. How about a new arch
function, invoke_nmi_sync(), that will trigger the NMI and wait for it?

--
error compiling committee.c: too many arguments to function

2010-04-14 10:36:21

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Wednesday 14 April 2010 18:33:37 Avi Kivity wrote:
> On 04/14/2010 01:27 PM, Sheng Yang wrote:
> >> Yes, interesting to see what the latency is. If it's reasonably short
> >> (and I expect it will be so), we can do the busy wait solution.
> >>
> >> If we have an NMI counter somewhere, we can simply wait until it
> >> changes.
> >
> > Good idea. Of course we have one(at least on x86). There is
> > irq_stat.irq__nmi_count for per cpu. :)
>
> Okay, but kvm doesn't want to know about it. How about a new arch
> function, invoke_nmi_sync(), that will trigger the NMI and wait for it?
>
Sound reasonable. Would try it.

--
regards
Yang, Sheng

2010-04-14 10:43:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side


* Avi Kivity <[email protected]> wrote:

> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> >Here is the new patch of V3 against tip/master of April 13th
> >if anyone wants to try it.
> >
>
> Thanks for persisting despite the flames.
>
> Can you please separate arch/x86/kvm part of the patch? That will make for
> easier reviewing, and will need to go through separate trees.

Once it gets into a state that it can be applied could you please create a
separate, -git based branch for it, so that i can pull it for testing and
integration with the tools/perf/ bits?

Assuming there are no serious conflicts with pending KVM work.

(or i can do that too)

Thanks,

Ingo

2010-04-14 11:17:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/14/2010 01:43 PM, Ingo Molnar wrote:
>>
>> Thanks for persisting despite the flames.
>>
>> Can you please separate arch/x86/kvm part of the patch? That will make for
>> easier reviewing, and will need to go through separate trees.
>>
> Once it gets into a state that it can be applied could you please create a
> separate, -git based branch for it, so that i can pull it for testing and
> integration with the tools/perf/ bits?
>
>

Sure.

> Assuming there are no serious conflicts with pending KVM work.
>

There will be a conflict with the NMI fix (which has to go in first,
we'll want to backport it), I'll put it on the same branch.

--
error compiling committee.c: too many arguments to function

2010-04-15 08:06:09

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/15/2030 04:04 AM, Zhang, Yanmin wrote:
>
>> An even more accurate way to determine this is to check whether the
>> interrupt frame points back at the 'int $2' instruction. However we
>> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
>> wether APIC NMIs are accepted on an instruction boundary or whether
>> there's some latency involved.
>>
> Yes. But the frame pointer checking seems a little complicated.
>

An even bigger disadvantage is that it won't work with Sheng's patch,
self-NMIs are not synchronous.

>>> trace_kvm_entry(vcpu->vcpu_id);
>>> +
>>> + percpu_write(current_vcpu, vcpu);
>>> kvm_x86_ops->run(vcpu);
>>> + percpu_write(current_vcpu, NULL);
>>>
>>>
>> If you move this around the 'int $2' instructions you will close the
>> race, as a stray NMI won't catch us updating the rip cache. But that
>> depends on whether self-IPI is accepted on the next instruction or not.
>>
> Right. The kernel part has dependency on the self-IPI implementation.
> I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
> just around 'int $2'.
>
>

Or create a new function to inject the interrupt in x86.c. That will
reduce duplication between svm.c and vmx.c.

> Sheng would find a solution on the self-IPI delivery. Let's separate my patch
> and self-IPI as 2 issues as we don't know when the self-IPI delivery would be
> resolved.
>

Sure.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-15 09:04:08

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:

> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
> happens in guest os. In addition, svm_complete_interrupts is called after
> interrupt is enabled.

Yes. The NMI is held pending by the hardware until the STGI instruction
is executed.
And for nested svm the svm_complete_interrupts function needs to be
executed after the nested exit handling. Therefore it is done late on
svm.

Joerg

2010-04-15 09:09:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/15/2010 12:04 PM, oerg Roedel wrote:
> On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:
>
>
>> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
>> happens in guest os. In addition, svm_complete_interrupts is called after
>> interrupt is enabled.
>>
> Yes. The NMI is held pending by the hardware until the STGI instruction
> is executed.
> And for nested svm the svm_complete_interrupts function needs to be
> executed after the nested exit handling. Therefore it is done late on
> svm.
>

So, we'd need something like the following:

if (exit == NMI)
__get_cpu_var(nmi_vcpu) = vcpu;

stgi();

if (exit == NMI) {
while (!nmi_handled())
cpu_relax();
__get_cpu_var(nmi_vcpu) = NULL;
}

and no code sharing betweem vmx and svm.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-15 09:44:10

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Thu, Apr 15, 2010 at 12:09:28PM +0300, Avi Kivity wrote:
> On 04/15/2010 12:04 PM, oerg Roedel wrote:
>> On Mon, Apr 15, 2030 at 04:57:38PM +0800, Zhang, Yanmin wrote:
>>
>>
>>> I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
>>> happens in guest os. In addition, svm_complete_interrupts is called after
>>> interrupt is enabled.
>>>
>> Yes. The NMI is held pending by the hardware until the STGI instruction
>> is executed.
>> And for nested svm the svm_complete_interrupts function needs to be
>> executed after the nested exit handling. Therefore it is done late on
>> svm.
>>
>
> So, we'd need something like the following:
>
> if (exit == NMI)
> __get_cpu_var(nmi_vcpu) = vcpu;
>
> stgi();
>
> if (exit == NMI) {
> while (!nmi_handled())
> cpu_relax();
> __get_cpu_var(nmi_vcpu) = NULL;
> }

Hmm, looks a bit complicated to me. The NMI should happen shortly after
the stgi instruction. Interrupts are still disabled so we stay on this
cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?

Joerg

2010-04-15 09:48:34

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/15/2010 12:44 PM, oerg Roedel wrote:
>
>> So, we'd need something like the following:
>>
>> if (exit == NMI)
>> __get_cpu_var(nmi_vcpu) = vcpu;
>>
>> stgi();
>>
>> if (exit == NMI) {
>> while (!nmi_handled())
>> cpu_relax();
>> __get_cpu_var(nmi_vcpu) = NULL;
>> }
>>
> Hmm, looks a bit complicated to me. The NMI should happen shortly after
> the stgi instruction. Interrupts are still disabled so we stay on this
> cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?
>
>

That means an NMI that happens outside guest code (for example, in the
mmu, or during the exit itself) would be counted as if in guest code.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-15 10:40:54

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Thu, Apr 15, 2010 at 12:48:09PM +0300, Avi Kivity wrote:
> On 04/15/2010 12:44 PM, oerg Roedel wrote:
>>
>>> So, we'd need something like the following:
>>>
>>> if (exit == NMI)
>>> __get_cpu_var(nmi_vcpu) = vcpu;
>>>
>>> stgi();
>>>
>>> if (exit == NMI) {
>>> while (!nmi_handled())
>>> cpu_relax();
>>> __get_cpu_var(nmi_vcpu) = NULL;
>>> }
>>>
>> Hmm, looks a bit complicated to me. The NMI should happen shortly after
>> the stgi instruction. Interrupts are still disabled so we stay on this
>> cpu. Can't we just set and erase the cpu_var at vcpu_load/vcpu_put time?
>>
>>
>
> That means an NMI that happens outside guest code (for example, in the
> mmu, or during the exit itself) would be counted as if in guest code.

Hmm, true. The same is true for an NMI that happens between VMSAVE and
STGI but that window is smaller. Anyway, I think we don't need the
busy-wait loop. The NMI should be executed at a well defined point and
we set the cpu_var back to NULL after that point.

Joerg

2010-04-15 10:44:41

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/15/2010 01:40 PM, Joerg Roedel wrote:
>
>> That means an NMI that happens outside guest code (for example, in the
>> mmu, or during the exit itself) would be counted as if in guest code.
>>
> Hmm, true. The same is true for an NMI that happens between VMSAVE and
> STGI but that window is smaller. Anyway, I think we don't need the
> busy-wait loop. The NMI should be executed at a well defined point and
> we set the cpu_var back to NULL after that point.
>

The point is not well defined. Considering there are already at least
two implementations svm, I don't want to rely on implementation details.

We could tune the position of the loop so that zero iterations are
executed on the implementations we know about.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2010-04-15 14:08:22

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Thursday 15 April 2010 18:44:15 Avi Kivity wrote:
> On 04/15/2010 01:40 PM, Joerg Roedel wrote:
> >> That means an NMI that happens outside guest code (for example, in the
> >> mmu, or during the exit itself) would be counted as if in guest code.
> >
> > Hmm, true. The same is true for an NMI that happens between VMSAVE and
> > STGI but that window is smaller. Anyway, I think we don't need the
> > busy-wait loop. The NMI should be executed at a well defined point and
> > we set the cpu_var back to NULL after that point.
>
> The point is not well defined. Considering there are already at least
> two implementations svm, I don't want to rely on implementation details.

After more investigating, I realized that I had interpreted the SDM wrong.
Sorry.

There is *no* risk with the original method of calling "int $2".

According to the SDM 24.1:

> The following bullets detail when architectural state is and is not updated
in response to VM exits:
[...]
> - An NMI causes subsequent NMIs to be blocked, but only after the VM exit
completes.

So the truth is, after NMI directly caused VMExit, the following NMIs would be
blocked, until encountered next "iret". So execute "int $2" is safe in
vmx_complete_interrupts(), no risk in causing nested NMI. And it would unblock
the following NMIs as well due to "iret" it executed.

So there is unnecessary to make change to avoid "potential nested NMI".

Sorry for the mistake and caused confusing.

--
regards
Yang, Sheng

>
> We could tune the position of the loop so that zero iterations are
> executed on the implementations we know about.
>

2010-04-17 18:13:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/15/2010 05:08 PM, Sheng Yang wrote:
> On Thursday 15 April 2010 18:44:15 Avi Kivity wrote:
>
>> On 04/15/2010 01:40 PM, Joerg Roedel wrote:
>>
>>>> That means an NMI that happens outside guest code (for example, in the
>>>> mmu, or during the exit itself) would be counted as if in guest code.
>>>>
>>> Hmm, true. The same is true for an NMI that happens between VMSAVE and
>>> STGI but that window is smaller. Anyway, I think we don't need the
>>> busy-wait loop. The NMI should be executed at a well defined point and
>>> we set the cpu_var back to NULL after that point.
>>>
>> The point is not well defined. Considering there are already at least
>> two implementations svm, I don't want to rely on implementation details.
>>
> After more investigating, I realized that I had interpreted the SDM wrong.
> Sorry.
>
> There is *no* risk with the original method of calling "int $2".
>
> According to the SDM 24.1:
>
>
>> The following bullets detail when architectural state is and is not updated
>>
> in response to VM exits:
> [...]
>
>> - An NMI causes subsequent NMIs to be blocked, but only after the VM exit
>>
> completes.
>
> So the truth is, after NMI directly caused VMExit, the following NMIs would be
> blocked, until encountered next "iret". So execute "int $2" is safe in
> vmx_complete_interrupts(), no risk in causing nested NMI. And it would unblock
> the following NMIs as well due to "iret" it executed.
>
> So there is unnecessary to make change to avoid "potential nested NMI".
>

Let's look at the surrounding text...

>
> The following bullets detail when architectural state is and is not
> updated in response
> to VM exits:
> • If an event causes a VM exit directly, it does not update
> architectural state as it
> would have if it had it not caused the VM exit:
> — A debug exception does not update DR6, DR7.GD, or IA32_DEBUGCTL.LBR.
> (Information about the nature of the debug exception is saved
> in the exit
> qualification field.)
> — A page fault does not update CR2. (The linear address causing
> the page fault
> is saved in the exit-qualification field.)
> — An NMI causes subsequent NMIs to be blocked, but only after the
> VM exit
> completes.
> — An external interrupt does not acknowledge the interrupt
> controller and the
> interrupt remains pending, unless the “acknowledge interrupt
> on exit”
> VM-exit control is 1. In such a case, the interrupt controller
> is acknowledged
> and the interrupt is no longer pending.


Everywhere it says state is _not_ updated, so I think what is meant is
that NMIs are blocked, but only _until_ the VM exit completes.

I think you were right the first time around. Can you check with your
architecture team?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-19 08:25:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/17/2010 09:12 PM, Avi Kivity wrote:
>
> I think you were right the first time around.
>

Re-reading again (esp. the part about treatment of indirect NMI
vmexits), I think this was wrong, and that the code is correct. I am
now thoroughly confused.


--
error compiling committee.c: too many arguments to function

2010-04-20 03:33:12

by Sheng Yang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Monday 19 April 2010 16:25:17 Avi Kivity wrote:
> On 04/17/2010 09:12 PM, Avi Kivity wrote:
> > I think you were right the first time around.
>
> Re-reading again (esp. the part about treatment of indirect NMI
> vmexits), I think this was wrong, and that the code is correct. I am
> now thoroughly confused.
>
My fault...

To my understanding now, "If an event causes a VM exit directly, it does not
update architectural state as it would have if it had it not caused the VM
exit:", means: in NMI case, NMI would involve the NMI handler, and change the
"architectural state" to NMI block. In VMX non-root mode, the behavior of
calling NMI handler changed(determine by some VMCS fields), but not the
affection to the "architectural state". So the NMI block state would remain
the same.

--
regards
Yang, Sheng

2010-04-20 09:39:05

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On 04/20/2010 06:32 AM, Sheng Yang wrote:
> On Monday 19 April 2010 16:25:17 Avi Kivity wrote:
>
>> On 04/17/2010 09:12 PM, Avi Kivity wrote:
>>
>>> I think you were right the first time around.
>>>
>> Re-reading again (esp. the part about treatment of indirect NMI
>> vmexits), I think this was wrong, and that the code is correct. I am
>> now thoroughly confused.
>>
>>
> My fault...
>

Not at all, it's really confusingly worded.

> To my understanding now, "If an event causes a VM exit directly, it does not
> update architectural state as it would have if it had it not caused the VM
> exit:", means: in NMI case, NMI would involve the NMI handler, and change the
> "architectural state" to NMI block. In VMX non-root mode, the behavior of
> calling NMI handler changed(determine by some VMCS fields), but not the
> affection to the "architectural state". So the NMI block state would remain
> the same.
>

Agree. It's confusing because the internal "nmi pending" flag is not
set, while the "nmi blocking" flag is set.

(on svm both are set, but the NMI is not taken until the vmexit
completes and the host unmasks NMIs).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2010-04-27 19:03:17

by Uwaysi Bin Kareem

[permalink] [raw]
Subject: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)

This is based on the research I did with optimizing my machine for
graphics.
I also wrote the following article:
http://www.paradoxuncreated.com/articles/Millennium/Millennium.html
It is a bit outdated now, but I will update it with current information.
The value might iterate.

Peace Be With You,
Uwaysi Bin Kareem.


--- Kconfig.hzorig 2010-04-27 13:33:10.302162524 +0200
+++ Kconfig.hz 2010-04-27 20:39:54.736959816 +0200
@@ -45,6 +45,18 @@
1000 Hz is the preferred choice for desktop systems and other
systems requiring fast interactive responses to events.

+ config HZ_3956
+ bool "3956 HZ"
+ help
+ 3956 Hz is nearly the highest timer interrupt rate supported in the
kernel.
+ Graphics workstations, and OpenGL applications may benefit from this,
+ since it gives the lowest framerate-jitter. The exact value 3956 is
+ psychovisually-optimized, meaning that it aims for a level of jitter,
+ percieved to be natural, and therefore non-nosiy. It is tuned for a
+ profile of "where the human senses register the most information".
+
+
+
endchoice

config HZ
@@ -53,6 +65,7 @@
default 250 if HZ_250
default 300 if HZ_300
default 1000 if HZ_1000
+ default 3956 if HZ_3956

config SCHED_HRTICK
def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)

2010-04-27 19:51:24

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)

On Tue, 27 Apr 2010 21:03:11 +0200 Uwaysi Bin Kareem wrote:

> This is based on the research I did with optimizing my machine for
> graphics.
> I also wrote the following article:
> http://www.paradoxuncreated.com/articles/Millennium/Millennium.html
> It is a bit outdated now, but I will update it with current information.
> The value might iterate.

Hi,

What CPU architectures or platforms did you test this on?
Were any other kernel changes needed?


> Peace Be With You,
> Uwaysi Bin Kareem.
>
>
> --- Kconfig.hzorig 2010-04-27 13:33:10.302162524 +0200
> +++ Kconfig.hz 2010-04-27 20:39:54.736959816 +0200
> @@ -45,6 +45,18 @@
> 1000 Hz is the preferred choice for desktop systems and other
> systems requiring fast interactive responses to events.
>
> + config HZ_3956
> + bool "3956 HZ"
> + help
> + 3956 Hz is nearly the highest timer interrupt rate supported in the
> kernel.
> + Graphics workstations, and OpenGL applications may benefit from this,

drop first comma.

> + since it gives the lowest framerate-jitter. The exact value 3956 is
> + psychovisually-optimized, meaning that it aims for a level of jitter,
> + percieved to be natural, and therefore non-nosiy. It is tuned for a

perceived non-noisy.

> + profile of "where the human senses register the most information".
> +
> +
> +
> endchoice
>
> config HZ
> @@ -53,6 +65,7 @@
> default 250 if HZ_250
> default 300 if HZ_300
> default 1000 if HZ_1000
> + default 3956 if HZ_3956
>
> config SCHED_HRTICK
> def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
>
> --


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2010-04-27 21:50:15

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)

On Tue, 27 Apr 2010 21:03:11 +0200, Uwaysi Bin Kareem said:

> http://www.paradoxuncreated.com/articles/Millennium/Millennium.html

> + config HZ_3956
> + bool "3956 HZ"
> + help
> + 3956 Hz is nearly the highest timer interrupt rate supported in the kernel.
> + Graphics workstations, and OpenGL applications may benefit from this,
> + since it gives the lowest framerate-jitter. The exact value 3956 is
> + psychovisually-optimized, meaning that it aims for a level of jitter,

Even after reading your link, it's unclear why 3956 and not 4000. All your link
said was "A granularity below 0.5 milliseconds, seems to suit the human
senses." - anything over 2000 meets that requirement. Also, if your screen
refresh is sitting at 72hz or a bit under 14ms per refresh, any jitter under
that won't really matter much - it doesn't matter if your next frame is
ready 5ms early or 5.5ms early, you *still* have to wait for the next vertical
blanking interval or suffer tearing.

There's also the case of programs where HZ=300 would *make* the time budget,
but the added 3,356 timer interrupts and associated overhead would cause a
missed screen refresh.

I think you need more technical justification of why 3956 is better than 1000.


Attachments:
(No filename) (227.00 B)

2010-04-15 01:04:57

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Wed, 2010-04-14 at 12:20 +0300, Avi Kivity wrote:
> On 04/14/2030 12:05 PM, Zhang, Yanmin wrote:
> > Here is the new patch of V3 against tip/master of April 13th
> > if anyone wants to try it.
> >
> >
>
> Thanks for persisting despite the flames.
>
> Can you please separate arch/x86/kvm part of the patch? That will make
> for easier reviewing, and will need to go through separate trees.
I should do so definitely, and will do so in next version which also fixes
some issues pointed by Ingo.

>
> Sheng, did you make any progress with the NMI injection issue?
>
> > +
> > diff -Nraup linux-2.6_tip0413/arch/x86/kvm/x86.c linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c
> > --- linux-2.6_tip0413/arch/x86/kvm/x86.c 2010-04-14 11:11:04.341042024 +0800
> > +++ linux-2.6_tip0413_perfkvm/arch/x86/kvm/x86.c 2010-04-14 11:32:45.841278890 +0800
> > @@ -3765,6 +3765,35 @@ static void kvm_timer_init(void)
> > }
> > }
> >
> > +static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> > +
> > +static int kvm_is_in_guest(void)
> > +{
> > + return percpu_read(current_vcpu) != NULL;
> >
>
> An even more accurate way to determine this is to check whether the
> interrupt frame points back at the 'int $2' instruction. However we
> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
> wether APIC NMIs are accepted on an instruction boundary or whether
> there's some latency involved.
Yes. But the frame pointer checking seems a little complicated.

>
> > +static unsigned long kvm_get_guest_ip(void)
> > +{
> > + unsigned long ip = 0;
> > + if (percpu_read(current_vcpu))
> > + ip = kvm_rip_read(percpu_read(current_vcpu));
> > + return ip;
> > +}
> >
>
> This may be racy. kvm_rip_read() accesses a cache in memory; if we're
> in the process of updating the cache, then we may read a stale value.
> See below.
Right. The racy window seems too big.

>
> >
> > trace_kvm_entry(vcpu->vcpu_id);
> > +
> > + percpu_write(current_vcpu, vcpu);
> > kvm_x86_ops->run(vcpu);
> > + percpu_write(current_vcpu, NULL);
> >
>
> If you move this around the 'int $2' instructions you will close the
> race, as a stray NMI won't catch us updating the rip cache. But that
> depends on whether self-IPI is accepted on the next instruction or not.
Right. The kernel part has dependency on the self-IPI implementation.
I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
just around 'int $2'.

Sheng would find a solution on the self-IPI delivery. Let's separate my patch
and self-IPI as 2 issues as we don't know when the self-IPI delivery would be
resolved.

Thanks,
Yanmin

2010-04-15 08:57:54

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH V3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

On Thu, 2010-04-15 at 11:05 +0300, Avi Kivity wrote:
> On 04/15/2030 04:04 AM, Zhang, Yanmin wrote:
> >
> >> An even more accurate way to determine this is to check whether the
> >> interrupt frame points back at the 'int $2' instruction. However we
> >> plan to switch to a self-IPI method to inject the NMI, and I'm not sure
> >> wether APIC NMIs are accepted on an instruction boundary or whether
> >> there's some latency involved.
> >>
> > Yes. But the frame pointer checking seems a little complicated.
> >
>
> An even bigger disadvantage is that it won't work with Sheng's patch,
> self-NMIs are not synchronous.
>
> >>> trace_kvm_entry(vcpu->vcpu_id);
> >>> +
> >>> + percpu_write(current_vcpu, vcpu);
> >>> kvm_x86_ops->run(vcpu);
> >>> + percpu_write(current_vcpu, NULL);
> >>>
> >>>
> >> If you move this around the 'int $2' instructions you will close the
> >> race, as a stray NMI won't catch us updating the rip cache. But that
> >> depends on whether self-IPI is accepted on the next instruction or not.
> >>
> > Right. The kernel part has dependency on the self-IPI implementation.
> > I will move above percpu_write(current_vcpu, vcpu) (or a new wrapper function)
> > just around 'int $2'.
> >
> >
>
> Or create a new function to inject the interrupt in x86.c. That will
> reduce duplication between svm.c and vmx.c.
I checked svm.c and it seems svm.c doesn't trigger a NMI to host if the NMI
happens in guest os. In addition, svm_complete_interrupts is called after
interrupt is enabled.

2010-06-01 06:22:45

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)

On Sat, 29 May 2010 13:54:00 +0200, [email protected] said:

> Valdis Kletnieks: Why 3956, and why better than 1000:
>
> As stated the exact value 3956, fits a profile of "where the human senses register the most information".

More details on that profile. Where *exactly* did the number 3956 come from?
How did you distinguish between 3956 and 4000 or 4096? What numbers do you
have that show an actual *measurable* improvement over 1000?

In other words, convince us that people can actually see the difference
between 1000 and 3956.


Attachments:
(No filename) (227.00 B)

2010-06-01 10:47:18

by Uwaysi Bin Kareem

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)


On Tue, 01 Jun 2010 08:21:46 +0200, <[email protected]> wrote:

> On Sat, 29 May 2010 13:54:00 +0200,
> [email protected] said:
>
>> Valdis Kletnieks: Why 3956, and why better than 1000:
>>
>> As stated the exact value 3956, fits a profile of "where the human
>> senses register the most information".
>
> More details on that profile. Where *exactly* did the number 3956 come
> from?
> How did you distinguish between 3956 and 4000 or 4096? What numbers do
> you
> have that show an actual *measurable* improvement over 1000?
>
> In other words, convince us that people can actually see the difference
> between 1000 and 3956.

I do not really have any numbers Valdis, other than simple glxgears benchmarks.
However I have a lot of experience with jitter, and I am looking for sporadic jitter, jitter related to application-startup, jitter that is more or less constant.

Ofcourse I do not need any numbers either. If you think 1000 is better than 50, then there is a difference between 1000 and 4000 aswell.
However it gets in the the area of research, one would call psychovisuals. Small changes affecting the immersion, or experience of opengl.

Put it simply one might state "If you feel that your computer is a bit stoopid, try increasing the value, and maybe you will be more satisfied." This because the computer now, is more like the human senses.

The advanced version:
For those who posess religious knowledge, or believe in religious phenomea, lets just say that, if you don't comply to certain religious knowledge, you will be tuning psychovisuals for a spirit, and not a human, and the experience will be suboptimal.

Just like the worshipper of one spirit, say, atheist, includes his preferences in the tuning, so does, for instance the hash-smoker, and that is reflected in his tunings. What one would optimally like, is a spirit-free tuning. No personal preference, but tuned for the universal in us all.

And for those who would like to understand some of the methology behind this, again http://www.paradoxuncreated.com .Try the meditation-techinque, which purifies the mind from spirits.

Any answers related to this post, critisising or wasting my time, will be ignored.

God guides and deludes whomever he wills.
Peace Be with You.
Uwaysi.

2010-06-01 13:26:13

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)

On Tue, 01 Jun 2010 12:47:15 +0200, [email protected] said:

> I do not really have any numbers Valdis, other than simple glxgears benchmarks.

I suspect that glxgears isn't telling you what you think it's telling you.
For starters, the distinction between a glxgears wank-o-meter reading of
4,000 FPS and 8,000 FPS doesn't actually *matter* when your screen is only
actually able to do 60 or 72 or 120FPS. What it *really* tells you is that
the card that can do 8,000FPS can probably handle a more complicated scene
before the FPS drops below the refresh rate and you miss a frame, which
*will* be noticeable.

Repeat after me: Graphics cards are locked to the refresh rate, and you can't
see jitter or low frame rates unless it causes tearing, missed frames, or
other screen artifacts. And to maximize your chances of not missing a screen
update, you want a *lower* HZ value so you don't waste precious time handling
timer interrupts.

> However I have a lot of experience with jitter, and I am looking for sporadic
> jitter, jitter related to application-startup, jitter that is more or less
> constant.

"Constant jitter" - talking like that will get you mocked mercilessly by
some people.

> Ofcourse I do not need any numbers either. If you think 1000 is better than
> 50, then there is a difference between 1000 and 4000 aswell.

OK, so why not go straight to 8,00 or 10,000 instead? Did you try values in that
range?

Hate to tell you this, but around here, you *do* need numbers to justify
making changes. It used to be that HZ=100 was the only choice - 250 and
1000 were added because somebody showed that those options made noticeable
differences in the latency/overhead tradeoff (interestingly enough, HZ=1000
mattered more to audio processing than video, because most video cards are
locked to a relatively low refresh rate while audio cards will produce
a noticable transient if you miss a timeout by even 1ms). HZ=300 was added
specifically to play nice with 60-hz video processing.

But to swallow the added overhead of setting HZ=4000, you'll have to show
some remarkable benefits (especially when you're pulling out a magic number
like 3956 rather than 4000).

> Put it simply one might state "If you feel that your computer is a bit
> stoopid, try increasing the value, and maybe you will be more satisfied." This
> because the computer now, is more like the human senses.

And maybe you won't be, unless you're the type of person who buys the
special $1,000 HDMI cables and $600 wooden volume controls. Unfortunately,
we aren't building kernels for those type of people.

> And for those who would like to understand some of the methology behind this,
> again http://www.paradoxuncreated.com .Try the meditation-techinque, which purifies
> the mind from spirits.

Unfortunately, that's unlikely to get your changes into the kernel.

> Any answers related to this post, critisising or wasting my time, will be ignored.

Nor is this likely to help...


Attachments:
(No filename) (227.00 B)

2010-06-01 14:12:22

by Uwaysi Bin Kareem

[permalink] [raw]
Subject: Re: [PATCH] Psychovisually-optimized HZ setting (2.6.33.3)


I guess I could pick this post apart, just for the fun of it. I really hate satanical stupidity. Most of you here, can probably follow a level, where my arguments make sense, and some of you can even follow the religious argument. However a few may post completely retarded stuff like this. Now I'm going to pick this apart, so that you can rather imagine this being done, in similar future incidents, however I will just ignore. My life is better without this kind of conversation ;)

On Tue, 01 Jun 2010 15:25:46 +0200, <[email protected]> wrote:

> On Tue, 01 Jun 2010 12:47:15 +0200,
> [email protected] said:
>
>> I do not really have any numbers Valdis, other than simple glxgears
>> benchmarks.
>
> I suspect that glxgears isn't telling you what you think it's telling
> you.
> For starters, the distinction between a glxgears wank-o-meter reading of
> 4,000 FPS and 8,000 FPS doesn't actually *matter* when your screen is
> only
It matters to jitter measurements, where numbers in the 80000 region, is giving me the information I need. With more complex scenes, who would only render say 100fps, less information would be given. We are not talking about how complex scenes, we can render. We are talking about jitter, which is present, regardless of scene-complexity.
> actually able to do 60 or 72 or 120FPS. What it *really* tells you is
> that
> the card that can do 8,000FPS can probably handle a more complicated
> scene
> before the FPS drops below the refresh rate and you miss a frame, which
> *will* be noticeable.
>
> Repeat after me: Graphics cards are locked to the refresh rate, and you
> can't
I think very few things should be repeated after you. First the 8000fps, and now you claim vsynced behaviour. This is a contradiction.
> see jitter or low frame rates unless it causes tearing, missed frames, or
> other screen artifacts. And to maximize your chances of not missing a
> screen
> update, you want a *lower* HZ value so you don't waste precious time
No. You run 50hz, I run 3956hz. If you have this amount of garbage in your head, maybe it's in your vision aswell, and little can help you.
> handling
> timer interrupts.
>
>> However I have a lot of experience with jitter, and I am looking for
>> sporadic
>> jitter, jitter related to application-startup, jitter that is more or
>> less
>> constant.
>
> "Constant jitter" - talking like that will get you mocked mercilessly by
> some people.

Mindless people, who set a value of 4000hz instead of 3956, because they don't believe or lack the skill, to tune a value. Instead the value 4000 would reflect a guess, not tuned for the human senses, and we are back to the old stoopid computer again, tuned by people like Vladis, who would rather sit and run 50hz updates in machocistic hope of saving a cpu cycle, when reallife tests show that even values of 10000hz make little difference of performance in opengl. You do that Vladis. Live with the cheapest clothes, the most outwatered drinks, and the stalest and cheapest bread. And for us, who appreciate higher intelligence, and use the resources available to us, we will enjoy ourselves a little, away from backwardpeople like you.

>> Ofcourse I do not need any numbers either. If you think 1000 is better
>> than
>> 50, then there is a difference between 1000 and 4000 aswell.
>
> OK, so why not go straight to 8,00 or 10,000 instead? Did you try values
> in that
> range?
>
> Hate to tell you this, but around here, you *do* need numbers to justify
> making changes. It used to be that HZ=100 was the only choice - 250 and
In a menu. Lol, some of us actually looked at the source, and changed that value anyway. It's really quite simple, and if you don't posess even that basic level of skill, what are you doing on LKML. Lol, it's never been 100hz or no choice. You seem to even lack the most basic insight into what opensource is.
> 1000 were added because somebody showed that those options made
> noticeable
> differences in the latency/overhead tradeoff (interestingly enough,
> HZ=1000
> mattered more to audio processing than video, because most video cards
> are
> locked to a relatively low refresh rate while audio cards will produce
> a noticable transient if you miss a timeout by even 1ms). HZ=300 was
> added
> specifically to play nice with 60-hz video processing.
Play nice? Jitter between frames are still a problem, even though 300hz is divisible by 60.
>
> But to swallow the added overhead of setting HZ=4000, you'll have to show
> some remarkable benefits (especially when you're pulling out a magic
> number
> like 3956 rather than 4000).

You completely failed to comprehend the simple statement in my previous post "There is no performance penalty for opengl applications". How is it possible?

>> Put it simply one might state "If you feel that your computer is a bit
>> stoopid, try increasing the value, and maybe you will be more
>> satisfied." This
>> because the computer now, is more like the human senses.
>
> And maybe you won't be, unless you're the type of person who buys the
> special $1,000 HDMI cables and $600 wooden volume controls.
> Unfortunately,
> we aren't building kernels for those type of people.

We? This is opensource. I do whatever I will with the kernel. You take your cheap pirate rip-off copy of a volume control, and I WILL have the nice volume control. Resources exist to be consumed.

>> And for those who would like to understand some of the methology behind
>> this,
>> again http://www.paradoxuncreated.com .Try the meditation-techinque, which
>> purifies
>> the mind from spirits.
>
> Unfortunately, that's unlikely to get your changes into the kernel.

If people like you are influental in what goes into the kernel, the linux community truly suffers. Be ready for the cheapest and the lowest, just because theoretically some cpu cycles get lost. Why don't you get yourself and z80 and live in sinclair lands, it truly sounds like you want to keep being there.

>> Any answers related to this post, critisising or wasting my time, will
>> be ignored.
>
> Nor is this likely to help...

I don't think you have any idea about what helps and not helps. Go to a 1$ mental hospital, with zx sinclair computers, and run it at 0.0001hz screen update. That will surely save cpu cycles.

On the other hand it would bring our economy to a halt, and development of linux or any other OS, would be for a stagnant hardware market. What should we do with the new hardware? Pile em up, and brag, and never use em, because it saves cpu cycles?
Like a pile of trash?
Might aswell go the direct route to the trashheap then.

Peace Be With You.
Uwaysi. (in a more-than-rare-flamewars thesedays)

PS: To all the sane people here, do it. You will enjoy it. Do not be mr. 50hz sadomachochist.