2023-08-09 02:10:49

by Huang Shijie

[permalink] [raw]
Subject: [PATCH] perf/core: fix the bug in the event multiplexing

1.) Background.
1.1) In arm64, run a virtual guest with Qemu, and bind the guest
to core 33 and run program "a" in guest.
The code of "a" shows below:
----------------------------------------------------------
#include <stdio.h>

int main()
{
unsigned long i = 0;

for (;;) {
i++;
}

printf("i:%ld\n", i);
return 0;
}
----------------------------------------------------------

1.2) Use the following perf command in host:
#perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
# time counts unit events
1.000817400 3,299,471,572 cycles:G
1.000817400 3,240,586 cycles:H

This result is correct, my cpu's frequency is 3.3G.

1.3) Use the following perf command in host:
#perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
time counts unit events
1.000831480 153,634,097 cycles:G (70.03%)
1.000831480 3,147,940,599 cycles:H (70.03%)
1.000831480 1,143,598,527 L1-dcache-loads (70.03%)
1.000831480 9,986 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (70.03%)
1.000831480 <not supported> LLC-loads
1.000831480 <not supported> LLC-load-misses
1.000831480 580,887,696 L1-icache-loads (70.03%)
1.000831480 77,855 L1-icache-load-misses # 0.01% of all L1-icache accesses (70.03%)
1.000831480 6,112,224,612 dTLB-loads (70.03%)
1.000831480 16,222 dTLB-load-misses # 0.00% of all dTLB cache accesses (69.94%)
1.000831480 590,015,996 iTLB-loads (59.95%)
1.000831480 505 iTLB-load-misses # 0.00% of all iTLB cache accesses (59.95%)

This result is wrong. The "cycle:G" should be nearly 3.3G.

2.) Root cause.
There is only 7 counters in my arm64 platform:
(one cycle counter) + (6 normal counters)

In 1.3 above, we will use 10 event counters.
Since we only have 7 counters, the perf core will trigger
event multiplexing in hrtimer:
merge_sched_in() -->perf_mux_hrtimer_restart() -->
perf_rotate_context().

In the perf_rotate_context(), it does not restore some PMU registers
as context_switch() does. In context_switch():
kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
kvm_sched_out() --> kvm_vcpu_pmu_restore_host()

So we got wrong result.

3.) About this patch.
3.1) Add arch_perf_rotate_pmu_set()
3.2) Add is_guest().
Check the context for hrtimer.
3.3) In arm64's arch_perf_rotate_pmu_set(),
set the PMU registers by the context.

4.) Test result of this patch:
#perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
time counts unit events
1.000817360 3,297,898,244 cycles:G (70.03%)
1.000817360 2,719,941 cycles:H (70.03%)
1.000817360 883,764 L1-dcache-loads (70.03%)
1.000817360 17,517 L1-dcache-load-misses # 1.98% of all L1-dcache accesses (70.03%)
1.000817360 <not supported> LLC-loads
1.000817360 <not supported> LLC-load-misses
1.000817360 1,033,816 L1-icache-loads (70.03%)
1.000817360 103,839 L1-icache-load-misses # 10.04% of all L1-icache accesses (70.03%)
1.000817360 982,401 dTLB-loads (70.03%)
1.000817360 28,272 dTLB-load-misses # 2.88% of all dTLB cache accesses (69.94%)
1.000817360 972,072 iTLB-loads (59.95%)
1.000817360 772 iTLB-load-misses # 0.08% of all iTLB cache accesses (59.95%)

The result is correct. The "cycle:G" is nearly 3.3G now.

Signed-off-by: Huang Shijie <[email protected]>
---
arch/arm64/kvm/pmu.c | 8 ++++++++
include/linux/kvm_host.h | 1 +
kernel/events/core.c | 5 +++++
virt/kvm/kvm_main.c | 9 +++++++++
4 files changed, 23 insertions(+)

diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 121f1a14c829..a6815c3f0c4e 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -210,6 +210,14 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
kvm_vcpu_pmu_disable_el0(events_guest);
}

+void arch_perf_rotate_pmu_set(void)
+{
+ if (is_guest())
+ kvm_vcpu_pmu_restore_guest(NULL);
+ else
+ kvm_vcpu_pmu_restore_host(NULL);
+}
+
/*
* With VHE, keep track of the PMUSERENR_EL0 value for the host EL0 on the pCPU
* where PMUSERENR_EL0 for the guest is loaded, since PMUSERENR_EL0 is switched
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d3ac7720da9..e350cbc8190f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -931,6 +931,7 @@ void kvm_destroy_vcpus(struct kvm *kvm);

void vcpu_load(struct kvm_vcpu *vcpu);
void vcpu_put(struct kvm_vcpu *vcpu);
+bool is_guest(void);

#ifdef __KVM_HAVE_IOAPIC
void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6fd9272eec6e..fe78f9d17eba 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4229,6 +4229,10 @@ ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
return event;
}

+void __weak arch_perf_rotate_pmu_set(void)
+{
+}
+
static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
{
struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
@@ -4282,6 +4286,7 @@ static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
if (task_event || (task_epc && cpu_event))
__pmu_ctx_sched_in(task_epc->ctx, pmu);

+ arch_perf_rotate_pmu_set();
perf_pmu_enable(pmu);
perf_ctx_unlock(cpuctx, cpuctx->task_ctx);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dfbaafbe3a00..a77d336552be 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -218,6 +218,15 @@ void vcpu_load(struct kvm_vcpu *vcpu)
}
EXPORT_SYMBOL_GPL(vcpu_load);

+/* Do we in the guest? */
+bool is_guest(void)
+{
+ struct kvm_vcpu *vcpu;
+
+ vcpu = __this_cpu_read(kvm_running_vcpu);
+ return !!vcpu;
+}
+
void vcpu_put(struct kvm_vcpu *vcpu)
{
preempt_disable();
--
2.39.2



2023-08-09 04:20:04

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

Hi Huang,

kernel test robot noticed the following build warnings:

[auto build test WARNING on kvm/queue]
[also build test WARNING on kvmarm/next tip/perf/core linus/master v6.5-rc5 next-20230808]
[cannot apply to acme/perf/core kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Huang-Shijie/perf-core-fix-the-bug-in-the-event-multiplexing/20230809-094637
base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/r/20230809013953.7692-1-shijie%40os.amperecomputing.com
patch subject: [PATCH] perf/core: fix the bug in the event multiplexing
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20230809/[email protected]/config)
compiler: loongarch64-linux-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230809/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> kernel/events/core.c:4232:13: warning: no previous prototype for 'arch_perf_rotate_pmu_set' [-Wmissing-prototypes]
4232 | void __weak arch_perf_rotate_pmu_set(void)
| ^~~~~~~~~~~~~~~~~~~~~~~~


vim +/arch_perf_rotate_pmu_set +4232 kernel/events/core.c

4231
> 4232 void __weak arch_perf_rotate_pmu_set(void)
4233 {
4234 }
4235

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-08-09 08:57:34

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

Hi Huang,

On Wed, Aug 09, 2023 at 09:39:53AM +0800, Huang Shijie wrote:
> 2.) Root cause.
> There is only 7 counters in my arm64 platform:
> (one cycle counter) + (6 normal counters)
>
> In 1.3 above, we will use 10 event counters.
> Since we only have 7 counters, the perf core will trigger
> event multiplexing in hrtimer:
> merge_sched_in() -->perf_mux_hrtimer_restart() -->
> perf_rotate_context().
>
> In the perf_rotate_context(), it does not restore some PMU registers
> as context_switch() does. In context_switch():
> kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
> kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
>
> So we got wrong result.

This is a rather vague description of the problem. AFAICT, the
issue here is on VHE systems we wind up getting the EL0 count
enable/disable bits backwards when entering the guest, which is
corroborated by the data you have below.

> +void arch_perf_rotate_pmu_set(void)
> +{
> + if (is_guest())
> + kvm_vcpu_pmu_restore_guest(NULL);
> + else
> + kvm_vcpu_pmu_restore_host(NULL);
> +}
> +

This sort of hook is rather nasty, and I'd strongly prefer a solution
that's confined to KVM. I don't think the !is_guest() branch is
necessary at all. Regardless of how the pmu context is changed, we need
to go through vcpu_put() before getting back out to userspace.

We can check for a running vCPU (ick) from kvm_set_pmu_events() and either
do the EL0 bit flip there or make a request on the vCPU to call
kvm_vcpu_pmu_restore_guest() immediately before reentering the guest.
I'm slightly leaning towards the latter, unless anyone has a better idea
here.

--
Thanks,
Oliver

2023-08-09 09:51:11

by Shijie Huang

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

Hi Marc,

在 2023/8/9 16:48, Marc Zyngier 写道:
> On Wed, 09 Aug 2023 02:39:53 +0100,
> Huang Shijie <[email protected]> wrote:
>
> For a start, please provide a sensible subject line for your patch.
> "fix the bug" is not exactly descriptive, and I'd argue that if there
> was only one bug left, I'd have taken an early retirement by now.
okay, I will change the subject in the future.
>> 1.) Background.
>> 1.1) In arm64, run a virtual guest with Qemu, and bind the guest
> Is that with QEMU in system emulation mode? Or QEMU as a VMM for KVM?
Run the Qemu as a VMM for KVM.
>
>> to core 33 and run program "a" in guest.
> Is core 33 significant? Is the program itself significant?

It is not a significant one, just chosed by random.

The program is just used for testing the perf, not a significant one.

>> The code of "a" shows below:
>> ----------------------------------------------------------
>> #include <stdio.h>
>>
>> int main()
>> {
>> unsigned long i = 0;
>>
>> for (;;) {
>> i++;
>> }
>>
>> printf("i:%ld\n", i);
>> return 0;
>> }
>> ----------------------------------------------------------
>>
>> 1.2) Use the following perf command in host:
>> #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
>> # time counts unit events
>> 1.000817400 3,299,471,572 cycles:G
>> 1.000817400 3,240,586 cycles:H
>>
>> This result is correct, my cpu's frequency is 3.3G.
>>
>> 1.3) Use the following perf command in host:
>> #perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
>> time counts unit events
>> 1.000831480 153,634,097 cycles:G (70.03%)
>> 1.000831480 3,147,940,599 cycles:H (70.03%)
>> 1.000831480 1,143,598,527 L1-dcache-loads (70.03%)
>> 1.000831480 9,986 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (70.03%)
>> 1.000831480 <not supported> LLC-loads
>> 1.000831480 <not supported> LLC-load-misses
>> 1.000831480 580,887,696 L1-icache-loads (70.03%)
>> 1.000831480 77,855 L1-icache-load-misses # 0.01% of all L1-icache accesses (70.03%)
>> 1.000831480 6,112,224,612 dTLB-loads (70.03%)
>> 1.000831480 16,222 dTLB-load-misses # 0.00% of all dTLB cache accesses (69.94%)
>> 1.000831480 590,015,996 iTLB-loads (59.95%)
>> 1.000831480 505 iTLB-load-misses # 0.00% of all iTLB cache accesses (59.95%)
>>
>> This result is wrong. The "cycle:G" should be nearly 3.3G.
>>
>> 2.) Root cause.
>> There is only 7 counters in my arm64 platform:
>> (one cycle counter) + (6 normal counters)
>>
>> In 1.3 above, we will use 10 event counters.
>> Since we only have 7 counters, the perf core will trigger
>> event multiplexing in hrtimer:
>> merge_sched_in() -->perf_mux_hrtimer_restart() -->
>> perf_rotate_context().
>>
>> In the perf_rotate_context(), it does not restore some PMU registers
>> as context_switch() does. In context_switch():
>> kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
>> kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
>>
>> So we got wrong result.
>>
>> 3.) About this patch.
>> 3.1) Add arch_perf_rotate_pmu_set()
>> 3.2) Add is_guest().
>> Check the context for hrtimer.
>> 3.3) In arm64's arch_perf_rotate_pmu_set(),
>> set the PMU registers by the context.
>>
>> 4.) Test result of this patch:
>> #perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
>> time counts unit events
>> 1.000817360 3,297,898,244 cycles:G (70.03%)
>> 1.000817360 2,719,941 cycles:H (70.03%)
>> 1.000817360 883,764 L1-dcache-loads (70.03%)
>> 1.000817360 17,517 L1-dcache-load-misses # 1.98% of all L1-dcache accesses (70.03%)
>> 1.000817360 <not supported> LLC-loads
>> 1.000817360 <not supported> LLC-load-misses
>> 1.000817360 1,033,816 L1-icache-loads (70.03%)
>> 1.000817360 103,839 L1-icache-load-misses # 10.04% of all L1-icache accesses (70.03%)
>> 1.000817360 982,401 dTLB-loads (70.03%)
>> 1.000817360 28,272 dTLB-load-misses # 2.88% of all dTLB cache accesses (69.94%)
>> 1.000817360 972,072 iTLB-loads (59.95%)
>> 1.000817360 772 iTLB-load-misses # 0.08% of all iTLB cache accesses (59.95%)
>>
>> The result is correct. The "cycle:G" is nearly 3.3G now.
>>
>> Signed-off-by: Huang Shijie <[email protected]>
>> ---
>> arch/arm64/kvm/pmu.c | 8 ++++++++
>> include/linux/kvm_host.h | 1 +
>> kernel/events/core.c | 5 +++++
>> virt/kvm/kvm_main.c | 9 +++++++++
>> 4 files changed, 23 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
>> index 121f1a14c829..a6815c3f0c4e 100644
>> --- a/arch/arm64/kvm/pmu.c
>> +++ b/arch/arm64/kvm/pmu.c
>> @@ -210,6 +210,14 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
>> kvm_vcpu_pmu_disable_el0(events_guest);
>> }
>>
>> +void arch_perf_rotate_pmu_set(void)
>> +{
>> + if (is_guest())
>> + kvm_vcpu_pmu_restore_guest(NULL);
>> + else
>> + kvm_vcpu_pmu_restore_host(NULL);
>> +}
> So we're now randomly poking at the counters even when no guest is
> running, based on whatever is stashed in internal KVM data structures?
> I'm sure this is going to work really well.
>
> Hint: even if these functions don't directly look at the vcpu pointer,
> passing NULL is a really bad idea. It is a sure sign that you don't
> have the context on which to perform the action you're trying to do.
>
> This really shouldn't do *anything* when the rotation process is not
> preempting a guest.
Thanks for explanation, I will try Oliver's second method to resolve
this issue.
>> +
>> /*
>> * With VHE, keep track of the PMUSERENR_EL0 value for the host EL0 on the pCPU
>> * where PMUSERENR_EL0 for the guest is loaded, since PMUSERENR_EL0 is switched
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 9d3ac7720da9..e350cbc8190f 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -931,6 +931,7 @@ void kvm_destroy_vcpus(struct kvm *kvm);
>>
>> void vcpu_load(struct kvm_vcpu *vcpu);
>> void vcpu_put(struct kvm_vcpu *vcpu);
>> +bool is_guest(void);
> Why do we need this (not to mention the poor choice of name)? We
> already have kvm_get_running_vcpu(), which does everything you need
> (and gives you the actual context).
yes.
>>
>> #ifdef __KVM_HAVE_IOAPIC
>> void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 6fd9272eec6e..fe78f9d17eba 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -4229,6 +4229,10 @@ ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
>> return event;
>> }
>>
>> +void __weak arch_perf_rotate_pmu_set(void)
>> +{
>> +}
>> +
>> static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
>> {
>> struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
>> @@ -4282,6 +4286,7 @@ static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
>> if (task_event || (task_epc && cpu_event))
>> __pmu_ctx_sched_in(task_epc->ctx, pmu);
>>
>> + arch_perf_rotate_pmu_set();
> KVM already supports hooking into the perf core using the
> perf_guest_info_callbacks structure. Why should we need a separate
> mechanism?
okay, I will check it too.
>
>> perf_pmu_enable(pmu);
>> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index dfbaafbe3a00..a77d336552be 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -218,6 +218,15 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>> }
>> EXPORT_SYMBOL_GPL(vcpu_load);
>>
>> +/* Do we in the guest? */
>> +bool is_guest(void)
>> +{
>> + struct kvm_vcpu *vcpu;
>> +
>> + vcpu = __this_cpu_read(kvm_running_vcpu);
>> + return !!vcpu;
>> +}
>> +
>> void vcpu_put(struct kvm_vcpu *vcpu)
>> {
>> preempt_disable();
> It looks like you've identified an actual issue. However, I'm highly
> sceptical of the implementation. This really needs some more work.


Thanks again.

Huang Shijie


2023-08-09 09:51:53

by Shijie Huang

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

Hi Oliver,

在 2023/8/9 16:25, Oliver Upton 写道:
> Hi Huang,
>
> On Wed, Aug 09, 2023 at 09:39:53AM +0800, Huang Shijie wrote:
>> 2.) Root cause.
>> There is only 7 counters in my arm64 platform:
>> (one cycle counter) + (6 normal counters)
>>
>> In 1.3 above, we will use 10 event counters.
>> Since we only have 7 counters, the perf core will trigger
>> event multiplexing in hrtimer:
>> merge_sched_in() -->perf_mux_hrtimer_restart() -->
>> perf_rotate_context().
>>
>> In the perf_rotate_context(), it does not restore some PMU registers
>> as context_switch() does. In context_switch():
>> kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
>> kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
>>
>> So we got wrong result.
> This is a rather vague description of the problem. AFAICT, the
> issue here is on VHE systems we wind up getting the EL0 count
> enable/disable bits backwards when entering the guest, which is
> corroborated by the data you have below.
>
>> +void arch_perf_rotate_pmu_set(void)
>> +{
>> + if (is_guest())
>> + kvm_vcpu_pmu_restore_guest(NULL);
>> + else
>> + kvm_vcpu_pmu_restore_host(NULL);
>> +}
>> +
> This sort of hook is rather nasty, and I'd strongly prefer a solution
> that's confined to KVM. I don't think the !is_guest() branch is
> necessary at all. Regardless of how the pmu context is changed, we need
> to go through vcpu_put() before getting back out to userspace.
>
> We can check for a running vCPU (ick) from kvm_set_pmu_events() and either
> do the EL0 bit flip there or make a request on the vCPU to call
> kvm_vcpu_pmu_restore_guest() immediately before reentering the guest.
> I'm slightly leaning towards the latter, unless anyone has a better idea
> here.

Thanks a lot, I will check the code about the latter one, and try to fix
it again.


Thanks

Huang Shijie


2023-08-09 09:51:53

by Shijie Huang

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

Hi Mark,

在 2023/8/9 17:22, Mark Rutland 写道:
> On Wed, Aug 09, 2023 at 08:25:07AM +0000, Oliver Upton wrote:
>> Hi Huang,
>>
>> On Wed, Aug 09, 2023 at 09:39:53AM +0800, Huang Shijie wrote:
>>> 2.) Root cause.
>>> There is only 7 counters in my arm64 platform:
>>> (one cycle counter) + (6 normal counters)
>>>
>>> In 1.3 above, we will use 10 event counters.
>>> Since we only have 7 counters, the perf core will trigger
>>> event multiplexing in hrtimer:
>>> merge_sched_in() -->perf_mux_hrtimer_restart() -->
>>> perf_rotate_context().
>>>
>>> In the perf_rotate_context(), it does not restore some PMU registers
>>> as context_switch() does. In context_switch():
>>> kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
>>> kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
>>>
>>> So we got wrong result.
>> This is a rather vague description of the problem. AFAICT, the
>> issue here is on VHE systems we wind up getting the EL0 count
>> enable/disable bits backwards when entering the guest, which is
>> corroborated by the data you have below.
> Yep; IIUC the issue here is that when we take an IRQ from a guest and reprogram
> the PMU in the IRQ handler, the IRQ handler will program the PMU with
> appropriate host/guest/user/etc filters for a *host* context, and then we'll
> return back into the guest without reconfigurign the event filtering for a
> *guest* context.
Yes.
>
> That can happen for perf_rotate_context(), or when we install an event into a
> running context, as that'll happen via an IPI.
>
>>> +void arch_perf_rotate_pmu_set(void)
>>> +{
>>> + if (is_guest())
>>> + kvm_vcpu_pmu_restore_guest(NULL);
>>> + else
>>> + kvm_vcpu_pmu_restore_host(NULL);
>>> +}
>>> +
>> This sort of hook is rather nasty, and I'd strongly prefer a solution
>> that's confined to KVM. I don't think the !is_guest() branch is
>> necessary at all. Regardless of how the pmu context is changed, we need
>> to go through vcpu_put() before getting back out to userspace.
>>
>> We can check for a running vCPU (ick) from kvm_set_pmu_events() and either
>> do the EL0 bit flip there or make a request on the vCPU to call
>> kvm_vcpu_pmu_restore_guest() immediately before reentering the guest.
>> I'm slightly leaning towards the latter, unless anyone has a better idea
>> here.
> The latter sounds reasonable to me.

okay. I prefer the latter one now. :)


Thanks

Huang Shijie

>
> I suspect we need to take special care here to make sure we leave *all* events
> in a good state when re-entering the guest or if we get to kvm_sched_out()
> after *removing* an event via an IPI -- it'd be easy to mess either case up and
> leave some events in a bad state.
>
> Thanks,
> Mark.

2023-08-09 10:22:22

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

On Wed, Aug 09, 2023 at 08:25:07AM +0000, Oliver Upton wrote:
> Hi Huang,
>
> On Wed, Aug 09, 2023 at 09:39:53AM +0800, Huang Shijie wrote:
> > 2.) Root cause.
> > There is only 7 counters in my arm64 platform:
> > (one cycle counter) + (6 normal counters)
> >
> > In 1.3 above, we will use 10 event counters.
> > Since we only have 7 counters, the perf core will trigger
> > event multiplexing in hrtimer:
> > merge_sched_in() -->perf_mux_hrtimer_restart() -->
> > perf_rotate_context().
> >
> > In the perf_rotate_context(), it does not restore some PMU registers
> > as context_switch() does. In context_switch():
> > kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
> > kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
> >
> > So we got wrong result.
>
> This is a rather vague description of the problem. AFAICT, the
> issue here is on VHE systems we wind up getting the EL0 count
> enable/disable bits backwards when entering the guest, which is
> corroborated by the data you have below.

Yep; IIUC the issue here is that when we take an IRQ from a guest and reprogram
the PMU in the IRQ handler, the IRQ handler will program the PMU with
appropriate host/guest/user/etc filters for a *host* context, and then we'll
return back into the guest without reconfigurign the event filtering for a
*guest* context.

That can happen for perf_rotate_context(), or when we install an event into a
running context, as that'll happen via an IPI.

> > +void arch_perf_rotate_pmu_set(void)
> > +{
> > + if (is_guest())
> > + kvm_vcpu_pmu_restore_guest(NULL);
> > + else
> > + kvm_vcpu_pmu_restore_host(NULL);
> > +}
> > +
>
> This sort of hook is rather nasty, and I'd strongly prefer a solution
> that's confined to KVM. I don't think the !is_guest() branch is
> necessary at all. Regardless of how the pmu context is changed, we need
> to go through vcpu_put() before getting back out to userspace.
>
> We can check for a running vCPU (ick) from kvm_set_pmu_events() and either
> do the EL0 bit flip there or make a request on the vCPU to call
> kvm_vcpu_pmu_restore_guest() immediately before reentering the guest.
> I'm slightly leaning towards the latter, unless anyone has a better idea
> here.

The latter sounds reasonable to me.

I suspect we need to take special care here to make sure we leave *all* events
in a good state when re-entering the guest or if we get to kvm_sched_out()
after *removing* an event via an IPI -- it'd be easy to mess either case up and
leave some events in a bad state.

Thanks,
Mark.

2023-08-09 10:22:40

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

On Wed, 09 Aug 2023 02:39:53 +0100,
Huang Shijie <[email protected]> wrote:

For a start, please provide a sensible subject line for your patch.
"fix the bug" is not exactly descriptive, and I'd argue that if there
was only one bug left, I'd have taken an early retirement by now.

>
> 1.) Background.
> 1.1) In arm64, run a virtual guest with Qemu, and bind the guest

Is that with QEMU in system emulation mode? Or QEMU as a VMM for KVM?

> to core 33 and run program "a" in guest.

Is core 33 significant? Is the program itself significant?

> The code of "a" shows below:
> ----------------------------------------------------------
> #include <stdio.h>
>
> int main()
> {
> unsigned long i = 0;
>
> for (;;) {
> i++;
> }
>
> printf("i:%ld\n", i);
> return 0;
> }
> ----------------------------------------------------------
>
> 1.2) Use the following perf command in host:
> #perf stat -e cycles:G,cycles:H -C 33 -I 1000 sleep 1
> # time counts unit events
> 1.000817400 3,299,471,572 cycles:G
> 1.000817400 3,240,586 cycles:H
>
> This result is correct, my cpu's frequency is 3.3G.
>
> 1.3) Use the following perf command in host:
> #perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
> time counts unit events
> 1.000831480 153,634,097 cycles:G (70.03%)
> 1.000831480 3,147,940,599 cycles:H (70.03%)
> 1.000831480 1,143,598,527 L1-dcache-loads (70.03%)
> 1.000831480 9,986 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (70.03%)
> 1.000831480 <not supported> LLC-loads
> 1.000831480 <not supported> LLC-load-misses
> 1.000831480 580,887,696 L1-icache-loads (70.03%)
> 1.000831480 77,855 L1-icache-load-misses # 0.01% of all L1-icache accesses (70.03%)
> 1.000831480 6,112,224,612 dTLB-loads (70.03%)
> 1.000831480 16,222 dTLB-load-misses # 0.00% of all dTLB cache accesses (69.94%)
> 1.000831480 590,015,996 iTLB-loads (59.95%)
> 1.000831480 505 iTLB-load-misses # 0.00% of all iTLB cache accesses (59.95%)
>
> This result is wrong. The "cycle:G" should be nearly 3.3G.
>
> 2.) Root cause.
> There is only 7 counters in my arm64 platform:
> (one cycle counter) + (6 normal counters)
>
> In 1.3 above, we will use 10 event counters.
> Since we only have 7 counters, the perf core will trigger
> event multiplexing in hrtimer:
> merge_sched_in() -->perf_mux_hrtimer_restart() -->
> perf_rotate_context().
>
> In the perf_rotate_context(), it does not restore some PMU registers
> as context_switch() does. In context_switch():
> kvm_sched_in() --> kvm_vcpu_pmu_restore_guest()
> kvm_sched_out() --> kvm_vcpu_pmu_restore_host()
>
> So we got wrong result.
>
> 3.) About this patch.
> 3.1) Add arch_perf_rotate_pmu_set()
> 3.2) Add is_guest().
> Check the context for hrtimer.
> 3.3) In arm64's arch_perf_rotate_pmu_set(),
> set the PMU registers by the context.
>
> 4.) Test result of this patch:
> #perf stat -e cycles:G,cycles:H -C 33 -d -d -I 1000 sleep 1
> time counts unit events
> 1.000817360 3,297,898,244 cycles:G (70.03%)
> 1.000817360 2,719,941 cycles:H (70.03%)
> 1.000817360 883,764 L1-dcache-loads (70.03%)
> 1.000817360 17,517 L1-dcache-load-misses # 1.98% of all L1-dcache accesses (70.03%)
> 1.000817360 <not supported> LLC-loads
> 1.000817360 <not supported> LLC-load-misses
> 1.000817360 1,033,816 L1-icache-loads (70.03%)
> 1.000817360 103,839 L1-icache-load-misses # 10.04% of all L1-icache accesses (70.03%)
> 1.000817360 982,401 dTLB-loads (70.03%)
> 1.000817360 28,272 dTLB-load-misses # 2.88% of all dTLB cache accesses (69.94%)
> 1.000817360 972,072 iTLB-loads (59.95%)
> 1.000817360 772 iTLB-load-misses # 0.08% of all iTLB cache accesses (59.95%)
>
> The result is correct. The "cycle:G" is nearly 3.3G now.
>
> Signed-off-by: Huang Shijie <[email protected]>
> ---
> arch/arm64/kvm/pmu.c | 8 ++++++++
> include/linux/kvm_host.h | 1 +
> kernel/events/core.c | 5 +++++
> virt/kvm/kvm_main.c | 9 +++++++++
> 4 files changed, 23 insertions(+)
>
> diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
> index 121f1a14c829..a6815c3f0c4e 100644
> --- a/arch/arm64/kvm/pmu.c
> +++ b/arch/arm64/kvm/pmu.c
> @@ -210,6 +210,14 @@ void kvm_vcpu_pmu_restore_host(struct kvm_vcpu *vcpu)
> kvm_vcpu_pmu_disable_el0(events_guest);
> }
>
> +void arch_perf_rotate_pmu_set(void)
> +{
> + if (is_guest())
> + kvm_vcpu_pmu_restore_guest(NULL);
> + else
> + kvm_vcpu_pmu_restore_host(NULL);
> +}

So we're now randomly poking at the counters even when no guest is
running, based on whatever is stashed in internal KVM data structures?
I'm sure this is going to work really well.

Hint: even if these functions don't directly look at the vcpu pointer,
passing NULL is a really bad idea. It is a sure sign that you don't
have the context on which to perform the action you're trying to do.

This really shouldn't do *anything* when the rotation process is not
preempting a guest.

> +
> /*
> * With VHE, keep track of the PMUSERENR_EL0 value for the host EL0 on the pCPU
> * where PMUSERENR_EL0 for the guest is loaded, since PMUSERENR_EL0 is switched
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9d3ac7720da9..e350cbc8190f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -931,6 +931,7 @@ void kvm_destroy_vcpus(struct kvm *kvm);
>
> void vcpu_load(struct kvm_vcpu *vcpu);
> void vcpu_put(struct kvm_vcpu *vcpu);
> +bool is_guest(void);

Why do we need this (not to mention the poor choice of name)? We
already have kvm_get_running_vcpu(), which does everything you need
(and gives you the actual context).

>
> #ifdef __KVM_HAVE_IOAPIC
> void kvm_arch_post_irq_ack_notifier_list_update(struct kvm *kvm);
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6fd9272eec6e..fe78f9d17eba 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4229,6 +4229,10 @@ ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
> return event;
> }
>
> +void __weak arch_perf_rotate_pmu_set(void)
> +{
> +}
> +
> static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
> {
> struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> @@ -4282,6 +4286,7 @@ static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
> if (task_event || (task_epc && cpu_event))
> __pmu_ctx_sched_in(task_epc->ctx, pmu);
>
> + arch_perf_rotate_pmu_set();

KVM already supports hooking into the perf core using the
perf_guest_info_callbacks structure. Why should we need a separate
mechanism?

> perf_pmu_enable(pmu);
> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dfbaafbe3a00..a77d336552be 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -218,6 +218,15 @@ void vcpu_load(struct kvm_vcpu *vcpu)
> }
> EXPORT_SYMBOL_GPL(vcpu_load);
>
> +/* Do we in the guest? */
> +bool is_guest(void)
> +{
> + struct kvm_vcpu *vcpu;
> +
> + vcpu = __this_cpu_read(kvm_running_vcpu);
> + return !!vcpu;
> +}
> +
> void vcpu_put(struct kvm_vcpu *vcpu)
> {
> preempt_disable();

It looks like you've identified an actual issue. However, I'm highly
sceptical of the implementation. This really needs some more work.

Another question is how the same thing is handled on x86? Maybe they
don't suffer from this problem thanks to specific architectural
features, but it'd be good to find out, as this may guide the
implementation in a different way.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-08-09 10:30:17

by Oliver Upton

[permalink] [raw]
Subject: Re: [PATCH] perf/core: fix the bug in the event multiplexing

On Wed, Aug 09, 2023 at 09:48:29AM +0100, Marc Zyngier wrote:

[...]

> Another question is how the same thing is handled on x86? Maybe they
> don't suffer from this problem thanks to specific architectural
> features, but it'd be good to find out, as this may guide the
> implementation in a different way.

I'm pretty sure the bug here is arm64 specific. x86 (at least on intel)
fetches the guest PMU context from the perf driver w/ irqs disabled
immediately before entering the guest (see atomic_switch_perf_msrs()).

--
Thanks,
Oliver