LinuxLists.cc - [BUG] Guest OSes die simultaneously (bisected)

2024-01-03 22:24:12

Subject: [BUG] Guest OSes die simultaneously (bisected)

Hello!

Since some time between v5.19 and v6.4, long-running rcutorture tests
would (rarely but intolerably often) have all guests on a given host die
simultaneously with something like an instruction fault or a segmentation
violation.

Each bisection step required 20 hosts running 10 hours each, and
this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
is certainly messing with things that could possibly cause all manner
of mischief, I don't immediately see a smoking gun. Except that the
commit prior to this one is rock solid.

Just to make things a bit more exciting, bisection in mainline proved
to be problematic due to bugs of various kinds that hid this one. I was
therefore forced to bisect among the commits backported to the internal
v5.19-based kernel, which fingered the backported version of the patch
called out above.

Please note that this is not (yet) an emergency. I will just continue
to run rcutorture on v5.19-based hypervisors in the meantime.

Any suggestions for debugging or fixing?

Thanx, Paul

2024-01-03 23:27:41

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> Hello!
>
> Since some time between v5.19 and v6.4, long-running rcutorture tests
> would (rarely but intolerably often) have all guests on a given host die
> simultaneously with something like an instruction fault or a segmentation
> violation.
>
> Each bisection step required 20 hosts running 10 hours each, and
> this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> is certainly messing with things that could possibly cause all manner
> of mischief, I don't immediately see a smoking gun. Except that the
> commit prior to this one is rock solid.
>
> Just to make things a bit more exciting, bisection in mainline proved
> to be problematic due to bugs of various kinds that hid this one. I was
> therefore forced to bisect among the commits backported to the internal
> v5.19-based kernel, which fingered the backported version of the patch
> called out above.

Ah, and so why do I believe that this is a problem in mainline rather
than just (say) a backporting mistake?

Because this issue was first located in v6.4, which already has this
commit included.

Thanx, Paul

> Please note that this is not (yet) an emergency. I will just continue
> to run rcutorture on v5.19-based hypervisors in the meantime.
>
> Any suggestions for debugging or fixing?
>
> Thanx, Paul

2024-01-04 00:25:13

by Sean Christopherson

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > Hello!
> >
> > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > would (rarely but intolerably often) have all guests on a given host die
> > simultaneously with something like an instruction fault or a segmentation
> > violation.
> >
> > Each bisection step required 20 hosts running 10 hours each, and
> > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> > is certainly messing with things that could possibly cause all manner
> > of mischief, I don't immediately see a smoking gun. Except that the
> > commit prior to this one is rock solid.
> > Just to make things a bit more exciting, bisection in mainline proved
> > to be problematic due to bugs of various kinds that hid this one. I was
> > therefore forced to bisect among the commits backported to the internal
> > v5.19-based kernel, which fingered the backported version of the patch
> > called out above.
>
> Ah, and so why do I believe that this is a problem in mainline rather
> than just (say) a backporting mistake?
>
> Because this issue was first located in v6.4, which already has this
> commit included.
>
> Thanx, Paul
>
> > Please note that this is not (yet) an emergency. I will just continue
> > to run rcutorture on v5.19-based hypervisors in the meantime.
> >
> > Any suggestions for debugging or fixing?

This looks suspect:

+ u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
+ int global_ctrl, pebs_enable;

- arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
- arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
- arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
- arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
- *nr = 1;
+ *nr = 0;
+ global_ctrl = (*nr)++;
+ arr[global_ctrl] = (struct perf_guest_switch_msr){
+ .msr = MSR_CORE_PERF_GLOBAL_CTRL,
+ .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
+ .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
+ };

IIUC (always a big if with this code), the intent is that the guest's version of
PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
being used for PEBS. (b) is necessary because PEBS generates records in memory
using virtual addresses, i.e. the CPU will write to memory using a virtual address
that is valid for the host but not the guest. And so PMU counters that are
configured to generate PEBS records need to be disabled while running the guest.

Before that commit, the logic was:

guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
guest[PERF_GLOBAL_CTRL] &= ~pebs;

But after, it's now:

guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);

I.e. the kernel is enabled counters in the guest that are not host-only OR not
PEBS. E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
to the host, then the new code will yield (truncated to a single byte for sanity)

1 = 1 & (0xf | 0xe)

and thus keep counter 0 enabled, whereas the old code would yield

1 = 1 & 0xf
0 = 1 & 0xe

A bit of a shot in the dark and completed untested, but I think this is the correct
fix?

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a08f794a0e79..92d5a3464cb2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
arr[global_ctrl] = (struct perf_guest_switch_msr){
.msr = MSR_CORE_PERF_GLOBAL_CTRL,
.host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
- .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
+ .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
};

if (!x86_pmu.pebs)

2024-01-04 01:01:40

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > would (rarely but intolerably often) have all guests on a given host die
> > > simultaneously with something like an instruction fault or a segmentation
> > > violation.
> > >
> > > Each bisection step required 20 hosts running 10 hours each, and
> > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> > > is certainly messing with things that could possibly cause all manner
> > > of mischief, I don't immediately see a smoking gun. Except that the
> > > commit prior to this one is rock solid.
> > > Just to make things a bit more exciting, bisection in mainline proved
> > > to be problematic due to bugs of various kinds that hid this one. I was
> > > therefore forced to bisect among the commits backported to the internal
> > > v5.19-based kernel, which fingered the backported version of the patch
> > > called out above.
> >
> > Ah, and so why do I believe that this is a problem in mainline rather
> > than just (say) a backporting mistake?
> >
> > Because this issue was first located in v6.4, which already has this
> > commit included.
> >
> > Thanx, Paul
> >
> > > Please note that this is not (yet) an emergency. I will just continue
> > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > >
> > > Any suggestions for debugging or fixing?
>
> This looks suspect:
>
> + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> + int global_ctrl, pebs_enable;
>
> - arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> - arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> - *nr = 1;
> + *nr = 0;
> + global_ctrl = (*nr)++;
> + arr[global_ctrl] = (struct perf_guest_switch_msr){
> + .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> + .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> + };
>
>
> IIUC (always a big if with this code), the intent is that the guest's version of
> PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> being used for PEBS. (b) is necessary because PEBS generates records in memory
> using virtual addresses, i.e. the CPU will write to memory using a virtual address
> that is valid for the host but not the guest. And so PMU counters that are
> configured to generate PEBS records need to be disabled while running the guest.
>
> Before that commit, the logic was:
>
> guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> guest[PERF_GLOBAL_CTRL] &= ~pebs;
>
> But after, it's now:
>
> guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
>
> I.e. the kernel is enabled counters in the guest that are not host-only OR not
> PEBS. E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> to the host, then the new code will yield (truncated to a single byte for sanity)
>
> 1 = 1 & (0xf | 0xe)
>
> and thus keep counter 0 enabled, whereas the old code would yield
>
> 1 = 1 & 0xf
> 0 = 1 & 0xe
>
> A bit of a shot in the dark and completed untested, but I think this is the correct
> fix?

I am firing off some tests, and either way, thank you very much!!!

Thanx, Paul

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index a08f794a0e79..92d5a3464cb2 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> arr[global_ctrl] = (struct perf_guest_switch_msr){
> .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> - .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> + .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> };
>
> if (!x86_pmu.pebs)
>

2024-01-04 10:01:18

by Breno Leitao

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> I was
> therefore forced to bisect among the commits backported to the internal
> v5.19-based kernel, which fingered the backported version of the patch
> called out above.

Just to add some context to these backport, this commit (c59a1f106f5c)
was backported to the internal v5.19-based kernel in order to easily
backport these two fixes.

a16eb25b09c02a54c ("KVM: x86: Mask LVTPC when handling a PMI")
73554b29bd70546c1 ("KVM: x86/pmu: Synthesize at most one PMI per VM-exit")

They are required to solve the softlockup problem reported here:

https://lore.kernel.org/all/[email protected]/

2024-01-04 14:51:04

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > would (rarely but intolerably often) have all guests on a given host die
> > > > simultaneously with something like an instruction fault or a segmentation
> > > > violation.
> > > >
> > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> > > > is certainly messing with things that could possibly cause all manner
> > > > of mischief, I don't immediately see a smoking gun. Except that the
> > > > commit prior to this one is rock solid.
> > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > to be problematic due to bugs of various kinds that hid this one. I was
> > > > therefore forced to bisect among the commits backported to the internal
> > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > called out above.
> > >
> > > Ah, and so why do I believe that this is a problem in mainline rather
> > > than just (say) a backporting mistake?
> > >
> > > Because this issue was first located in v6.4, which already has this
> > > commit included.
> > >
> > > Thanx, Paul
> > >
> > > > Please note that this is not (yet) an emergency. I will just continue
> > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > >
> > > > Any suggestions for debugging or fixing?
> >
> > This looks suspect:
> >
> > + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > + int global_ctrl, pebs_enable;
> >
> > - arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > - arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > - *nr = 1;
> > + *nr = 0;
> > + global_ctrl = (*nr)++;
> > + arr[global_ctrl] = (struct perf_guest_switch_msr){
> > + .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > + .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > + };
> >
> >
> > IIUC (always a big if with this code), the intent is that the guest's version of
> > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > being used for PEBS. (b) is necessary because PEBS generates records in memory
> > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > that is valid for the host but not the guest. And so PMU counters that are
> > configured to generate PEBS records need to be disabled while running the guest.
> >
> > Before that commit, the logic was:
> >
> > guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> > guest[PERF_GLOBAL_CTRL] &= ~pebs;
> >
> > But after, it's now:
> >
> > guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> >
> > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > PEBS. E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > to the host, then the new code will yield (truncated to a single byte for sanity)
> >
> > 1 = 1 & (0xf | 0xe)
> >
> > and thus keep counter 0 enabled, whereas the old code would yield
> >
> > 1 = 1 & 0xf
> > 0 = 1 & 0xe
> >
> > A bit of a shot in the dark and completed untested, but I think this is the correct
> > fix?
>
> I am firing off some tests, and either way, thank you very much!!!

Woo-hoo!!! ;-)

Tested-by: Paul E. McKenney <[email protected]>

Will you be sending a proper patch, or would you prefer that I do so?
In the latter case, I would need your Signed-off-by.

And again, thank you very much!!!

Thanx, Paul

> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index a08f794a0e79..92d5a3464cb2 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > arr[global_ctrl] = (struct perf_guest_switch_msr){
> > .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > - .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > + .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> > };
> >
> > if (!x86_pmu.pebs)
> >

2024-01-04 15:12:23

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Thu, Jan 4, 2024 at 3:58 PM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > > Hello!
> > > > >
> > > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > > would (rarely but intolerably often) have all guests on a given host die
> > > > > simultaneously with something like an instruction fault or a segmentation
> > > > > violation.
> > > > >
> > > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> > > > > is certainly messing with things that could possibly cause all manner
> > > > > of mischief, I don't immediately see a smoking gun. Except that the
> > > > > commit prior to this one is rock solid.
> > > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > > to be problematic due to bugs of various kinds that hid this one. I was
> > > > > therefore forced to bisect among the commits backported to the internal
> > > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > > called out above.
> > > >
> > > > Ah, and so why do I believe that this is a problem in mainline rather
> > > > than just (say) a backporting mistake?
> > > >
> > > > Because this issue was first located in v6.4, which already has this
> > > > commit included.
> > > >
> > > > Thanx, Paul
> > > >
> > > > > Please note that this is not (yet) an emergency. I will just continue
> > > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > > >
> > > > > Any suggestions for debugging or fixing?
> > >
> > > This looks suspect:
> > >
> > > + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > > + int global_ctrl, pebs_enable;
> > >
> > > - arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > > - arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > > - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > > - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > > - *nr = 1;
> > > + *nr = 0;
> > > + global_ctrl = (*nr)++;
> > > + arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > + .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > + .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > + };
> > >
> > >
> > > IIUC (always a big if with this code), the intent is that the guest's version of
> > > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > > being used for PEBS. (b) is necessary because PEBS generates records in memory
> > > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > > that is valid for the host but not the guest. And so PMU counters that are
> > > configured to generate PEBS records need to be disabled while running the guest.
> > >
> > > Before that commit, the logic was:
> > >
> > > guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> > > guest[PERF_GLOBAL_CTRL] &= ~pebs;
> > >
> > > But after, it's now:
> > >
> > > guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> > >
> > > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > > PEBS. E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > > to the host, then the new code will yield (truncated to a single byte for sanity)
> > >
> > > 1 = 1 & (0xf | 0xe)
> > >
> > > and thus keep counter 0 enabled, whereas the old code would yield
> > >
> > > 1 = 1 & 0xf
> > > 0 = 1 & 0xe
> > >
> > > A bit of a shot in the dark and completed untested, but I think this is the correct
> > > fix?
> >
> > I am firing off some tests, and either way, thank you very much!!!
>
> Woo-hoo!!! ;-)
>
> Tested-by: Paul E. McKenney <[email protected]>
>
> Will you be sending a proper patch, or would you prefer that I do so?
> In the latter case, I would need your Signed-off-by.

I will fast track this one to Linus.

Paolo

> And again, thank you very much!!!
>
> Thanx, Paul
>
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index a08f794a0e79..92d5a3464cb2 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > > arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > - .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > + .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> > > };
> > >
> > > if (!x86_pmu.pebs)
> > >
>

2024-01-04 16:07:10

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Thu, Jan 04, 2024 at 03:59:52PM +0100, Paolo Bonzini wrote:
> On Thu, Jan 4, 2024 at 3:58 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Jan 03, 2024 at 05:00:35PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 03, 2024 at 04:24:06PM -0800, Sean Christopherson wrote:
> > > > On Wed, Jan 03, 2024, Paul E. McKenney wrote:
> > > > > On Wed, Jan 03, 2024 at 02:22:23PM -0800, Paul E. McKenney wrote:
> > > > > > Hello!
> > > > > >
> > > > > > Since some time between v5.19 and v6.4, long-running rcutorture tests
> > > > > > would (rarely but intolerably often) have all guests on a given host die
> > > > > > simultaneously with something like an instruction fault or a segmentation
> > > > > > violation.
> > > > > >
> > > > > > Each bisection step required 20 hosts running 10 hours each, and
> > > > > > this eventually fingered commit c59a1f106f5c ("KVM: x86/pmu: Add
> > > > > > IA32_PEBS_ENABLE MSR emulation for extended PEBS"). Although this commit
> > > > > > is certainly messing with things that could possibly cause all manner
> > > > > > of mischief, I don't immediately see a smoking gun. Except that the
> > > > > > commit prior to this one is rock solid.
> > > > > > Just to make things a bit more exciting, bisection in mainline proved
> > > > > > to be problematic due to bugs of various kinds that hid this one. I was
> > > > > > therefore forced to bisect among the commits backported to the internal
> > > > > > v5.19-based kernel, which fingered the backported version of the patch
> > > > > > called out above.
> > > > >
> > > > > Ah, and so why do I believe that this is a problem in mainline rather
> > > > > than just (say) a backporting mistake?
> > > > >
> > > > > Because this issue was first located in v6.4, which already has this
> > > > > commit included.
> > > > >
> > > > > Thanx, Paul
> > > > >
> > > > > > Please note that this is not (yet) an emergency. I will just continue
> > > > > > to run rcutorture on v5.19-based hypervisors in the meantime.
> > > > > >
> > > > > > Any suggestions for debugging or fixing?
> > > >
> > > > This looks suspect:
> > > >
> > > > + u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
> > > > + int global_ctrl, pebs_enable;
> > > >
> > > > - arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> > > > - arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> > > > - arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> > > > - arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
> > > > - *nr = 1;
> > > > + *nr = 0;
> > > > + global_ctrl = (*nr)++;
> > > > + arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > > + .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > > + .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > > + .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > > + };
> > > >
> > > >
> > > > IIUC (always a big if with this code), the intent is that the guest's version of
> > > > PERF_GLOBAL_CTRL gets bits that are (a) not exclusive to the host and (b) not
> > > > being used for PEBS. (b) is necessary because PEBS generates records in memory
> > > > using virtual addresses, i.e. the CPU will write to memory using a virtual address
> > > > that is valid for the host but not the guest. And so PMU counters that are
> > > > configured to generate PEBS records need to be disabled while running the guest.
> > > >
> > > > Before that commit, the logic was:
> > > >
> > > > guest[PERF_GLOBAL_CTRL] = ctrl & ~host;
> > > > guest[PERF_GLOBAL_CTRL] &= ~pebs;
> > > >
> > > > But after, it's now:
> > > >
> > > > guest[PERF_GLOBAL_CTRL] = ctrl & (~host | ~pebs);
> > > >
> > > > I.e. the kernel is enabled counters in the guest that are not host-only OR not
> > > > PEBS. E.g. if only counter 0 is in use, it's using PEBS, but it's not exclusive
> > > > to the host, then the new code will yield (truncated to a single byte for sanity)
> > > >
> > > > 1 = 1 & (0xf | 0xe)
> > > >
> > > > and thus keep counter 0 enabled, whereas the old code would yield
> > > >
> > > > 1 = 1 & 0xf
> > > > 0 = 1 & 0xe
> > > >
> > > > A bit of a shot in the dark and completed untested, but I think this is the correct
> > > > fix?
> > >
> > > I am firing off some tests, and either way, thank you very much!!!
> >
> > Woo-hoo!!! ;-)
> >
> > Tested-by: Paul E. McKenney <[email protected]>
> >
> > Will you be sending a proper patch, or would you prefer that I do so?
> > In the latter case, I would need your Signed-off-by.
>
> I will fast track this one to Linus.

Thank you, Paolo!

One additional request, if I may be so bold...

Although I am happy to have been able to locate the commit (and even
happier that Sean spotted the problem and that you quickly pushed the
fix to mainline!), chasing this consumed a lot of time and systems over
an embarrassingly large number of months. As in I first spotted this
bug in late July. Despite a number of increasingly complex attempts,
bisection became feasible only after the buggy commit was backported to
our internal v5.19 code base. :-(

My (completely random) guess is that there is some rare combination
of events that causes this code to fail. If so, is it feasible to
construct a test that makes this rare combination of events less rare,
so that similar future bugs are caught more quickly?

And please understand that I am not casting shade on those who wrote,
reviewed, and committed that buggy commit. As in I freely confess that
I had to stare at Sean's fix for a few minutes before I figured out what
was going on. Instead, the point I am trying to make is that carefully
constructed tests can serve as tireless and accurate code reviewers.
This won't ever replace actual code review, but my experience indicates
that it will help find more bugs more quickly and more easily.

Thanx, Paul

> Paolo
>
> > And again, thank you very much!!!
> >
> > Thanx, Paul
> >
> > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > > index a08f794a0e79..92d5a3464cb2 100644
> > > > --- a/arch/x86/events/intel/core.c
> > > > +++ b/arch/x86/events/intel/core.c
> > > > @@ -4056,7 +4056,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > > > arr[global_ctrl] = (struct perf_guest_switch_msr){
> > > > .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> > > > .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> > > > - .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> > > > + .guest = intel_ctrl & ~(cpuc->intel_ctrl_host_mask | pebs_mask),
> > > > };
> > > >
> > > > if (!x86_pmu.pebs)
> > > >
> >
>

2024-01-04 16:33:14

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On 1/4/24 17:06, Paul E. McKenney wrote:
> Although I am happy to have been able to locate the commit (and even
> happier that Sean spotted the problem and that you quickly pushed the
> fix to mainline!), chasing this consumed a lot of time and systems over
> an embarrassingly large number of months. As in I first spotted this
> bug in late July. Despite a number of increasingly complex attempts,
> bisection became feasible only after the buggy commit was backported to
> our internal v5.19 code base. ????

Yes, this strikes two sore points.

One is that I have also experienced being able to bisect only with a
somewhat more linear history (namely the CentOS Stream 9 aka c9s
frankenkernel [1]) and not with upstream. Even if the c9s kernel is not
a fully linear set of commits, there's some benefit from merge commits
that consist of slightly more curated set of patches, where each merge
commit includes both new features and bugfixes. Unfortunately, whether
you'll be able to do this with the c9s kernel depends a lot on the
subsystems involved and on the bug. Both are factors that may or may
not be known in advance.

The other, of course, is testing. The KVM selftests infrastructure is
meant for this kind of white box problem, but the space of tests that
can be written is so large, that there's always too few tests. It
shines when you have a clear bisection but an unclear fix (in the past I
have had cases where spending two days to write a test led me to writing
a fix in thirty minutes), but boosting the reproducibility is always a
good thing.

> And please understand that I am not casting shade on those who wrote,
> reviewed, and committed that buggy commit. As in I freely confess that
> I had to stare at Sean's fix for a few minutes before I figured out what
> was going on.

Oh don't worry about that---rather, I am going to cast a shade on those
that did not review the commit, namely me. I am somewhat obsessed with
Boolean logic and *probably* I would have caught it, or would have asked
to split the use of designated initializers to a separate patch. Any of
the two could, at least potentially, have saved you quite some time.

> Instead, the point I am trying to make is that carefully
> constructed tests can serve as tireless and accurate code reviewers.
> This won't ever replace actual code review, but my experience indicates
> that it will help find more bugs more quickly and more easily.

TBH this (conflict between virtual addresses on the host and the guest
leading to corruption of the guest) is probably not the kind of
adversarial test that one would have written or suggested right off the
bat. But it should be written now indeed.

Paolo

[1]
https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/

2024-01-04 17:08:14

by Andi Kleen

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

> My (completely random) guess is that there is some rare combination
> of events that causes this code to fail. If so, is it feasible to
> construct a test that makes this rare combination of events less rare,
> so that similar future bugs are caught more quickly?

Yes, I tested something similar before. What you need is create lots of
PMIs with perf (running perf top should be enough) and a workload that creates
lots of exits in a guest (e.g. running fio on a virtio device). This
will stress test this particular path.

-Andi

2024-01-04 17:26:28

by Sean Christopherson

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Thu, Jan 04, 2024, Paolo Bonzini wrote:
> On 1/4/24 17:06, Paul E. McKenney wrote:
> > Instead, the point I am trying to make is that carefully
> > constructed tests can serve as tireless and accurate code reviewers.
> > This won't ever replace actual code review, but my experience indicates
> > that it will help find more bugs more quickly and more easily.
>
> TBH this (conflict between virtual addresses on the host and the guest
> leading to corruption of the guest) is probably not the kind of adversarial
> test that one would have written or suggested right off the bat.

I disagree. The flaws with PEBS using a virtual address is blatantly obvious to
anyone that has spent any time dealing with the cross-section of PMU and VMX.
Intel even explicitly added "isolation" functionality to ensure PEBS can't overrun
VM-Enter and generate host records in the guest. Not to mention that Intel
specifically addressed the virtual addressing issue in the design of Processor
Trace (PT, a.k.a. RTIT).

In other words, we *knew* exactly what would break *and* there had been breakage
in the past. Chalk it up to messed up priorities, poor test infrastructure, or
anything along those lines. But we shouldn't pretend that this was some obscure
edge case that didn't warrant a dedicated test from the get-go.

2024-01-04 19:24:23

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [BUG] Guest OSes die simultaneously (bisected)

On Thu, Jan 04, 2024 at 05:32:34PM +0100, Paolo Bonzini wrote:
> On 1/4/24 17:06, Paul E. McKenney wrote:
> > Although I am happy to have been able to locate the commit (and even
> > happier that Sean spotted the problem and that you quickly pushed the
> > fix to mainline!), chasing this consumed a lot of time and systems over
> > an embarrassingly large number of months. As in I first spotted this
> > bug in late July. Despite a number of increasingly complex attempts,
> > bisection became feasible only after the buggy commit was backported to
> > our internal v5.19 code base. ????
>
> Yes, this strikes two sore points.
>
> One is that I have also experienced being able to bisect only with a
> somewhat more linear history (namely the CentOS Stream 9 aka c9s
> frankenkernel [1]) and not with upstream. Even if the c9s kernel is not a
> fully linear set of commits, there's some benefit from merge commits that
> consist of slightly more curated set of patches, where each merge commit
> includes both new features and bugfixes. Unfortunately, whether you'll be
> able to do this with the c9s kernel depends a lot on the subsystems involved
> and on the bug. Both are factors that may or may not be known in advance.

I guess I am glad that it is not just me. ;-)

> The other, of course, is testing. The KVM selftests infrastructure is meant
> for this kind of white box problem, but the space of tests that can be
> written is so large, that there's always too few tests. It shines when you
> have a clear bisection but an unclear fix (in the past I have had cases
> where spending two days to write a test led me to writing a fix in thirty
> minutes), but boosting the reproducibility is always a good thing.

Agreed, validation never will be perfect, and so improving the test
suite based on production experience is a good thing, as is creating
test cases based on the behavior of important production workloads for
those who run them.

> > And please understand that I am not casting shade on those who wrote,
> > reviewed, and committed that buggy commit. As in I freely confess that
> > I had to stare at Sean's fix for a few minutes before I figured out what
> > was going on.
>
> Oh don't worry about that---rather, I am going to cast a shade on those that
> did not review the commit, namely me. I am somewhat obsessed with Boolean
> logic and *probably* I would have caught it, or would have asked to split
> the use of designated initializers to a separate patch. Any of the two
> could, at least potentially, have saved you quite some time.

We have all done similar things. I certainly have!

> > Instead, the point I am trying to make is that carefully
> > constructed tests can serve as tireless and accurate code reviewers.
> > This won't ever replace actual code review, but my experience indicates
> > that it will help find more bugs more quickly and more easily.
>
> TBH this (conflict between virtual addresses on the host and the guest
> leading to corruption of the guest) is probably not the kind of adversarial
> test that one would have written or suggested right off the bat. But it
> should be written now indeed.

Very good, looking forward to seeing it!

Thanx, Paul

> Paolo
>
> [1]
> https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/
>