2012-06-21 16:06:16

by Jan Beulich

[permalink] [raw]
Subject: Re: [vMCE design RFC] Xen vMCE design

>>> On 20.06.12 at 18:13, "Liu, Jinsong" <[email protected]> wrote:
> Recently we design xen vMCE as attached.
> Please kindly help me to review it, any comments/suggestions are
> appreciated.

The concept looks quite okay, provided no OS has a problem with
the limitations imposed (most notably the restriction to a single
reporting bank, particularly in the context of e.g. Linux partly
ignoring the first bank under some conditions iirc).

As to not needing any migration specific adjustments - what if
a migration is in progress when an event needs to be delivered?

Jan


2012-06-22 10:40:10

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

Jan Beulich wrote:
>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <[email protected]> wrote:
>> Recently we design xen vMCE as attached.
>> Please kindly help me to review it, any comments/suggestions are
>> appreciated.
>
> The concept looks quite okay, provided no OS has a problem with
> the limitations imposed (most notably the restriction to a single
> reporting bank, particularly in the context of e.g. Linux partly
> ignoring the first bank under some conditions iirc).

'bank0 skipping' quirks is only for older model cpus, I think we have 2 options:
1). still use 1 bank and simply ignore this issue. I mean, even if guest runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest skip bank0, then guest MCE logic would think it detect a spurious mce, then kill itself. Considering bank0 quirks is only for old cpus, this is acceptable;
2). use 32 banks

In fact, a third option is, use 1 bank, but hypervisor kill guest when it detect bank0 quirks. This would be same effect as option 1, so I prefer let guest kill itself.

>
> As to not needing any migration specific adjustments - what if
> a migration is in progress when an event needs to be delivered?
>
> Jan

If a migration is in progress while an event delivered, we abort the migration.

Thanks,
Jinsong-

2012-06-22 11:39:33

by Jan Beulich

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> On 22.06.12 at 12:40, "Liu, Jinsong" <[email protected]> wrote:
> Jan Beulich wrote:
>>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <[email protected]> wrote:
>>> Recently we design xen vMCE as attached.
>>> Please kindly help me to review it, any comments/suggestions are
>>> appreciated.
>>
>> The concept looks quite okay, provided no OS has a problem with
>> the limitations imposed (most notably the restriction to a single
>> reporting bank, particularly in the context of e.g. Linux partly
>> ignoring the first bank under some conditions iirc).
>
> 'bank0 skipping' quirks is only for older model cpus, I think we have 2
> options:
> 1). still use 1 bank and simply ignore this issue. I mean, even if guest
> runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest
> skip bank0, then guest MCE logic would think it detect a spurious mce, then
> kill itself. Considering bank0 quirks is only for old cpus, this is
> acceptable;
> 2). use 32 banks
>
> In fact, a third option is, use 1 bank, but hypervisor kill guest when it
> detect bank0 quirks. This would be same effect as option 1, so I prefer let
> guest kill itself.

Out of these, I'd actually favor using 32 banks. Using 2 banks
instead of 1 might be another option.

>> As to not needing any migration specific adjustments - what if
>> a migration is in progress when an event needs to be delivered?
>
> If a migration is in progress while an event delivered, we abort the
> migration.

Is there a way the hypervisor can tell the tools to abort a
migration? Or are you meaning to say such functionality would
need to be added?

One other concern that occurred to me after long having sent
the original response: Your proposal aims at a fixed,
unmodifiable vMCE interface. How is that going to be forward
compatible? I.e. consider you had made that proposal before
the SRAO/SRAR changes went in - would the same interface (with
the same set of capability bits set/clear) still be suitable?

I think that we minimally need to retain the MCG_CAP register
as being of potentially variable content (and hence needing
saving/restoring on migration). To support this in a forward
compatible manner, we may have to have a way to tell the
hypervisor e.g. via command line option which extra MSRs
have to be treated read-as-zero/writes-ignored upon guest
accesses.

Jan

2012-06-22 13:46:34

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

Jan Beulich wrote:
>>>> On 22.06.12 at 12:40, "Liu, Jinsong" <[email protected]> wrote:
>> Jan Beulich wrote:
>>>>>> On 20.06.12 at 18:13, "Liu, Jinsong" <[email protected]>
>>>>>> wrote:
>>>> Recently we design xen vMCE as attached.
>>>> Please kindly help me to review it, any comments/suggestions are
>>>> appreciated.
>>>
>>> The concept looks quite okay, provided no OS has a problem with
>>> the limitations imposed (most notably the restriction to a single
>>> reporting bank, particularly in the context of e.g. Linux partly
>>> ignoring the first bank under some conditions iirc).
>>
>> 'bank0 skipping' quirks is only for older model cpus, I think we
>> have 2 options: 1). still use 1 bank and simply ignore this issue. I
>> mean, even if guest runs at bank0 quirks platform, when hypervisor
>> inject vMCE# to guest, guest skip bank0, then guest MCE logic would
>> think it detect a spurious mce, then kill itself. Considering bank0
>> quirks is only for old cpus, this is acceptable; 2). use 32 banks
>>
>> In fact, a third option is, use 1 bank, but hypervisor kill guest
>> when it detect bank0 quirks. This would be same effect as option 1,
>> so I prefer let guest kill itself.
>
> Out of these, I'd actually favor using 32 banks. Using 2 banks
> instead of 1 might be another option.

Yes, 2 or 32 are both OK. Let's use 32.
(or, 2 is better for some other reasons, let me confirm inside Intel first).

>
>>> As to not needing any migration specific adjustments - what if
>>> a migration is in progress when an event needs to be delivered?
>>
>> If a migration is in progress while an event delivered, we abort the
>> migration.
>
> Is there a way the hypervisor can tell the tools to abort a
> migration? Or are you meaning to say such functionality would
> need to be added?

I didn't check migration code, but I think it could be done by
1). set a flag indecating the case 'migration is in progress while event delivered'.
2). at the last shakehand stage of migration (i.e. A to B), checking the flag and if yes, abort migration.
So guest will continue run at A, and quit at B after timeout.

>
> One other concern that occurred to me after long having sent
> the original response: Your proposal aims at a fixed,
> unmodifiable vMCE interface. How is that going to be forward
> compatible? I.e. consider you had made that proposal before
> the SRAO/SRAR changes went in - would the same interface (with
> the same set of capability bits set/clear) still be suitable?

Yes, since it's pure s/w emulated interface. At the case when SRAO or SRAR not supported by h/w platform, it's still OK, since under such case hypervisor don't need deliver SRAO or SRAR to guest at all. The emulated vMCE interface just tell the guest that it runs at a virtual platform with those well-defined capabilities.

>
> I think that we minimally need to retain the MCG_CAP register
> as being of potentially variable content (and hence needing
> saving/restoring on migration). To support this in a forward
> compatible manner, we may have to have a way to tell the
> hypervisor e.g. via command line option which extra MSRs
> have to be treated read-as-zero/writes-ignored upon guest
> accesses.
>
> Jan

Seems unnecessary, reason as above.

Thanks,
Jinsong

2012-06-22 15:06:21

by Jan Beulich

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> On 22.06.12 at 15:46, "Liu, Jinsong" <[email protected]> wrote:
> Jan Beulich wrote:
>> One other concern that occurred to me after long having sent
>> the original response: Your proposal aims at a fixed,
>> unmodifiable vMCE interface. How is that going to be forward
>> compatible? I.e. consider you had made that proposal before
>> the SRAO/SRAR changes went in - would the same interface (with
>> the same set of capability bits set/clear) still be suitable?
>
> Yes, since it's pure s/w emulated interface. At the case when SRAO or SRAR
> not supported by h/w platform, it's still OK, since under such case
> hypervisor don't need deliver SRAO or SRAR to guest at all. The emulated vMCE
> interface just tell the guest that it runs at a virtual platform with those
> well-defined capabilities.

I probably didn't express well enough what I want you to check:
Consider you had done the current re-design work without
SRAO/SRAR in mind (e.g. a couple of years ago). Would the
end result have been the same? Namely, would the bits you
nominate to be set/clear in MCG_CAP be the same?

>> I think that we minimally need to retain the MCG_CAP register
>> as being of potentially variable content (and hence needing
>> saving/restoring on migration). To support this in a forward
>> compatible manner, we may have to have a way to tell the
>> hypervisor e.g. via command line option which extra MSRs
>> have to be treated read-as-zero/writes-ignored upon guest
>> accesses.
>
> Seems unnecessary, reason as above.

So going forward you see no possibility of additions to the
interface that might warrant allowing more bits to be set in
MCG_CAP than you define to be set here? That really looks
unrealistic to me.

Jan

2012-06-22 15:21:57

by Tony Luck

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> I think that we minimally need to retain the MCG_CAP register
>>> as being of potentially variable content (and hence needing
>>> saving/restoring on migration). To support this in a forward
>>> compatible manner, we may have to have a way to tell the
>>> hypervisor e.g. via command line option which extra MSRs
>>> have to be treated read-as-zero/writes-ignored upon guest
>>> accesses.
>>
>> Seems unnecessary, reason as above.
>
> So going forward you see no possibility of additions to the
> interface that might warrant allowing more bits to be set in
> MCG_CAP than you define to be set here? That really looks
> unrealistic to me.

More bits can be added to MCG_CAP - but this becomes a hard
problem for migration because most OS guests only look at
MCG_CAP at boot time (linux certainly does) ... so if you migrate
and change MCG_CAP to match the new host, the guest will have
no idea that things changed.

MCG_SER_P bit is easy to deal with:
1) You already know about it
2) You can safely set it when the guest is running on a host that
does not support software error recover ... the guest will just
think that SRAO and SRAR events are possible, but they will never
actually occur. Setting it does make sure that the guest will be
ready should you migrate to a system that really does support it.

Future bits would have to be dealt with on a case by case basis.
The same general model would apply ... set the bit everywhere ...
but you might need some Xen changes to virtualize any new resources
and capabilities that were described by the new bit. Side question:
when a guest is migrated from one machine to another, does the version
of Xen running on source and target have to be the same? This tactic
will not work if you migrate a guest running on Xen version 99 that
does have the code to virtualize some new capability to a machine
running Xen version 98 that doesn't.

-Tony

2012-06-22 15:29:50

by Tony Luck

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

> 1). still use 1 bank and simply ignore this issue. I mean, even if guest
> runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest
> skip bank0, then guest MCE logic would think it detect a spurious mce, then
> kill itself. Considering bank0 quirks is only for old cpus, this is
> acceptable;
> 2). use 32 banks
>
> In fact, a third option is, use 1 bank, but hypervisor kill guest when it
> detect bank0 quirks. This would be same effect as option 1, so I prefer let
> guest kill itself.

Don't you control what CPUID is shown to the guest? Under what circumstances
would you tell the guest that it is running on an AMD-K7 or an Intel family
6 with model < 0x1A? Surely for migration reasons you need to present the
same virtualized family/model all the time ... so just don't use ones that
cause problems.

If that isn't an option - then say there are 2 banks and have Xen ignore bank
0 (make MC0_STATUS always appear to contain 0) and put all the errors into
bank1. If you tell the guest there are 32 banks it will read all of them.
Which means a lot of pointless exits to the hypervisor.

-Tony

2012-06-22 15:48:05

by Jan Beulich

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> On 22.06.12 at 17:21, "Luck, Tony" <[email protected]> wrote:
>>>> I think that we minimally need to retain the MCG_CAP register
>>>> as being of potentially variable content (and hence needing
>>>> saving/restoring on migration). To support this in a forward
>>>> compatible manner, we may have to have a way to tell the
>>>> hypervisor e.g. via command line option which extra MSRs
>>>> have to be treated read-as-zero/writes-ignored upon guest
>>>> accesses.
>>>
>>> Seems unnecessary, reason as above.
>>
>> So going forward you see no possibility of additions to the
>> interface that might warrant allowing more bits to be set in
>> MCG_CAP than you define to be set here? That really looks
>> unrealistic to me.
>
> More bits can be added to MCG_CAP - but this becomes a hard
> problem for migration because most OS guests only look at
> MCG_CAP at boot time (linux certainly does) ... so if you migrate
> and change MCG_CAP to match the new host, the guest will have
> no idea that things changed.

And it doesn't have to - respective events will just not occur
anymore. All we need to make sure is that if it accesses an MSR
that's no longer backed by anything in hardware, we don't
inject a #GP (but instead return e.g. all zeros to reads, and
drop all writes).

> MCG_SER_P bit is easy to deal with:
> 1) You already know about it
> 2) You can safely set it when the guest is running on a host that
> does not support software error recover ... the guest will just
> think that SRAO and SRAR events are possible, but they will never
> actually occur. Setting it does make sure that the guest will be
> ready should you migrate to a system that really does support it.

And so is likely going to be the case for at least some of the
additions to come.

> Future bits would have to be dealt with on a case by case basis.
> The same general model would apply ... set the bit everywhere ...
> but you might need some Xen changes to virtualize any new resources
> and capabilities that were described by the new bit. Side question:
> when a guest is migrated from one machine to another, does the version
> of Xen running on source and target have to be the same? This tactic
> will not work if you migrate a guest running on Xen version 99 that
> does have the code to virtualize some new capability to a machine
> running Xen version 98 that doesn't.

No, that's not a requirement, and hence we need to create a
model that will at least always allow upward (destination host
running higher version Xen than source) migration, and that
preserves as much functionality as possible when moved
between hosts with different MCA capabilities.

Jan

2012-06-22 15:54:19

by Jan Beulich

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> On 22.06.12 at 17:29, "Luck, Tony" <[email protected]> wrote:
>> 1). still use 1 bank and simply ignore this issue. I mean, even if guest
>> runs at bank0 quirks platform, when hypervisor inject vMCE# to guest, guest
>> skip bank0, then guest MCE logic would think it detect a spurious mce, then
>> kill itself. Considering bank0 quirks is only for old cpus, this is
>> acceptable;
>> 2). use 32 banks
>>
>> In fact, a third option is, use 1 bank, but hypervisor kill guest when it
>> detect bank0 quirks. This would be same effect as option 1, so I prefer let
>> guest kill itself.
>
> Don't you control what CPUID is shown to the guest? Under what circumstances
> would you tell the guest that it is running on an AMD-K7 or an Intel family
> 6 with model < 0x1A? Surely for migration reasons you need to present the
> same virtualized family/model all the time ... so just don't use ones that
> cause problems.

I don't think family/model/stepping are frequently faked, it's
normally just the various feature flags that get normalized to
the smallest common set.

> If that isn't an option - then say there are 2 banks and have Xen ignore bank
> 0 (make MC0_STATUS always appear to contain 0) and put all the errors into
> bank1. If you tell the guest there are 32 banks it will read all of them.
> Which means a lot of pointless exits to the hypervisor.

Indeed, emulating too many banks can have its own downsides.
Yet I don't think we're really concerned about performance when
handling machine checks. But having more than one usable bank
must have advantages, else hardware wouldn't implement things
that way.

Jan

2012-06-22 16:21:21

by Tony Luck

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

> Yet I don't think we're really concerned about performance when handling machine
> checks. But having more than one usable bank must have advantages, else hardware
> wouldn't implement things that way.

Primary reason for multiple banks is h/w design ... the silicon implementing the bank
is generally included in the component that generates the error. E.g. there may be
multiple memory controllers on a die, each with its own bank. H/W designers hate
running long "wires" across the die as it messes up their layout options.

There may be some secondary side benefit that independent errors might be
reported to different banks, and so avoid some overwrite problems. But I don't
think that Xen has a big worry with overwrite, does it? In general the errors that
you will show to the guest are ones that you expect the guest to handle immediately
(e.g. SRAO and SRAR signaled with a machine check). You do not log any corrected
errors to the guest (they can't do anything useful with them). You certainly don't
log any errors that are not signaled. So you should never have any errors hanging
around in banks for long periods that would get overwritten.

-Tony

2012-06-25 07:14:19

by Jan Beulich

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

>>> On 22.06.12 at 18:21, "Luck, Tony" <[email protected]> wrote:
> There may be some secondary side benefit that independent errors might be
> reported to different banks, and so avoid some overwrite problems. But I
> don't
> think that Xen has a big worry with overwrite, does it? In general the
> errors that
> you will show to the guest are ones that you expect the guest to handle
> immediately
> (e.g. SRAO and SRAR signaled with a machine check). You do not log any
> corrected
> errors to the guest (they can't do anything useful with them). You certainly
> don't
> log any errors that are not signaled. So you should never have any errors
> hanging
> around in banks for long periods that would get overwritten.

The problem is the determination of what "long" means here. The
target vCPU may not get scheduled for extended periods of time
(raising its priority would have other undesirable implications), so
the risk over overwrite exists from that perspective. However,
assuming that reportable errors get associated with a particular
vCPU, and that such a vCPU won't be able to execute any further
guest code prior to the delivery of the exception, the only real
risk here would be if the vMCE handler itself raised another event.
That I agree with Jinsong can be well treated as fatal (killing the
guest, provided the event gets properly logged so the admin
isn't left in the dark regarding the unexpected death of the VM),
mostly matching would a single bank hardware implementation
would result in.

Jan

2012-06-25 08:30:22

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [vMCE design RFC] Xen vMCE design

Jan Beulich wrote:
>>>> On 22.06.12 at 15:46, "Liu, Jinsong" <[email protected]> wrote:
>> Jan Beulich wrote:
>>> One other concern that occurred to me after long having sent
>>> the original response: Your proposal aims at a fixed,
>>> unmodifiable vMCE interface. How is that going to be forward
>>> compatible? I.e. consider you had made that proposal before
>>> the SRAO/SRAR changes went in - would the same interface (with
>>> the same set of capability bits set/clear) still be suitable?
>>
>> Yes, since it's pure s/w emulated interface. At the case when SRAO
>> or SRAR not supported by h/w platform, it's still OK, since under
>> such case hypervisor don't need deliver SRAO or SRAR to guest at
>> all. The emulated vMCE interface just tell the guest that it runs at
>> a virtual platform with those well-defined capabilities.
>
> I probably didn't express well enough what I want you to check:
> Consider you had done the current re-design work without
> SRAO/SRAR in mind (e.g. a couple of years ago). Would the
> end result have been the same? Namely, would the bits you
> nominate to be set/clear in MCG_CAP be the same?
>
>>> I think that we minimally need to retain the MCG_CAP register
>>> as being of potentially variable content (and hence needing
>>> saving/restoring on migration). To support this in a forward
>>> compatible manner, we may have to have a way to tell the
>>> hypervisor e.g. via command line option which extra MSRs
>>> have to be treated read-as-zero/writes-ignored upon guest
>>> accesses.
>>
>> Seems unnecessary, reason as above.
>
> So going forward you see no possibility of additions to the
> interface that might warrant allowing more bits to be set in
> MCG_CAP than you define to be set here? That really looks
> unrealistic to me.
>
> Jan

Sorry for misunderstanding your meaning in my last email, please ignore it.
I agree that MCG_CAP should be save/restore when migration, considering in the future some CAP bit may be added.

Thanks,
Jinsong-

2012-06-26 11:05:28

by George Dunlap

[permalink] [raw]
Subject: Re: [Xen-devel] [vMCE design RFC] Xen vMCE design

On Fri, Jun 22, 2012 at 4:21 PM, Luck, Tony <[email protected]> wrote:
>>>> I think that we minimally need to retain the MCG_CAP register
>>>> as being of potentially variable content (and hence needing
>>>> saving/restoring on migration). To support this in a forward
>>>> compatible manner, we may have to have a way to tell the
>>>> hypervisor e.g. via command line option which extra MSRs
>>>> have to be treated read-as-zero/writes-ignored upon guest
>>>> accesses.
>>>
>>> Seems unnecessary, reason as above.
>>
>> So going forward you see no possibility of additions to the
>> interface that might warrant allowing more bits to be set in
>> MCG_CAP than you define to be set here? That really looks
>> unrealistic to me.
>
> More bits can be added to MCG_CAP - but this becomes a hard
> problem for migration because most OS guests only look at
> MCG_CAP at boot time (linux certainly does) ... so if you migrate
> and change MCG_CAP to match the new host, the guest will have
> no idea that things changed.

Typically if you have a heterogeneous pool (i.e., different processors
with different CPUID bits) you typically have to calculate the minimum
subset of features available on all processors and only expose those
to the guest. It wouldn't be too hard to extend that to vMCEs if we
had to. Alternately, as Jan says, we could just fake things out when
we need to.

-George