2012-06-27 03:52:05

by Liu, Jinsong

[permalink] [raw]
Subject: [xen vMCE RFC V0.2] xen vMCE design

Hi,

This is updated xen vMCE design foils, according to comments from community recently.

This foils focus on vMCE part of Xen MCA, so as Keir said, it's some dense.
Later Will will present a document to elaborate more, including Intel MCA and surrounding features and Xen implementation.

Thanks,
Jinsong


Attachments:
xen vMCE design (v0 2).pdf (250.61 kB)
xen vMCE design (v0 2).pdf

2012-06-27 13:14:12

by Jan Beulich

[permalink] [raw]
Subject: Re: [xen vMCE RFC V0.2] xen vMCE design

>>> On 27.06.12 at 05:51, "Liu, Jinsong" <[email protected]> wrote:
> This is updated xen vMCE design foils, according to comments from community
> recently.
>
> This foils focus on vMCE part of Xen MCA, so as Keir said, it's some dense.
> Later Will will present a document to elaborate more, including Intel MCA
> and surrounding features and Xen implementation.

For MCi_CTL2 you probably meant to say "allow setting CMCI_EN
and error count threshold"?

The 2-bank approach also needs to be brought in line with the
current host derived value in MCG_CAP, especially if the code to
implement this new model doesn't make it into 4.2 (which would
generally save a larger value).

Jan

2012-06-28 08:54:51

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

Jan Beulich wrote:
>>>> On 27.06.12 at 05:51, "Liu, Jinsong" <[email protected]> wrote:
>> This is updated xen vMCE design foils, according to comments from
>> community recently.
>>
>> This foils focus on vMCE part of Xen MCA, so as Keir said, it's some
>> dense. Later Will will present a document to elaborate more,
>> including Intel MCA and surrounding features and Xen implementation.
>
> For MCi_CTL2 you probably meant to say "allow setting CMCI_EN
> and error count threshold"?

Yes, exactly! the words is w/ subtle but important different meaning.

>
> The 2-bank approach also needs to be brought in line with the
> current host derived value in MCG_CAP, especially if the code to
> implement this new model doesn't make it into 4.2 (which would
> generally save a larger value).
>
> Jan

Let me repeat in my word to avoid misunderstanding about your concern:
Your concern rooted from the history patch (c/s 24887, as attached) which used to solve vMCE migration issue caused from bank number. I guess the patch was not in xen4.1.x but would be in xen 4.2 release recently (right? and when will xen 4.2 release?)
Per my understanding, you want us to make sure our new vMCE model compatible w/ olde vMCE. For example if our patch in xen 4.3 release, you want to make sure a guest migrate from xen 4.2 to 4.3 would not broken. Is this your concern?

Thanks,
Jinsong


Attachments:
(No filename) (43.36 kB)

2012-06-28 09:08:20

by Jan Beulich

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

>>> On 28.06.12 at 10:54, "Liu, Jinsong" <[email protected]> wrote:
> Jan Beulich wrote:
>> The 2-bank approach also needs to be brought in line with the
>> current host derived value in MCG_CAP, especially if the code to
>> implement this new model doesn't make it into 4.2 (which would
>> generally save a larger value).
>
> Let me repeat in my word to avoid misunderstanding about your concern:
> Your concern rooted from the history patch (c/s 24887, as attached) which
> used to solve vMCE migration issue caused from bank number. I guess the patch
> was not in xen4.1.x but would be in xen 4.2 release recently (right? and when
> will xen 4.2 release?)

4.2 is in feature freeze right now, preparing for the release.

> Per my understanding, you want us to make sure our new vMCE model compatible
> w/ olde vMCE. For example if our patch in xen 4.3 release, you want to make
> sure a guest migrate from xen 4.2 to 4.3 would not broken. Is this your
> concern?

Yes. If we can't get the save/restore records adjusted in time
for 4.2, compatibility with 4.2 would be a requirement. If we
manage to get the necessary adjustments done in time, and if
they're not too intrusive (i.e. would be acceptable at this late
stage of the development cycle), then the concern could be
dropped from an upstream perspective due to the lack of
support in 4.1 (and hence no backward compatibility issues).
BUT this isn't as simple from a product usage point of view: As
the save/restore code currently in -unstable was coded up to
address a problem observed by SLE11 SP2 users, we already
backported those patches. So compatibility will be a requirement
in any case.

Jan

2012-06-28 09:40:14

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

Jan Beulich wrote:
>>>> On 28.06.12 at 10:54, "Liu, Jinsong" <[email protected]> wrote:
>> Jan Beulich wrote:
>>> The 2-bank approach also needs to be brought in line with the
>>> current host derived value in MCG_CAP, especially if the code to
>>> implement this new model doesn't make it into 4.2 (which would
>>> generally save a larger value).
>>
>> Let me repeat in my word to avoid misunderstanding about your
>> concern:
>> Your concern rooted from the history patch (c/s 24887, as attached)
>> which used to solve vMCE migration issue caused from bank number. I
>> guess the patch was not in xen4.1.x but would be in xen 4.2 release
>> recently (right? and when will xen 4.2 release?)
>
> 4.2 is in feature freeze right now, preparing for the release.
>
>> Per my understanding, you want us to make sure our new vMCE model
>> compatible w/ olde vMCE. For example if our patch in xen 4.3
>> release, you want to make sure a guest migrate from xen 4.2 to 4.3
>> would not broken. Is this your concern?
>
> Yes. If we can't get the save/restore records adjusted in time
> for 4.2, compatibility with 4.2 would be a requirement. If we
> manage to get the necessary adjustments done in time, and if
> they're not too intrusive (i.e. would be acceptable at this late
> stage of the development cycle), then the concern could be
> dropped from an upstream perspective due to the lack of
> support in 4.1 (and hence no backward compatibility issues).
> BUT this isn't as simple from a product usage point of view: As
> the save/restore code currently in -unstable was coded up to
> address a problem observed by SLE11 SP2 users, we already
> backported those patches. So compatibility will be a requirement
> in any case.
>
> Jan

A basic point of new vMCE is, it get rid of old vMCE, start setting up a new model from the very beginning. From coding point of view, backward compatibility issue would be dirty and troublesome.

The point is, old vMCE interface is host-based while new vMCE is pure s/w defined, hence troubles come from the interface heterogeneous (if need keep compatibility). The basic assumption of live migration from A to B is, A and B basically at same page, so it could be migrated by setting the smallest common feature/capability set (via cpuid, command line, etc.). However, old vMCE and new vMCE are quite different and cannot controlled effectively. For example, old vMCE has MCG_CTL but new vMCE doesn't, and new vMCE has CMCI support (and MCi_CTL2) but old vMCE doesn't. I even doubt the feasibility of keeping compatibility w/ old vMCE. An example is, it's hard to migrate between Intel cpu and AMD cpu.

So I would like to push new vMCE as quickly as possible. What's the timeline of vMCE developing that xen 4.2 could accept? I wonder if we could make major components of vMCE done before xen 4.2 timeline, and leave the surrounding features and the corner cases done later?

Thanks,
Jinsong-

2012-06-28 09:54:59

by Jan Beulich

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

>>> On 28.06.12 at 11:40, "Liu, Jinsong" <[email protected]> wrote:
> So I would like to push new vMCE as quickly as possible. What's the timeline
> of vMCE developing that xen 4.2 could accept?

Weeks ago, really. See http://lists.xen.org/archives/html/xen-devel/2012-06/msg01619.html
and follow-ups - we'd really only consider getting the save/restore
interface into forward compatible shape as acceptable.

> I wonder if we could make major
> components of vMCE done before xen 4.2 timeline, and leave the surrounding
> features and the corner cases done later?

Unfortunately it's likely going to be even less. However, if split
that way, chances are things could go into e.g. 4.2.1.

Jan

2012-06-28 09:58:37

by Ian Campbell

[permalink] [raw]
Subject: Re: [Xen-devel] [xen vMCE RFC V0.2] xen vMCE design

On Thu, 2012-06-28 at 10:55 +0100, Jan Beulich wrote:
> >>> On 28.06.12 at 11:40, "Liu, Jinsong" <[email protected]> wrote:
> > So I would like to push new vMCE as quickly as possible. What's the timeline
> > of vMCE developing that xen 4.2 could accept?
>
> Weeks ago, really. See http://lists.xen.org/archives/html/xen-devel/2012-06/msg01619.html
> and follow-ups - we'd really only consider getting the save/restore
> interface into forward compatible shape as acceptable.

Yes it really is far to late to considering entire new features at this
stage.

Ian.

>
> > I wonder if we could make major
> > components of vMCE done before xen 4.2 timeline, and leave the surrounding
> > features and the corner cases done later?
>
> Unfortunately it's likely going to be even less. However, if split
> that way, chances are things could go into e.g. 4.2.1.
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xen.org/xen-devel

2012-06-28 13:38:41

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

Jan Beulich wrote:
>>>> On 28.06.12 at 11:40, "Liu, Jinsong" <[email protected]> wrote:
>> So I would like to push new vMCE as quickly as possible. What's the
>> timeline of vMCE developing that xen 4.2 could accept?
>
> Weeks ago, really. See
> http://lists.xen.org/archives/html/xen-devel/2012-06/msg01619.html
> and follow-ups - we'd really only consider getting the save/restore
> interface into forward compatible shape as acceptable.
>
>> I wonder if we could make major
>> components of vMCE done before xen 4.2 timeline, and leave the
>> surrounding features and the corner cases done later?
>
> Unfortunately it's likely going to be even less. However, if split
> that way, chances are things could go into e.g. 4.2.1.
>
> Jan

So let's look at current vMCE status first:
1). functionally it work abnormally for guest (but tolerated by some guest like linux/solaris);
2). before xen 4.1 it blocks migration when migrate from big bank to small bank platform;

We may try some middle steps, minimal adjusting for xen 4.2 release (to avoid futher compatible issue at xen 4.2.1, 4.3, ...):
1). we don't handle vMCE function bugs, only make sure migration works OK;
2). update vMCE interface to a middle clean status:
* remove MCG_CTL (otherwise we have to add this useless MSR at new vMCE);
* stick all 1's to MCi_CTL (avoid semantic difference);
* for MCG_CAP, clear MCG_CTL_P, limit to 2 banks (otherwise dirty code have to be added at new vMCE);

Thoughts?

Thanks,
Jinsong-

2012-06-28 14:00:14

by Jan Beulich

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

>>> On 28.06.12 at 15:38, "Liu, Jinsong" <[email protected]> wrote:
> Jan Beulich wrote:
>>>>> On 28.06.12 at 11:40, "Liu, Jinsong" <[email protected]> wrote:
>>> So I would like to push new vMCE as quickly as possible. What's the
>>> timeline of vMCE developing that xen 4.2 could accept?
>>
>> Weeks ago, really. See
>> http://lists.xen.org/archives/html/xen-devel/2012-06/msg01619.html
>> and follow-ups - we'd really only consider getting the save/restore
>> interface into forward compatible shape as acceptable.
>>
>>> I wonder if we could make major
>>> components of vMCE done before xen 4.2 timeline, and leave the
>>> surrounding features and the corner cases done later?
>>
>> Unfortunately it's likely going to be even less. However, if split
>> that way, chances are things could go into e.g. 4.2.1.
>>
>> Jan
>
> So let's look at current vMCE status first:
> 1). functionally it work abnormally for guest (but tolerated by some guest
> like linux/solaris);
> 2). before xen 4.1 it blocks migration when migrate from big bank to small
> bank platform;

Before 4.2 you mean (in 4.1 we only have this as a backport in SLE11).

> We may try some middle steps, minimal adjusting for xen 4.2 release (to
> avoid futher compatible issue at xen 4.2.1, 4.3, ...):
> 1). we don't handle vMCE function bugs, only make sure migration works OK;

That's the minimal goal.

> 2). update vMCE interface to a middle clean status:
> * remove MCG_CTL (otherwise we have to add this useless MSR at new
> vMCE);
> * stick all 1's to MCi_CTL (avoid semantic difference);
> * for MCG_CAP, clear MCG_CTL_P, limit to 2 banks (otherwise dirty code
> have to be added at new vMCE);

Whether that's acceptable would need to be seen when code
is ready.

Jan

2012-06-28 17:03:01

by Liu, Jinsong

[permalink] [raw]
Subject: RE: [xen vMCE RFC V0.2] xen vMCE design

Jan Beulich wrote:
>>>> On 28.06.12 at 15:38, "Liu, Jinsong" <[email protected]> wrote:
>> Jan Beulich wrote:
>>>>>> On 28.06.12 at 11:40, "Liu, Jinsong" <[email protected]>
>>>>>> wrote:
>>>> So I would like to push new vMCE as quickly as possible. What's the
>>>> timeline of vMCE developing that xen 4.2 could accept?
>>>
>>> Weeks ago, really. See
>>> http://lists.xen.org/archives/html/xen-devel/2012-06/msg01619.html
>>> and follow-ups - we'd really only consider getting the save/restore
>>> interface into forward compatible shape as acceptable.
>>>
>>>> I wonder if we could make major
>>>> components of vMCE done before xen 4.2 timeline, and leave the
>>>> surrounding features and the corner cases done later?
>>>
>>> Unfortunately it's likely going to be even less. However, if split
>>> that way, chances are things could go into e.g. 4.2.1.
>>>
>>> Jan
>>
>> So let's look at current vMCE status first:
>> 1). functionally it work abnormally for guest (but tolerated by some
>> guest like linux/solaris); 2). before xen 4.1 it blocks migration
>> when migrate from big bank to small bank platform;
>
> Before 4.2 you mean (in 4.1 we only have this as a backport in SLE11).

Yes.

>
>> We may try some middle steps, minimal adjusting for xen 4.2 release
>> (to avoid futher compatible issue at xen 4.2.1, 4.3, ...):
>> 1). we don't handle vMCE function bugs, only make sure migration
>> works OK;
>
> That's the minimal goal.

You mean to fix current vMCE function bugs in xen 4.2? That would involve much work hence too late for xen 4.2. In fact the bugs currently tolerated by guest, so it's important but non-urgent.

What we need to do urgently is to adjust current vMCE interface a little so that
1). it would not block xen 4.2 live migration
2). it would not bring compatibility issues to new vMCE in the future
These 2 points are our minimal targets for xen 4.2

Thanks,
Jinsong

>
>> 2). update vMCE interface to a middle clean status:
>> * remove MCG_CTL (otherwise we have to add this useless MSR at
>> new vMCE);
>> * stick all 1's to MCi_CTL (avoid semantic difference);
>> * for MCG_CAP, clear MCG_CTL_P, limit to 2 banks (otherwise
>> dirty code have to be added at new vMCE);
>
> Whether that's acceptable would need to be seen when code
> is ready.
>
> Jan

2012-06-29 09:59:10

by Christoph Egger

[permalink] [raw]
Subject: Re: [Xen-devel] [xen vMCE RFC V0.2] xen vMCE design


Feedback from the AMD side:

slide 2:
- PV guests are supposed to install a MCE trap handler
which reads the MSR values from struct mcinfo_bank.
Hence it is unclear where the #GP should come from.
Same for HVM guests which have a PV MCE "driver"
(those are very rare in reality).

slide 3:
- unclear what "Weird per-domain MSRs" means
- unclear what "Unnatural MCE injection semantics" means

slide 4:
- typo: interace -> interface :-)
- enable UCR-related capabilities, but only on Intel machines
- Filter non-SRAO/SRAR banks:
Rename it to "Let guest see northbridge bank only to the guest"

slide 7:
- ignore/disable CMCI and CTL2 on AMD

slide 8:
- Filter non-SRAO/SRAR banks:
Rename it to "Let guest see northbridge bank only to the guest"
- Question: Should we allow the guest to inject errors? Does it make
sense?
- always disable MCi_CTL2 on AMD

slide 9:
- Model specific issue: Also affects AMD as some models have
l3 cache and some do not.
E.g. it does not make sense to report l3 cache errors to guests


--
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85689 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632