On 2017/11/15 0:00, James Morse wrote:
>> + * error has not been propagated
>> + */
>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>> + run->ex.exception = ESR_ELx_EC_SERROR;
>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> + return 0;
> We should not pass RAS notifications to user space. The kernel either handles
> them, or it panics(). User space shouldn't even know if the kernel supports RAS
> until it gets an MCEERR signal.
>
> You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS.
>
> If we get a RAS SError and there are no CPER records or values in the ERR nodes,
> we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors)
Hi james,
sorry to disturb you!
do you think whether we need to set the guest ESR by user space? if need, I need to
notify user space that there is a SError happen and need to set ESR for guest in some place of
KVM. so here I return a error code to user space. you mean we should not pass RAS notifications
to user space, so could you give some suggestion how to notify user space to set guest ESR.
Thanks a lot in advance.
>
>
Hi gengdongjiu,
On 06/12/17 10:26, gengdongjiu wrote:
> On 2017/11/15 0:00, James Morse wrote:
>>> + * error has not been propagated
>>> + */
>>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>>> + run->ex.exception = ESR_ELx_EC_SERROR;
>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> + return 0;
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports RAS
>> until it gets an MCEERR signal.
>>
>> You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS.
>>
>> If we get a RAS SError and there are no CPER records or values in the ERR nodes,
>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors)
> do you think whether we need to set the guest ESR by user space? if need, I need to
> notify user space that there is a SError happen and need to set ESR for guest in some place of
> KVM.
I think you are still coming from a world where user-space gets raw RAS
notifications via KVM. This should not happen because the notification method is
private to firmware and the kernel. KVM is just in the way when a guest is running.
Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.
The kernel RAS code may signal user-space with the symptoms of the error, and
user-space may decided to generate a new RAS notification for the guest.
This should function in exactly the same way, regardless of which notification
method is in use between the kernel and firmware. (its the only way to make this
future-proof).
Which notification user-space chooses to use entirely depends on what (if
anything) it advertised to the guest in the HEST. User-space has to be in
control of triggering any SError, not just overriding the ESR when KVM has
decided it wants to kill the guest.
> so here I return a error code to user space. you mean we should not pass RAS notifications
> to user space, so could you give some suggestion how to notify user space to set guest ESR.
KVM shouldn't give the guest an SError when it takes a RAS notification, it
should pass the notification to the kernel RAS code. It only needs to 'fall
through' to some default cause if both APEI and kernel-first deny-all-knowledge
of this notification.
The end-to-end flow is then (assuming no-VHE):
(1)An error occurs, taking the CPU to EL3.
EL3: triage the error, generate CPER, notify the OS
EL2: KVM takes the notification, exits the guest, returns to host EL1.
EL1: KVM handle_exit() calls APEI to handle the error.
This is the end of KVMs involvement in RAS - its just plumbing.
(2)APEI processes the CPER records and signals affected processes.
If KVM's user-space is affected, KVM will spot the pending signal when it goes
to re-enter the guest, and exit to user-space instead.
Qemu takes the SIGBUS_MCEERR_A{O,R}.
(3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
records (in memory only Qemu knows about), then drives the KVM API to make the
appropriate notification appear.
(1) only happens if the guest was running when the error arrived. GHES has ~4
flavours of IRQ which may be used to describe corruption in guest memory. Steps
(2) and (3) are exactly the same in this case.
Qemu may decide to trigger RAS errors all by itself, (probably for testing and
debugging), in which case (1) and (2) don't happen, but (3), is exactly the same.
This way platform-firmware/host-kernel can use kernel-first or firmware-first
with any of the notifications, independently from Qemu/guest-kernel making a
different kernel-first or firmware-first with different notifications.
Passing information out of KVM breaks this, forcing Qemu to know about the
mechanism platform-firmware is using.
We need to tackle (1) and (3) separately. For (3) we need some API that lets
Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
a way of migrating pending SError yet... which is where I got stuck last time I
was looking at this.
James
Hi James,
On 2017/12/7 3:04, James Morse wrote:
> Hi gengdongjiu,
>
> On 06/12/17 10:26, gengdongjiu wrote:
>> On 2017/11/15 0:00, James Morse wrote:
>>>> + * error has not been propagated
>>>> + */
>>>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>>>> + run->ex.exception = ESR_ELx_EC_SERROR;
>>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>>> + return 0;
>>> We should not pass RAS notifications to user space. The kernel either handles
>>> them, or it panics(). User space shouldn't even know if the kernel supports RAS
>>> until it gets an MCEERR signal.
>>>
>>> You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS.
>>>
>>> If we get a RAS SError and there are no CPER records or values in the ERR nodes,
>>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors)
>
>> do you think whether we need to set the guest ESR by user space? if need, I need to
>> notify user space that there is a SError happen and need to set ESR for guest in some place of
>> KVM.
>
> I think you are still coming from a world where user-space gets raw RAS
> notifications via KVM. This should not happen because the notification method is
> private to firmware and the kernel. KVM is just in the way when a guest is running.
>
> Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
> or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.
>
> The kernel RAS code may signal user-space with the symptoms of the error, and
> user-space may decided to generate a new RAS notification for the guest.
>
> This should function in exactly the same way, regardless of which notification
> method is in use between the kernel and firmware. (its the only way to make this
> future-proof).
>
> Which notification user-space chooses to use entirely depends on what (if
> anything) it advertised to the guest in the HEST. User-space has to be in
> control of triggering any SError, not just overriding the ESR when KVM has
> decided it wants to kill the guest.
thanks, I will explain more.
>
>
>> so here I return a error code to user space. you mean we should not pass RAS notifications
>> to user space, so could you give some suggestion how to notify user space to set guest ESR.
>
> KVM shouldn't give the guest an SError when it takes a RAS notification, it
> should pass the notification to the kernel RAS code. It only needs to 'fall
> through' to some default cause if both APEI and kernel-first deny-all-knowledge
> of this notification.
>
>
> The end-to-end flow is then (assuming no-VHE):
> (1)An error occurs, taking the CPU to EL3.
> EL3: triage the error, generate CPER, notify the OS
> EL2: KVM takes the notification, exits the guest, returns to host EL1.
> EL1: KVM handle_exit() calls APEI to handle the error.
> This is the end of KVMs involvement in RAS - its just plumbing.
>
> (2)APEI processes the CPER records and signals affected processes.
> If KVM's user-space is affected, KVM will spot the pending signal when it goes
> to re-enter the guest, and exit to user-space instead.
> Qemu takes the SIGBUS_MCEERR_A{O,R}.
>
> (3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
> records (in memory only Qemu knows about), then drives the KVM API to make the
> appropriate notification appear.
>
>
> (1) only happens if the guest was running when the error arrived. GHES has ~4
> flavours of IRQ which may be used to describe corruption in guest memory. Steps
> (2) and (3) are exactly the same in this case.
>
> Qemu may decide to trigger RAS errors all by itself, (probably for testing and
> debugging), in which case (1) and (2) don't happen, but (3), is exactly the same.
>
>
> This way platform-firmware/host-kernel can use kernel-first or firmware-first
> with any of the notifications, independently from Qemu/guest-kernel making a
> different kernel-first or firmware-first with different notifications.
>
> Passing information out of KVM breaks this, forcing Qemu to know about the
> mechanism platform-firmware is using.
>
>
> We need to tackle (1) and (3) separately. For (3) we need some API that lets
> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
> a way of migrating pending SError yet... which is where I got stuck last time I
> was looking at this.
I understand you most idea.
But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors,
otherwise Qemu will do not know how to do.
For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER
records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO
signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
SIGBUS_MCEERR_AR trigger Synchronous External Abort.
SIGBUS_MCEERR_AO trigger GPIO IRQ.
For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
not involve _trigger_ an SError.
so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
>
>
>
> James
>
> .
>
Hi James,
On 2017/12/7 14:37, gengdongjiu wrote:
>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
>> a way of migrating pending SError yet... which is where I got stuck last time I
>> was looking at this.
> I understand you most idea.
>
> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors,
> otherwise Qemu will do not know how to do.
>
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>
> not involve _trigger_ an SError.
>
> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
As I explained above:
If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort;
If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
So Qemu does not know when to _trigger_ an SError.
so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error",
do you have a better idea about it? thanks
About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know
how and when to trigger the SError.
[1]:
static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
{
.......................
+ case ESR_ELx_AET_UER: /* The error has not been propagated */
+ /*
+ * Userspace only handle the guest SError Interrupt(SEI) if the
+ * error has not been propagated
+ */
+ run->exit_reason = KVM_EXIT_EXCEPTION;
+ run->ex.exception = ESR_ELx_EC_SERROR;
+ run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
+ return 0;
.......................
}
>
Hi gengdongjiu,
On 07/12/17 06:37, gengdongjiu wrote:
> I understand you most idea.
>
> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors,
> otherwise Qemu will do not know how to do.
>
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>
> not involve _trigger_ an SError.
It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
CPU you are trying to emulate.
I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
the guest doesn't take the SError immediately, instead tries to re-execute the
code KVM has unmapped from stage2 because its corrupt. (You could detect this
happening in Qemu and try something else)
Synchronous/asynchronous external abort matters to the CPU, but once the error
has been notified to software the reasons for this distinction disappear. Once
the error has been handled, all trace of this distinction is gone.
CPER records only describe component failures. You are trying to re-create some
state that disappeared with one of the firmware-first abstractions. Trying to
re-create this information isn't worth the effort as the distinction doesn't
matter to linux, only to the CPU.
> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
You mean there is no reason for Qemu to trigger an SError when it gets a signal
from the kernel.
The reasons the CPU might have to generate an SError don't apply to linux and
KVM user space. User-space will never get a signal for an uncontained error, we
will always panic(). We can't give user-space a signal for imprecise exceptions,
as it can't return from the signal. The classes of error that are left are
covered by polled/irq and NOTIFY_SEA.
Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
to, (but I don't think you should, the kernel may have unmapped the page at PC
from stage2 due to corruption).
I think the problem here is you're applying the CPU->software behaviour and
choices to software->software. By the time user-space gets the error, the
behaviour is different.
Thanks,
James
Hi James,
On 2017/12/16 2:52, James Morse wrote:
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>>
>> not involve _trigger_ an SError.
> It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
> CPU you are trying to emulate.
>
> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
> the guest doesn't take the SError immediately, instead tries to re-execute the
I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.
> code KVM has unmapped from stage2 because its corrupt. (You could detect this
> happening in Qemu and try something else)For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR? At current implementation,
It seems only have this case that "KVM has unmapped from stage2", do you thing we still have something else?
>
>
> Synchronous/asynchronous external abort matters to the CPU, but once the error
> has been notified to software the reasons for this distinction disappear. Once
> the error has been handled, all trace of this distinction is gone.
>
> CPER records only describe component failures. You are trying to re-create some
> state that disappeared with one of the firmware-first abstractions. Trying to
> re-create this information isn't worth the effort as the distinction doesn't
> matter to linux, only to the CPU.
>
>
>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
> You mean there is no reason for Qemu to trigger an SError when it gets a signal
> from the kernel.
>
> The reasons the CPU might have to generate an SError don't apply to linux and
> KVM user space. User-space will never get a signal for an uncontained error, we
> will always panic(). We can't give user-space a signal for imprecise exceptions,
> as it can't return from the signal. The classes of error that are left are
> covered by polled/irq and NOTIFY_SEA.
>
> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
> to, (but I don't think you should, the kernel may have unmapped the page at PC
> from stage2 due to corruption).
yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR,
so Qemu does not know in which condition to generate RAS SErrors.
>
> I think the problem here is you're applying the CPU->software behaviour and
> choices to software->software. By the time user-space gets the error, the
> behaviour is different.
In the KVM, as a policy choice to reserve this API to specify guest ESR and drive to trigger SError is OK,
At least for Qemu it does not know in which condition to trigger it.
Hi gengdongjiu,
On 15/12/17 03:30, gengdongjiu wrote:
> On 2017/12/7 14:37, gengdongjiu wrote:
>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
>>> a way of migrating pending SError yet... which is where I got stuck last time I
>>> was looking at this.
>> I understand you most idea.
>>
>> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors,
>> otherwise Qemu will do not know how to do.
>>
>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER
>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>>
>> not involve _trigger_ an SError.
>>
>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
>
> As I explained above:
>
> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort;
> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
> So Qemu does not know when to _trigger_ an SError.
There is no answer to this. How the CPU decides is specific to the CPU design.
How Qemu decides is going to be specific to the machine it emulates.
My understanding is there is some overlap for which RAS errors are reported as
synchronous external abort, and which use SError. (Obviously the imprecise ones
are all SError). Which one the CPU uses depends on how the CPU is designed.
When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
stage2 fault because the page is marked with PG_poisoned. These started out as a
synchronous exception, but you could still report these with SError.
We don't have a way to signal user-space about imprecise exceptions, this isn't
a KVM specific problem.
> so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error",
> do you have a better idea about it? thanks
If ghes_notify_sei() fails to claim the error, we should drop through to
kernel-first-handling. We don't have that yet, just the stub that ignores errors
where we can make progress.
If neither firmware-first nor kernel-first claim a RAS error, we're in trouble.
I'd like to panic() as we got a RAS notification but no description of the
error. We can't do this until we have kernel-first support, hence that stub.
> About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know
> how and when to trigger the SError.
I agree, but I think we should fix this first before we add another user of this
unmigratable hypervisor state.
(I recall someone saying migration is needed for any new KVM/cpu features, but I
can't find the thread)
> [1]:
> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> {
> .......................
> + case ESR_ELx_AET_UER: /* The error has not been propagated */
> + /*
> + * Userspace only handle the guest SError Interrupt(SEI) if the
> + * error has not been propagated
> + */
> + run->exit_reason = KVM_EXIT_EXCEPTION;
> + run->ex.exception = ESR_ELx_EC_SERROR;
I'm against telling user space RAS errors ever happened, only the final
user-visible error when the kernel can't fix it.
This is inventing something new for RAS errors not claimed by firmware-first.
If we have kernel-first too, this will never happen. (unless your system is
losing the error description).
Your system has firmware-first, why isn't it claiming the notification?
If its not finding CPER records written by firmware, check firmware and the UEFI
memory map agree on the attributes to be used when read/writing that area.
> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> + return 0;
Thanks,
James
On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote:
> On 15/12/17 03:30, gengdongjiu wrote:
> > On 2017/12/7 14:37, gengdongjiu wrote:
[...]
>
> (I recall someone saying migration is needed for any new KVM/cpu features, but I
> can't find the thread)
>
I don't know of any hard set-in-stone rule for this, but I have
certainly argued that since migration is a popular technique in data
centers and often a key motivation behind using virtual machines as it
provides both load-balancing and high availability, we should think
about migration support for all features and state. Further, experience
has shown that retroactively trying to support migration can result in
really complex interfaces for saving/restoring state (see the ITS
ordering requirements in
Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
thinking about this problem when introducing functionality is a good
idea.
Of course, if there are really good arguments for having some state that
simply cannot be migrated, then that's fine, and we should just make
sure that userspace (e.g. QEMU) and higher level components in the
stack (libvirt, openstack, etc.) can detect this state being used, and
ideally enable/disable it, so that it can predict that a particular VM
cannot be migrated off a particular host, or between a particular set of
two hosts. As an example, migration is typically prohibited when using
VFIO direct device assignment, but userspace etc. are already aware of
this.
As a final note, if we add support for some architectural feature, which
may be present on some particular hardware and/or implementation, if the
KVM support for said feature is automatically enabled (and not
selectively from userspace), I would push back quite strongly on
something that doesn't support migration, because it would effectively
prevent migration of VMs on ARM.
Thanks,
-Christoffer
Hi Christoffer
On 2018/1/15 16:33, Christoffer Dall wrote:
> On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>>> On 2017/12/7 14:37, gengdongjiu wrote:
>
> [...]
>
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, but I
>> can't find the thread)
>>
>
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state. Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
>
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts. As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
>
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Thanks very much for this mail and reply, I will check it, please give me some time due to
recently busy with other things.
>
> Thanks,
> -Christoffer
>
> .
>
Hi James,
Sorry for my late response due to out of office recently.
2018-01-13 2:05 GMT+08:00 James Morse <[email protected]>:
> Hi gengdongjiu,
>
> On 15/12/17 03:30, gengdongjiu wrote:
>> On 2017/12/7 14:37, gengdongjiu wrote:
>>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
>>>> a way of migrating pending SError yet... which is where I got stuck last time I
>>>> was looking at this.
>>> I understand you most idea.
>>>
>>> But In the Qemu one signal type can only correspond to one behavior, can not correspond to two behaviors,
>>> otherwise Qemu will do not know how to do.
>>>
>>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate the CPER
>>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if receives the SIGBUS_MCEERR_AO
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>>>
>>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
>>
>> As I explained above:
>>
>> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger Synchronous External Abort;
>> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;
>
>> So Qemu does not know when to _trigger_ an SError.
>
> There is no answer to this. How the CPU decides is specific to the CPU design.
> How Qemu decides is going to be specific to the machine it emulates.
>
> My understanding is there is some overlap for which RAS errors are reported as
> synchronous external abort, and which use SError. (Obviously the imprecise ones
> are all SError). Which one the CPU uses depends on how the CPU is designed.
>
> When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
> stage2 fault because the page is marked with PG_poisoned. These started out as a
> synchronous exception, but you could still report these with SError.
yes, I agree, it is policy choice.
>
> We don't have a way to signal user-space about imprecise exceptions, this isn't
> a KVM specific problem.
>
>
>> so here I "return a error" to Qemu if ghes_notify_sei() return failure in [1], if you opposed KVM "return error",
>> do you have a better idea about it? thanks
>
> If ghes_notify_sei() fails to claim the error, we should drop through to
> kernel-first-handling. We don't have that yet, just the stub that ignores errors
> where we can make progress.
>
> If neither firmware-first nor kernel-first claim a RAS error, we're in trouble.
> I'd like to panic() as we got a RAS notification but no description of the
> error. We can't do this until we have kernel-first support, hence that stub.
>
>
>> About the way of migrating pending SError, I think it is a separate case, because Qemu still does not know
>> how and when to trigger the SError.
>
> I agree, but I think we should fix this first before we add another user of this
> unmigratable hypervisor state.
>
> (I recall someone saying migration is needed for any new KVM/cpu features, but I
> can't find the thread)
>
>
>> [1]:
>> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
>> {
>> .......................
>> + case ESR_ELx_AET_UER: /* The error has not been propagated */
>> + /*
>> + * Userspace only handle the guest SError Interrupt(SEI) if the
>> + * error has not been propagated
>> + */
>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>> + run->ex.exception = ESR_ELx_EC_SERROR;
>
> I'm against telling user space RAS errors ever happened, only the final
> user-visible error when the kernel can't fix it.
thanks for the explanation.
For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
be better[1].
But if you think panic is better until we support kernel-first, it is
also OK to me.
+static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
+{
+ unsigned int esr = kvm_vcpu_get_hsr(vcpu);
+ bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */
+ unsigned int aet = esr & ESR_ELx_AET;
+
+ /*
+ * This is not RAS SError
+ */
+ if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ /* For RAS the host kernel may handle this abort. */
+ if (!handle_guest_sei())
+ return 1;
+
+ /*
+ * In below two conditions, it will directly inject the
+ * virtual SError:
+ * 1. The Syndrome is IMPLEMENTATION DEFINED
+ * 2. It is Uncategorized SEI
+ */
+ if (impdef_syndrome ||
+ ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
+ kvm_inject_vabt(vcpu);
+ return 1;
+ }
+
+ switch (aet) {
+ case ESR_ELx_AET_CE: /* corrected error */
+ case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */
+ return 1; /* continue processing the guest exit */
+ case ESR_ELx_AET_UER: /* recoverable error */
+ /*
+ * the exception is precise, not been silently propagated
+ * and not been consumed by the CPU, temporarily shut down
+ * the VM to isolated the error, hope not touch it again.
+ */
+ run->exit_reason = KVM_EXIT_EXCEPTION;
+ return 0;
+ default:
+ /*
+ * Until now, the CPU supports RAS, SError interrupt is fatal
+ * and host does not successfully handle it.
+ */
+ panic("This Asynchronous SError interrupt is dangerous, panic");
+ }
+
+ return 0;
+}
+
>
> This is inventing something new for RAS errors not claimed by firmware-first.
> If we have kernel-first too, this will never happen. (unless your system is
> losing the error description).
In fact, if we have kernel-first, I think we still need to judge the
error type by ESR, right?
If the handle_guest_sei() , may be the system does not support firmware-first,
so we judge the ESR value,
>
>
> Your system has firmware-first, why isn't it claiming the notification?
> If its not finding CPER records written by firmware, check firmware and the UEFI
> memory map agree on the attributes to be used when read/writing that area.
>
>
>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>> + return 0;
>
>
> Thanks,
>
> James
>
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
2018-01-15 16:33 GMT+08:00 Christoffer Dall <[email protected]>:
> On Fri, Jan 12, 2018 at 06:05:23PM +0000, James Morse wrote:
>> On 15/12/17 03:30, gengdongjiu wrote:
>> > On 2017/12/7 14:37, gengdongjiu wrote:
>
> [...]
>
>>
>> (I recall someone saying migration is needed for any new KVM/cpu features, but I
>> can't find the thread)
>>
>
> I don't know of any hard set-in-stone rule for this, but I have
> certainly argued that since migration is a popular technique in data
> centers and often a key motivation behind using virtual machines as it
> provides both load-balancing and high availability, we should think
> about migration support for all features and state. Further, experience
> has shown that retroactively trying to support migration can result in
> really complex interfaces for saving/restoring state (see the ITS
> ordering requirements in
> Documentation/virtual/kvm/devices/arm-vgic-its.txt as an example) so
> thinking about this problem when introducing functionality is a good
> idea.
yes, agree it.
>
> Of course, if there are really good arguments for having some state that
> simply cannot be migrated, then that's fine, and we should just make
> sure that userspace (e.g. QEMU) and higher level components in the
> stack (libvirt, openstack, etc.) can detect this state being used, and
> ideally enable/disable it, so that it can predict that a particular VM
> cannot be migrated off a particular host, or between a particular set of
> two hosts. As an example, migration is typically prohibited when using
> VFIO direct device assignment, but userspace etc. are already aware of
> this.
Ok, I think this problem is similar to migrating a VM that uses an irqchip in
userspace and has set the IRQ or FIQ lines using KVM_IRQ_LINE.
>
> As a final note, if we add support for some architectural feature, which
> may be present on some particular hardware and/or implementation, if the
> KVM support for said feature is automatically enabled (and not
> selectively from userspace), I would push back quite strongly on
> something that doesn't support migration, because it would effectively
> prevent migration of VMs on ARM.
Ok, got it.
>
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi gengdongjiu,
On 21/01/18 02:45, gengdongjiu wrote:
> For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
> be better[1].
> But if you think panic is better until we support kernel-first, it is
> also OK to me.
I'm not convinced SError while a guest was running means only guest memory could
be affected. Mechanisms like KSM means the error could affect multiple guests.
Both firmware-fist and kernel-first will give us the address, with which we can
know which processes are affected, isolated the memory and signal affected
processes.
Until we have one of these panic() is the only way we have to contain an error,
but its an interim fix.
Not panic()ing the host for an error that should be contained to the guest is a
fudge, we don't actually know its safe (KSM, page-table etc). I want to improve
on this with {firmware, kernel}-first support (or both!), I don't want to expose
that this is happening to user-space, as once we have one of {firmware,
kernel}-first, it shouldn't happen.
>> This is inventing something new for RAS errors not claimed by firmware-first.
>> If we have kernel-first too, this will never happen. (unless your system is
>> losing the error description).
> In fact, if we have kernel-first, I think we still need to judge the
> error type by ESR, right?
The kernel-first mechanism should consider the ESR/FAR, yes, but once the error
has been claimed and handled, KVM shouldn't care about any of these values.
(maybe we'll sanity check for uncontained errors, just in case the error escaped
to the RAS code...)
My point here was exposing 'unhandled' (ignored) RAS errors to user-space
creates an ABI: someone will complain once we start handling the error, and they
no longer get a notification via this 'unhandled' interface. Code written to use
this interface becomes useless/untested.
> If the handle_guest_sei() , may be the system does not support firmware-first,
> so we judge the ESR value,
...and panic()/ignore as appropriate.
I agree not all systems will support firmware-first, (big-endian is the obvious
example), but if we get kernel-first support this ESR guessing can disappear,
I'm against exposing it to user-space in the meantime.
Thanks,
James
Hi gengdongjiu,
On 16/12/17 03:44, gengdongjiu wrote:
> On 2017/12/16 2:52, James Morse wrote:
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>> It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
>> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
>> CPU you are trying to emulate.
>>
>> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
>> the guest doesn't take the SError immediately, instead tries to re-execute the
> I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.
>> code KVM has unmapped from stage2 because its corrupt. (You could detect this
>> happening in Qemu and try something else)
> For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR?
Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set,
in which case the the CPU will patiently wait for it to unmask, (or consume it
with an ESB-instruction), before delivering the notification. The guest may not
have unmasked SError because its hammering the same page taking the same fault
again and again. Pending the asynchronous notification and re-running the vcpu
doesn't guarantee progress will be made.
In this case user-space can spot its pended an asynchronous notification (for
the same address!) more than once in the last few seconds, and try something
else, like firing a guest:reboot watchdog on another CPU.
> At current implementation,
> It seems only have this case that "KVM has unmapped from stage2", do you thing we
> still have something else?
I'm wary that this only works for errors where we know the guest PC accessed the
faulting location.
The arch code will send this signal too if user-space touches the PG_poisoned
page. (I recall you checked Qemu spots this case and acts differently).
Migration is the obvious example for Qemu read/writing guest memory.
On x86 the MachineCheck code sends these signals too, so our kernel-first
implementation may do the same. As a response to a RAS error notified by
synchronous-external-abort, this is fine. But we need to remember '_AR' implies
the error is related to the code the signal interrupted, which wouldn't be true
for an error notified by SError.
>> Synchronous/asynchronous external abort matters to the CPU, but once the error
>> has been notified to software the reasons for this distinction disappear. Once
>> the error has been handled, all trace of this distinction is gone.
>>
>> CPER records only describe component failures. You are trying to re-create some
>> state that disappeared with one of the firmware-first abstractions. Trying to
>> re-create this information isn't worth the effort as the distinction doesn't
>> matter to linux, only to the CPU.
>>
>>
>>> so there is no chance for Qemu to trigger the SError when gets the SIGBUS_MCEERR_A{O,R}.
>> You mean there is no reason for Qemu to trigger an SError when it gets a signal
>> from the kernel.
>>
>> The reasons the CPU might have to generate an SError don't apply to linux and
>> KVM user space. User-space will never get a signal for an uncontained error, we
>> will always panic(). We can't give user-space a signal for imprecise exceptions,
>> as it can't return from the signal. The classes of error that are left are
>> covered by polled/irq and NOTIFY_SEA.
>>
>> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
>> to, (but I don't think you should, the kernel may have unmapped the page at PC
>> from stage2 due to corruption).
> yes, you also said you do not want to generate RAS SErrors for SIGBUS_MCEERR_AR,
> so Qemu does not know in which condition to generate RAS SErrors.
There are two things going on here, firstly the guest may have masked PSTATE.A,
and be hammering an unmapped page. (this this 'sorry that was unclear' case
above). This would happen if the exception-entry code or stack became corrupt
when an exception was taken.
The second is what does existing non-RAS-aware software do? For SError it
panic()s, whereas for synchronous external abort there are some cases that can
be handled. (e.g. on linux: synchronous external abort from user-space).
>> I think the problem here is you're applying the CPU->software behaviour and
>> choices to software->software. By the time user-space gets the error, the
>> behaviour is different.
> In the KVM, as a policy choice to reserve this API to specify guest ESR and
> drive to trigger SError is OK,
> At least for Qemu it does not know in which condition to trigger it.
I think you're saying "lets keep it KVM for now, Qemu doesn't have a better idea
of what to do."
Thanks,
James