2017-12-14 12:23:58

by Gonglei (Arei)

[permalink] [raw]
Subject: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

We hit a bug in our test while run PCMark 10 in a windows 7 VM,
The VM got stuck and the wallclock was hang after several minutes running
PCMark 10 in it.
It is quite easily to reproduce the bug with the upstream KVM and Qemu.

We found that KVM can not inject any RTC irq to VM after it was hang, it fails to
Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
int irq_level, bool line_status)
{
...
if (!irq_level) {
ioapic->irr &= ~mask;
ret = 1;
goto out;
}
...
if ((edge && old_irr == ioapic->irr) ||
(!edge && entry.fields.remote_irr)) {
ret = 0;
goto out;
}

According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
register C to to clear the irq flag, and pull down the irq electric pin.

For Qemu, we will emulate the reading operation in cmos_ioport_read(),
but Guest OS will fire a write operation before to tell which register will be read
after this write, where we use s->cmos_index to record the following register to read.

But in our test, we found that there is a possible situation that Vcpu fails to read
RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading
registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
but before it tries to read register C, another vcpu1 is going to read RTC_YEAR,
it changes s->cmos_index to RTC_YEAR by a writing action.
The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we will miss
calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject RTC irq,
and Windows VM will hang.

Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.

Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Gonglei <[email protected]>
---
Thanks to Paolo provides a good solution. :)

arch/x86/kvm/ioapic.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 4e822ad..5022d63 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu)
{
if (test_and_clear_bit(vcpu->vcpu_id,
ioapic->rtc_status.dest_map.map)) {
+ ioapic->irr &= ~(1 << RTC_GSI);
--ioapic->rtc_status.pending_eoi;
rtc_status_pending_eoi_check_valid(ioapic);
}
--
1.8.3.1



2017-12-14 13:02:00

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten



On 14/12/17 14:23, Gonglei wrote:
> We hit a bug in our test while run PCMark 10 in a windows 7 VM,
> The VM got stuck and the wallclock was hang after several minutes running
> PCMark 10 in it.
> It is quite easily to reproduce the bug with the upstream KVM and Qemu.
>
> We found that KVM can not inject any RTC irq to VM after it was hang, it fails to
> Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.
>
> static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
> int irq_level, bool line_status)
> {
> ...
> if (!irq_level) {
> ioapic->irr &= ~mask;
> ret = 1;
> goto out;
> }
> ...
> if ((edge && old_irr == ioapic->irr) ||
> (!edge && entry.fields.remote_irr)) {
> ret = 0;
> goto out;
> }
>
> According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
> register C to to clear the irq flag, and pull down the irq electric pin.
>
> For Qemu, we will emulate the reading operation in cmos_ioport_read(),
> but Guest OS will fire a write operation before to tell which register will be read
> after this write, where we use s->cmos_index to record the following register to read.
>
> But in our test, we found that there is a possible situation that Vcpu fails to read
> RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading
> registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
> so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
> but before it tries to read register C, another vcpu1 is going to read RTC_YEAR,
> it changes s->cmos_index to RTC_YEAR by a writing action.
> The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we will miss
> calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject RTC irq,
> and Windows VM will hang.

If I understood correctly, this looks to me like a race-condition bug in
the Windows guest kernel. In real-hardware this race-condition will also
cause the RTC_YEAR to be read instead of RTC_REG_C.
Guest kernel should make sure that 2 CPUs does not attempt to read a
CMOS register in parallel as they can override each other's cmos_index.

See for example how Linux kernel makes sure to avoid such kind of issues
in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.

>
> Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.

Can you elaborate a bit more why it makes sense to put such workaround
in KVM code instead of declaring this as guest kernel bug?

Regards,
-Liran

>
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Gonglei <[email protected]>
> ---
> Thanks to Paolo provides a good solution. :)
>
> arch/x86/kvm/ioapic.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 4e822ad..5022d63 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu)
> {
> if (test_and_clear_bit(vcpu->vcpu_id,
> ioapic->rtc_status.dest_map.map)) {
> + ioapic->irr &= ~(1 << RTC_GSI);
> --ioapic->rtc_status.pending_eoi;
> rtc_status_pending_eoi_check_valid(ioapic);
> }
>

2017-12-14 13:16:27

by Gonglei (Arei)

[permalink] [raw]
Subject: RE: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten


> -----Original Message-----
> From: Liran Alon [mailto:[email protected]]
> Sent: Thursday, December 14, 2017 9:02 PM
> To: Gonglei (Arei); [email protected]; [email protected]
> Cc: [email protected]; [email protected]; Huangweidong (C)
> Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten
>
>
>
> On 14/12/17 14:23, Gonglei wrote:
> > We hit a bug in our test while run PCMark 10 in a windows 7 VM,
> > The VM got stuck and the wallclock was hang after several minutes running
> > PCMark 10 in it.
> > It is quite easily to reproduce the bug with the upstream KVM and Qemu.
> >
> > We found that KVM can not inject any RTC irq to VM after it was hang, it fails
> to
> > Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.
> >
> > static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
> > int irq_level, bool line_status)
> > {
> > ...
> > if (!irq_level) {
> > ioapic->irr &= ~mask;
> > ret = 1;
> > goto out;
> > }
> > ...
> > if ((edge && old_irr == ioapic->irr) ||
> > (!edge && entry.fields.remote_irr)) {
> > ret = 0;
> > goto out;
> > }
> >
> > According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
> > register C to to clear the irq flag, and pull down the irq electric pin.
> >
> > For Qemu, we will emulate the reading operation in cmos_ioport_read(),
> > but Guest OS will fire a write operation before to tell which register will be
> read
> > after this write, where we use s->cmos_index to record the following register
> to read.
> >
> > But in our test, we found that there is a possible situation that Vcpu fails to
> read
> > RTC_REG_C to clear irq, This could happens while two VCpus are
> writing/reading
> > registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
> > so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
> > but before it tries to read register C, another vcpu1 is going to read
> RTC_YEAR,
> > it changes s->cmos_index to RTC_YEAR by a writing action.
> > The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we
> will miss
> > calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject
> RTC irq,
> > and Windows VM will hang.
>
> If I understood correctly, this looks to me like a race-condition bug in
> the Windows guest kernel. In real-hardware this race-condition will also
> cause the RTC_YEAR to be read instead of RTC_REG_C.
> Guest kernel should make sure that 2 CPUs does not attempt to read a
> CMOS register in parallel as they can override each other's cmos_index.
>
> See for example how Linux kernel makes sure to avoid such kind of issues
> in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.
>
Yes, I knew that.

> >
> > Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.
>
> Can you elaborate a bit more why it makes sense to put such workaround
> in KVM code instead of declaring this as guest kernel bug?
>
I agree it's a Windows bug. The big problem is there is not problem on Xen platform.

Thanks,
-Gonglei

> Regards,
> -Liran
>
> >
> > Suggested-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Gonglei <[email protected]>
> > ---
> > Thanks to Paolo provides a good solution. :)
> >
> > arch/x86/kvm/ioapic.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> > index 4e822ad..5022d63 100644
> > --- a/arch/x86/kvm/ioapic.c
> > +++ b/arch/x86/kvm/ioapic.c
> > @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic,
> struct kvm_vcpu *vcpu)
> > {
> > if (test_and_clear_bit(vcpu->vcpu_id,
> > ioapic->rtc_status.dest_map.map)) {
> > + ioapic->irr &= ~(1 << RTC_GSI);
> > --ioapic->rtc_status.pending_eoi;
> > rtc_status_pending_eoi_check_valid(ioapic);
> > }
> >

2017-12-14 13:18:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

On 14/12/2017 14:01, Liran Alon wrote:
>> But in our test, we found that there is a possible situation that Vcpu
>> fails to read
>> RTC_REG_C to clear irq, This could happens while two VCpus are
>> writing/reading
>> registers at the same time, for example, vcpu 0 is trying to read
>> RTC_REG_C,
>> so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
>> but before it tries to read register C, another vcpu1 is going to read
>> RTC_YEAR,
>> it changes s->cmos_index to RTC_YEAR by a writing action.
>> The next operation of vcpu0 will be lead to read RTC_YEAR, In this
>> case, we will miss
>> calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will
>> never inject RTC irq,
>> and Windows VM will hang.
>
> If I understood correctly, this looks to me like a race-condition bug in
> the Windows guest kernel. In real-hardware this race-condition will also
> cause the RTC_YEAR to be read instead of RTC_REG_C.
> Guest kernel should make sure that 2 CPUs does not attempt to read a
> CMOS register in parallel as they can override each other's cmos_index.
>
> See for example how Linux kernel makes sure to avoid such kind of issues
> in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.

Lei and I looked at it further, and the root cause is not the missed EOI
in QEMU. Rather it's a bug in ioapic.c's tracking of RTC interrupts.

Paolo

2017-12-14 14:26:58

by Quan Xu

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten



On 2017/12/14 21:15, Gonglei (Arei) wrote:
>> -----Original Message-----
>> From: Liran Alon [mailto:[email protected]]
>> Sent: Thursday, December 14, 2017 9:02 PM
>> To: Gonglei (Arei); [email protected]; [email protected]
>> Cc: [email protected]; [email protected]; Huangweidong (C)
>> Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten
>>
>>
>>
>> On 14/12/17 14:23, Gonglei wrote:
>>> We hit a bug in our test while run PCMark 10 in a windows 7 VM,
>>> The VM got stuck and the wallclock was hang after several minutes running
>>> PCMark 10 in it.
>>> It is quite easily to reproduce the bug with the upstream KVM and Qemu.
>>>
>>> We found that KVM can not inject any RTC irq to VM after it was hang, it fails
>> to
>>> Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.
>>>
>>> static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
>>> int irq_level, bool line_status)
>>> {
>>> ...
>>> if (!irq_level) {
>>> ioapic->irr &= ~mask;
>>> ret = 1;
>>> goto out;
>>> }
>>> ...
>>> if ((edge && old_irr == ioapic->irr) ||
>>> (!edge && entry.fields.remote_irr)) {
>>> ret = 0;
>>> goto out;
>>> }
>>>
>>> According to RTC spec, after RTC injects a High level irq, OS will read CMOS's
>>> register C to to clear the irq flag, and pull down the irq electric pin.
>>>
>>> For Qemu, we will emulate the reading operation in cmos_ioport_read(),
>>> but Guest OS will fire a write operation before to tell which register will be
>> read
>>> after this write, where we use s->cmos_index to record the following register
>> to read.
>>> But in our test, we found that there is a possible situation that Vcpu fails to
>> read
>>> RTC_REG_C to clear irq, This could happens while two VCpus are
>> writing/reading
>>> registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
>>> so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
>>> but before it tries to read register C, another vcpu1 is going to read
>> RTC_YEAR,
>>> it changes s->cmos_index to RTC_YEAR by a writing action.
>>> The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we
>> will miss
>>> calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject
>> RTC irq,
>>> and Windows VM will hang.
>> If I understood correctly, this looks to me like a race-condition bug in
>> the Windows guest kernel. In real-hardware this race-condition will also
>> cause the RTC_YEAR to be read instead of RTC_REG_C.
>> Guest kernel should make sure that 2 CPUs does not attempt to read a
>> CMOS register in parallel as they can override each other's cmos_index.
>>
>> See for example how Linux kernel makes sure to avoid such kind of issues
>> in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock.
>>
> Yes, I knew that.
>
>>> Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.
>> Can you elaborate a bit more why it makes sense to put such workaround
>> in KVM code instead of declaring this as guest kernel bug?
>>
> I agree it's a Windows bug. The big problem is there is not problem on Xen platform.

 have you analyzed why it is not a problem on Xen?

Quan


> Thanks,
> -Gonglei
>
>> Regards,
>> -Liran
>>
>>> Suggested-by: Paolo Bonzini <[email protected]>
>>> Signed-off-by: Gonglei <[email protected]>
>>> ---
>>> Thanks to Paolo provides a good solution. :)
>>>
>>> arch/x86/kvm/ioapic.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>>> index 4e822ad..5022d63 100644
>>> --- a/arch/x86/kvm/ioapic.c
>>> +++ b/arch/x86/kvm/ioapic.c
>>> @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic,
>> struct kvm_vcpu *vcpu)
>>> {
>>> if (test_and_clear_bit(vcpu->vcpu_id,
>>> ioapic->rtc_status.dest_map.map)) {
>>> + ioapic->irr &= ~(1 << RTC_GSI);
>>> --ioapic->rtc_status.pending_eoi;
>>> rtc_status_pending_eoi_check_valid(ioapic);
>>> }
>>>

2017-12-25 07:29:12

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten

2017-12-14 20:23 GMT+08:00 Gonglei <[email protected]>:
> We hit a bug in our test while run PCMark 10 in a windows 7 VM,
> The VM got stuck and the wallclock was hang after several minutes running
> PCMark 10 in it.
> It is quite easily to reproduce the bug with the upstream KVM and Qemu.
>
> We found that KVM can not inject any RTC irq to VM after it was hang, it fails to
> Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr.
>
> static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
> int irq_level, bool line_status)
> {
> ...
> if (!irq_level) {
> ioapic->irr &= ~mask;
> ret = 1;
> goto out;
> }
> ...
> if ((edge && old_irr == ioapic->irr) ||
> (!edge && entry.fields.remote_irr)) {
> ret = 0;
> goto out;
> }
>
> According to RTC spec, after RTC injects a High level irq, OS will read CMOS's

I think it is falling edge and active low.

Regards,
Wanpeng Li

> register C to to clear the irq flag, and pull down the irq electric pin.
>
> For Qemu, we will emulate the reading operation in cmos_ioport_read(),
> but Guest OS will fire a write operation before to tell which register will be read
> after this write, where we use s->cmos_index to record the following register to read.
>
> But in our test, we found that there is a possible situation that Vcpu fails to read
> RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading
> registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C,
> so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C,
> but before it tries to read register C, another vcpu1 is going to read RTC_YEAR,
> it changes s->cmos_index to RTC_YEAR by a writing action.
> The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we will miss
> calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject RTC irq,
> and Windows VM will hang.
>
> Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue.
>
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Gonglei <[email protected]>
> ---
> Thanks to Paolo provides a good solution. :)
>
> arch/x86/kvm/ioapic.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 4e822ad..5022d63 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu)
> {
> if (test_and_clear_bit(vcpu->vcpu_id,
> ioapic->rtc_status.dest_map.map)) {
> + ioapic->irr &= ~(1 << RTC_GSI);
> --ioapic->rtc_status.pending_eoi;
> rtc_status_pending_eoi_check_valid(ioapic);
> }
> --
> 1.8.3.1
>
>