2022-01-21 20:14:23

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
MOVSS blocking is active. Setting the flag is necessary to make VM-Entry
consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
set and STI/MOVSS blocking is true, then the previous instruction must
have been STI or MOV/POP, and therefore a single-step #DB must be pending
since the RFLAGS.TF cannot have been set by the previous instruction,
i.e. the one instruction delay after setting RFLAGS.TF must have already
expired.

Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
when recording guest state as part of a VM-Exit, but #DB VM-Exits
intentionally do not treat the #DB as "guest state" as interception of
the #DB effectively makes the #DB host-owned, thus KVM needs to manually
set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.

Note, although this bug can be triggered by guest userspace, doing so
requires IOPL=3, and guest userspace running with IOPL=3 has full access
to all I/O ports (from the guest's perspective) and can crash/reboot the
guest any number of ways. IOPL=3 is required because STI blocking kicks
in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.

MOVSS blocking can be initiated by userspace, but can be coincident with
a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
is problematic only for CPL0 (and only if the guest is crazy enough to
access a DR in a MOVSS shadow). All other sources of #DBs are either
suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
are mutually exclusive with MOVSS blocking (T-bit task switch), or are
already handled by KVM (ICEBP, a.k.a. INT1).

This bug was originally found by running tests[1] created for XSA-308[2].
Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
presumably why the Xen bug was deemed to be an exploitable DOS from guest
userspace. KVM already handles ICEBP by skipping the ICEBP instruction
and thus clears MOVSS blocking as a side effect of its "emulation".

[1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html
[2] https://xenbits.xen.org/xsa/advisory-308.html

Reported-by: David Woodhouse <[email protected]>
Reported-by: Alexander Graf <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a02a28ce7cc3..3f7b09a24d1e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4901,8 +4901,33 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
dr6 = vmx_get_exit_qual(vcpu);
if (!(vcpu->guest_debug &
(KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) {
+ /*
+ * If the #DB was due to ICEBP, a.k.a. INT1, skip the
+ * instruction. ICEBP generates a trap-like #DB, but
+ * despite its interception control being tied to #DB,
+ * is an instruction intercept, i.e. the VM-Exit occurs
+ * on the ICEBP itself. Note, skipping ICEBP also
+ * clears STI and MOVSS blocking.
+ *
+ * For all other #DBs, set vmcs.PENDING_DBG_EXCEPTIONS.BS
+ * if single-step is enabled in RFLAGS and STI or MOVSS
+ * blocking is active, as the CPU doesn't set the bit
+ * on VM-Exit due to #DB interception. VM-Entry has a
+ * consistency check that a single-step #DB is pending
+ * in this scenario as the previous instruction cannot
+ * have toggled RFLAGS.TF 0=>1 (because STI and POP/MOV
+ * don't modify RFLAGS), therefore the one instruction
+ * delay when activating single-step breakpoints must
+ * have already expired. Note, the CPU sets/clears BS
+ * as appropriate for all other VM-Exits types.
+ */
if (is_icebp(intr_info))
WARN_ON(!skip_emulated_instruction(vcpu));
+ else if ((vmx_get_rflags(vcpu) & X86_EFLAGS_TF) &&
+ (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
+ (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)))
+ vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+ vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS) | DR6_BS);

kvm_queue_exception_p(vcpu, DB_VECTOR, dr6);
return 1;

base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33
--
2.34.1.703.g22d0c6ccf7-goog


2022-01-21 21:04:26

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

On 20/01/2022 00:06, Sean Christopherson wrote:
> Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
> breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
> MOVSS blocking is active. Setting the flag is necessary to make VM-Entry
> consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
> set and STI/MOVSS blocking is true, then the previous instruction must
> have been STI or MOV/POP, and therefore a single-step #DB must be pending
> since the RFLAGS.TF cannot have been set by the previous instruction,
> i.e. the one instruction delay after setting RFLAGS.TF must have already
> expired.
>
> Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
> when recording guest state as part of a VM-Exit, but #DB VM-Exits
> intentionally do not treat the #DB as "guest state" as interception of
> the #DB effectively makes the #DB host-owned, thus KVM needs to manually
> set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.

The problem is that none of this is documented.

Amongst other things, the vmentry consistency check misses the case when
#DB really is pending in ENTRY_INTR_INFO.


It is very clear that to use VT-x/SVM correctly, required reading
includes the core microcode and RTL, which of course all of us have
access to...

> Note, although this bug can be triggered by guest userspace, doing so
> requires IOPL=3, and guest userspace running with IOPL=3 has full access
> to all I/O ports (from the guest's perspective) and can crash/reboot the
> guest any number of ways. IOPL=3 is required because STI blocking kicks
> in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
> takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.
>
> MOVSS blocking can be initiated by userspace, but can be coincident with
> a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
> executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
> is problematic only for CPL0 (and only if the guest is crazy enough to
> access a DR in a MOVSS shadow). All other sources of #DBs are either
> suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),

It is more complicated than this and undocumented.  Single step is
discard in a shadow, while data breakpoints are deferred.

I've just run an experiment with code breakpoints, as they're faults
like General Detect:

bool do_unhandled_exception(struct cpu_regs *regs)
{
    static int limit;

    if ( limit++ > 10 )
        return false;

    if ( regs->entry_vector == X86_EXC_DB )
    {
        unsigned int pending_dbg = read_dr6() ^ X86_DR6_DEFAULT;
        unsigned int dr7 = read_dr7(), spurious = 0;

        for ( int i = 0; i < 4; ++i )
            if ( pending_dbg & (1 << i) && ((dr7 >> (2 * i)) & 3) == 0 )
                spurious |= (1 << i);

        printk("#DB at %04x:%p, pending %08x, spurious %x\n",
               regs->cs, _p(regs->ip), pending_dbg ^ spurious, spurious);
        write_dr6(X86_DR6_DEFAULT);

        return true;
    }

    return false;
}

void test_main(void)
{
    extern char l0[] asm ("0f"), l1[] asm ("1f");
    extern char l2[] asm ("2f"), l3[] asm ("3f");
    unsigned int tmp;

    write_cr4(read_cr4() | X86_CR4_DE);

    write_dr0(_u(l0));
    write_dr1(_u(l1));
    write_dr2(_u(l2));
    write_dr3(_u(l3));

    write_dr7(/* DR7_SYM(0, G, X) | */
              /* DR7_SYM(1, G, X) | */
              DR7_SYM(2, G, X) |
              /* DR7_SYM(3, G, X) | */
              X86_DR7_GE);

    asm volatile("mov %%ss, %[tmp];"
                 "pushf;"
                 "pushf;"
                 "orl $"STR(X86_EFLAGS_TF)", (%%"_ASM_SP");"
                 "popf;"
                 "nop;"
                 "0: nop;"
                 "1: mov %[tmp], %%ss;"
                 "2: nop;"
                 "3: popf;"
                 : [tmp] "=r" (tmp));

    /* If the VM is still alive, it didn't suffer a vmentry failure. */
    xtf_success("Success: Not vulnerable to XSA-308\n");
}

$ objdump -d tests/xsa-308/test-hvm64-xsa-308 | grep -A25 '<test_main>:'
001048a0 <test_main>:
  1048a0:    0f 20 e0                 mov    %cr4,%rax
  1048a3:    48 83 c8 08              or     $0x8,%rax
  1048a7:    0f 22 e0                 mov    %rax,%cr4
  1048aa:    b8 df 48 10 00           mov    $0x1048df,%eax
  1048af:    0f 23 c0                 mov    %rax,%db0
  1048b2:    b8 e0 48 10 00           mov    $0x1048e0,%eax
  1048b7:    0f 23 c8                 mov    %rax,%db1
  1048ba:    b8 e2 48 10 00           mov    $0x1048e2,%eax
  1048bf:    0f 23 d0                 mov    %rax,%db2
  1048c2:    b8 e3 48 10 00           mov    $0x1048e3,%eax
  1048c7:    0f 23 d8                 mov    %rax,%db3
  1048ca:    b8 20 02 00 00           mov    $0x220,%eax
  1048cf:    0f 23 f8                 mov    %rax,%db7
  1048d2:    8c d0                    mov    %ss,%eax
  1048d4:    9c                       pushf 
  1048d5:    9c                       pushf 
  1048d6:    81 0c 24 00 01 00 00     orl    $0x100,(%rsp)
  1048dd:    9d                       popf  
  1048de:    90                       nop
  1048df:    90                       nop
  1048e0:    8e d0                    mov    %eax,%ss
  1048e2:    90                       nop
  1048e3:    9d                       popf  
  1048e4:    bf 00 3e 11 00           mov    $0x113e00,%edi
  1048e9:    31 c0                    xor    %eax,%eax

gives

--- Xen Test Framework ---
Environment: HVM 64bit (Long mode 4 levels)
XSA-308 PoC
#DB at 0008:00000000001048df, pending 00004000, spurious 1
#DB at 0008:00000000001048e0, pending 00004000, spurious 2
#DB at 0008:00000000001048e3, pending 00004000, spurious 8
#DB at 0008:00000000001048e4, pending 00004000, spurious 0
Success: Not vulnerable to XSA-308

which suggests that the active code breakpoint in the MovSS shadow is
discarded too, because of no #DB on the 0x1048e2 boundary.

This test is obscured by another bug/misfeature/something where the
B{0..3} get lost on vmexit if BT is also set.

> are mutually exclusive with MOVSS blocking (T-bit task switch),

Howso?  MovSS prevents external interrupts from triggering task
switches, but instruction sources still trigger in a shadow.

> or are
> already handled by KVM (ICEBP, a.k.a. INT1).

Other sources of #DB include RTM debugging, with errata causing a
fault-style #DB pointing at the XBEGIN instruction, rather than
vectoring to the abort handler, and splitlock which is new since I last
thought about this problem.

> This bug was originally found by running tests[1] created for XSA-308[2].
> Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
> presumably why the Xen bug was deemed to be an exploitable DOS from guest
> userspace.

As I recall, the original report to the security team was something
along the lines of "Steam has just updated game, and now when I start
it, the VM explodes".

> KVM already handles ICEBP by skipping the ICEBP instruction
> and thus clears MOVSS blocking as a side effect of its "emulation".
>
> [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html

This URL is at the whim of doxygen and not necessarily stable.

https://xenbits.xen.org/gitweb/?p=xtf.git;a=blob;f=tests/xsa-308/main.c
ought to have better longevity, as well as including test description.

~Andrew

2022-01-21 22:24:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

On Thu, Jan 20, 2022, Andrew Cooper wrote:
> On 20/01/2022 00:06, Sean Christopherson wrote:
> > MOVSS blocking can be initiated by userspace, but can be coincident with
> > a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
> > executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
> > is problematic only for CPL0 (and only if the guest is crazy enough to
> > access a DR in a MOVSS shadow). All other sources of #DBs are either
> > suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
>
> It is more complicated than this and undocumented.? Single step is
> discard in a shadow, while data breakpoints are deferred.

But for the purposes of making the consitency check happy, whether they are
deferred or dropped should be irrelevant, no?

> > are mutually exclusive with MOVSS blocking (T-bit task switch),
>
> Howso?? MovSS prevents external interrupts from triggering task
> switches, but instruction sources still trigger in a shadow.

T-bit #DBs are traps, and arrive after the task switch has completed. The switch
can be initiated in the shadow, but the #DB will be delivered after the instruction
retires and so after MOVSS blocking goes away. Or am I missing something?

The processor generates a debug exception after a task switch if the T flag of the
new task's TSS is set. This exception is generated after program control has passed
to the new task, and prior to the execution of the first instruction of that task.

> > or are
> > already handled by KVM (ICEBP, a.k.a. INT1).
>
> Other sources of #DB include RTM debugging, with errata causing a
> fault-style #DB pointing at the XBEGIN instruction, rather than
> vectoring to the abort handler

Ugh, I forgot about RTM. That mess should also be mutually exclusive. CLI/STI and
modifying segment registers cause abort, and XBEGIN in the shadow wouldn't activate
the region until XBEGIN retires and the shadow goes away.

> and splitlock which is new since I last thought about this problem.

Eww. Split Lock is trap-like, which begs the question of what happens if the
MOV/POP SS splits a cache line when loading the source data. I'm guess it's
suppressed, a la data breakpoints, but that'd be a fun one to test.

> > This bug was originally found by running tests[1] created for XSA-308[2].
> > Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
> > presumably why the Xen bug was deemed to be an exploitable DOS from guest
> > userspace.
>
> As I recall, the original report to the security team was something
> along the lines of "Steam has just updated game, and now when I start
> it, the VM explodes".

Lovely. I wonder if the game added some form of anti-cheat? I don't suppose you
have disassembly from the report? I'm super curious what on earth a game would
do to trigger this.

> > KVM already handles ICEBP by skipping the ICEBP instruction
> > and thus clears MOVSS blocking as a side effect of its "emulation".
> >
> > [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html
>
> This URL is at the whim of doxygen and not necessarily stable.
>
> https://xenbits.xen.org/gitweb/?p=xtf.git;a=blob;f=tests/xsa-308/main.c
> ought to have better longevity, as well as including test description.

Thanks!

2022-01-21 22:27:41

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

On 20/01/2022 16:36, Sean Christopherson wrote:
> On Thu, Jan 20, 2022, Andrew Cooper wrote:
>> On 20/01/2022 00:06, Sean Christopherson wrote:
>>> MOVSS blocking can be initiated by userspace, but can be coincident with
>>> a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
>>> executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
>>> is problematic only for CPL0 (and only if the guest is crazy enough to
>>> access a DR in a MOVSS shadow). All other sources of #DBs are either
>>> suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
>> It is more complicated than this and undocumented.  Single step is
>> discard in a shadow, while data breakpoints are deferred.
> But for the purposes of making the consitency check happy, whether they are
> deferred or dropped should be irrelevant, no?

From that point of view, yes.  The consistency check is specific to TS. 
I suppose I was mostly questioning the wording of the explanation.

>>> are mutually exclusive with MOVSS blocking (T-bit task switch),
>> Howso?  MovSS prevents external interrupts from triggering task
>> switches, but instruction sources still trigger in a shadow.
> T-bit #DBs are traps, and arrive after the task switch has completed. The switch
> can be initiated in the shadow, but the #DB will be delivered after the instruction
> retires and so after MOVSS blocking goes away. Or am I missing something?

Well - this is where the pipeline RTL is needed, in lieu of anything
better.  Trap-style #DBs are part of the current instruction, and
specifically ahead (in the instruction cycle) of the subsequent intchk.

There are implementations where NMI/INTR/etc won't be delivered at the
head of an exception generated in a shadow, which would suggest that
these implementations have the falling edge of the shadow after intchk
on the instruction boundary.  (Probably certainly what happens is that
intchk is responsible for clearing the shadow, but this is entirely
guesswork on my behalf.)

>> and splitlock which is new since I last thought about this problem.
> Eww. Split Lock is trap-like, which begs the question of what happens if the
> MOV/POP SS splits a cache line when loading the source data. I'm guess it's
> suppressed, a la data breakpoints, but that'd be a fun one to test.

They're both reads of their memory operand, so aren't eligible to be
locked accesses.

However, a devious kernel can misalign the GDT/LDT such that setting the
descriptor access bit does trigger a splitlock.  I suppose "kernel
doesn't misalign structures", or "kernel doesn't write a descriptor with
the access bit clear" are both valid mitigations.

>>> This bug was originally found by running tests[1] created for XSA-308[2].
>>> Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
>>> presumably why the Xen bug was deemed to be an exploitable DOS from guest
>>> userspace.
>> As I recall, the original report to the security team was something
>> along the lines of "Steam has just updated game, and now when I start
>> it, the VM explodes".
> Lovely. I wonder if the game added some form of anti-cheat? I don't suppose you
> have disassembly from the report? I'm super curious what on earth a game would
> do to trigger this.

Anti-cheat was my guess too, but no disassembly happened.

I was already aware of the STI issue, and had posted
https://lore.kernel.org/xen-devel/[email protected]/
more than a year previously.  The security report showed ICEBP pending
in the INTR_INFO field, and extending the STI test case in light of this
was all of 30s of work to get a working repro.

~Andrew

2022-01-21 22:29:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

On Thu, Jan 20, 2022, Andrew Cooper wrote:
> On 20/01/2022 16:36, Sean Christopherson wrote:
> > On Thu, Jan 20, 2022, Andrew Cooper wrote:
> >> On 20/01/2022 00:06, Sean Christopherson wrote:
> >>> MOVSS blocking can be initiated by userspace, but can be coincident with
> >>> a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
> >>> executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
> >>> is problematic only for CPL0 (and only if the guest is crazy enough to
> >>> access a DR in a MOVSS shadow). All other sources of #DBs are either
> >>> suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
> >> It is more complicated than this and undocumented.? Single step is
> >> discard in a shadow, while data breakpoints are deferred.
> > But for the purposes of making the consitency check happy, whether they are
> > deferred or dropped should be irrelevant, no?
>
> From that point of view, yes.? The consistency check is specific to TS.?
> I suppose I was mostly questioning the wording of the explanation.
>
> >>> are mutually exclusive with MOVSS blocking (T-bit task switch),
> >> Howso?? MovSS prevents external interrupts from triggering task
> >> switches, but instruction sources still trigger in a shadow.
> > T-bit #DBs are traps, and arrive after the task switch has completed. The switch
> > can be initiated in the shadow, but the #DB will be delivered after the instruction
> > retires and so after MOVSS blocking goes away. Or am I missing something?
>
> Well - this is where the pipeline RTL is needed, in lieu of anything
> better.? Trap-style #DBs are part of the current instruction, and
> specifically ahead (in the instruction cycle) of the subsequent intchk.

And T-bit traps in particular have crazy high priority...

> There are implementations where NMI/INTR/etc won't be delivered at the
> head of an exception generated in a shadow, which would suggest that
> these implementations have the falling edge of the shadow after intchk
> on the instruction boundary.? (Probably certainly what happens is that
> intchk is responsible for clearing the shadow, but this is entirely
> guesswork on my behalf.)

Well, thankfully hardware's behavior should be moot for VM-Entry since task switches
unconditionally VM-Exit, and KVM has a big fat TODO for handling the T-bit.

> >> and splitlock which is new since I last thought about this problem.
> > Eww. Split Lock is trap-like, which begs the question of what happens if the
> > MOV/POP SS splits a cache line when loading the source data. I'm guess it's
> > suppressed, a la data breakpoints, but that'd be a fun one to test.
>
> They're both reads of their memory operand, so aren't eligible to be
> locked accesses.

Hah, right, the "lock" part of "split lock" is just a minor detail...

> However, a devious kernel can misalign the GDT/LDT such that setting the
> descriptor access bit does trigger a splitlock.? I suppose "kernel
> doesn't misalign structures", or "kernel doesn't write a descriptor with
> the access bit clear" are both valid mitigations.
>
> >>> This bug was originally found by running tests[1] created for XSA-308[2].
> >>> Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
> >>> presumably why the Xen bug was deemed to be an exploitable DOS from guest
> >>> userspace.
> >> As I recall, the original report to the security team was something
> >> along the lines of "Steam has just updated game, and now when I start
> >> it, the VM explodes".
> > Lovely. I wonder if the game added some form of anti-cheat? I don't suppose you
> > have disassembly from the report? I'm super curious what on earth a game would
> > do to trigger this.
>
> Anti-cheat was my guess too, but no disassembly happened.
>
> I was already aware of the STI issue, and had posted
> https://lore.kernel.org/xen-devel/[email protected]/
> more than a year previously.? The security report showed ICEBP pending
> in the INTR_INFO field, and extending the STI test case in light of this
> was all of 30s of work to get a working repro.
>
> ~Andrew

2022-01-24 19:22:46

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: VMX: Set vmcs.PENDING_DBG.BS on #DB in STI/MOVSS blocking shadow

On 1/20/22 01:06, Sean Christopherson wrote:
> Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step
> breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or
> MOVSS blocking is active. Setting the flag is necessary to make VM-Entry
> consistency checks happy, as VMX has an invariant that if RFLAGS.TF is
> set and STI/MOVSS blocking is true, then the previous instruction must
> have been STI or MOV/POP, and therefore a single-step #DB must be pending
> since the RFLAGS.TF cannot have been set by the previous instruction,
> i.e. the one instruction delay after setting RFLAGS.TF must have already
> expired.
>
> Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately
> when recording guest state as part of a VM-Exit, but #DB VM-Exits
> intentionally do not treat the #DB as "guest state" as interception of
> the #DB effectively makes the #DB host-owned, thus KVM needs to manually
> set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest.
>
> Note, although this bug can be triggered by guest userspace, doing so
> requires IOPL=3, and guest userspace running with IOPL=3 has full access
> to all I/O ports (from the guest's perspective) and can crash/reboot the
> guest any number of ways. IOPL=3 is required because STI blocking kicks
> in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either
> takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF.
>
> MOVSS blocking can be initiated by userspace, but can be coincident with
> a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is
> executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking
> is problematic only for CPL0 (and only if the guest is crazy enough to
> access a DR in a MOVSS shadow). All other sources of #DBs are either
> suppressed by MOVSS blocking (single-step, code fetch, data, and I/O),
> are mutually exclusive with MOVSS blocking (T-bit task switch), or are
> already handled by KVM (ICEBP, a.k.a. INT1).
>
> This bug was originally found by running tests[1] created for XSA-308[2].
> Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is
> presumably why the Xen bug was deemed to be an exploitable DOS from guest
> userspace. KVM already handles ICEBP by skipping the ICEBP instruction
> and thus clears MOVSS blocking as a side effect of its "emulation".
>
> [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html
> [2] https://xenbits.xen.org/xsa/advisory-308.html
>
> Reported-by: David Woodhouse <[email protected]>
> Reported-by: Alexander Graf <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index a02a28ce7cc3..3f7b09a24d1e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4901,8 +4901,33 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
> dr6 = vmx_get_exit_qual(vcpu);
> if (!(vcpu->guest_debug &
> (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP))) {
> + /*
> + * If the #DB was due to ICEBP, a.k.a. INT1, skip the
> + * instruction. ICEBP generates a trap-like #DB, but
> + * despite its interception control being tied to #DB,
> + * is an instruction intercept, i.e. the VM-Exit occurs
> + * on the ICEBP itself. Note, skipping ICEBP also
> + * clears STI and MOVSS blocking.
> + *
> + * For all other #DBs, set vmcs.PENDING_DBG_EXCEPTIONS.BS
> + * if single-step is enabled in RFLAGS and STI or MOVSS
> + * blocking is active, as the CPU doesn't set the bit
> + * on VM-Exit due to #DB interception. VM-Entry has a
> + * consistency check that a single-step #DB is pending
> + * in this scenario as the previous instruction cannot
> + * have toggled RFLAGS.TF 0=>1 (because STI and POP/MOV
> + * don't modify RFLAGS), therefore the one instruction
> + * delay when activating single-step breakpoints must
> + * have already expired. Note, the CPU sets/clears BS
> + * as appropriate for all other VM-Exits types.
> + */
> if (is_icebp(intr_info))
> WARN_ON(!skip_emulated_instruction(vcpu));
> + else if ((vmx_get_rflags(vcpu) & X86_EFLAGS_TF) &&
> + (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
> + (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)))
> + vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> + vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS) | DR6_BS);
>
> kvm_queue_exception_p(vcpu, DB_VECTOR, dr6);
> return 1;
>
> base-commit: edb9e50dbe18394d0fc9d0494f5b6046fc912d33

Queued, thanks.

Paolo