LinuxLists.cc - [patch V2 04/25] x86/apic: Make apic_pending_intr

2019-07-04 17:08:12

Subject: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

In course of developing shorthand based IPI support issues with the
function which tries to clear eventually pending ISR bits in the local APIC
were observed.

1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().

This warning is emitted when the function fails to clear pending ISR
bits or observes pending IRR bits which are not delivered to the CPU
after the stale ISR bit(s) are ACK'ed.

Unfortunately the function only emits a WARN_ON() and fails to dump
the IRR/ISR content. That's useless for debugging.

Feng added spot on debug printk's which revealed that the stale IRR
bit belonged to the APIC timer interrupt vector, but adding ad hoc
debug code does not help with sporadic failures in the field.

Rework the loop so the full IRR/ISR contents are saved and on failure
dumped.

2) The loop termination logic is interesting at best.

If the machine has no TSC or cpu_khz is not known yet it tries 1
million times to ack stale IRR/ISR bits. What?

With TSC it uses the TSC to calculate the loop termination. It takes a
timestamp at entry and terminates the loop when:

(rdtsc() - start_timestamp) >= (cpu_hkz << 10)

That's roughly one second.

Both methods are problematic. The APIC has 256 vectors, which means
that in theory max. 256 IRR/ISR bits can be set. In practice this is
impossible as the first 32 vectors are reserved and not affected and
the chance that more than a few bits are set is close to zero.

With the pure loop based approach the 1 million retries are complete
overkill.

With TSC this can terminate too early in a guest which is running on a
heavily loaded host even with only a couple of IRR/ISR bits set. The
reason is that after acknowledging the highest priority ISR bit,
pending IRRs must get serviced first before the next round of
acknowledge can take place as the APIC (real and virtualized) does not
honour EOI without a preceeding interrupt on the CPU. And every APIC
read/write takes a VMEXIT if the APIC is virtualized. While trying to
reproduce the issue 0-day reported it was observed that the guest was
scheduled out long enough under heavy load that it terminated after 8
iterations.

Make the loop terminate after 512 iterations. That's plenty enough
in any case and does not take endless time to complete.

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/kernel/apic/apic.c | 111 +++++++++++++++++++++++++-------------------
1 file changed, 65 insertions(+), 46 deletions(-)

--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1424,54 +1424,72 @@ static void lapic_setup_esr(void)
oldvalue, value);
}

+#define APIC_IR_REGS APIC_ISR_NR
+#define APIC_IR_BITS (APIC_IR_REGS * 32)
+#define APIC_IR_MAPSIZE (APIC_IR_BITS / BITS_PER_LONG)
+
+union apic_ir {
+ unsigned long map[APIC_IR_MAPSIZE];
+ u32 regs[APIC_IR_REGS];
+};
+
+static bool apic_check_and_ack(union apic_ir *irr, union apic_ir *isr)
+{
+ int i, bit;
+
+ /* Read the IRRs */
+ for (i = 0; i < APIC_IR_REGS; i++)
+ irr->regs[i] = apic_read(APIC_IRR + i * 0x10);
+
+ /* Read the ISRs */
+ for (i = 0; i < APIC_IR_REGS; i++)
+ isr->regs[i] = apic_read(APIC_ISR + i * 0x10);
+
+ /*
+ * If the ISR map is not empty. ACK the APIC and run another round
+ * to verify whether a pending IRR has been unblocked and turned
+ * into a ISR.
+ */
+ if (!bitmap_empty(isr->map, APIC_IR_BITS)) {
+ /*
+ * There can be multiple ISR bits set when a high priority
+ * interrupt preempted a lower priority one. Issue an ACK
+ * per set bit.
+ */
+ for_each_set_bit(bit, isr->map, APIC_IR_BITS)
+ ack_APIC_irq();
+ return true;
+ }
+
+ return !bitmap_empty(irr->map, APIC_IR_BITS);
+}
+
+/*
+ * After a crash, we no longer service the interrupts and a pending
+ * interrupt from previous kernel might still have ISR bit set.
+ *
+ * Most probably by now the CPU has serviced that pending interrupt and it
+ * might not have done the ack_APIC_irq() because it thought, interrupt
+ * came from i8259 as ExtInt. LAPIC did not get EOI so it does not clear
+ * the ISR bit and cpu thinks it has already serivced the interrupt. Hence
+ * a vector might get locked. It was noticed for timer irq (vector
+ * 0x31). Issue an extra EOI to clear ISR.
+ *
+ * If there are pending IRR bits they turn into ISR bits after a higher
+ * priority ISR bit has been acked.
+ */
static void apic_pending_intr_clear(void)
{
- long long max_loops = cpu_khz ? cpu_khz : 1000000;
- unsigned long long tsc = 0, ntsc;
- unsigned int queued;
- unsigned long value;
- int i, j, acked = 0;
-
- if (boot_cpu_has(X86_FEATURE_TSC))
- tsc = rdtsc();
- /*
- * After a crash, we no longer service the interrupts and a pending
- * interrupt from previous kernel might still have ISR bit set.
- *
- * Most probably by now CPU has serviced that pending interrupt and
- * it might not have done the ack_APIC_irq() because it thought,
- * interrupt came from i8259 as ExtInt. LAPIC did not get EOI so it
- * does not clear the ISR bit and cpu thinks it has already serivced
- * the interrupt. Hence a vector might get locked. It was noticed
- * for timer irq (vector 0x31). Issue an extra EOI to clear ISR.
- */
- do {
- queued = 0;
- for (i = APIC_ISR_NR - 1; i >= 0; i--)
- queued |= apic_read(APIC_IRR + i*0x10);
-
- for (i = APIC_ISR_NR - 1; i >= 0; i--) {
- value = apic_read(APIC_ISR + i*0x10);
- for_each_set_bit(j, &value, 32) {
- ack_APIC_irq();
- acked++;
- }
- }
- if (acked > 256) {
- pr_err("LAPIC pending interrupts after %d EOI\n", acked);
- break;
- }
- if (queued) {
- if (boot_cpu_has(X86_FEATURE_TSC) && cpu_khz) {
- ntsc = rdtsc();
- max_loops = (long long)cpu_khz << 10;
- max_loops -= ntsc - tsc;
- } else {
- max_loops--;
- }
- }
- } while (queued && max_loops > 0);
- WARN_ON(max_loops <= 0);
+ union apic_ir irr, isr;
+ unsigned int i;
+
+ /* 512 loops are way oversized and give the APIC a chance to obey. */
+ for (i = 0; i < 512; i++) {
+ if (!apic_check_and_ack(&irr, &isr))
+ return;
+ }
+ /* Dump the IRR/ISR content if that failed */
+ pr_warn("APIC: Stale IRR: %256pb ISR: %256pb\n", irr.map, isr.map);
}

/**
@@ -1544,6 +1562,7 @@ static void setup_local_APIC(void)
value &= ~APIC_TPRI_MASK;
apic_write(APIC_TASKPRI, value);

+ /* Clear eventually stale ISR/IRR bits */
apic_pending_intr_clear();

/*

2019-07-05 17:07:39

by Andrew Cooper

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On 04/07/2019 16:51, Thomas Gleixner wrote:
> 2) The loop termination logic is interesting at best.
>
> If the machine has no TSC or cpu_khz is not known yet it tries 1
> million times to ack stale IRR/ISR bits. What?
>
> With TSC it uses the TSC to calculate the loop termination. It takes a
> timestamp at entry and terminates the loop when:
>
> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>
> That's roughly one second.
>
> Both methods are problematic. The APIC has 256 vectors, which means
> that in theory max. 256 IRR/ISR bits can be set. In practice this is
> impossible as the first 32 vectors are reserved and not affected and
> the chance that more than a few bits are set is close to zero.

[Disclaimer. I talked to Thomas in private first, and he asked me to
post this publicly as the CVE is almost a decade old already.]

I'm afraid that this isn't quite true.

In terms of IDT vectors, the first 32 are reserved for exceptions, but
only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).

In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
I'm disappointed to see wasn't shared with other software vendors at the
time.

Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
without an error code on the stack, which results in a corrupt pt_regs
in the exception handler, and a stack underflow on the way back out,
most likely with a fault on IRET.

These can be addressed by setting TPR to 0x10, which will inhibit
delivery of any errant IPIs in this range, but some extra sanity logic
may not go amiss. An error code on a 64bit stack can be spotted with
`testb $8, %spl` due to %rsp being aligned before pushing the exception
frame.

Another interesting problem is an IPI which its vector 0x80. A cunning
attacker can use this to simulate system calls from unsuspecting
positions in userspace, or for interrupting kernel context. At the very
least the int0x80 path does an unconditional swapgs, so will try to run
with the user gs, and I expect things will explode quickly from there.

One option here is to look at ISR and complain if it is found to be set.

Another option, which I've only just remembered, is that AMD hardware
has the Interrupt Enable Register in its extended APIC space, which may
or may not be good enough to prohibit delivery of 0x80. There isn't
enough information in the APM to be clear, but the name suggests it is
worth experimenting with.

~Andrew

2019-07-05 19:21:36

by Nadav Amit

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

> On Jul 5, 2019, at 8:47 AM, Andrew Cooper <[email protected]> wrote:
>
> On 04/07/2019 16:51, Thomas Gleixner wrote:
>> 2) The loop termination logic is interesting at best.
>>
>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>> million times to ack stale IRR/ISR bits. What?
>>
>> With TSC it uses the TSC to calculate the loop termination. It takes a
>> timestamp at entry and terminates the loop when:
>>
>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>
>> That's roughly one second.
>>
>> Both methods are problematic. The APIC has 256 vectors, which means
>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>> impossible as the first 32 vectors are reserved and not affected and
>> the chance that more than a few bits are set is close to zero.
>
> [Disclaimer. I talked to Thomas in private first, and he asked me to
> post this publicly as the CVE is almost a decade old already.]
>
> I'm afraid that this isn't quite true.
>
> In terms of IDT vectors, the first 32 are reserved for exceptions, but
> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>
> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.

IIRC (and from skimming the CVE again) the basic problem in Xen was that
MSIs can be used when devices are assigned to generate IRQs with arbitrary
vectors. The mitigation was to require interrupt remapping to be enabled in
the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment).

Are you concerned about this case, additional concrete ones, or is it about
security hardening? (or am I missing something?)

2019-07-05 19:49:07

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <[email protected]> wrote:
>
> On 04/07/2019 16:51, Thomas Gleixner wrote:
> > 2) The loop termination logic is interesting at best.
> >
> > If the machine has no TSC or cpu_khz is not known yet it tries 1
> > million times to ack stale IRR/ISR bits. What?
> >
> > With TSC it uses the TSC to calculate the loop termination. It takes a
> > timestamp at entry and terminates the loop when:
> >
> > (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
> >
> > That's roughly one second.
> >
> > Both methods are problematic. The APIC has 256 vectors, which means
> > that in theory max. 256 IRR/ISR bits can be set. In practice this is
> > impossible as the first 32 vectors are reserved and not affected and
> > the chance that more than a few bits are set is close to zero.
>
> [Disclaimer. I talked to Thomas in private first, and he asked me to
> post this publicly as the CVE is almost a decade old already.]
>
> I'm afraid that this isn't quite true.
>
> In terms of IDT vectors, the first 32 are reserved for exceptions, but
> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>
> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.
>
> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
> without an error code on the stack, which results in a corrupt pt_regs
> in the exception handler, and a stack underflow on the way back out,
> most likely with a fault on IRET.
>
> These can be addressed by setting TPR to 0x10, which will inhibit
> delivery of any errant IPIs in this range, but some extra sanity logic
> may not go amiss. An error code on a 64bit stack can be spotted with
> `testb $8, %spl` due to %rsp being aligned before pushing the exception
> frame.

Several years ago, I remember having a discussion with someone (Jan
Beulich, maybe?) about how to efficiently make the entry code figure
out the error code situation automatically. I suspect it was on IRC
and I can't find the logs. I'm thinking that maybe we should just
make Linux's idtentry code do something like this.

If nothing else, we could make idtentry do:

testl $8, %esp /* shorter than testb IIRC */
jz 1f /* or jnz -- too lazy to figure it out */
pushq $-1
1:

instead of the current hardcoded push. The cost of a mispredicted
branch here will be smallish compared to the absurdly large cost of
the entry itself. But I thought I had something more clever than
this. This sequence works, but it still feels like it should be
possible to do better:

.macro PUSH_ERROR_IF_NEEDED
/*
* Before the IRET frame is pushed, RSP is aligned to a 16-byte
* boundary. After SS .. RIP and the error code are pushed, RSP is
* once again aligned. Pushing -1 will put -1 in the error code slot
* (regs->orig_ax) if there was no error code.
*/

pushq $-1 /* orig_ax = -1, maybe */
/* now RSP points to orig_ax (aligned) or di (misaligned) */
pushq $0
/* now RSP points to di (misaligned) or si (aligned) */
orq $8, %rsp
/* now RSP points to di */
addq $8, %rsp
/* now RSP points to orig_ax, and we're in good shape */
.endm

Is there a better sequence for this?

A silly downside here is that the ORC annotations need to know the
offset to the IRET frame. Josh, how ugly would it be to teach the
unwinder that UNWIND_HINT_IRET_REGS actually means "hey, maybe I'm 8
bytes off -- please realign RSP when doing your calculation"?

FWIW, the entry code is rather silly here in that it actually only
uses the orig_ax slot as a temporary dumping ground for the error code
and then it replaces it with -1 later on. I don't remember whether
anything still cares about the -1. Once upon a time, there was some
code that assumed that -1 meant "not in a syscall" and anything else
meant "in a syscall", but, if so, I suspect we should just kill that
code regardless.

>
> Another interesting problem is an IPI which its vector 0x80. A cunning
> attacker can use this to simulate system calls from unsuspecting
> positions in userspace, or for interrupting kernel context. At the very
> least the int0x80 path does an unconditional swapgs, so will try to run
> with the user gs, and I expect things will explode quickly from there.

At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I
suppose we could harden this by adding a special check to int $0x80 to
validate GSBASE.

>
> One option here is to look at ISR and complain if it is found to be set.

Barring some real hackery, we're toast long before we get far enough to do that.

2019-07-05 20:26:52

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

Andrew,

On Fri, 5 Jul 2019, Andrew Cooper wrote:

> On 04/07/2019 16:51, Thomas Gleixner wrote:
> > 2) The loop termination logic is interesting at best.
> >
> > If the machine has no TSC or cpu_khz is not known yet it tries 1
> > million times to ack stale IRR/ISR bits. What?
> >
> > With TSC it uses the TSC to calculate the loop termination. It takes a
> > timestamp at entry and terminates the loop when:
> >
> > (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
> >
> > That's roughly one second.
> >
> > Both methods are problematic. The APIC has 256 vectors, which means
> > that in theory max. 256 IRR/ISR bits can be set. In practice this is
> > impossible as the first 32 vectors are reserved and not affected and
> > the chance that more than a few bits are set is close to zero.
>
> [Disclaimer. I talked to Thomas in private first, and he asked me to
> post this publicly as the CVE is almost a decade old already.]

thanks for bringing this up!

> I'm afraid that this isn't quite true.
>
> In terms of IDT vectors, the first 32 are reserved for exceptions, but
> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).

Indeed.

> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.

No comment.

> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
> without an error code on the stack, which results in a corrupt pt_regs
> in the exception handler, and a stack underflow on the way back out,
> most likely with a fault on IRET.
>
> These can be addressed by setting TPR to 0x10, which will inhibit

Right, that's easy and obvious.

> delivery of any errant IPIs in this range, but some extra sanity logic
> may not go amiss. An error code on a 64bit stack can be spotted with
> `testb $8, %spl` due to %rsp being aligned before pushing the exception
> frame.

The question is what we do with that information :)

> Another interesting problem is an IPI which its vector 0x80. A cunning
> attacker can use this to simulate system calls from unsuspecting
> positions in userspace, or for interrupting kernel context. At the very
> least the int0x80 path does an unconditional swapgs, so will try to run
> with the user gs, and I expect things will explode quickly from there.

Cute.

> One option here is to look at ISR and complain if it is found to be set.

That's sloooow, but could at least provide an option to do so.

> Another option, which I've only just remembered, is that AMD hardware
> has the Interrupt Enable Register in its extended APIC space, which may
> or may not be good enough to prohibit delivery of 0x80. There isn't
> enough information in the APM to be clear, but the name suggests it is
> worth experimenting with.

I doubt it. Clearing a bit in the IER takes the interrupt out of the
priority decision logic. That's a SVM feature so interrupts directed
directly to guests cannot block other interrupts if they are not
serviced. It's grossly misnomed and won't help with the int80 issue.

The more interesting question is whether this is all relevant. If I
understood the issue correctly then this is mitigated by proper interrupt
remapping.

Is there any serious usage of virtualization w/o interrupt remapping left
or have the machines which are not capable been retired already?

Thanks,

tglx

2019-07-05 20:37:22

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, 5 Jul 2019, Andy Lutomirski wrote:
> On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <[email protected]> wrote:
> > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
> > without an error code on the stack, which results in a corrupt pt_regs
> > in the exception handler, and a stack underflow on the way back out,
> > most likely with a fault on IRET.
> >
> > These can be addressed by setting TPR to 0x10, which will inhibit
> > delivery of any errant IPIs in this range, but some extra sanity logic
> > may not go amiss. An error code on a 64bit stack can be spotted with
> > `testb $8, %spl` due to %rsp being aligned before pushing the exception
> > frame.
>
> Several years ago, I remember having a discussion with someone (Jan
> Beulich, maybe?) about how to efficiently make the entry code figure
> out the error code situation automatically. I suspect it was on IRC
> and I can't find the logs. I'm thinking that maybe we should just
> make Linux's idtentry code do something like this.
>
> If nothing else, we could make idtentry do:
>
> testl $8, %esp /* shorter than testb IIRC */
> jz 1f /* or jnz -- too lazy to figure it out */
> pushq $-1
> 1:

Errm, no. We should not silently paper over it. If we detect that this came
in with a wrong stack frame, i.e. not from a CPU originated exception, then
we truly should yell loud. Also in that case you want to check the APIC:ISR
and issue an EOI to clear it.

> > Another interesting problem is an IPI which its vector 0x80. A cunning
> > attacker can use this to simulate system calls from unsuspecting
> > positions in userspace, or for interrupting kernel context. At the very
> > least the int0x80 path does an unconditional swapgs, so will try to run
> > with the user gs, and I expect things will explode quickly from there.
>
> At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I

How does it help? It still crashes the kernel.

> suppose we could harden this by adding a special check to int $0x80 to
> validate GSBASE.

> > One option here is to look at ISR and complain if it is found to be set.
>
> Barring some real hackery, we're toast long before we get far enough to
> do that.

No. We can map the APIC into the user space visible page tables for PTI
without compromising the PTI isolation and it can be read very early on
before SWAPGS. All you need is a register to clobber not more. It the ISR
is set, then go into an error path, yell loudly, issue EOI and return.
The only issue I can see is: It's slow :)

Thanks,

tglx

2019-07-05 20:38:28

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, Jul 5, 2019 at 1:25 PM Thomas Gleixner <[email protected]> wrote:
>
> Andrew,
>
> >
> > These can be addressed by setting TPR to 0x10, which will inhibit
>
> Right, that's easy and obvious.
>

This boots:

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 177aa8ef2afa..5257c40bde6c 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1531,11 +1531,14 @@ static void setup_local_APIC(void)
#endif

/*
- * Set Task Priority to 'accept all'. We never change this
- * later on.
+ * Set Task Priority to 'accept all except vectors 0-31'. An APIC
+ * vector in the 16-31 range can be delivered otherwise, but we'll
+ * think it's an exception and terrible things will happen.
+ * We never change this later on.
*/
value = apic_read(APIC_TASKPRI);
value &= ~APIC_TPRI_MASK;
+ value |= 0x10;
apic_write(APIC_TASKPRI, value);

apic_pending_intr_clear();

2019-07-05 20:40:56

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, Jul 5, 2019 at 1:36 PM Thomas Gleixner <[email protected]> wrote:
>
> On Fri, 5 Jul 2019, Andy Lutomirski wrote:
> > On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <[email protected]> wrote:
> > > Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
> > > without an error code on the stack, which results in a corrupt pt_regs
> > > in the exception handler, and a stack underflow on the way back out,
> > > most likely with a fault on IRET.
> > >
> > > These can be addressed by setting TPR to 0x10, which will inhibit
> > > delivery of any errant IPIs in this range, but some extra sanity logic
> > > may not go amiss. An error code on a 64bit stack can be spotted with
> > > `testb $8, %spl` due to %rsp being aligned before pushing the exception
> > > frame.
> >
> > Several years ago, I remember having a discussion with someone (Jan
> > Beulich, maybe?) about how to efficiently make the entry code figure
> > out the error code situation automatically. I suspect it was on IRC
> > and I can't find the logs. I'm thinking that maybe we should just
> > make Linux's idtentry code do something like this.
> >
> > If nothing else, we could make idtentry do:
> >
> > testl $8, %esp /* shorter than testb IIRC */
> > jz 1f /* or jnz -- too lazy to figure it out */
> > pushq $-1
> > 1:
>
> Errm, no. We should not silently paper over it. If we detect that this came
> in with a wrong stack frame, i.e. not from a CPU originated exception, then
> we truly should yell loud. Also in that case you want to check the APIC:ISR
> and issue an EOI to clear it.

It gives us the option to replace idtentry with something
table-driven. I don't think I love it, but it's not an awful idea.

>
> > > Another interesting problem is an IPI which its vector 0x80. A cunning
> > > attacker can use this to simulate system calls from unsuspecting
> > > positions in userspace, or for interrupting kernel context. At the very
> > > least the int0x80 path does an unconditional swapgs, so will try to run
> > > with the user gs, and I expect things will explode quickly from there.
> >
> > At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I
>
> How does it help? It still crashes the kernel.
>
> > suppose we could harden this by adding a special check to int $0x80 to
> > validate GSBASE.
>
> > > One option here is to look at ISR and complain if it is found to be set.
> >
> > Barring some real hackery, we're toast long before we get far enough to
> > do that.
>
> No. We can map the APIC into the user space visible page tables for PTI
> without compromising the PTI isolation and it can be read very early on
> before SWAPGS. All you need is a register to clobber not more. It the ISR
> is set, then go into an error path, yell loudly, issue EOI and return.
> The only issue I can see is: It's slow :)
>
>

I think this will be really extremely slow. If we can restrict this
to x2apic machines, then maybe it's not so awful.

FWIW, if we just patch up the GS thing, then we are still vulnerable:
the bad guy can arrange for a privileged process to have register
state corresponding to a dangerous syscall and then send an int $0x80
via the APIC.

2019-07-05 20:49:20

by Andrew Cooper

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On 05/07/2019 20:19, Nadav Amit wrote:
>> On Jul 5, 2019, at 8:47 AM, Andrew Cooper <[email protected]> wrote:
>>
>> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>> 2) The loop termination logic is interesting at best.
>>>
>>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>>> million times to ack stale IRR/ISR bits. What?
>>>
>>> With TSC it uses the TSC to calculate the loop termination. It takes a
>>> timestamp at entry and terminates the loop when:
>>>
>>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>>
>>> That's roughly one second.
>>>
>>> Both methods are problematic. The APIC has 256 vectors, which means
>>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>>> impossible as the first 32 vectors are reserved and not affected and
>>> the chance that more than a few bits are set is close to zero.
>> [Disclaimer. I talked to Thomas in private first, and he asked me to
>> post this publicly as the CVE is almost a decade old already.]
>>
>> I'm afraid that this isn't quite true.
>>
>> In terms of IDT vectors, the first 32 are reserved for exceptions, but
>> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
>> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>>
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
> IIRC (and from skimming the CVE again) the basic problem in Xen was that
> MSIs can be used when devices are assigned to generate IRQs with arbitrary
> vectors. The mitigation was to require interrupt remapping to be enabled in
> the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment).
>
> Are you concerned about this case, additional concrete ones, or is it about
> security hardening? (or am I missing something?)

The phrase "impossible as the first 32 vectors are reserved" stuck out,
because its not true. That generally means that any logic derived from
it is also false. :)

In practice, I was thinking more about robustness against buggy
conditions. Setting TPR to 1 at start of day is very easy. Some of the
other protections, less so.

When it comes to virtualisation, security is an illusion when a guest
kernel has a real piece of hardware in its hands. Anyone who is under
the misapprehension otherwise should try talking to a IOMMU hardware
engineer and see the reaction on their face. IOMMUs were designed to
isolate devices when all controlling software was of the same privilege
level. They don't magically make the system safe against a hostile
guest device driver, which in the most basic case, can still mount a DoS
attempt with deliberately bad DMA.

~Andrew

2019-07-05 20:50:49

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On 05/07/19 22:25, Thomas Gleixner wrote:
> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.

Oh, that brings back memories. At the time I was working on Xen, so I
remember that CVE. IIRC there was some mitigation but the fix was
basically to print a very scary error message if you used VT-d without
interrupt remapping. Maybe force the user to add something on the Xen
command line too?

> The more interesting question is whether this is all relevant. If I
> understood the issue correctly then this is mitigated by proper interrupt
> remapping.

Yes, and for Linux we're good I think. VFIO by default refuses to use
the IOMMU if interrupt remapping is absent or disabled, and KVM's own
(pre-VFIO) IOMMU support was removed a couple years ago. I guess the
secure boot lockdown patches should outlaw VFIO's
allow_unsafe_interrupts option, but that's it.

> Is there any serious usage of virtualization w/o interrupt remapping left
> or have the machines which are not capable been retired already?

I think they were already starting to disappear in 2011, as I don't
remember much worry about customers that were using systems without it.

Paolo

2019-07-05 21:00:43

by Andrew Cooper

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On 05/07/2019 20:06, Andy Lutomirski wrote:
> On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <[email protected]> wrote:
>> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>> 2) The loop termination logic is interesting at best.
>>>
>>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>>> million times to ack stale IRR/ISR bits. What?
>>>
>>> With TSC it uses the TSC to calculate the loop termination. It takes a
>>> timestamp at entry and terminates the loop when:
>>>
>>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>>
>>> That's roughly one second.
>>>
>>> Both methods are problematic. The APIC has 256 vectors, which means
>>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>>> impossible as the first 32 vectors are reserved and not affected and
>>> the chance that more than a few bits are set is close to zero.
>> [Disclaimer. I talked to Thomas in private first, and he asked me to
>> post this publicly as the CVE is almost a decade old already.]
>>
>> I'm afraid that this isn't quite true.
>>
>> In terms of IDT vectors, the first 32 are reserved for exceptions, but
>> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
>> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>>
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
>>
>> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
>> without an error code on the stack, which results in a corrupt pt_regs
>> in the exception handler, and a stack underflow on the way back out,
>> most likely with a fault on IRET.
>>
>> These can be addressed by setting TPR to 0x10, which will inhibit
>> delivery of any errant IPIs in this range, but some extra sanity logic
>> may not go amiss. An error code on a 64bit stack can be spotted with
>> `testb $8, %spl` due to %rsp being aligned before pushing the exception
>> frame.
> Several years ago, I remember having a discussion with someone (Jan
> Beulich, maybe?) about how to efficiently make the entry code figure
> out the error code situation automatically. I suspect it was on IRC
> and I can't find the logs.

It was on IRC, but I don't remember exactly when, either.

> I'm thinking that maybe we should just
> make Linux's idtentry code do something like this.
>
> If nothing else, we could make idtentry do:
>
> testl $8, %esp /* shorter than testb IIRC */

Sadly not. test (unlike cmp and the basic mutative opcodes) doesn't
have a sign-extendable imm8 encoding. The two options are:

f7 c4 08 00 00 00 test $0x8,%esp
40 f6 c4 08 test $0x8,%spl

> jz 1f /* or jnz -- too lazy to figure it out */
> pushq $-1
> 1:

It is jz, and Xen does use this sequence for reserved/unimplemented
vectors, but we expect those codepaths never to be executed.

>
> instead of the current hardcoded push. The cost of a mispredicted
> branch here will be smallish compared to the absurdly large cost of
> the entry itself. But I thought I had something more clever than
> this. This sequence works, but it still feels like it should be
> possible to do better:
>
> .macro PUSH_ERROR_IF_NEEDED
> /*
> * Before the IRET frame is pushed, RSP is aligned to a 16-byte
> * boundary. After SS .. RIP and the error code are pushed, RSP is
> * once again aligned. Pushing -1 will put -1 in the error code slot
> * (regs->orig_ax) if there was no error code.
> */
>
> pushq $-1 /* orig_ax = -1, maybe */
> /* now RSP points to orig_ax (aligned) or di (misaligned) */
> pushq $0
> /* now RSP points to di (misaligned) or si (aligned) */
> orq $8, %rsp
> /* now RSP points to di */
> addq $8, %rsp
> /* now RSP points to orig_ax, and we're in good shape */
> .endm
>
> Is there a better sequence for this?

The only aspect I can think of is whether mixing the push/pops with
explicit updates updates to %rsp is better or worse than a very well
predicted branch, given that various frontends have special tracking to
reduce instruction dependencies on %rsp. I'll have to defer to the CPU
microachitects as to which of the two options is the lesser evil.

That said, both Intel and AMD's Optimisation guides have stack alignment
suggestions which mix push/sub/and on function prolog, so I expect this
is as optimised as it can reasonably be in the pipelines.

>> Another interesting problem is an IPI which its vector 0x80. A cunning
>> attacker can use this to simulate system calls from unsuspecting
>> positions in userspace, or for interrupting kernel context. At the very
>> least the int0x80 path does an unconditional swapgs, so will try to run
>> with the user gs, and I expect things will explode quickly from there.
> At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I
> suppose we could harden this by adding a special check to int $0x80 to
> validate GSBASE.
>
>> One option here is to look at ISR and complain if it is found to be set.
> Barring some real hackery, we're toast long before we get far enough to do that.

Even if the path moves to be like a regular idtentry? How much more
expensive is that in reality? Ultimately, it is that which needs to be
weighed against any extra wanted robustness.

~Andrew

2019-07-05 21:21:30

by Andrew Cooper

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On 05/07/2019 21:49, Paolo Bonzini wrote:
> On 05/07/19 22:25, Thomas Gleixner wrote:
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
> Oh, that brings back memories. At the time I was working on Xen, so I
> remember that CVE. IIRC there was some mitigation but the fix was
> basically to print a very scary error message if you used VT-d without
> interrupt remapping. Maybe force the user to add something on the Xen
> command line too?

It was before my time. I have no public comment on how the other
aspects of it were handled.

>> Is there any serious usage of virtualization w/o interrupt remapping left
>> or have the machines which are not capable been retired already?
> I think they were already starting to disappear in 2011, as I don't
> remember much worry about customers that were using systems without it.

ISTR Nehalem/Westmere era systems were the first to support interrupt
remapping, but were totally crippled with errata to the point of needing
to turn a prerequisite feature (Queued Invalidation) off. I believe
later systems have it working to a first approximation.

As to the original question, whether people should be using such systems
is a different question to whether they actually are.

~Andrew

2019-07-07 08:40:56

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, 5 Jul 2019, Andy Lutomirski wrote:
> On Fri, Jul 5, 2019 at 1:36 PM Thomas Gleixner <[email protected]> wrote:
> > No. We can map the APIC into the user space visible page tables for PTI
> > without compromising the PTI isolation and it can be read very early on
> > before SWAPGS. All you need is a register to clobber not more. It the ISR
> > is set, then go into an error path, yell loudly, issue EOI and return.
> > The only issue I can see is: It's slow :)
> >
> I think this will be really extremely slow. If we can restrict this
> to x2apic machines, then maybe it's not so awful.

x2apic machines have working iommu/interrupt remapping.

> FWIW, if we just patch up the GS thing, then we are still vulnerable:
> the bad guy can arrange for a privileged process to have register
> state corresponding to a dangerous syscall and then send an int $0x80
> via the APIC.

Right, that's why you want to read the APIC:ISR to check where that thing
came from.

Thanks,

tglx

2019-07-07 08:42:23

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Fri, 5 Jul 2019, Paolo Bonzini wrote:
> On 05/07/19 22:25, Thomas Gleixner wrote:
> > The more interesting question is whether this is all relevant. If I
> > understood the issue correctly then this is mitigated by proper interrupt
> > remapping.
>
> Yes, and for Linux we're good I think. VFIO by default refuses to use
> the IOMMU if interrupt remapping is absent or disabled, and KVM's own

Confused. If it does not use IOMMU, what does it do? Hand in the device as
is and let the guest fiddle with it unconstrained or does it actually
refuse to pass through?

> (pre-VFIO) IOMMU support was removed a couple years ago. I guess the
> secure boot lockdown patches should outlaw VFIO's
> allow_unsafe_interrupts option, but that's it.

I'm not worried too much about command line options. The important thing is
the default behaviour. If an admin decides to do something stupid, so be
it.

Thanks,

tglx

2019-07-09 14:44:25

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

On Sun, 7 Jul 2019, Thomas Gleixner wrote:

> On Fri, 5 Jul 2019, Paolo Bonzini wrote:
> > On 05/07/19 22:25, Thomas Gleixner wrote:
> > > The more interesting question is whether this is all relevant. If I
> > > understood the issue correctly then this is mitigated by proper interrupt
> > > remapping.
> >
> > Yes, and for Linux we're good I think. VFIO by default refuses to use
> > the IOMMU if interrupt remapping is absent or disabled, and KVM's own
>
> Confused. If it does not use IOMMU, what does it do? Hand in the device as
> is and let the guest fiddle with it unconstrained or does it actually
> refuse to pass through?

Read through it and it refuses to attach unless the allow_unsafe_interrupts
option is set, but again we can't protect against wilful ignorance.

So the default prevents abuse on systems without IOMMU and irq remapping,
so there is not much to worry about AFAICT.

Setting TPR to 1 and fixing the comments/changelogs still makes sense
independently.

Thanks,

tglx