On Mon, Oct 19, 2020 at 05:11:19PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel <[email protected]>
>
> Check whether the hypervisor reported the correct C-bit when running as
> an SEV guest. Using a wrong C-bit position could be used to leak
> sensitive data from the guest to the hypervisor.
>
> The check function is in arch/x86/kernel/sev_verify_cbit.S so that it
> can be re-used in the running kernel image.
>
> Signed-off-by: Joerg Roedel <[email protected]>
> ---
> +
> + /* Store value to memory and keep it in %r10 */
> + movq %r10, sev_check_data(%rip)
> +
Does there need to be a cache flush/invalidation between this and the
read below to avoid just reading back from cache, or will the hardware
take care of that?
> + /* Backup current %cr3 value to restore it later */
> + movq %cr3, %r11
> +
> + /* Switch to new %cr3 - This might unmap the stack */
> + movq %rdi, %cr3
Does there need to be a TLB flush after this? When executed from the
main kernel's head code, CR4.PGE is enabled, and if the original page
mapping had the global bit set (the decompressor stub sets that in the
identity mapping), won't the read below still use the original encrypted
mapping if we don't explicitly flush it?
> +
> + /*
> + * Compare value in %r10 with memory location - If C-Bit is incorrect
> + * this would read the encrypted data and make the check fail.
> + */
> + cmpq %r10, sev_check_data(%rip)
> +
> + /* Restore old %cr3 */
> + movq %r11, %cr3
> +
> + /* Check CMPQ result */
> + je 3f
> +
> + /*
> + * The check failed - Prevent any forward progress to prevent ROP
> + * attacks, invalidate the stack and go into a hlt loop.
> + */
> + xorq %rsp, %rsp
> + subq $0x1000, %rsp
> +2: hlt
> + jmp 2b
> +3:
> +#endif
> + ret
> +SYM_FUNC_END(sev_verify_cbit)
> +
> --
> 2.28.0
>
On Mon, Oct 19, 2020 at 01:00:08PM -0400, Arvind Sankar wrote:
> On Mon, Oct 19, 2020 at 05:11:19PM +0200, Joerg Roedel wrote:
> > From: Joerg Roedel <[email protected]>
> >
> > Check whether the hypervisor reported the correct C-bit when running as
> > an SEV guest. Using a wrong C-bit position could be used to leak
> > sensitive data from the guest to the hypervisor.
> >
> > The check function is in arch/x86/kernel/sev_verify_cbit.S so that it
> > can be re-used in the running kernel image.
> >
> > Signed-off-by: Joerg Roedel <[email protected]>
> > ---
>
> > +
> > + /* Store value to memory and keep it in %r10 */
> > + movq %r10, sev_check_data(%rip)
> > +
>
> Does there need to be a cache flush/invalidation between this and the
> read below to avoid just reading back from cache, or will the hardware
> take care of that?
Also, isn't it possible that the initial page tables we're running on
have already been messed with and have the C-bit in the wrong location,
so that this write happens decrypted?
>
> > + /* Backup current %cr3 value to restore it later */
> > + movq %cr3, %r11
> > +
> > + /* Switch to new %cr3 - This might unmap the stack */
> > + movq %rdi, %cr3
>
> Does there need to be a TLB flush after this? When executed from the
> main kernel's head code, CR4.PGE is enabled, and if the original page
> mapping had the global bit set (the decompressor stub sets that in the
> identity mapping), won't the read below still use the original encrypted
> mapping if we don't explicitly flush it?
>
> > +
> > + /*
> > + * Compare value in %r10 with memory location - If C-Bit is incorrect
> > + * this would read the encrypted data and make the check fail.
> > + */
> > + cmpq %r10, sev_check_data(%rip)
> > +
> > + /* Restore old %cr3 */
> > + movq %r11, %cr3
> > +
> > + /* Check CMPQ result */
> > + je 3f
> > +
> > + /*
> > + * The check failed - Prevent any forward progress to prevent ROP
> > + * attacks, invalidate the stack and go into a hlt loop.
> > + */
> > + xorq %rsp, %rsp
> > + subq $0x1000, %rsp
> > +2: hlt
> > + jmp 2b
> > +3:
> > +#endif
> > + ret
> > +SYM_FUNC_END(sev_verify_cbit)
> > +
> > --
> > 2.28.0
> >
Hi Arvind,
On Mon, Oct 19, 2020 at 01:00:08PM -0400, Arvind Sankar wrote:
> On Mon, Oct 19, 2020 at 05:11:19PM +0200, Joerg Roedel wrote:
> > +
> > + /* Store value to memory and keep it in %r10 */
> > + movq %r10, sev_check_data(%rip)
> > +
>
> Does there need to be a cache flush/invalidation between this and the
> read below to avoid just reading back from cache, or will the hardware
> take care of that?
No, a cache flush is not needed. When the C bit position is correct,
then the data will be mapped encrypted with the old and the new
page-table. If the C bit position is wrong, the access goes to a
different physical address.
> > + /* Backup current %cr3 value to restore it later */
> > + movq %cr3, %r11
> > +
> > + /* Switch to new %cr3 - This might unmap the stack */
> > + movq %rdi, %cr3
>
> Does there need to be a TLB flush after this? When executed from the
> main kernel's head code, CR4.PGE is enabled, and if the original page
> mapping had the global bit set (the decompressor stub sets that in the
> identity mapping), won't the read below still use the original encrypted
> mapping if we don't explicitly flush it?
The check only really matters for the boot CPU, not for the secondary
CPUs. IIRC at this point in boot CR4.PGE is still off.
Regards,
Joerg
On Mon, Oct 19, 2020 at 01:54:47PM -0400, Arvind Sankar wrote:
> Also, isn't it possible that the initial page tables we're running on
> have already been messed with and have the C-bit in the wrong location,
> so that this write happens decrypted?
The code assumes that the page-table it is running on has the correct C
bit position set and that the code which set it up verified that it is
correct. For the kernel itself this is true, at least, but when booting
via UEFI the check also needs to happen in the firmware.
Note that the possibilies are limited when the hypervisor reports the
wrong C bit position because code fetches always assume encryption, even
when the C bit is cleared in the page-table. So a wrong C bit position
in the decompression stub would write the kernel image to memory
unencrypted and executing it would not be possible.
Regards,
Joerg
On Mon, Oct 19, 2020 at 10:39:35PM +0200, Joerg Roedel wrote:
> On Mon, Oct 19, 2020 at 01:54:47PM -0400, Arvind Sankar wrote:
> > Also, isn't it possible that the initial page tables we're running on
> > have already been messed with and have the C-bit in the wrong location,
> > so that this write happens decrypted?
>
> The code assumes that the page-table it is running on has the correct C
> bit position set and that the code which set it up verified that it is
> correct. For the kernel itself this is true, at least, but when booting
> via UEFI the check also needs to happen in the firmware.
>
> Note that the possibilies are limited when the hypervisor reports the
> wrong C bit position because code fetches always assume encryption, even
> when the C bit is cleared in the page-table. So a wrong C bit position
> in the decompression stub would write the kernel image to memory
> unencrypted and executing it would not be possible.
Is it possible to take advantage of this to make the check independent
of the original page tables? i.e. switch to the new pagetables, then
write into .data or .bss the opcodes for a function that does
movabs $imm64, %rax
jmp *%rdi // avoid using stack for the return
filling in the imm64 with the RDRAND value, and then try to execute it.
If the C-bit value is wrong, this will probably crash, and at any rate
shouldn't return with the correct value in %rax.
>
> Regards,
>
> Joerg
>
On Mon, Oct 19, 2020 at 10:33:45PM +0200, Joerg Roedel wrote:
> Hi Arvind,
>
> On Mon, Oct 19, 2020 at 01:00:08PM -0400, Arvind Sankar wrote:
> > On Mon, Oct 19, 2020 at 05:11:19PM +0200, Joerg Roedel wrote:
> > > +
> > > + /* Store value to memory and keep it in %r10 */
> > > + movq %r10, sev_check_data(%rip)
> > > +
> >
> > Does there need to be a cache flush/invalidation between this and the
> > read below to avoid just reading back from cache, or will the hardware
> > take care of that?
>
> No, a cache flush is not needed. When the C bit position is correct,
> then the data will be mapped encrypted with the old and the new
> page-table. If the C bit position is wrong, the access goes to a
> different physical address.
Ok.
>
> > > + /* Backup current %cr3 value to restore it later */
> > > + movq %cr3, %r11
> > > +
> > > + /* Switch to new %cr3 - This might unmap the stack */
> > > + movq %rdi, %cr3
> >
> > Does there need to be a TLB flush after this? When executed from the
> > main kernel's head code, CR4.PGE is enabled, and if the original page
> > mapping had the global bit set (the decompressor stub sets that in the
> > identity mapping), won't the read below still use the original encrypted
> > mapping if we don't explicitly flush it?
>
> The check only really matters for the boot CPU, not for the secondary
> CPUs. IIRC at this point in boot CR4.PGE is still off.
>
> Regards,
>
> Joerg
>
The boot cpu also enables CR4.PGE -- that code is shared between boot
and secondary cpus. The boot cpu jumps to the first "1" label below,
just before the call to sev_verify_cbit you're adding.
/* Form the CR3 value being sure to include the CR3 modifier */
addq $(init_top_pgt - __START_KERNEL_map), %rax
1:
/* Enable PAE mode, PGE and LA57 */
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
orl $X86_CR4_LA57, %ecx
1:
#endif
movq %rcx, %cr4
/* Setup early boot stage 4-/5-level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3
On Mon, Oct 19, 2020 at 05:31:06PM -0400, Arvind Sankar wrote:
> Is it possible to take advantage of this to make the check independent
> of the original page tables? i.e. switch to the new pagetables, then
> write into .data or .bss the opcodes for a function that does
> movabs $imm64, %rax
> jmp *%rdi // avoid using stack for the return
> filling in the imm64 with the RDRAND value, and then try to execute it.
> If the C-bit value is wrong, this will probably crash, and at any rate
> shouldn't return with the correct value in %rax.
That could work, but is not reliable. When the C bit is wrong the CPU
would essentially execute random data, which could also be a valid
instruction stream. A crash is not guaranteed.
Regards,
Joerg
On Tue, Oct 20, 2020 at 10:59:57AM +0200, Joerg Roedel wrote:
> On Mon, Oct 19, 2020 at 05:31:06PM -0400, Arvind Sankar wrote:
> > Is it possible to take advantage of this to make the check independent
> > of the original page tables? i.e. switch to the new pagetables, then
> > write into .data or .bss the opcodes for a function that does
> > movabs $imm64, %rax
> > jmp *%rdi // avoid using stack for the return
> > filling in the imm64 with the RDRAND value, and then try to execute it.
> > If the C-bit value is wrong, this will probably crash, and at any rate
> > shouldn't return with the correct value in %rax.
>
> That could work, but is not reliable. When the C bit is wrong the CPU
> would essentially execute random data, which could also be a valid
> instruction stream. A crash is not guaranteed.
>
That doesn't feel like a big loss: if a malicious hypervisor wanted to
induce completely random code execution, it can do that anyway by just
messing with the guest-to-host translation, no?
We would need to avoid calling this in the secondary cpu startup, I guess.
I was hoping to be able to clean up the identity mapping in
__startup_64(), which currently maps the entire kernel using wraparound
entries, to just map the head page of the kernel, since AFAICT nothing
else is actually used from the identity mapping after switching to the
new page tables. But we'd need to keep it to support this check.