2019-12-20 19:30:28

by Andersen, John

[permalink] [raw]
Subject: [RESEND RFC 0/2] Paravirtualized Control Register pinning

Paravirtualized Control Register pinning is a strengthened version of
existing protections on the Write Protect, Supervisor Mode Execution /
Access Protection, and User-Mode Instruction Prevention bits. The
existing protections prevent native_write_cr*() functions from writing
values which disable those bits. This patchset prevents any guest
writes to control registers from disabling pinned bits, not just writes
from native_write_cr*(). This stops attackers within the guest from
using ROP to disable protection bits.

https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/

The protection is implemented by adding MSRs to KVM which contain the
bits that are allowed to be pinned, and the bits which are pinned. The
guest or userspace can enable bit pinning by reading MSRs to check
which bits are allowed to be pinned, and then writing MSRs to set which
bits they want pinned.

Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.

https://www.abatchy.com/2018/01/kernel-exploitation-4

We add a CR pin feature bit to the KVM cpuid, read only MSRs which
guests use to identify which bits they may request be pinned, and
CR pinned MSRs which contain the pinned bits. Guests can request that
KVM pin bits within control register 0 or 4 via the CR pinned MSRs.
Writes to the MSRs fail if they include bits that aren't allowed to be
pinned. Host userspace may clear or modify pinned bits at any time.
Once pinned bits are set, the guest may pin more allowed bits, but may
never clear pinned bits.

In the event that the guest vcpu attempts to disable any of the pinned
bits, the vcpu that issued the write is sent a general protection
fault, and the register is left unchanged.

Pinning is not active when running in SMM. Entering SMM disables pinned
bits, writes to control registers within SMM would therefore trigger
general protection faults if pinning was enforced.

The guest may never read pinned bits. If an attacker were to read the
CR pinned MSRs, they might decide to preform another attack which would
not cause a general protection fault.

Should userspace expose the CR pining CPUID feature bit, it must zero CR
pinned MSRs on reboot. If it does not, it runs the risk of having the
guest enable pinning and subsequently cause general protection faults on
next boot due to early boot code setting control registers to values
which do not contain the pinned bits.

When running with KVM guest support and paravirtualized CR pinning
enabled, paravirtualized and existing pinning are setup at the same
point on the boot CPU. Non-boot CPUs setup pinning upon identification.

Guests using the kexec system call currently do not support
paravirtualized control register pinning. This is due to early boot
code writing known good values to control registers, these values do
not contain the protected bits. This is due to CPU feature
identification being done at a later time, when the kernel properly
checks if it can enable protections.

Most distributions enable kexec. However, kexec could be made boot time
disableable. In this case if a user has disabled kexec at boot time
the guest will request that paravirtualized control register pinning
be enabled. This would expand the userbase to users of major
distributions.

Paravirtualized CR pinning will likely be incompatible with kexec for
the foreseeable future. Early boot code could possibly be changed to
not clear protected bits. However, a kernel that requests CR bits be
pinned can't know if the kernel it's kexecing has been updated to not
clear protected bits. This would result in the kernel being kexec'd
almost immediately receiving a general protection fault.

Security conscious kernel configurations disable kexec already, per KSPP
guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
Termina, and others using KVM to virtualize Linux will benefit from
this protection.

The usage of SMM in SeaBIOS was explored as a way to communicate to KVM
that a reboot has occurred and it should zero the pinned bits. When
using QEMU and SeaBIOS, SMM initialization occurs on reboot. However,
prior to SMM initialization, BIOS writes zero values to CR0, causing a
general protection fault to be sent to the guest before SMM can signal
that the machine has booted.

Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction. Guests running with paravirtualized CR pinning are now
protected against the use of ROP to disable CR bits. The same bits that
are being pinned natively may be pinned via the CR pinned MSRs. These
bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.

Future patches could protect bits in MSRs in a similar fashion. The NXE
bit of the EFER MSR is a prime candidate.

John Andersen (2):
KVM: X86: Add CR pin MSRs
X86: Use KVM CR pin MSRs

Documentation/virt/kvm/msr.txt | 38 +++++++++++++++++++++++
arch/x86/Kconfig | 9 ++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/asm/kvm_para.h | 10 +++++++
arch/x86/include/uapi/asm/kvm_para.h | 5 ++++
arch/x86/kernel/cpu/common.c | 5 ++++
arch/x86/kernel/kvm.c | 17 +++++++++++
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/x86.c | 45 ++++++++++++++++++++++++++++
9 files changed, 133 insertions(+), 1 deletion(-)

--
2.21.0


2019-12-21 14:01:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On 20/12/19 20:26, John Andersen wrote:
> Paravirtualized CR pinning will likely be incompatible with kexec for
> the foreseeable future. Early boot code could possibly be changed to
> not clear protected bits. However, a kernel that requests CR bits be
> pinned can't know if the kernel it's kexecing has been updated to not
> clear protected bits. This would result in the kernel being kexec'd
> almost immediately receiving a general protection fault.
>
> Security conscious kernel configurations disable kexec already, per KSPP
> guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
> Termina, and others using KVM to virtualize Linux will benefit from
> this protection.
>
> The usage of SMM in SeaBIOS was explored as a way to communicate to KVM
> that a reboot has occurred and it should zero the pinned bits. When
> using QEMU and SeaBIOS, SMM initialization occurs on reboot. However,
> prior to SMM initialization, BIOS writes zero values to CR0, causing a
> general protection fault to be sent to the guest before SMM can signal
> that the machine has booted.

SMM is optional; I think it makes sense to leave it to userspace to
reset pinning (including for the case of triple faults), while INIT
which is handled within KVM would keep it active.

> Pinning of sensitive CR bits has already been implemented to protect
> against exploits directly calling native_write_cr*(). The current
> protection cannot stop ROP attacks which jump directly to a MOV CR
> instruction. Guests running with paravirtualized CR pinning are now
> protected against the use of ROP to disable CR bits. The same bits that
> are being pinned natively may be pinned via the CR pinned MSRs. These
> bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.
>
> Future patches could protect bits in MSRs in a similar fashion. The NXE
> bit of the EFER MSR is a prime candidate.

Please include patches for either kvm-unit-tests or
tools/testing/selftests/kvm that test the functionality.

Paolo

2019-12-23 14:33:26

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 20 Dec 2019, at 21:26, John Andersen <[email protected]> wrote:
>
> Paravirtualized Control Register pinning is a strengthened version of
> existing protections on the Write Protect, Supervisor Mode Execution /
> Access Protection, and User-Mode Instruction Prevention bits. The
> existing protections prevent native_write_cr*() functions from writing
> values which disable those bits. This patchset prevents any guest
> writes to control registers from disabling pinned bits, not just writes
> from native_write_cr*(). This stops attackers within the guest from
> using ROP to disable protection bits.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
>
> The protection is implemented by adding MSRs to KVM which contain the
> bits that are allowed to be pinned, and the bits which are pinned. The
> guest or userspace can enable bit pinning by reading MSRs to check
> which bits are allowed to be pinned, and then writing MSRs to set which
> bits they want pinned.
>
> Other hypervisors such as HyperV have implemented similar protections
> for Control Registers and MSRs; which security researchers have found
> effective.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
>

I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.

Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).

Higher VTL controls execution of lower VTL by the following VSM mechanisms:
1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).

VBS use above mentioned mechanisms as follows:
a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
(HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.

To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.

Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.

Once Linux will have security features implemented with this new KVM API, we could also consider implementing them on top of other similar hypervisor APIs
such as Hyper-V VSM. To achieve, for example, Linux being more secure when running on Microsoft Azure compute instances.

Therefore, I see this patch as a short-term solution to quickly gain real security value on a very specific issue.
But if we are serious about improving Linux security using Virtualization, we should have this more broad discussion.

-Liran

2019-12-23 14:51:03

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 20 Dec 2019, at 21:26, John Andersen <[email protected]> wrote:
>
> Pinning is not active when running in SMM. Entering SMM disables pinned
> bits, writes to control registers within SMM would therefore trigger
> general protection faults if pinning was enforced.

For compatibility reasons, it’s reasonable that pinning won’t be active when running in SMM.
However, I do think we should not allow vSMM code to change pinned values when returning back from SMM.
This would prevent a vulnerable vSMI handler from modifying vSMM state-area to modify CR4 when running outside of vSMM.
I believe in this case it’s legit to just forcibly restore original CR0/CR4 pinned values. Ignoring vSMM changes.

>
> The guest may never read pinned bits. If an attacker were to read the
> CR pinned MSRs, they might decide to preform another attack which would
> not cause a general protection fault.

I disagree with this statement.
An attacker knows what is the system it is attacking and can deduce by that which bits it pinned…
Therefore, protecting from guest reading these is not important at all.

>
> Should userspace expose the CR pining CPUID feature bit, it must zero CR
> pinned MSRs on reboot. If it does not, it runs the risk of having the
> guest enable pinning and subsequently cause general protection faults on
> next boot due to early boot code setting control registers to values
> which do not contain the pinned bits.

Why reset CR pinned MSRs by userspace instead of KVM INIT handling?

>
> When running with KVM guest support and paravirtualized CR pinning
> enabled, paravirtualized and existing pinning are setup at the same
> point on the boot CPU. Non-boot CPUs setup pinning upon identification.
>
> Guests using the kexec system call currently do not support
> paravirtualized control register pinning. This is due to early boot
> code writing known good values to control registers, these values do
> not contain the protected bits. This is due to CPU feature
> identification being done at a later time, when the kernel properly
> checks if it can enable protections.
>
> Most distributions enable kexec. However, kexec could be made boot time
> disableable. In this case if a user has disabled kexec at boot time
> the guest will request that paravirtualized control register pinning
> be enabled. This would expand the userbase to users of major
> distributions.
>
> Paravirtualized CR pinning will likely be incompatible with kexec for
> the foreseeable future. Early boot code could possibly be changed to
> not clear protected bits. However, a kernel that requests CR bits be
> pinned can't know if the kernel it's kexecing has been updated to not
> clear protected bits. This would result in the kernel being kexec'd
> almost immediately receiving a general protection fault.

Instead of disabling kexec entirely, I think it makes more sense to invent
some generic mechanism in which new kernel can describe to old kernel
a set of flags that specifies which features hand-over it supports. One of them
being pinned CRs.

For example, isn’t this also relevant for IOMMU DMA protection?
i.e. Doesn’t old kernel need to know if it should disable or enable IOMMU DMAR
before kexec to new kernel? Similar to EDK2 IOMMU DMA protection hand-over?

-Liran

2019-12-23 17:10:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On 23/12/19 15:48, Liran Alon wrote:
>> Should userspace expose the CR pining CPUID feature bit, it must zero CR
>> pinned MSRs on reboot. If it does not, it runs the risk of having the
>> guest enable pinning and subsequently cause general protection faults on
>> next boot due to early boot code setting control registers to values
>> which do not contain the pinned bits.
>
> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?

Most MSRs are not reset by INIT, are they?

Paolo

2019-12-23 17:28:01

by Andersen, John

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On Mon, 2019-12-23 at 18:09 +0100, Paolo Bonzini wrote:
> On 23/12/19 15:48, Liran Alon wrote:
> > > Should userspace expose the CR pining CPUID feature bit, it must
> > > zero CR
> > > pinned MSRs on reboot. If it does not, it runs the risk of having
> > > the
> > > guest enable pinning and subsequently cause general protection
> > > faults on
> > > next boot due to early boot code setting control registers to
> > > values
> > > which do not contain the pinned bits.
> >
> > Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
>
> Most MSRs are not reset by INIT, are they?
>

As far as I can tell, KVM doesn't know if the guest is rebooted.
Userspace uses the sregs and set MSRs ioctls to reset state.
kvm_vcpu_reset is called on non-boot CPUs. kvm_vcpu_init isn't called
on reboot.

2019-12-23 17:29:18

by Andersen, John

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On Sat, 2019-12-21 at 14:59 +0100, Paolo Bonzini wrote:
> On 20/12/19 20:26, John Andersen wrote:
> > Paravirtualized CR pinning will likely be incompatible with kexec
> > for
> > the foreseeable future. Early boot code could possibly be changed
> > to
> > not clear protected bits. However, a kernel that requests CR bits
> > be
> > pinned can't know if the kernel it's kexecing has been updated to
> > not
> > clear protected bits. This would result in the kernel being kexec'd
> > almost immediately receiving a general protection fault.
> >
> > Security conscious kernel configurations disable kexec already, per
> > KSPP
> > guidelines. Projects such as Kata Containers, AWS Lambda, ChromeOS
> > Termina, and others using KVM to virtualize Linux will benefit from
> > this protection.
> >
> > The usage of SMM in SeaBIOS was explored as a way to communicate to
> > KVM
> > that a reboot has occurred and it should zero the pinned bits. When
> > using QEMU and SeaBIOS, SMM initialization occurs on reboot.
> > However,
> > prior to SMM initialization, BIOS writes zero values to CR0,
> > causing a
> > general protection fault to be sent to the guest before SMM can
> > signal
> > that the machine has booted.
>
> SMM is optional; I think it makes sense to leave it to userspace to
> reset pinning (including for the case of triple faults), while INIT
> which is handled within KVM would keep it active.
>
> > Pinning of sensitive CR bits has already been implemented to
> > protect
> > against exploits directly calling native_write_cr*(). The current
> > protection cannot stop ROP attacks which jump directly to a MOV CR
> > instruction. Guests running with paravirtualized CR pinning are now
> > protected against the use of ROP to disable CR bits. The same bits
> > that
> > are being pinned natively may be pinned via the CR pinned MSRs.
> > These
> > bits are WP in CR0, and SMEP, SMAP, and UMIP in CR4.
> >
> > Future patches could protect bits in MSRs in a similar fashion. The
> > NXE
> > bit of the EFER MSR is a prime candidate.
>
> Please include patches for either kvm-unit-tests or
> tools/testing/selftests/kvm that test the functionality.
>

Will do

2019-12-23 17:31:36

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 23 Dec 2019, at 19:09, Paolo Bonzini <[email protected]> wrote:
>
> On 23/12/19 15:48, Liran Alon wrote:
>>> Should userspace expose the CR pining CPUID feature bit, it must zero CR
>>> pinned MSRs on reboot. If it does not, it runs the risk of having the
>>> guest enable pinning and subsequently cause general protection faults on
>>> next boot due to early boot code setting control registers to values
>>> which do not contain the pinned bits.
>>
>> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
>
> Most MSRs are not reset by INIT, are they?
>
> Paolo
>

MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT handler).
In addition, vmx_vcpu_reset(), called from kvm_vcpu_reset(), also resets multiple MSRs such as: MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL (msr_ia32_umwait_control).

Having said that, I see indeed that most of MSRs are being set by QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE.
When is triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset -> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().

So given current design, OK I agree with you that CR pinned MSRs should be zeroed by userspace VMM.

It does though seems kinda weird to me that part of CPU state is initialised on KVM INIT handler and part of it in userspace VMM.
It could lead to inconsistent (i.e. diverging from spec) CPU behaviour.

-Liran

2019-12-23 17:47:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On 23/12/19 18:28, Liran Alon wrote:
>>> Why reset CR pinned MSRs by userspace instead of KVM INIT
>>> handling?
>> Most MSRs are not reset by INIT, are they?
>>
>> Paolo
>>
> MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at
> kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT
> handler). In addition, vmx_vcpu_reset(), called from
> kvm_vcpu_reset(), also resets multiple MSRs such as:
> MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL
> (msr_ia32_umwait_control).

These probably can be removed, since they are zero at startup and at
least SPEC_CTRL is documented[1] to be unaffected by INIT. However, I
couldn't find information about UMWAIT_CONTROL.

> Having said that, I see indeed that most of MSRs are being set by
> QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE. When is
> triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset ->
> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().
>
> So given current design, OK I agree with you that CR pinned MSRs
> should be zeroed by userspace VMM.
>
> It does though seems kinda weird to me that part of CPU state is
> initialised on KVM INIT handler and part of it in userspace VMM. It
> could lead to inconsistent (i.e. diverging from spec) CPU behaviour.

The reason for that is the even on real hardware INIT does not touch
most MSRs:

9.1 Initialization overview

Asserting the INIT# pin on the processor invokes a similar response to
a hardware reset. The major difference is that during an INIT, the
internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged
(although, the TLBs and BTB are invalidated as with a hardware reset).
An INIT provides a method for switching from protected to real-address
mode while maintaining the contents of the internal caches.

Paolo

[1]
https://software.intel.com/security-software-guidance/api-app/sites/default/files/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

2019-12-23 22:52:56

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 23 Dec 2019, at 19:46, Paolo Bonzini <[email protected]> wrote:
>
> On 23/12/19 18:28, Liran Alon wrote:
>>>> Why reset CR pinned MSRs by userspace instead of KVM INIT
>>>> handling?
>>> Most MSRs are not reset by INIT, are they?
>>>
>>> Paolo
>>>
>> MSR_KVM_SYSTEM_TIME saved in vcpu->arch.time is reset at
>> kvmclock_reset() which is called by kvm_vcpu_reset() (KVM INIT
>> handler). In addition, vmx_vcpu_reset(), called from
>> kvm_vcpu_reset(), also resets multiple MSRs such as:
>> MSR_IA32_SPEC_CTRL (vmx->spec_ctrl) and MSR_IA32_UMWAIT_CONTROL
>> (msr_ia32_umwait_control).
>
> These probably can be removed, since they are zero at startup and at
> least SPEC_CTRL is documented[1] to be unaffected by INIT. However, I
> couldn't find information about UMWAIT_CONTROL.
>
>> Having said that, I see indeed that most of MSRs are being set by
>> QEMU in kvm_put_msrs() when level >= KVM_PUT_RESET_STATE. When is
>> triggered by qemu_system_reset() -> cpu_synchronize_all_post_reset ->
>> cpu_synchronize_post_reset() -> kvm_cpu_synchronize_post_reset().
>>
>> So given current design, OK I agree with you that CR pinned MSRs
>> should be zeroed by userspace VMM.
>>
>> It does though seems kinda weird to me that part of CPU state is
>> initialised on KVM INIT handler and part of it in userspace VMM. It
>> could lead to inconsistent (i.e. diverging from spec) CPU behaviour.
>
> The reason for that is the even on real hardware INIT does not touch
> most MSRs:
>
> 9.1 Initialization overview
>
> Asserting the INIT# pin on the processor invokes a similar response to
> a hardware reset. The major difference is that during an INIT, the
> internal caches, MSRs, MTRRs, and x87 FPU state are left unchanged
> (although, the TLBs and BTB are invalidated as with a hardware reset).
> An INIT provides a method for switching from protected to real-address
> mode while maintaining the contents of the internal caches.
>
> Paolo
>
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__software.intel.com_security-2Dsoftware-2Dguidance_api-2Dapp_sites_default_files_336996-2DSpeculative-2DExecution-2DSide-2DChannel-2DMitigations.pdf&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=MRhvCoZcvqLvHMhI3y3iSkpiAfitrNoCgPtWaTXhzNQ&s=66DAue_Ojp-a_msxZ2W9bLtQWJB4W1p9F9GrBBQ8dnw&e=
>

Oh right. That’s true.
There is a difference between power-up and asserting RESET# pin to asserting INIT# pin or receiving INIT IPI.
The first does reset all the processor state while the latter reset only part of it as indeed Intel SDM section 9.1 describes.

So basically because KVM is only aware of INIT IPI but not on power-up and asserting RESET# pin events,
then the job of emulating the latter events is currently implemented in userspace VMM.

Having said that, one could argue that because KVM implements the vCPU, it should just expose a relevant ioctl
for userspace VMM to trigger “full vCPU state reset” on power-up or asserting RESET# pin events.
i.e. Making userspace VMM only the one which emulates the peripheral hardware that triggers the event,
but not implementing the vCPU emulation that responds to this event. I think this would have resulted in a cleaner design.

But we are diverting from original patch-series discussion so if we want to discuss this further, let’s open a separate email thread on that. :)

-Liran

2019-12-24 19:45:25

by Andersen, John

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On Mon, 2019-12-23 at 16:48 +0200, Liran Alon wrote:
> > On 20 Dec 2019, at 21:26, John Andersen <[email protected]>
> > wrote:
> >
> > Pinning is not active when running in SMM. Entering SMM disables
> > pinned
> > bits, writes to control registers within SMM would therefore
> > trigger
> > general protection faults if pinning was enforced.
>
> For compatibility reasons, it’s reasonable that pinning won’t be
> active when running in SMM.
> However, I do think we should not allow vSMM code to change pinned
> values when returning back from SMM.
> This would prevent a vulnerable vSMI handler from modifying vSMM
> state-area to modify CR4 when running outside of vSMM.
> I believe in this case it’s legit to just forcibly restore original
> CR0/CR4 pinned values. Ignoring vSMM changes.
>

In em_rsm could we just OR with the value of the PINNED MSRs right
before the final return?

> > The guest may never read pinned bits. If an attacker were to read
> > the
> > CR pinned MSRs, they might decide to preform another attack which
> > would
> > not cause a general protection fault.
>
> I disagree with this statement.
> An attacker knows what is the system it is attacking and can deduce
> by that which bits it pinned…
> Therefore, protecting from guest reading these is not important at
> all.
>

Sure, I'll make it readable.

> > Should userspace expose the CR pining CPUID feature bit, it must
> > zero CR
> > pinned MSRs on reboot. If it does not, it runs the risk of having
> > the
> > guest enable pinning and subsequently cause general protection
> > faults on
> > next boot due to early boot code setting control registers to
> > values
> > which do not contain the pinned bits.
>
> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
>
> > When running with KVM guest support and paravirtualized CR pinning
> > enabled, paravirtualized and existing pinning are setup at the same
> > point on the boot CPU. Non-boot CPUs setup pinning upon
> > identification.
> >
> > Guests using the kexec system call currently do not support
> > paravirtualized control register pinning. This is due to early boot
> > code writing known good values to control registers, these values
> > do
> > not contain the protected bits. This is due to CPU feature
> > identification being done at a later time, when the kernel properly
> > checks if it can enable protections.
> >
> > Most distributions enable kexec. However, kexec could be made boot
> > time
> > disableable. In this case if a user has disabled kexec at boot time
> > the guest will request that paravirtualized control register
> > pinning
> > be enabled. This would expand the userbase to users of major
> > distributions.
> >
> > Paravirtualized CR pinning will likely be incompatible with kexec
> > for
> > the foreseeable future. Early boot code could possibly be changed
> > to
> > not clear protected bits. However, a kernel that requests CR bits
> > be
> > pinned can't know if the kernel it's kexecing has been updated to
> > not
> > clear protected bits. This would result in the kernel being kexec'd
> > almost immediately receiving a general protection fault.
>
> Instead of disabling kexec entirely, I think it makes more sense to
> invent
> some generic mechanism in which new kernel can describe to old kernel
> a set of flags that specifies which features hand-over it supports.
> One of them
> being pinned CRs.
>
> For example, isn’t this also relevant for IOMMU DMA protection?
> i.e. Doesn’t old kernel need to know if it should disable or enable
> IOMMU DMAR
> before kexec to new kernel? Similar to EDK2 IOMMU DMA protection
> hand-over?

Great idea.

Making kexec work will require changes to these files and maybe more:

arch/x86/boot/compressed/head_64.S
arch/x86/kernel/head_64.S
arch/x86/kernel/relocate_kernel_64.S

Which my previous attempts showed different results when running
virtualized vs. unvirtualized. Specificity different behavior with SMAP
and UMIP bits.

This would be a longer process though. As validating that everything
still works in both the VM and on physical hosts will be required. As
it stands this patchset could pick up a fairly large userbase via the
virtualized container projects. Should we pursue kexec in this patchset
or a later one?

Thanks,
John

2019-12-24 20:41:53

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 24 Dec 2019, at 21:44, Andersen, John S <[email protected]> wrote:
>
> On Mon, 2019-12-23 at 16:48 +0200, Liran Alon wrote:
>>> On 20 Dec 2019, at 21:26, John Andersen <[email protected]>
>>> wrote:
>>>
>>> Pinning is not active when running in SMM. Entering SMM disables
>>> pinned
>>> bits, writes to control registers within SMM would therefore
>>> trigger
>>> general protection faults if pinning was enforced.
>>
>> For compatibility reasons, it’s reasonable that pinning won’t be
>> active when running in SMM.
>> However, I do think we should not allow vSMM code to change pinned
>> values when returning back from SMM.
>> This would prevent a vulnerable vSMI handler from modifying vSMM
>> state-area to modify CR4 when running outside of vSMM.
>> I believe in this case it’s legit to just forcibly restore original
>> CR0/CR4 pinned values. Ignoring vSMM changes.
>>
>
> In em_rsm could we just OR with the value of the PINNED MSRs right
> before the final return?

Not exactly.

If I understand correctly, the proposed mechanism should also allow pinning specific
system registers (e.g. CR0/CR4) bits to either being cleared or being set. Not necessarily being set.
As kvm_set_cr0() & kvm_set_cr4() were modified to only check if pinned bit changed value.

Therefore, you should modify enter_smm() to save current pinned bits values
and then silently restore their values on em_rsm().

>
>>> The guest may never read pinned bits. If an attacker were to read
>>> the
>>> CR pinned MSRs, they might decide to preform another attack which
>>> would
>>> not cause a general protection fault.
>>
>> I disagree with this statement.
>> An attacker knows what is the system it is attacking and can deduce
>> by that which bits it pinned…
>> Therefore, protecting from guest reading these is not important at
>> all.
>>
>
> Sure, I'll make it readable.
>
>>> Should userspace expose the CR pining CPUID feature bit, it must
>>> zero CR
>>> pinned MSRs on reboot. If it does not, it runs the risk of having
>>> the
>>> guest enable pinning and subsequently cause general protection
>>> faults on
>>> next boot due to early boot code setting control registers to
>>> values
>>> which do not contain the pinned bits.
>>
>> Why reset CR pinned MSRs by userspace instead of KVM INIT handling?
>>
>>> When running with KVM guest support and paravirtualized CR pinning
>>> enabled, paravirtualized and existing pinning are setup at the same
>>> point on the boot CPU. Non-boot CPUs setup pinning upon
>>> identification.
>>>
>>> Guests using the kexec system call currently do not support
>>> paravirtualized control register pinning. This is due to early boot
>>> code writing known good values to control registers, these values
>>> do
>>> not contain the protected bits. This is due to CPU feature
>>> identification being done at a later time, when the kernel properly
>>> checks if it can enable protections.
>>>
>>> Most distributions enable kexec. However, kexec could be made boot
>>> time
>>> disableable. In this case if a user has disabled kexec at boot time
>>> the guest will request that paravirtualized control register
>>> pinning
>>> be enabled. This would expand the userbase to users of major
>>> distributions.
>>>
>>> Paravirtualized CR pinning will likely be incompatible with kexec
>>> for
>>> the foreseeable future. Early boot code could possibly be changed
>>> to
>>> not clear protected bits. However, a kernel that requests CR bits
>>> be
>>> pinned can't know if the kernel it's kexecing has been updated to
>>> not
>>> clear protected bits. This would result in the kernel being kexec'd
>>> almost immediately receiving a general protection fault.
>>
>> Instead of disabling kexec entirely, I think it makes more sense to
>> invent
>> some generic mechanism in which new kernel can describe to old kernel
>> a set of flags that specifies which features hand-over it supports.
>> One of them
>> being pinned CRs.
>>
>> For example, isn’t this also relevant for IOMMU DMA protection?
>> i.e. Doesn’t old kernel need to know if it should disable or enable
>> IOMMU DMAR
>> before kexec to new kernel? Similar to EDK2 IOMMU DMA protection
>> hand-over?
>
> Great idea.
>
> Making kexec work will require changes to these files and maybe more:
>
> arch/x86/boot/compressed/head_64.S
> arch/x86/kernel/head_64.S
> arch/x86/kernel/relocate_kernel_64.S
>
> Which my previous attempts showed different results when running
> virtualized vs. unvirtualized. Specificity different behavior with SMAP
> and UMIP bits.

I didn’t understand why there should be different results when running virtualized or not.

What I suggested is to just add metadata in a vmlinux ELF section that will be designated to
describe vmlinux handover capabilities.

Then, kexec functionality can be modified to read this section before loading & jumping
to new kernel vmlinux.

In the context of this patch-set, this section will specify a flag of if new vmlinux supports
CR0/CR4 pinning hand-over. If not and current kernel already pinned these values,
kexec should just return failure.

Just for example, in the context of IOMMU DMA protection hand-over, kexec will
use another flag to new of new vmlinux supports IOMMU DMA protection hand-over,
and if not, make sure to disable IOMMU before jumping to new vmlinux.

>
> This would be a longer process though. As validating that everything
> still works in both the VM and on physical hosts will be required. As
> it stands this patchset could pick up a fairly large userbase via the
> virtualized container projects. Should we pursue kexec in this patchset
> or a later one?

In my opinion, this should be handled as part of this patch-series.
Otherwise, you basically introduce a kernel change that breaks existing functionality
in unpredictable manner to user.

i.e. It’s ok of kernel would haven’t allowed to use system registers pinning functionality
unless user also configured at boot-time that it disables kexec. But it’s not ok for
kernel behaviour to change such that kexec suddenly crashes if kernel was upgraded
to now use system registers pinning.

My personal opinion though is that kexec should first be enhanced to read hand-over metadata
as I described. And only then apply this patch-series which also modifies kexec to define and use
one of the hand-over bits as I mentioned above. But I would like to hear additional opinions on this.

In addition, I would like to understand Linux security maintainers point of view on the comment
I mentioned in another reply regarding how this mechanism is implemented in Hyper-V.

-Liran

>
> Thanks,
> John

2019-12-24 21:18:08

by Andersen, John

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On Tue, 2019-12-24 at 22:35 +0200, Liran Alon wrote:
> > On 24 Dec 2019, at 21:44, Andersen, John S <
> > [email protected]> wrote:
> > In em_rsm could we just OR with the value of the PINNED MSRs right
> > before the final return?
>
> Not exactly.
>
> If I understand correctly, the proposed mechanism should also allow
> pinning specific
> system registers (e.g. CR0/CR4) bits to either being cleared or being
> set. Not necessarily being set.
> As kvm_set_cr0() & kvm_set_cr4() were modified to only check if
> pinned bit changed value.
>
> Therefore, you should modify enter_smm() to save current pinned bits
> values
> and then silently restore their values on em_rsm().
>

That's true. Sounds good.

> >
> > Making kexec work will require changes to these files and maybe
> > more:
> >
> > arch/x86/boot/compressed/head_64.S
> > arch/x86/kernel/head_64.S
> > arch/x86/kernel/relocate_kernel_64.S
> >
> > Which my previous attempts showed different results when running
> > virtualized vs. unvirtualized. Specificity different behavior with
> > SMAP
> > and UMIP bits.
>
> I didn’t understand why there should be different results when
> running virtualized or not.
>

I think it's either a bug in KVM's enforcement of SMAP and UMIP, or a
bug in the hardware I'm on. If I do the same as bellow but with SMAP
and UMIP, the physical host doesn't boot, but the virtualized host
boots fine and can kexec without issue.

diff --git a/arch/x86/boot/compressed/head_64.S
b/arch/x86/boot/compressed/head_64.S
index 537f7d2bfb17a..dc50634fa674e 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -126,7 +126,7 @@ ENTRY(startup_32)

/* Enable PAE mode */
movl %cr4, %eax
- orl $X86_CR4_PAE, %eax
+ orl $(X86_CR4_PAE | X86_CR4_SMEP), %eax
movl %eax, %cr4

/*
@@ -614,10 +614,10 @@ ENTRY(trampoline_32bit_src)
popl %ecx

/* Enable PAE and LA57 (if required) paging modes */
- movl $X86_CR4_PAE, %eax
+ movl $(X86_CR4_PAE | X86_CR4_SMEP), %eax
cmpl $0, %edx
jz 1f
- orl $X86_CR4_LA57, %eax
+ orl $(X86_CR4_LA57 | X86_CR4_SMEP), %eax
1:
movl %eax, %cr4

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index f3d3e9646a99b..d57ce48a40b5f 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -122,7 +122,7 @@ ENTRY(secondary_startup_64)
1:

/* Enable PAE mode, PGE and LA57 */
- movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
+ movl $(X86_CR4_PAE | X86_CR4_PGE | X86_CR4_SMEP), %ecx
#ifdef CONFIG_X86_5LEVEL
testl $1, __pgtable_l5_enabled(%rip)
jz 1f
diff --git a/arch/x86/kernel/relocate_kernel_64.S
b/arch/x86/kernel/relocate_kernel_64.S
index b0e558b473f9b..38acd530622cc 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -129,7 +129,7 @@ identity_mapped:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
*/
- movl $X86_CR4_PAE, %eax
+ movl $(X86_CR4_PAE | X86_CR4_SMEP), %eax
testq $X86_CR4_LA57, %r13
jz 1f
orl $X86_CR4_LA57, %eax


I realize this isn't the right way to do this. But it was just my first
attempt at it. We should probably just change it to be something along
the lines of, if the bit was already set, keep it set. I probably just
need to go further down that road.


> What I suggested is to just add metadata in a vmlinux ELF section
> that will be designated to
> describe vmlinux handover capabilities.
>
> Then, kexec functionality can be modified to read this section before
> loading & jumping
> to new kernel vmlinux.
>
> In the context of this patch-set, this section will specify a flag of
> if new vmlinux supports
> CR0/CR4 pinning hand-over. If not and current kernel already pinned
> these values,
> kexec should just return failure.
>
> Just for example, in the context of IOMMU DMA protection hand-over,
> kexec will
> use another flag to new of new vmlinux supports IOMMU DMA protection
> hand-over,
> and if not, make sure to disable IOMMU before jumping to new vmlinux.
>

Ah okay I'm sorry I misunderstood.

> > This would be a longer process though. As validating that
> > everything
> > still works in both the VM and on physical hosts will be required.
> > As
> > it stands this patchset could pick up a fairly large userbase via
> > the
> > virtualized container projects. Should we pursue kexec in this
> > patchset
> > or a later one?
>
> In my opinion, this should be handled as part of this patch-series.
> Otherwise, you basically introduce a kernel change that breaks
> existing functionality
> in unpredictable manner to user.
>
> i.e. It’s ok of kernel would haven’t allowed to use system registers
> pinning functionality
> unless user also configured at boot-time that it disables kexec. But
> it’s not ok for
> kernel behaviour to change such that kexec suddenly crashes if kernel
> was upgraded
> to now use system registers pinning.
>
> My personal opinion though is that kexec should first be enhanced to
> read hand-over metadata
> as I described. And only then apply this patch-series which also
> modifies kexec to define and use
> one of the hand-over bits as I mentioned above. But I would like to
> hear additional opinions on this.
>

I will check this out. Thanks.

> In addition, I would like to understand Linux security maintainers
> point of view on the comment
> I mentioned in another reply regarding how this mechanism is
> implemented in Hyper-V.
>

I agree. I did see Hyper-Vs interface definitions and didn't understand
it to be that simple. It seemed like there was a significant amount of
pre-existing hypercall infrastructure that it was a part of. My
understanding was that we'd have to replicate that, and then only
implement the subset that we care about. It seemed like overkill to get
this enabled. The goal here has been to get this CR protection in as a
simple interface. It would of course be nice to have all those
protections as well though.

Thanks,
John

2019-12-24 22:59:00

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

+kernel-hardening mailing list.

> On 23 Dec 2019, at 16:30, Liran Alon <[email protected]> wrote:
>
>
>
>> On 20 Dec 2019, at 21:26, John Andersen <[email protected]> wrote:
>>
>> Paravirtualized Control Register pinning is a strengthened version of
>> existing protections on the Write Protect, Supervisor Mode Execution /
>> Access Protection, and User-Mode Instruction Prevention bits. The
>> existing protections prevent native_write_cr*() functions from writing
>> values which disable those bits. This patchset prevents any guest
>> writes to control registers from disabling pinned bits, not just writes
>> from native_write_cr*(). This stops attackers within the guest from
>> using ROP to disable protection bits.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
>>
>> The protection is implemented by adding MSRs to KVM which contain the
>> bits that are allowed to be pinned, and the bits which are pinned. The
>> guest or userspace can enable bit pinning by reading MSRs to check
>> which bits are allowed to be pinned, and then writing MSRs to set which
>> bits they want pinned.
>>
>> Other hypervisors such as HyperV have implemented similar protections
>> for Control Registers and MSRs; which security researchers have found
>> effective.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
>>
>
> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
>
> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
>
> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
>
> VBS use above mentioned mechanisms as follows:
> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
> HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
>
> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
>
> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.
>
> Once Linux will have security features implemented with this new KVM API, we could also consider implementing them on top of other similar hypervisor APIs
> such as Hyper-V VSM. To achieve, for example, Linux being more secure when running on Microsoft Azure compute instances.
>
> Therefore, I see this patch as a short-term solution to quickly gain real security value on a very specific issue.
> But if we are serious about improving Linux security using Virtualization, we should have this more broad discussion.
>
> -Liran
>

2019-12-25 02:06:01

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning

On Mon, Dec 23, 2019 at 6:31 AM Liran Alon <[email protected]> wrote:
>
>
>
> > On 20 Dec 2019, at 21:26, John Andersen <[email protected]> wrote:
> >
> > Paravirtualized Control Register pinning is a strengthened version of
> > existing protections on the Write Protect, Supervisor Mode Execution /
> > Access Protection, and User-Mode Instruction Prevention bits. The
> > existing protections prevent native_write_cr*() functions from writing
> > values which disable those bits. This patchset prevents any guest
> > writes to control registers from disabling pinned bits, not just writes
> > from native_write_cr*(). This stops attackers within the guest from
> > using ROP to disable protection bits.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
> >
> > The protection is implemented by adding MSRs to KVM which contain the
> > bits that are allowed to be pinned, and the bits which are pinned. The
> > guest or userspace can enable bit pinning by reading MSRs to check
> > which bits are allowed to be pinned, and then writing MSRs to set which
> > bits they want pinned.
> >
> > Other hypervisors such as HyperV have implemented similar protections
> > for Control Registers and MSRs; which security researchers have found
> > effective.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
> >
>
> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
>
> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
>
> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
>
> VBS use above mentioned mechanisms as follows:
> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
> HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
>
> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
>
> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.

How about having KVM implement the VSM API directly?

2019-12-25 13:07:35

by Liran Alon

[permalink] [raw]
Subject: Re: [RESEND RFC 0/2] Paravirtualized Control Register pinning



> On 25 Dec 2019, at 4:04, Andy Lutomirski <[email protected]> wrote:
>
> On Mon, Dec 23, 2019 at 6:31 AM Liran Alon <[email protected]> wrote:
>>
>>
>>
>>> On 20 Dec 2019, at 21:26, John Andersen <[email protected]> wrote:
>>>
>>> Paravirtualized Control Register pinning is a strengthened version of
>>> existing protections on the Write Protect, Supervisor Mode Execution /
>>> Access Protection, and User-Mode Instruction Prevention bits. The
>>> existing protections prevent native_write_cr*() functions from writing
>>> values which disable those bits. This patchset prevents any guest
>>> writes to control registers from disabling pinned bits, not just writes
>>> from native_write_cr*(). This stops attackers within the guest from
>>> using ROP to disable protection bits.
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__web.archive.org_web_20171029060939_http-3A__www.blackbunny.io_linux-2Dkernel-2Dx86-2D64-2Dbypass-2Dsmep-2Dkaslr-2Dkptr-5Frestric_&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=TrjU4_UEZIoYjxtoXcjsA8Riu0QZ8eI7a4fH96hSBQc&e=
>>>
>>> The protection is implemented by adding MSRs to KVM which contain the
>>> bits that are allowed to be pinned, and the bits which are pinned. The
>>> guest or userspace can enable bit pinning by reading MSRs to check
>>> which bits are allowed to be pinned, and then writing MSRs to set which
>>> bits they want pinned.
>>>
>>> Other hypervisors such as HyperV have implemented similar protections
>>> for Control Registers and MSRs; which security researchers have found
>>> effective.
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.abatchy.com_2018_01_kernel-2Dexploitation-2D4&d=DwIDAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=-H3SsRpu0sEBqqn9-OOVimBDXk6TimcJerlu4-ko5Io&s=Fg3e-BSUebNg44Ocp_y19xIoK0HJEHPW2AgM958F3Uc&e=
>>>
>>
>> I think it’s important to mention how Hyper-V implements this protection as it is done in a very different architecture.
>>
>> Hyper-V implements a set of PV APIs named VSM (Virtual Secure Mode) aimed to allow a guest (partition) to separate itself to multiple security domains called VTLs (Virtual Trust Level).
>> The VSM API expose an interface to higher VTLs to control the execution of lower VTLs. In theory, VSM supports up to 16 VTLs, but Windows VBS (Virtualization Based Security) that is
>> the only current technology which utilise VSM, use only 2 VTLs. VTL0 for most of OS execution (Normal-Mode) and VTL1 for a secure OS execution (Secure-Mode).
>>
>> Higher VTL controls execution of lower VTL by the following VSM mechanisms:
>> 1) Memory Access Protections: Allows higher VTL to restrict memory access to physical pages. Either making them inaccessible or limited to certain permissions.
>> 2) Secure Intercepts: Allows a higher VTL to request hypervisor to intercept certain events in lower VTLs for handling by higher VTL. This includes access to system registers (e.g. CRs & MSRs).
>>
>> VBS use above mentioned mechanisms as follows:
>> a) Credentials Guard: Prevents pass-the-hash attacks. Done by encrypting credentials using a VTL1 trustlet to encrypt them by an encryption-key stored in VTL1-only accessible memory.
>> b) HVCI (Hypervisor-based Code-Integrity): Prevents execution of unsigned code. Done by marking all EPT entries with NX until signature verified by VTL1 service. Once verified, mark EPT entries as RO+X.
>> (HVCI also supports enforcing code-signing only on Ring0 code efficiently by utilising Intel MBEC or AMD GMET CPU features. Which allows setting NX-bit on EPT entries based on guest CPL).
>> c) KDP (Kernel Data Protection): Marks certain pages after initialisation as read-only on VTL0 EPT.
>> d) kCFG (Kernel Control-Flow Guard): VTL1 protects bitmap,specifying valid indirect branch targets, by protecting it with read-only on VTL0 EPT.
>> e) HyperGuard: VTL1 use “Secure Intercepts” mechanism to prevent VTL0 from modifying important system registers. Including CR0 & CR4 as done by this patch.
>> HyperGuard also implements a mechanism named NPIEP (Non-Privileged Instruction Execution Prevention) that prevents VTL0 Ring3 executing SIDT/SGDT/SLDT/STR to leak Ring0 addresses.
>>
>> To sum-up, In Hyper-V, the hypervisor expose a relatively thin API to allow guest to partition itself to multiple security domains (enforced by virtualization).
>> Using this framework, it’s possible to implement multiple OS-level protection mechanisms. Only one of them are pinning certain registers to specific values as done by this patch.
>>
>> Therefore, as I also tried to say in recent KVM Forum, I think KVM should consider exposing a VSM-like API to guest to allow various guest OS,
>> Including Linux, to implement VBS-like features. To decide on how this API should look like, we need to have a more broad discussion with Linux
>> Security maintainers and KVM maintainers on which security features we would like to implement using such API and what should be their architecture.
>> Then, we can implement this API in KVM and start to gradually introduce more security features in Linux which utilise this API.
>
> How about having KVM implement the VSM API directly?

Hyper-V VSM API is tightly coupled to the rest of Hyper-V PV interface. Therefore, KVM could only implement VSM API as-is
as part of it’s Hyper-V PV interface emulation implementation. Because we don’t wish to expose Hyper-V PV interface by default
to all KVM guests, KVM should have it’s own variant providing similar capabilities.

In addition, in my opinion there are some bad design choices in VSM API I haven’t mentioned in my previous message. Which KVM
VSM-like API would maybe want to do differently to avoid those mistakes. For example, VSM API by design assumes that a given VTL
is superior and control every aspect of all VTLs lower than it. In Windows VBS, this have caused securekernel (Sk) running in VTL1 to
be part of TCB and therefore significantly enlarge it. In contrast, for example, to QubesOS where OS is split to security-domains that
each have well-defined capabilities but none have full capabilities as VTL1 have in VBS. Therefore, it preserves only the hypervisor
in TCB as it should.

Having said that, I am already working on a patch-series to enhance KVM Hyper-V PV interface implementation to also include VSM.
As I have mentioned in recent KVM Forum, I wish to do so to make modern Windows OS with VBS support running on top of KVM,
to not need to run Hyper-V inside the KVM guest (i.e. Leading to nested-virtualization workload). When Windows detect it’s already
running as a guest on top of Hyper-V with VSM support, it uses underlying hypervisor VSM API to implement VBS. Without loading
Hyper-V inside the guest. Therefore, improving performance & semantics of Windows VBS guests on top of KVM.

Note though, that my work of implementing VSM in KVM Hyper-V PV interface implementation isn’t related to the discussion here.
Which is: How should Linux be modified to take advantage of a VSM-like API to implement security mitigations features as those
I described above that Windows VBS implement on top of such API. Deciding on the design of those features, will also guideline
what should be the KVM PV VSM-like API we should implement.

-Liran