2017-08-24 15:56:20

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH] KVM: MMU: speedup update_permission_bitmask

update_permission_bitmask currently does a 128-iteration loop to,
essentially, compute a constant array. Computing the 8 bits in parallel
reduces it to 16 iterations, and is enough to speed it up substantially
because many boolean operations in the inner loop become constants or
simplify noticeably.

Because update_permission_bitmask is actually the top item in the profile
for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
clock cycles, or up to 30%:

before after
cpuid 35173 25954
vmcall 35122 27079
inl_from_pmtimer 52635 42675
inl_from_qemu 53604 44599
inl_from_kernel 38498 30798
outl_to_kernel 34508 28816
wr_tsc_adjust_msr 34185 26818
rd_tsc_adjust_msr 37409 27049
mmio-no-eventfd:pci-mem 50563 45276
mmio-wildcard-eventfd:pci-mem 34495 30823
mmio-datamatch-eventfd:pci-mem 35612 31071
portio-no-eventfd:pci-io 44925 40661
portio-wildcard-eventfd:pci-io 29708 27269
portio-datamatch-eventfd:pci-io 31135 27164

(I wrote a small C program to compare the tables for all values of CR0.WP,
CR4.SMAP and CR4.SMEP, and they match).

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu.c | 121 +++++++++++++++++++++++++++++++----------------------
1 file changed, 70 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f47cccace1a1..2a8a6e3e2a31 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4204,66 +4204,85 @@ static inline bool boot_cpu_is_amd(void)
boot_cpu_data.x86_phys_bits, execonly);
}

+#define BYTE_MASK(access) \
+ ((1 & (access) ? 2 : 0) | \
+ (2 & (access) ? 4 : 0) | \
+ (3 & (access) ? 8 : 0) | \
+ (4 & (access) ? 16 : 0) | \
+ (5 & (access) ? 32 : 0) | \
+ (6 & (access) ? 64 : 0) | \
+ (7 & (access) ? 128 : 0))
+
+
static void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
{
- unsigned bit, byte, pfec;
- u8 map;
- bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
+ unsigned byte;
+
+ const u8 x = BYTE_MASK(ACC_EXEC_MASK);
+ const u8 w = BYTE_MASK(ACC_WRITE_MASK);
+ const u8 u = BYTE_MASK(ACC_USER_MASK);
+
+ bool cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
+ bool cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
+ bool cr0_wp = is_write_protection(vcpu);

- cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
- cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
- pfec = byte << 1;
- map = 0;
- wf = pfec & PFERR_WRITE_MASK;
- uf = pfec & PFERR_USER_MASK;
- ff = pfec & PFERR_FETCH_MASK;
+ unsigned pfec = byte << 1;
+
/*
- * PFERR_RSVD_MASK bit is set in PFEC if the access is not
- * subject to SMAP restrictions, and cleared otherwise. The
- * bit is only meaningful if the SMAP bit is set in CR4.
+ * Each "*f" variable has a 1 bit for each UWX value
+ * that causes a fault with the given PFEC.
*/
- smapf = !(pfec & PFERR_RSVD_MASK);
- for (bit = 0; bit < 8; ++bit) {
- x = bit & ACC_EXEC_MASK;
- w = bit & ACC_WRITE_MASK;
- u = bit & ACC_USER_MASK;
-
- if (!ept) {
- /* Not really needed: !nx will cause pte.nx to fault */
- x |= !mmu->nx;
- /* Allow supervisor writes if !cr0.wp */
- w |= !is_write_protection(vcpu) && !uf;
- /* Disallow supervisor fetches of user code if cr4.smep */
- x &= !(cr4_smep && u && !uf);
-
- /*
- * SMAP:kernel-mode data accesses from user-mode
- * mappings should fault. A fault is considered
- * as a SMAP violation if all of the following
- * conditions are ture:
- * - X86_CR4_SMAP is set in CR4
- * - A user page is accessed
- * - Page fault in kernel mode
- * - if CPL = 3 or X86_EFLAGS_AC is clear
- *
- * Here, we cover the first three conditions.
- * The fourth is computed dynamically in
- * permission_fault() and is in smapf.
- *
- * Also, SMAP does not affect instruction
- * fetches, add the !ff check here to make it
- * clearer.
- */
- smap = cr4_smap && u && !uf && !ff;
- }

- fault = (ff && !x) || (uf && !u) || (wf && !w) ||
- (smapf && smap);
- map |= fault << bit;
+ /* Faults from writes to non-writable pages */
+ u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
+ /* Faults from user mode accesses to supervisor pages */
+ u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;
+ /* Faults from fetches of non-executable pages*/
+ u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
+ /* Faults from kernel mode fetches of user pages */
+ u8 smepf = 0;
+ /* Faults from kernel mode accesses of user pages */
+ u8 smapf = 0;
+
+ if (!ept) {
+ /* Faults from kernel mode accesses to user pages */
+ u8 kf = (pfec & PFERR_USER_MASK) ? 0 : u;
+
+ /* Not really needed: !nx will cause pte.nx to fault */
+ if (!mmu->nx)
+ ff = 0;
+
+ /* Allow supervisor writes if !cr0.wp */
+ if (!cr0_wp)
+ wf = (pfec & PFERR_USER_MASK) ? wf : 0;
+
+ /* Disallow supervisor fetches of user code if cr4.smep */
+ if (cr4_smep)
+ smepf = (pfec & PFERR_FETCH_MASK) ? kf : 0;
+
+ /*
+ * SMAP:kernel-mode data accesses from user-mode
+ * mappings should fault. A fault is considered
+ * as a SMAP violation if all of the following
+ * conditions are ture:
+ * - X86_CR4_SMAP is set in CR4
+ * - A user page is accessed
+ * - The access is not a fetch
+ * - Page fault in kernel mode
+ * - if CPL = 3 or X86_EFLAGS_AC is clear
+ *
+ * Here, we cover the first three conditions.
+ * The fourth is computed dynamically in permission_fault();
+ * PFERR_RSVD_MASK bit will be set in PFEC if the access is
+ * *not* subject to SMAP restrictions.
+ */
+ if (cr4_smap)
+ smapf = (pfec & (PFERR_RSVD_MASK|PFERR_FETCH_MASK)) ? 0 : kf;
}
- mmu->permissions[byte] = map;
+
+ mmu->permissions[byte] = ff | uf | wf | smepf | smapf;
}
}

--
1.8.3.1


2017-08-28 19:42:52

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

Looks okay to me, but I'm hoping Peter will chime in.

Reviewed-by: Jim Mattson <[email protected]>

On Thu, Aug 24, 2017 at 8:56 AM, Paolo Bonzini <[email protected]> wrote:
> update_permission_bitmask currently does a 128-iteration loop to,
> essentially, compute a constant array. Computing the 8 bits in parallel
> reduces it to 16 iterations, and is enough to speed it up substantially
> because many boolean operations in the inner loop become constants or
> simplify noticeably.
>
> Because update_permission_bitmask is actually the top item in the profile
> for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
> clock cycles, or up to 30%:
>
> before after
> cpuid 35173 25954
> vmcall 35122 27079
> inl_from_pmtimer 52635 42675
> inl_from_qemu 53604 44599
> inl_from_kernel 38498 30798
> outl_to_kernel 34508 28816
> wr_tsc_adjust_msr 34185 26818
> rd_tsc_adjust_msr 37409 27049
> mmio-no-eventfd:pci-mem 50563 45276
> mmio-wildcard-eventfd:pci-mem 34495 30823
> mmio-datamatch-eventfd:pci-mem 35612 31071
> portio-no-eventfd:pci-io 44925 40661
> portio-wildcard-eventfd:pci-io 29708 27269
> portio-datamatch-eventfd:pci-io 31135 27164
>
> (I wrote a small C program to compare the tables for all values of CR0.WP,
> CR4.SMAP and CR4.SMEP, and they match).
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 121 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 70 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f47cccace1a1..2a8a6e3e2a31 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4204,66 +4204,85 @@ static inline bool boot_cpu_is_amd(void)
> boot_cpu_data.x86_phys_bits, execonly);
> }
>
> +#define BYTE_MASK(access) \
> + ((1 & (access) ? 2 : 0) | \
> + (2 & (access) ? 4 : 0) | \
> + (3 & (access) ? 8 : 0) | \
> + (4 & (access) ? 16 : 0) | \
> + (5 & (access) ? 32 : 0) | \
> + (6 & (access) ? 64 : 0) | \
> + (7 & (access) ? 128 : 0))
> +
> +
> static void update_permission_bitmask(struct kvm_vcpu *vcpu,
> struct kvm_mmu *mmu, bool ept)
> {
> - unsigned bit, byte, pfec;
> - u8 map;
> - bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
> + unsigned byte;
> +
> + const u8 x = BYTE_MASK(ACC_EXEC_MASK);
> + const u8 w = BYTE_MASK(ACC_WRITE_MASK);
> + const u8 u = BYTE_MASK(ACC_USER_MASK);
> +
> + bool cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
> + bool cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
> + bool cr0_wp = is_write_protection(vcpu);
>
> - cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
> - cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
> for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
> - pfec = byte << 1;
> - map = 0;
> - wf = pfec & PFERR_WRITE_MASK;
> - uf = pfec & PFERR_USER_MASK;
> - ff = pfec & PFERR_FETCH_MASK;
> + unsigned pfec = byte << 1;
> +
> /*
> - * PFERR_RSVD_MASK bit is set in PFEC if the access is not
> - * subject to SMAP restrictions, and cleared otherwise. The
> - * bit is only meaningful if the SMAP bit is set in CR4.
> + * Each "*f" variable has a 1 bit for each UWX value
> + * that causes a fault with the given PFEC.
> */
> - smapf = !(pfec & PFERR_RSVD_MASK);
> - for (bit = 0; bit < 8; ++bit) {
> - x = bit & ACC_EXEC_MASK;
> - w = bit & ACC_WRITE_MASK;
> - u = bit & ACC_USER_MASK;
> -
> - if (!ept) {
> - /* Not really needed: !nx will cause pte.nx to fault */
> - x |= !mmu->nx;
> - /* Allow supervisor writes if !cr0.wp */
> - w |= !is_write_protection(vcpu) && !uf;
> - /* Disallow supervisor fetches of user code if cr4.smep */
> - x &= !(cr4_smep && u && !uf);
> -
> - /*
> - * SMAP:kernel-mode data accesses from user-mode
> - * mappings should fault. A fault is considered
> - * as a SMAP violation if all of the following
> - * conditions are ture:
> - * - X86_CR4_SMAP is set in CR4
> - * - A user page is accessed
> - * - Page fault in kernel mode
> - * - if CPL = 3 or X86_EFLAGS_AC is clear
> - *
> - * Here, we cover the first three conditions.
> - * The fourth is computed dynamically in
> - * permission_fault() and is in smapf.
> - *
> - * Also, SMAP does not affect instruction
> - * fetches, add the !ff check here to make it
> - * clearer.
> - */
> - smap = cr4_smap && u && !uf && !ff;
> - }
>
> - fault = (ff && !x) || (uf && !u) || (wf && !w) ||
> - (smapf && smap);
> - map |= fault << bit;
> + /* Faults from writes to non-writable pages */
> + u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
> + /* Faults from user mode accesses to supervisor pages */
> + u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;
> + /* Faults from fetches of non-executable pages*/
> + u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
> + /* Faults from kernel mode fetches of user pages */
> + u8 smepf = 0;
> + /* Faults from kernel mode accesses of user pages */
> + u8 smapf = 0;
> +
> + if (!ept) {
> + /* Faults from kernel mode accesses to user pages */
> + u8 kf = (pfec & PFERR_USER_MASK) ? 0 : u;
> +
> + /* Not really needed: !nx will cause pte.nx to fault */
> + if (!mmu->nx)
> + ff = 0;
> +
> + /* Allow supervisor writes if !cr0.wp */
> + if (!cr0_wp)
> + wf = (pfec & PFERR_USER_MASK) ? wf : 0;
> +
> + /* Disallow supervisor fetches of user code if cr4.smep */
> + if (cr4_smep)
> + smepf = (pfec & PFERR_FETCH_MASK) ? kf : 0;
> +
> + /*
> + * SMAP:kernel-mode data accesses from user-mode
> + * mappings should fault. A fault is considered
> + * as a SMAP violation if all of the following
> + * conditions are ture:
> + * - X86_CR4_SMAP is set in CR4
> + * - A user page is accessed
> + * - The access is not a fetch
> + * - Page fault in kernel mode
> + * - if CPL = 3 or X86_EFLAGS_AC is clear
> + *
> + * Here, we cover the first three conditions.
> + * The fourth is computed dynamically in permission_fault();
> + * PFERR_RSVD_MASK bit will be set in PFEC if the access is
> + * *not* subject to SMAP restrictions.
> + */
> + if (cr4_smap)
> + smapf = (pfec & (PFERR_RSVD_MASK|PFERR_FETCH_MASK)) ? 0 : kf;
> }
> - mmu->permissions[byte] = map;
> +
> + mmu->permissions[byte] = ff | uf | wf | smepf | smapf;
> }
> }
>
> --
> 1.8.3.1
>

2017-08-29 16:46:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

On 24.08.2017 17:56, Paolo Bonzini wrote:
> update_permission_bitmask currently does a 128-iteration loop to,
> essentially, compute a constant array. Computing the 8 bits in parallel
> reduces it to 16 iterations, and is enough to speed it up substantially
> because many boolean operations in the inner loop become constants or
> simplify noticeably.
>
> Because update_permission_bitmask is actually the top item in the profile
> for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
> clock cycles, or up to 30%:
>
> before after
> cpuid 35173 25954
> vmcall 35122 27079
> inl_from_pmtimer 52635 42675
> inl_from_qemu 53604 44599
> inl_from_kernel 38498 30798
> outl_to_kernel 34508 28816
> wr_tsc_adjust_msr 34185 26818
> rd_tsc_adjust_msr 37409 27049
> mmio-no-eventfd:pci-mem 50563 45276
> mmio-wildcard-eventfd:pci-mem 34495 30823
> mmio-datamatch-eventfd:pci-mem 35612 31071
> portio-no-eventfd:pci-io 44925 40661
> portio-wildcard-eventfd:pci-io 29708 27269
> portio-datamatch-eventfd:pci-io 31135 27164
>
> (I wrote a small C program to compare the tables for all values of CR0.WP,
> CR4.SMAP and CR4.SMEP, and they match).
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu.c | 121 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 70 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f47cccace1a1..2a8a6e3e2a31 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -4204,66 +4204,85 @@ static inline bool boot_cpu_is_amd(void)
> boot_cpu_data.x86_phys_bits, execonly);
> }
>
> +#define BYTE_MASK(access) \
> + ((1 & (access) ? 2 : 0) | \
> + (2 & (access) ? 4 : 0) | \
> + (3 & (access) ? 8 : 0) | \
> + (4 & (access) ? 16 : 0) | \
> + (5 & (access) ? 32 : 0) | \
> + (6 & (access) ? 64 : 0) | \
> + (7 & (access) ? 128 : 0))
> +

Hmm, I wonder if

#define BYTE_MASK(access) \
((1 & (access) ? (1 << 1) : 0) | \
(2 & (access) ? (1 << 2) : 0) | \
...

would be easier to get

> +
> static void update_permission_bitmask(struct kvm_vcpu *vcpu,
> struct kvm_mmu *mmu, bool ept)
> {
> - unsigned bit, byte, pfec;
> - u8 map;
> - bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
> + unsigned byte;
> +
> + const u8 x = BYTE_MASK(ACC_EXEC_MASK);
> + const u8 w = BYTE_MASK(ACC_WRITE_MASK);
> + const u8 u = BYTE_MASK(ACC_USER_MASK);
> +
> + bool cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
> + bool cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
> + bool cr0_wp = is_write_protection(vcpu);

all three can be turned const. (and I'd drop the empty lines in between ..)

>
> - cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
> - cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
> for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
> - pfec = byte << 1;
> - map = 0;
> - wf = pfec & PFERR_WRITE_MASK;
> - uf = pfec & PFERR_USER_MASK;
> - ff = pfec & PFERR_FETCH_MASK;
> + unsigned pfec = byte << 1;

This one can be turned const, too. And I think you could get rid of
"byte" and directly work on "pfec" ... but not sure if that is better.
Most probably not.

> +
> /*
> - * PFERR_RSVD_MASK bit is set in PFEC if the access is not
> - * subject to SMAP restrictions, and cleared otherwise. The
> - * bit is only meaningful if the SMAP bit is set in CR4.
> + * Each "*f" variable has a 1 bit for each UWX value
> + * that causes a fault with the given PFEC.
> */
> - smapf = !(pfec & PFERR_RSVD_MASK);
> - for (bit = 0; bit < 8; ++bit) {
> - x = bit & ACC_EXEC_MASK;
> - w = bit & ACC_WRITE_MASK;
> - u = bit & ACC_USER_MASK;
> -
> - if (!ept) {
> - /* Not really needed: !nx will cause pte.nx to fault */
> - x |= !mmu->nx;
> - /* Allow supervisor writes if !cr0.wp */
> - w |= !is_write_protection(vcpu) && !uf;
> - /* Disallow supervisor fetches of user code if cr4.smep */
> - x &= !(cr4_smep && u && !uf);
> -
> - /*
> - * SMAP:kernel-mode data accesses from user-mode
> - * mappings should fault. A fault is considered
> - * as a SMAP violation if all of the following
> - * conditions are ture:
> - * - X86_CR4_SMAP is set in CR4
> - * - A user page is accessed
> - * - Page fault in kernel mode
> - * - if CPL = 3 or X86_EFLAGS_AC is clear
> - *
> - * Here, we cover the first three conditions.
> - * The fourth is computed dynamically in
> - * permission_fault() and is in smapf.
> - *
> - * Also, SMAP does not affect instruction
> - * fetches, add the !ff check here to make it
> - * clearer.
> - */
> - smap = cr4_smap && u && !uf && !ff;
> - }
>
> - fault = (ff && !x) || (uf && !u) || (wf && !w) ||
> - (smapf && smap);
> - map |= fault << bit;
> + /* Faults from writes to non-writable pages */
> + u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
> + /* Faults from user mode accesses to supervisor pages */
> + u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;
> + /* Faults from fetches of non-executable pages*/
> + u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
> + /* Faults from kernel mode fetches of user pages */
> + u8 smepf = 0;
> + /* Faults from kernel mode accesses of user pages */
> + u8 smapf = 0;
> +
> + if (!ept) {
> + /* Faults from kernel mode accesses to user pages */
> + u8 kf = (pfec & PFERR_USER_MASK) ? 0 : u;
> +
> + /* Not really needed: !nx will cause pte.nx to fault */
> + if (!mmu->nx)
> + ff = 0;
> +
> + /* Allow supervisor writes if !cr0.wp */
> + if (!cr0_wp)
> + wf = (pfec & PFERR_USER_MASK) ? wf : 0;
> +
> + /* Disallow supervisor fetches of user code if cr4.smep */
> + if (cr4_smep)
> + smepf = (pfec & PFERR_FETCH_MASK) ? kf : 0;
> +
> + /*
> + * SMAP:kernel-mode data accesses from user-mode
> + * mappings should fault. A fault is considered
> + * as a SMAP violation if all of the following
> + * conditions are ture:
> + * - X86_CR4_SMAP is set in CR4
> + * - A user page is accessed
> + * - The access is not a fetch
> + * - Page fault in kernel mode
> + * - if CPL = 3 or X86_EFLAGS_AC is clear
> + *
> + * Here, we cover the first three conditions.
> + * The fourth is computed dynamically in permission_fault();
> + * PFERR_RSVD_MASK bit will be set in PFEC if the access is
> + * *not* subject to SMAP restrictions.
> + */
> + if (cr4_smap)
> + smapf = (pfec & (PFERR_RSVD_MASK|PFERR_FETCH_MASK)) ? 0 : kf;
> }
> - mmu->permissions[byte] = map;
> +
> + mmu->permissions[byte] = ff | uf | wf | smepf | smapf;
> }
> }
>
>

My brain hurts, but I am pretty confident that this is correct. :)

--

Thanks,

David

2017-09-12 08:20:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

On 29/08/2017 18:46, David Hildenbrand wrote:
>> +#define BYTE_MASK(access) \
>> + ((1 & (access) ? 2 : 0) | \
>> + (2 & (access) ? 4 : 0) | \
>> + (3 & (access) ? 8 : 0) | \
>> + (4 & (access) ? 16 : 0) | \
>> + (5 & (access) ? 32 : 0) | \
>> + (6 & (access) ? 64 : 0) | \
>> + (7 & (access) ? 128 : 0))
>> +
> Hmm, I wonder if
>
> #define BYTE_MASK(access) \
> ((1 & (access) ? (1 << 1) : 0) | \
> (2 & (access) ? (1 << 2) : 0) | \
> ...
>
> would be easier to get
>

Yeah, you have a point. Another way to write it is:

(1 & (access) ? 0xAA : 0) | \
(2 & (access) ? 0xCC : 0) | \
(4 & (access) ? 0xF0 : 0)

but I think yours is the best.

>>
>> +
>> + const u8 x = BYTE_MASK(ACC_EXEC_MASK);
>> + const u8 w = BYTE_MASK(ACC_WRITE_MASK);
>> + const u8 u = BYTE_MASK(ACC_USER_MASK);
>> +
>> + bool cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
>> + bool cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
>> + bool cr0_wp = is_write_protection(vcpu);
>
> all three can be turned const. (and I'd drop the empty lines in between ..)

I am using const to identify a compile-time constant here, so more like
"static const" (but I was not sure if C optimizes away static const, so
I just used "const"). This explains also the empty line!

Paolo

2017-09-12 16:48:09

by Peter Feiner

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

On Mon, Aug 28, 2017 at 12:42 PM, Jim Mattson <[email protected]> wrote:
>
> Looks okay to me, but I'm hoping Peter will chime in.

Sorry, this slipped by. Busy couple of weeks!

>
>
> Reviewed-by: Jim Mattson <[email protected]>
>
> On Thu, Aug 24, 2017 at 8:56 AM, Paolo Bonzini <[email protected]> wrote:
> > update_permission_bitmask currently does a 128-iteration loop to,
> > essentially, compute a constant array. Computing the 8 bits in parallel
> > reduces it to 16 iterations, and is enough to speed it up substantially
> > because many boolean operations in the inner loop become constants or
> > simplify noticeably.
> >
> > Because update_permission_bitmask is actually the top item in the profile
> > for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
> > clock cycles, or up to 30%:

This is a great improvement! Why not take it a step further and
compute the whole table once at module init time and be done with it?
There are only 5 extra input bits (nx, ept, smep, smap, wp), so the
whole table would only take up (1 << 5) * 16 = 512 bytes. Moreover, if
you had 32 VMs on the host, you'd actually save memory!

> >
> > before after
> > cpuid 35173 25954
> > vmcall 35122 27079
> > inl_from_pmtimer 52635 42675
> > inl_from_qemu 53604 44599
> > inl_from_kernel 38498 30798
> > outl_to_kernel 34508 28816
> > wr_tsc_adjust_msr 34185 26818
> > rd_tsc_adjust_msr 37409 27049
> > mmio-no-eventfd:pci-mem 50563 45276
> > mmio-wildcard-eventfd:pci-mem 34495 30823
> > mmio-datamatch-eventfd:pci-mem 35612 31071
> > portio-no-eventfd:pci-io 44925 40661
> > portio-wildcard-eventfd:pci-io 29708 27269
> > portio-datamatch-eventfd:pci-io 31135 27164
> >
> > (I wrote a small C program to compare the tables for all values of CR0.WP,
> > CR4.SMAP and CR4.SMEP, and they match).
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/mmu.c | 121 +++++++++++++++++++++++++++++++----------------------
> > 1 file changed, 70 insertions(+), 51 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> > index f47cccace1a1..2a8a6e3e2a31 100644
> > --- a/arch/x86/kvm/mmu.c
> > +++ b/arch/x86/kvm/mmu.c
> > @@ -4204,66 +4204,85 @@ static inline bool boot_cpu_is_amd(void)
> > boot_cpu_data.x86_phys_bits, execonly);
> > }
> >
> > +#define BYTE_MASK(access) \
> > + ((1 & (access) ? 2 : 0) | \
> > + (2 & (access) ? 4 : 0) | \
> > + (3 & (access) ? 8 : 0) | \
> > + (4 & (access) ? 16 : 0) | \
> > + (5 & (access) ? 32 : 0) | \
> > + (6 & (access) ? 64 : 0) | \
> > + (7 & (access) ? 128 : 0))
> > +
> > +
> > static void update_permission_bitmask(struct kvm_vcpu *vcpu,
> > struct kvm_mmu *mmu, bool ept)
> > {
> > - unsigned bit, byte, pfec;
> > - u8 map;
> > - bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
> > + unsigned byte;
> > +
> > + const u8 x = BYTE_MASK(ACC_EXEC_MASK);
> > + const u8 w = BYTE_MASK(ACC_WRITE_MASK);
> > + const u8 u = BYTE_MASK(ACC_USER_MASK);
> > +
> > + bool cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) != 0;
> > + bool cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP) != 0;
> > + bool cr0_wp = is_write_protection(vcpu);
> >
> > - cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
> > - cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
> > for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
> > - pfec = byte << 1;
> > - map = 0;
> > - wf = pfec & PFERR_WRITE_MASK;
> > - uf = pfec & PFERR_USER_MASK;
> > - ff = pfec & PFERR_FETCH_MASK;
> > + unsigned pfec = byte << 1;
> > +
> > /*
> > - * PFERR_RSVD_MASK bit is set in PFEC if the access is not
> > - * subject to SMAP restrictions, and cleared otherwise. The
> > - * bit is only meaningful if the SMAP bit is set in CR4.
> > + * Each "*f" variable has a 1 bit for each UWX value
> > + * that causes a fault with the given PFEC.
> > */
> > - smapf = !(pfec & PFERR_RSVD_MASK);
> > - for (bit = 0; bit < 8; ++bit) {
> > - x = bit & ACC_EXEC_MASK;
> > - w = bit & ACC_WRITE_MASK;
> > - u = bit & ACC_USER_MASK;
> > -
> > - if (!ept) {
> > - /* Not really needed: !nx will cause pte.nx to fault */
> > - x |= !mmu->nx;
> > - /* Allow supervisor writes if !cr0.wp */
> > - w |= !is_write_protection(vcpu) && !uf;
> > - /* Disallow supervisor fetches of user code if cr4.smep */
> > - x &= !(cr4_smep && u && !uf);
> > -
> > - /*
> > - * SMAP:kernel-mode data accesses from user-mode
> > - * mappings should fault. A fault is considered
> > - * as a SMAP violation if all of the following
> > - * conditions are ture:
> > - * - X86_CR4_SMAP is set in CR4
> > - * - A user page is accessed
> > - * - Page fault in kernel mode
> > - * - if CPL = 3 or X86_EFLAGS_AC is clear
> > - *
> > - * Here, we cover the first three conditions.
> > - * The fourth is computed dynamically in
> > - * permission_fault() and is in smapf.
> > - *
> > - * Also, SMAP does not affect instruction
> > - * fetches, add the !ff check here to make it
> > - * clearer.
> > - */
> > - smap = cr4_smap && u && !uf && !ff;
> > - }
> >
> > - fault = (ff && !x) || (uf && !u) || (wf && !w) ||
> > - (smapf && smap);
> > - map |= fault << bit;
> > + /* Faults from writes to non-writable pages */
> > + u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
> > + /* Faults from user mode accesses to supervisor pages */
> > + u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;

Aside: In the EPT case, I wondered why it's necessary to construct any
of the mmu->permissions where PFERR_USER_MASK is set. EPT doesn't have
a 'user' bit. Digging through the code, I see that vmx.c sets
PFERR_USER_MASK for read violations, which normal x86 page tables
don't have. I'm not sure to what effect, but it's used for something
:-)

> > + /* Faults from fetches of non-executable pages*/
> > + u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
> > + /* Faults from kernel mode fetches of user pages */
> > + u8 smepf = 0;
> > + /* Faults from kernel mode accesses of user pages */
> > + u8 smapf = 0;
> > +
> > + if (!ept) {
> > + /* Faults from kernel mode accesses to user pages */
> > + u8 kf = (pfec & PFERR_USER_MASK) ? 0 : u;
> > +
> > + /* Not really needed: !nx will cause pte.nx to fault */
> > + if (!mmu->nx)
> > + ff = 0;
> > +
> > + /* Allow supervisor writes if !cr0.wp */
> > + if (!cr0_wp)
> > + wf = (pfec & PFERR_USER_MASK) ? wf : 0;
> > +
> > + /* Disallow supervisor fetches of user code if cr4.smep */
> > + if (cr4_smep)
> > + smepf = (pfec & PFERR_FETCH_MASK) ? kf : 0;
> > +
> > + /*
> > + * SMAP:kernel-mode data accesses from user-mode
> > + * mappings should fault. A fault is considered
> > + * as a SMAP violation if all of the following
> > + * conditions are ture:
> > + * - X86_CR4_SMAP is set in CR4
> > + * - A user page is accessed
> > + * - The access is not a fetch
> > + * - Page fault in kernel mode
> > + * - if CPL = 3 or X86_EFLAGS_AC is clear
> > + *
> > + * Here, we cover the first three conditions.
> > + * The fourth is computed dynamically in permission_fault();
> > + * PFERR_RSVD_MASK bit will be set in PFEC if the access is
> > + * *not* subject to SMAP restrictions.
> > + */
> > + if (cr4_smap)
> > + smapf = (pfec & (PFERR_RSVD_MASK|PFERR_FETCH_MASK)) ? 0 : kf;
> > }
> > - mmu->permissions[byte] = map;
> > +
> > + mmu->permissions[byte] = ff | uf | wf | smepf | smapf;
> > }
> > }
> >
> > --
> > 1.8.3.1
> >

2017-09-12 19:56:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

On 12/09/2017 18:48, Peter Feiner wrote:
>>>
>>> Because update_permission_bitmask is actually the top item in the profile
>>> for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
>>> clock cycles, or up to 30%:
>
> This is a great improvement! Why not take it a step further and
> compute the whole table once at module init time and be done with it?
> There are only 5 extra input bits (nx, ept, smep, smap, wp),

4 actually, nx could be ignored (because unlike WP, the bit is reserved
when nx is disabled). It is only handled for clarity.

> so the
> whole table would only take up (1 << 5) * 16 = 512 bytes. Moreover, if
> you had 32 VMs on the host, you'd actually save memory!

Indeed; my thought was to write a script or something to generate the
tables at compile time, but doing it at module init time would be clever
and easier.

That said, the generated code for the function, right now, is pretty
good. If it saved 1000 clock cycles per nested vmexit it would be very
convincing, but if it were 50 or even 100 a bit less so.

Paolo

2017-09-12 20:06:45

by Peter Feiner

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: speedup update_permission_bitmask

On Tue, Sep 12, 2017 at 12:55 PM, Paolo Bonzini <[email protected]> wrote:
> On 12/09/2017 18:48, Peter Feiner wrote:
>>>>
>>>> Because update_permission_bitmask is actually the top item in the profile
>>>> for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
>>>> clock cycles, or up to 30%:
>>
>> This is a great improvement! Why not take it a step further and
>> compute the whole table once at module init time and be done with it?
>> There are only 5 extra input bits (nx, ept, smep, smap, wp),
>
> 4 actually, nx could be ignored (because unlike WP, the bit is reserved
> when nx is disabled). It is only handled for clarity.
>
>> so the
>> whole table would only take up (1 << 5) * 16 = 512 bytes. Moreover, if
>> you had 32 VMs on the host, you'd actually save memory!
>
> Indeed; my thought was to write a script or something to generate the
> tables at compile time, but doing it at module init time would be clever
> and easier.
>
> That said, the generated code for the function, right now, is pretty
> good. If it saved 1000 clock cycles per nested vmexit it would be very
> convincing, but if it were 50 or even 100 a bit less so.

ACK. I'm good with either approach :-) Please consider this one

Reviewed-By: Peter Feiner <[email protected]>