When installing a new pte entry or updating an old valid entry in stage 2
translation, we use get_page()/put_page() to record page_count of the page-table
pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
which might make page-table pages unable to be freed when unmapping a range.
When dirty logging of a guest with hugepages is finished, we should merge tables
back into a block entry if adjustment of huge mapping is found necessary.
In addition to installing the block entry, we should not only free the non-huge
page-table pages but also invalidate all the TLB entries of non-huge mappings for
the block. PATCH 2/3 adds enough TLBI when merging tables into a block entry.
The rewrite of page-table code and fault handling add two different handlers
for "just relaxing permissions" and "map by stage2 page-table walk", that's
good improvement. Yet, in function user_mem_abort(), conditions where we choose
the above two fault handlers are not strictly distinguished. This will causes
guest errors such as infinite-loop (soft lockup will occur in result), because of
calling the inappropriate fault handler. So, a solution that can strictly
distinguish conditions is introduced in PATCH 3/3.
Changes from v1:
* In PATCH 1/3, introduce a more concise fix.
* In PATCH 2/3, using full S2 TLB invalidation when merging tables into
a block entry.
Yanan Wang (3):
KVM: arm64: Fix possible memory leak in kvm stage2
KVM: arm64: Fix handling of merging tables into a block entry
KVM: arm64: Add usage of stage 2 fault lookup level in
user_mem_abort()
arch/arm64/include/asm/esr.h | 1 +
arch/arm64/include/asm/kvm_emulate.h | 5 +++++
arch/arm64/kvm/hyp/pgtable.c | 11 ++++++++++-
arch/arm64/kvm/mmu.c | 11 +++++++++--
4 files changed, 25 insertions(+), 3 deletions(-)
--
2.19.1
If we get a FSC_PERM fault, just using (logging_active && writable) to
determine calling kvm_pgtable_stage2_map(). There will be two more cases
we should consider.
(1) After logging_active is configged back to false from true. When we
get a FSC_PERM fault with write_fault and adjustment of hugepage is needed,
we should merge tables back to a block entry. This case is ignored by still
calling kvm_pgtable_stage2_relax_perms(), which will lead to an endless
loop and guest panic due to soft lockup.
(2) We use (FSC_PERM && logging_active && writable) to determine
collapsing a block entry into a table by calling kvm_pgtable_stage2_map().
But sometimes we may only need to relax permissions when trying to write
to a page other than a block.
In this condition,using kvm_pgtable_stage2_relax_perms() will be fine.
The ISS filed bit[1:0] in ESR_EL2 regesiter indicates the stage2 lookup
level at which a D-abort or I-abort occurred. By comparing granule of
the fault lookup level with vma_pagesize, we can strictly distinguish
conditions of calling kvm_pgtable_stage2_relax_perms() or
kvm_pgtable_stage2_map(), and the above two cases will be well considered.
Suggested-by: Keqian Zhu <[email protected]>
Signed-off-by: Yanan Wang <[email protected]>
---
arch/arm64/include/asm/esr.h | 1 +
arch/arm64/include/asm/kvm_emulate.h | 5 +++++
arch/arm64/kvm/mmu.c | 11 +++++++++--
3 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
index 22c81f1edda2..85a3e49f92f4 100644
--- a/arch/arm64/include/asm/esr.h
+++ b/arch/arm64/include/asm/esr.h
@@ -104,6 +104,7 @@
/* Shared ISS fault status code(IFSC/DFSC) for Data/Instruction aborts */
#define ESR_ELx_FSC (0x3F)
#define ESR_ELx_FSC_TYPE (0x3C)
+#define ESR_ELx_FSC_LEVEL (0x03)
#define ESR_ELx_FSC_EXTABT (0x10)
#define ESR_ELx_FSC_SERROR (0x11)
#define ESR_ELx_FSC_ACCESS (0x08)
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 5ef2669ccd6c..00bc6f1234ba 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -350,6 +350,11 @@ static __always_inline u8 kvm_vcpu_trap_get_fault_type(const struct kvm_vcpu *vc
return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_TYPE;
}
+static __always_inline u8 kvm_vcpu_trap_get_fault_level(const struct kvm_vcpu *vcpu)
+{
+ return kvm_vcpu_get_esr(vcpu) & ESR_ELx_FSC_LEVEL;
+}
+
static __always_inline bool kvm_vcpu_abt_issea(const struct kvm_vcpu *vcpu)
{
switch (kvm_vcpu_trap_get_fault(vcpu)) {
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 1a01da9fdc99..75814a02d189 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -754,10 +754,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
gfn_t gfn;
kvm_pfn_t pfn;
bool logging_active = memslot_is_logging(memslot);
- unsigned long vma_pagesize;
+ unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
+ unsigned long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+ fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
write_fault = kvm_is_write_fault(vcpu);
exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
VM_BUG_ON(write_fault && exec_fault);
@@ -896,7 +898,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
prot |= KVM_PGTABLE_PROT_X;
- if (fault_status == FSC_PERM && !(logging_active && writable)) {
+ /*
+ * Under the premise of getting a FSC_PERM fault, we just need to relax
+ * permissions only if vma_pagesize equals fault_granule. Otherwise,
+ * kvm_pgtable_stage2_map() should be called to change block size.
+ */
+ if (fault_status == FSC_PERM && vma_pagesize == fault_granule) {
ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
} else {
ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
--
2.19.1
On Wed, Dec 02, 2020 at 04:10:31AM +0800, Yanan Wang wrote:
> When installing a new pte entry or updating an old valid entry in stage 2
> translation, we use get_page()/put_page() to record page_count of the page-table
> pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
> which might make page-table pages unable to be freed when unmapping a range.
>
> When dirty logging of a guest with hugepages is finished, we should merge tables
> back into a block entry if adjustment of huge mapping is found necessary.
> In addition to installing the block entry, we should not only free the non-huge
> page-table pages but also invalidate all the TLB entries of non-huge mappings for
> the block. PATCH 2/3 adds enough TLBI when merging tables into a block entry.
>
> The rewrite of page-table code and fault handling add two different handlers
> for "just relaxing permissions" and "map by stage2 page-table walk", that's
> good improvement. Yet, in function user_mem_abort(), conditions where we choose
> the above two fault handlers are not strictly distinguished. This will causes
> guest errors such as infinite-loop (soft lockup will occur in result), because of
> calling the inappropriate fault handler. So, a solution that can strictly
> distinguish conditions is introduced in PATCH 3/3.
For the series:
Acked-by: Will Deacon <[email protected]>
Thanks for reporting these, helping me to understand the issues and then
spinning a v2 so promptly.
Will
Hi Will, Marc,
On 2020/12/2 4:59, Will Deacon wrote:
> On Wed, Dec 02, 2020 at 04:10:31AM +0800, Yanan Wang wrote:
>> When installing a new pte entry or updating an old valid entry in stage 2
>> translation, we use get_page()/put_page() to record page_count of the page-table
>> pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
>> which might make page-table pages unable to be freed when unmapping a range.
>>
>> When dirty logging of a guest with hugepages is finished, we should merge tables
>> back into a block entry if adjustment of huge mapping is found necessary.
>> In addition to installing the block entry, we should not only free the non-huge
>> page-table pages but also invalidate all the TLB entries of non-huge mappings for
>> the block. PATCH 2/3 adds enough TLBI when merging tables into a block entry.
>>
>> The rewrite of page-table code and fault handling add two different handlers
>> for "just relaxing permissions" and "map by stage2 page-table walk", that's
>> good improvement. Yet, in function user_mem_abort(), conditions where we choose
>> the above two fault handlers are not strictly distinguished. This will causes
>> guest errors such as infinite-loop (soft lockup will occur in result), because of
>> calling the inappropriate fault handler. So, a solution that can strictly
>> distinguish conditions is introduced in PATCH 3/3.
> For the series:
>
> Acked-by: Will Deacon <[email protected]>
>
> Thanks for reporting these, helping me to understand the issues and then
> spinning a v2 so promptly.
>
> Will
> .
Thanks for the help and suggestions.
BTW: there are two more things below that I want to talk about.
1. Recently, I have been focusing on the ARMv8.4-TTRem feature which is
aimed at changing block size in stage 2 mapping.
I have a plan to implement this feature for stage 2 translation when
splitting a block into tables or merging tables into a block.
This feature supports changing block size without performing
*break-before-make*, which might have some improvement on performance.
What do you think about this?
2. Given that the issues we discussed before were found in practice when
guest state changes from dirty logging to dirty logging canceled.
I could add a test file testing on this case to selftests/ or kvm unit
tests/, if it's necessary.
Yanan
Hi Yanan,
[...]
> BTW: there are two more things below that I want to talk about.
>
> 1. Recently, I have been focusing on the ARMv8.4-TTRem feature which
> is aimed at changing block size in stage 2 mapping.
>
> I have a plan to implement this feature for stage 2 translation when
> splitting a block into tables or merging tables into a block.
>
> This feature supports changing block size without performing
> *break-before-make*, which might have some improvement on performance.
>
> What do you think about this?
It would be interesting if you can demonstrate some significant
performance improvements compared to the same workload with BBM.
I'm not completely convinced this would change much, given that
it is only when moving from a table to a block mapping that you
can elide BBM when the support level is 1 or 2. As far as I can
tell, this only happens in the "stop logging" case.
Is that something that happens often enough to justify the added
complexity? Having to handle TLB Conflict Abort is annoying, for
example.
> 2. Given that the issues we discussed before were found in practice
> when guest state changes from dirty logging to dirty logging canceled.
>
> I could add a test file testing on this case to selftests/ or kvm unit
> tests/, if it's necessary.
That would be awesome, and I'd be very grateful if you did. It is the
second time we break this exact case, and having a reliable way to
verify it would definitely help.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Wed, 2 Dec 2020 04:10:31 +0800, Yanan Wang wrote:
> When installing a new pte entry or updating an old valid entry in stage 2
> translation, we use get_page()/put_page() to record page_count of the page-table
> pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
> which might make page-table pages unable to be freed when unmapping a range.
>
> When dirty logging of a guest with hugepages is finished, we should merge tables
> back into a block entry if adjustment of huge mapping is found necessary.
> In addition to installing the block entry, we should not only free the non-huge
> page-table pages but also invalidate all the TLB entries of non-huge mappings for
> the block. PATCH 2/3 adds enough TLBI when merging tables into a block entry.
>
> [...]
Applied to kvm-arm64/fixes-5.10, thanks!
[1/3] KVM: arm64: Fix memory leak on stage2 update of a valid PTE
commit: 5c646b7e1d8bcb12317426287c516dfa4c5171c2
[2/3] KVM: arm64: Fix handling of merging tables into a block entry
commit: 3a0b870e3448302ca2ba703bea1b79b61c3f33c6
[3/3] KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
commit: 7d894834a305568a0168c55d4729216f5f8cb4e6
Cheers,
M.
--
Without deviation from the norm, progress is not possible.
On 2020/12/2 20:23, Marc Zyngier wrote:
> Hi Yanan,
>
> [...]
>
>> BTW: there are two more things below that I want to talk about.
>>
>> 1. Recently, I have been focusing on the ARMv8.4-TTRem feature which
>> is aimed at changing block size in stage 2 mapping.
>>
>> I have a plan to implement this feature for stage 2 translation when
>> splitting a block into tables or merging tables into a block.
>>
>> This feature supports changing block size without performing
>> *break-before-make*, which might have some improvement on performance.
>>
>> What do you think about this?
>
> It would be interesting if you can demonstrate some significant
> performance improvements compared to the same workload with BBM.
>
> I'm not completely convinced this would change much, given that
> it is only when moving from a table to a block mapping that you
> can elide BBM when the support level is 1 or 2. As far as I can
> tell, this only happens in the "stop logging" case.
>
> Is that something that happens often enough to justify the added
> complexity? Having to handle TLB Conflict Abort is annoying, for
> example.
I will take more consideration about the necessity and maybe some tests
on the performance will be made later.
Thanks,
Yanan
>
>> 2. Given that the issues we discussed before were found in practice
>> when guest state changes from dirty logging to dirty logging canceled.
>>
>> I could add a test file testing on this case to selftests/ or kvm unit
>> tests/, if it's necessary.
>
> That would be awesome, and I'd be very grateful if you did. It is the
> second time we break this exact case, and having a reliable way to
> verify it would definitely help.
>
> Thanks,
>
> M.