LinuxLists.cc - [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation

2020-11-30 12:23:57

Subject: [RFC PATCH 0/3] Fix several bugs in KVM stage 2 translation

Several problems about KVM stage 2 translation were found when testing based
on the mainline code. The following is description of the problems and the
corresponding patchs.

When installing a new pte entry or updating an old valid entry in stage 2
translation, we use get_page()/put_page() to record page_count of the page-table
pages. PATCH 1/3 aims to fix incorrect use of get_page()/put_page() in stage 2,
which might make page-table pages unable to be freed when unmapping a range.

When dirty logging of a guest with hugepages is finished, we should merge tables
back into a block entry if adjustment of huge mapping is found necessary.
In addition to installing the block entry, mapping of the lower-levels for the
block should also be unmapped to avoid multiple TLB entries.
PATCH 2/3 adds unmap operation when merge tables into a block entry.

The rewrite of page-table code and fault handling add two different handlers
for "just relaxing permissions" and "map by stage2 page-table walk", that's
great improvement. Yet, in function user_mem_abort(), conditions where we choose
the above two fault handlers are not strictly distinguished. This will causes
guest errors such as infinite-loop (soft lockup will occur in result), because of
calling the inappropriate fault handler. So, a solution that can strictly
distinguish conditions is introduced in PATCH 3/3.

Yanan Wang (3):
KVM: arm64: Fix possible memory leak in kvm stage2
KVM: arm64: Fix handling of merging tables into a block entry
KVM: arm64: Add usage of stage 2 fault lookup level in
user_mem_abort()

arch/arm64/include/asm/esr.h | 1 +
arch/arm64/include/asm/kvm_emulate.h | 5 +++++
arch/arm64/kvm/hyp/pgtable.c | 22 +++++++++++++++++-----
arch/arm64/kvm/mmu.c | 11 +++++++++--
4 files changed, 32 insertions(+), 7 deletions(-)

--
2.19.1

2020-11-30 12:25:05

by Yanan Wang

[permalink] [raw]

Subject: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

In dirty logging case(logging_active == True), we need to collapse a block
entry into a table if necessary. After dirty logging is canceled, when merging
tables back into a block entry, we should not only free the non-huge page
tables but also unmap the non-huge mapping for the block. Without the unmap,
inconsistent TLB entries for the pages in the the block will be created.

We could also use unmap_stage2_range API to unmap the non-huge mapping,
but this could potentially free the upper level page-table page which
will be useful later.

Signed-off-by: Yanan Wang <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 696b6aa83faf..fec8dc9f2baa 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
return 0;
}

+static void stage2_flush_dcache(void *addr, u64 size);
+static bool stage2_pte_cacheable(kvm_pte_t pte);
+
static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct stage2_map_data *data)
{
@@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
struct page *page = virt_to_page(ptep);

if (data->anchor) {
- if (kvm_pte_valid(pte))
+ if (kvm_pte_valid(pte)) {
+ kvm_set_invalid_pte(ptep);
+ kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
+ addr, level);
put_page(page);

+ if (stage2_pte_cacheable(pte))
+ stage2_flush_dcache(kvm_pte_follow(pte),
+ kvm_granule_size(level));
+ }
+
return 0;
}

@@ -574,7 +585,7 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
* The behaviour of the LEAF callback then depends on whether or not the
* anchor has been set. If not, then we're not using a block mapping higher
* up the table and we perform the mapping at the existing leaves instead.
- * If, on the other hand, the anchor _is_ set, then we drop references to
+ * If, on the other hand, the anchor _is_ set, then we unmap the mapping of
* all valid leaves so that the pages beneath the anchor can be freed.
*
* Finally, the TABLE_POST callback does nothing if the anchor has not
--
2.19.1

2020-11-30 13:39:28

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> In dirty logging case(logging_active == True), we need to collapse a block
> entry into a table if necessary. After dirty logging is canceled, when merging
> tables back into a block entry, we should not only free the non-huge page
> tables but also unmap the non-huge mapping for the block. Without the unmap,
> inconsistent TLB entries for the pages in the the block will be created.
>
> We could also use unmap_stage2_range API to unmap the non-huge mapping,
> but this could potentially free the upper level page-table page which
> will be useful later.
>
> Signed-off-by: Yanan Wang <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 696b6aa83faf..fec8dc9f2baa 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> return 0;
> }
>
> +static void stage2_flush_dcache(void *addr, u64 size);
> +static bool stage2_pte_cacheable(kvm_pte_t pte);
> +
> static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> struct stage2_map_data *data)
> {
> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> struct page *page = virt_to_page(ptep);
>
> if (data->anchor) {
> - if (kvm_pte_valid(pte))
> + if (kvm_pte_valid(pte)) {
> + kvm_set_invalid_pte(ptep);
> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> + addr, level);
> put_page(page);

This doesn't make sense to me: the page-table pages we're walking when the
anchor is set are not accessible to the hardware walker because we unhooked
the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
TLB invalidation.

Are you seeing a problem in practice here?

> + if (stage2_pte_cacheable(pte))
> + stage2_flush_dcache(kvm_pte_follow(pte),
> + kvm_granule_size(level));

I don't understand the need for the flush either, as we're just coalescing
existing entries into a larger block mapping.

Will

2020-11-30 15:28:15

by Yanan Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On 2020/11/30 21:34, Will Deacon wrote:
> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> In dirty logging case(logging_active == True), we need to collapse a block
>> entry into a table if necessary. After dirty logging is canceled, when merging
>> tables back into a block entry, we should not only free the non-huge page
>> tables but also unmap the non-huge mapping for the block. Without the unmap,
>> inconsistent TLB entries for the pages in the the block will be created.
>>
>> We could also use unmap_stage2_range API to unmap the non-huge mapping,
>> but this could potentially free the upper level page-table page which
>> will be useful later.
>>
>> Signed-off-by: Yanan Wang <[email protected]>
>> ---
>> arch/arm64/kvm/hyp/pgtable.c | 15 +++++++++++++--
>> 1 file changed, 13 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 696b6aa83faf..fec8dc9f2baa 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>> return 0;
>> }
>>
>> +static void stage2_flush_dcache(void *addr, u64 size);
>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> +
>> static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> struct stage2_map_data *data)
>> {
>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> struct page *page = virt_to_page(ptep);
>>
>> if (data->anchor) {
>> - if (kvm_pte_valid(pte))
>> + if (kvm_pte_valid(pte)) {
>> + kvm_set_invalid_pte(ptep);
>> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> + addr, level);
>> put_page(page);
> This doesn't make sense to me: the page-table pages we're walking when the
> anchor is set are not accessible to the hardware walker because we unhooked
> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> TLB invalidation.
>
> Are you seeing a problem in practice here?

Yes, I indeed find a problem in practice.

When the migration was cancelled, a TLB conflic abort was found in guest.

This problem is fixed before rework of the page table code, you can have
a look in the following two links:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277

https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

>> + if (stage2_pte_cacheable(pte))
>> + stage2_flush_dcache(kvm_pte_follow(pte),
>> + kvm_granule_size(level));
> I don't understand the need for the flush either, as we're just coalescing
> existing entries into a larger block mapping.

In my opinion, after unmapping, it is necessary to ensure the cache
coherency, because it is unknown whether the future mapping memory
attribute is changed or not (cacheable -> non_cacheable) theoretically.

> Will
> .

2020-11-30 16:06:35

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

Hi,

Cheers for the quick reply. See below for more questions...

On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> On 2020/11/30 21:34, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > > return 0;
> > > }
> > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > +
> > > static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > struct stage2_map_data *data)
> > > {
> > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > struct page *page = virt_to_page(ptep);
> > > if (data->anchor) {
> > > - if (kvm_pte_valid(pte))
> > > + if (kvm_pte_valid(pte)) {
> > > + kvm_set_invalid_pte(ptep);
> > > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > + addr, level);
> > > put_page(page);
> > This doesn't make sense to me: the page-table pages we're walking when the
> > anchor is set are not accessible to the hardware walker because we unhooked
> > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > TLB invalidation.
> >
> > Are you seeing a problem in practice here?
>
> Yes, I indeed find a problem in practice.
>
> When the migration was cancelled, a TLB conflic abort? was found in guest.
>
> This problem is fixed before rework of the page table code, you can have a
> look in the following two links:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>
> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html

Ok, let's go through this, because I still don't see the bug. Please correct
me if you spot any mistakes:

1. We have a block mapping for X => Y
2. Dirty logging is enabled, so the block mapping is write-protected and
ends up being split into page mappings
3. Dirty logging is disabled due to a failed migration.

--- At this point, I think we agree that the state of the MMU is alright ---

4. We take a stage-2 fault and want to reinstall the block mapping:

a. kvm_pgtable_stage2_map() is invoked to install the block mapping
b. stage2_map_walk_table_pre() finds a table where we would like to
install the block:

i. The anchor is set to point at this entry
ii. The entry is made invalid
iii. We invalidate the TLB for the input address. This is
TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.

*** At this point, the page-table pointed to by the old table entry
is not reachable to the hardware walker ***

c. stage2_map_walk_leaf() is called for each leaf entry in the
now-unreachable subtree, dropping page-references for each valid
entry it finds.
d. stage2_map_walk_table_post() is eventually called for the entry
which we cleared back in b.ii, so we install the new block mapping.

You are proposing to add additional TLB invalidation to (c), but I don't
think that is necessary, thanks to the invalidation already performed in
b.iii. What am I missing here?

> > > + if (stage2_pte_cacheable(pte))
> > > + stage2_flush_dcache(kvm_pte_follow(pte),
> > > + kvm_granule_size(level));
> > I don't understand the need for the flush either, as we're just coalescing
> > existing entries into a larger block mapping.
>
> In my opinion, after unmapping, it is necessary to ensure the cache
> coherency, because it is unknown whether the future mapping memory attribute
> is changed or not (cacheable -> non_cacheable) theoretically.

But in this case we're just changing the structure of the page-tables,
not the pages which are mapped, are we?

Will

2020-12-01 05:07:26

by Yanan Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

Hi Will,

On 2020/12/1 0:01, Will Deacon wrote:
> Hi,
>
> Cheers for the quick reply. See below for more questions...
>
> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> On 2020/11/30 21:34, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>> return 0;
>>>> }
>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>> +
>>>> static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>> struct stage2_map_data *data)
>>>> {
>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>> struct page *page = virt_to_page(ptep);
>>>> if (data->anchor) {
>>>> - if (kvm_pte_valid(pte))
>>>> + if (kvm_pte_valid(pte)) {
>>>> + kvm_set_invalid_pte(ptep);
>>>> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>> + addr, level);
>>>> put_page(page);
>>> This doesn't make sense to me: the page-table pages we're walking when the
>>> anchor is set are not accessible to the hardware walker because we unhooked
>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>> TLB invalidation.
>>>
>>> Are you seeing a problem in practice here?
>> Yes, I indeed find a problem in practice.
>>
>> When the migration was cancelled, a TLB conflic abort was found in guest.
>>
>> This problem is fixed before rework of the page table code, you can have a
>> look in the following two links:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>
>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> Ok, let's go through this, because I still don't see the bug. Please correct
> me if you spot any mistakes:
>
> 1. We have a block mapping for X => Y
> 2. Dirty logging is enabled, so the block mapping is write-protected and
> ends up being split into page mappings
> 3. Dirty logging is disabled due to a failed migration.
>
> --- At this point, I think we agree that the state of the MMU is alright ---
>
> 4. We take a stage-2 fault and want to reinstall the block mapping:
>
> a. kvm_pgtable_stage2_map() is invoked to install the block mapping
> b. stage2_map_walk_table_pre() finds a table where we would like to
> install the block:
>
> i. The anchor is set to point at this entry
> ii. The entry is made invalid
> iii. We invalidate the TLB for the input address. This is
> TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>
> *** At this point, the page-table pointed to by the old table entry
> is not reachable to the hardware walker ***
>
> c. stage2_map_walk_leaf() is called for each leaf entry in the
> now-unreachable subtree, dropping page-references for each valid
> entry it finds.
> d. stage2_map_walk_table_post() is eventually called for the entry
> which we cleared back in b.ii, so we install the new block mapping.
>
> You are proposing to add additional TLB invalidation to (c), but I don't
> think that is necessary, thanks to the invalidation already performed in
> b.iii. What am I missing here?

The point is at b.iii where the TLBI is not enough. There are many page
mappings that we need to merge into a block mapping.

We invalidate the TLB for the input address without level hint at b.iii,
but this operation just flush TLB for one page mapping, there

are still some TLB entries for the other page mappings in the cache, the
MMU hardware walker can still hit these entries next time.

Maybe we can imagine a concrete example here. If we now need to merge
page mappings into a 1G block mapping, and the

fault_ipa to user_mem_abort() is 0x225043000, after ALIGNMENT to 1G, the
input address will be 0x200000000, then the TLBI

operation at b.iii will invalidate the TLB entry for address
0x200000000. But what about address 0x200001000 , 0x200002000 ... ?

After the installing of 1G block mapping is finished, when the fault_ipa
is 0x200007000, MMU can still hit the TLB entry for page

mapping in the cache and can also access memory through the new block entry.

So adding TLBI operation in stage2_map_walk_leaf() aims to invalidate
TLB entries for all the page mappings that will be merged.

In this way, after installing of block mapping, MMU can only access
memory through the new block entry.

>>>> + if (stage2_pte_cacheable(pte))
>>>> + stage2_flush_dcache(kvm_pte_follow(pte),
>>>> + kvm_granule_size(level));
>>> I don't understand the need for the flush either, as we're just coalescing
>>> existing entries into a larger block mapping.
>> In my opinion, after unmapping, it is necessary to ensure the cache
>> coherency, because it is unknown whether the future mapping memory attribute
>> is changed or not (cacheable -> non_cacheable) theoretically.
> But in this case we're just changing the structure of the page-tables,
> not the pages which are mapped, are we?
>
> Will
> .

Yes, there is not a condition for now that cache attributes will be
changed in this case.

Maybe this part of dcache flush can be cut.

Yanan

2020-12-01 14:09:08

by Marc Zyngier

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On 2020-12-01 13:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>> > > On 2020/11/30 21:34, Will Deacon wrote:
>> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
>> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>> > > > > return 0;
>> > > > > }
>> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
>> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
>> > > > > +
>> > > > > static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > > struct stage2_map_data *data)
>> > > > > {
>> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>> > > > > struct page *page = virt_to_page(ptep);
>> > > > > if (data->anchor) {
>> > > > > - if (kvm_pte_valid(pte))
>> > > > > + if (kvm_pte_valid(pte)) {
>> > > > > + kvm_set_invalid_pte(ptep);
>> > > > > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>> > > > > + addr, level);
>> > > > > put_page(page);
>> > > > This doesn't make sense to me: the page-table pages we're walking when the
>> > > > anchor is set are not accessible to the hardware walker because we unhooked
>> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>> > > > TLB invalidation.
>> > > >
>> > > > Are you seeing a problem in practice here?
>> > > Yes, I indeed find a problem in practice.
>> > >
>> > > When the migration was cancelled, a TLB conflic abort was found in guest.
>> > >
>> > > This problem is fixed before rework of the page table code, you can have a
>> > > look in the following two links:
>> > >
>> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>> > >
>> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>> > Ok, let's go through this, because I still don't see the bug. Please correct
>> > me if you spot any mistakes:
>> >
>> > 1. We have a block mapping for X => Y
>> > 2. Dirty logging is enabled, so the block mapping is write-protected and
>> > ends up being split into page mappings
>> > 3. Dirty logging is disabled due to a failed migration.
>> >
>> > --- At this point, I think we agree that the state of the MMU is alright ---
>> >
>> > 4. We take a stage-2 fault and want to reinstall the block mapping:
>> >
>> > a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>> > b. stage2_map_walk_table_pre() finds a table where we would like to
>> > install the block:
>> >
>> > i. The anchor is set to point at this entry
>> > ii. The entry is made invalid
>> > iii. We invalidate the TLB for the input address. This is
>> > TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>> >
>> > *** At this point, the page-table pointed to by the old table entry
>> > is not reachable to the hardware walker ***
>> >
>> > c. stage2_map_walk_leaf() is called for each leaf entry in the
>> > now-unreachable subtree, dropping page-references for each valid
>> > entry it finds.
>> > d. stage2_map_walk_table_post() is eventually called for the entry
>> > which we cleared back in b.ii, so we install the new block mapping.
>> >
>> > You are proposing to add additional TLB invalidation to (c), but I don't
>> > think that is necessary, thanks to the invalidation already performed in
>> > b.iii. What am I missing here?
>>
>> The point is at b.iii where the TLBI is not enough. There are many
>> page
>> mappings that we need to merge into a block mapping.
>>
>> We invalidate the TLB for the input address without level hint at
>> b.iii, but
>> this operation just flush TLB for one page mapping, there
>>
>> are still some TLB entries for the other page mappings in the cache,
>> the MMU
>> hardware walker can still hit these entries next time.
>
> Ah, yes, I see. Thanks. I hadn't considered the case where there are
> table
> entries beneath the anchor. So how about the diff below?
>
> Will
>
> --->8
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c
> b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> end, u32 level,
> return 0;
>
> kvm_set_invalid_pte(ptep);
> - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> + /* TLB invalidation is deferred until the _post handler */
> data->anchor = ptep;
> return 0;
> }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> u64 end, u32 level,
> struct stage2_map_data *data)
> {
> int ret = 0;
> + kvm_pte_t pte = *ptep;
>
> if (!data->anchor)
> return 0;
>
> - free_page((unsigned long)kvm_pte_follow(*ptep));
> + kvm_set_invalid_pte(ptep);
> +
> + /*
> + * Invalidate the whole stage-2, as we may have numerous leaf
> + * entries below us which would otherwise need invalidating
> + * individually.
> + */
> + kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);

That's a big hammer, and we so far have been pretty careful not to
over-invalidate. Is the block-replacing-table *without* an unmap
in between the only case where this triggers?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-12-01 14:16:35

by Yanan Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On 2020/12/1 21:46, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 0:01, Will Deacon wrote:
>>> On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
>>>> On 2020/11/30 21:34, Will Deacon wrote:
>>>>> On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
>>>>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> index 696b6aa83faf..fec8dc9f2baa 100644
>>>>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>>>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>>>>> @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
>>>>>> return 0;
>>>>>> }
>>>>>> +static void stage2_flush_dcache(void *addr, u64 size);
>>>>>> +static bool stage2_pte_cacheable(kvm_pte_t pte);
>>>>>> +
>>>>>> static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>> struct stage2_map_data *data)
>>>>>> {
>>>>>> @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
>>>>>> struct page *page = virt_to_page(ptep);
>>>>>> if (data->anchor) {
>>>>>> - if (kvm_pte_valid(pte))
>>>>>> + if (kvm_pte_valid(pte)) {
>>>>>> + kvm_set_invalid_pte(ptep);
>>>>>> + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
>>>>>> + addr, level);
>>>>>> put_page(page);
>>>>> This doesn't make sense to me: the page-table pages we're walking when the
>>>>> anchor is set are not accessible to the hardware walker because we unhooked
>>>>> the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
>>>>> TLB invalidation.
>>>>>
>>>>> Are you seeing a problem in practice here?
>>>> Yes, I indeed find a problem in practice.
>>>>
>>>> When the migration was cancelled, a TLB conflic abort was found in guest.
>>>>
>>>> This problem is fixed before rework of the page table code, you can have a
>>>> look in the following two links:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
>>>>
>>>> https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
>>> Ok, let's go through this, because I still don't see the bug. Please correct
>>> me if you spot any mistakes:
>>>
>>> 1. We have a block mapping for X => Y
>>> 2. Dirty logging is enabled, so the block mapping is write-protected and
>>> ends up being split into page mappings
>>> 3. Dirty logging is disabled due to a failed migration.
>>>
>>> --- At this point, I think we agree that the state of the MMU is alright ---
>>>
>>> 4. We take a stage-2 fault and want to reinstall the block mapping:
>>>
>>> a. kvm_pgtable_stage2_map() is invoked to install the block mapping
>>> b. stage2_map_walk_table_pre() finds a table where we would like to
>>> install the block:
>>>
>>> i. The anchor is set to point at this entry
>>> ii. The entry is made invalid
>>> iii. We invalidate the TLB for the input address. This is
>>> TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
>>>
>>> *** At this point, the page-table pointed to by the old table entry
>>> is not reachable to the hardware walker ***
>>>
>>> c. stage2_map_walk_leaf() is called for each leaf entry in the
>>> now-unreachable subtree, dropping page-references for each valid
>>> entry it finds.
>>> d. stage2_map_walk_table_post() is eventually called for the entry
>>> which we cleared back in b.ii, so we install the new block mapping.
>>>
>>> You are proposing to add additional TLB invalidation to (c), but I don't
>>> think that is necessary, thanks to the invalidation already performed in
>>> b.iii. What am I missing here?
>> The point is at b.iii where the TLBI is not enough. There are many page
>> mappings that we need to merge into a block mapping.
>>
>> We invalidate the TLB for the input address without level hint at b.iii, but
>> this operation just flush TLB for one page mapping, there
>>
>> are still some TLB entries for the other page mappings in the cache, the MMU
>> hardware walker can still hit these entries next time.
> Ah, yes, I see. Thanks. I hadn't considered the case where there are table
> entries beneath the anchor. So how about the diff below?
>
> Will
>
> --->8

Hi, I think it's inappropriate to put the TLBI of all the leaf entries
in function stage2_map_walk_table_post(),

because the *ptep must be an upper table entry when we enter
stage2_map_walk_table_post().

We should make the TLBI for every leaf entry not table entry in the last
lookup level, just as I am proposing

to add the additional TLBI in function stage2_map_walk_leaf().

Thanks.

Yanan

>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 0271b4a3b9fe..12526d8c7ae4 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> return 0;
>
> kvm_set_invalid_pte(ptep);
> - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> + /* TLB invalidation is deferred until the _post handler */
> data->anchor = ptep;
> return 0;
> }
> @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
> struct stage2_map_data *data)
> {
> int ret = 0;
> + kvm_pte_t pte = *ptep;
>
> if (!data->anchor)
> return 0;
>
> - free_page((unsigned long)kvm_pte_follow(*ptep));
> + kvm_set_invalid_pte(ptep);
> +
> + /*
> + * Invalidate the whole stage-2, as we may have numerous leaf
> + * entries below us which would otherwise need invalidating
> + * individually.
> + */
> + kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> +
> + free_page((unsigned long)kvm_pte_follow(pte));
> put_page(virt_to_page(ptep));
>
> if (data->anchor == ptep) {
> .

2020-12-01 14:36:07

by Marc Zyngier

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On 2020-12-01 14:23, Will Deacon wrote:
> On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
>> On 2020-12-01 13:46, Will Deacon wrote:
>> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> > index 0271b4a3b9fe..12526d8c7ae4 100644
>> > --- a/arch/arm64/kvm/hyp/pgtable.c
>> > +++ b/arch/arm64/kvm/hyp/pgtable.c
>> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
>> > end, u32 level,
>> > return 0;
>> >
>> > kvm_set_invalid_pte(ptep);
>> > - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
>> > + /* TLB invalidation is deferred until the _post handler */
>> > data->anchor = ptep;
>> > return 0;
>> > }
>> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
>> > u64 end, u32 level,
>> > struct stage2_map_data *data)
>> > {
>> > int ret = 0;
>> > + kvm_pte_t pte = *ptep;
>> >
>> > if (!data->anchor)
>> > return 0;
>> >
>> > - free_page((unsigned long)kvm_pte_follow(*ptep));
>> > + kvm_set_invalid_pte(ptep);
>> > +
>> > + /*
>> > + * Invalidate the whole stage-2, as we may have numerous leaf
>> > + * entries below us which would otherwise need invalidating
>> > + * individually.
>> > + */
>> > + kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>>
>> That's a big hammer, and we so far have been pretty careful not to
>> over-invalidate. Is the block-replacing-table *without* an unmap
>> in between the only case where this triggers?
>
> Yes, this only happens in that case. The alternative would be to issue
> invalidations for every single entry we unmap, which I can implement if
> you prefer, but it felt worse to me given that by-IPA invalidation
> isn't really great either).

Right. If that's the only case where this happens, I'm not too bothered.
What I want to make sure is that the normal table->block upgrade path
(which goes via a MMU notifier that unmaps the table) stays relatively
quick and doesn't suffer from the above.

It looks like Yanan still has some concerns though, and I'd like to
understand what they are.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-12-01 14:38:51

by Marc Zyngier

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

Hi Yanan,

On 2020-12-01 14:11, wangyanan (Y) wrote:
> On 2020/12/1 21:46, Will Deacon wrote:
>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:

[...]

>>> The point is at b.iii where the TLBI is not enough. There are many
>>> page
>>> mappings that we need to merge into a block mapping.
>>>
>>> We invalidate the TLB for the input address without level hint at
>>> b.iii, but
>>> this operation just flush TLB for one page mapping, there
>>>
>>> are still some TLB entries for the other page mappings in the cache,
>>> the MMU
>>> hardware walker can still hit these entries next time.
>> Ah, yes, I see. Thanks. I hadn't considered the case where there are
>> table
>> entries beneath the anchor. So how about the diff below?
>>
>> Will
>>
>> --->8
>
> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> in function stage2_map_walk_table_post(),
>
> because the *ptep must be an upper table entry when we enter
> stage2_map_walk_table_post().
>
> We should make the TLBI for every leaf entry not table entry in the
> last lookup level, just as I am proposing
>
> to add the additional TLBI in function stage2_map_walk_leaf().

Could you make your concerns explicit? As far as I can tell, this should
address the bug you found, at least from a correctness perspective.

Are you worried about the impact of the full S2 invalidation? Or do you
see another correctness issue?

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2020-12-01 17:25:02

by Yanan Wang

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On 2020/12/1 22:35, Marc Zyngier wrote:

> Hi Yanan,
>
> On 2020-12-01 14:11, wangyanan (Y) wrote:
>> On 2020/12/1 21:46, Will Deacon wrote:
>>> On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
>
> [...]
>
>>>> The point is at b.iii where the TLBI is not enough. There are many
>>>> page
>>>> mappings that we need to merge into a block mapping.
>>>>
>>>> We invalidate the TLB for the input address without level hint at
>>>> b.iii, but
>>>> this operation just flush TLB for one page mapping, there
>>>>
>>>> are still some TLB entries for the other page mappings in the
>>>> cache, the MMU
>>>> hardware walker can still hit these entries next time.
>>> Ah, yes, I see. Thanks. I hadn't considered the case where there are
>>> table
>>> entries beneath the anchor. So how about the diff below?
>>>
>>> Will
>>>
>>> --->8
>>
>> Hi, I think it's inappropriate to put the TLBI of all the leaf entries
>> in function stage2_map_walk_table_post(),
>>
>> because the *ptep must be an upper table entry when we enter
>> stage2_map_walk_table_post().
>>
>> We should make the TLBI for every leaf entry not table entry in the
>> last lookup level, just as I am proposing
>>
>> to add the additional TLBI in function stage2_map_walk_leaf().
>
> Could you make your concerns explicit? As far as I can tell, this should
> address the bug you found, at least from a correctness perspective.
>
> Are you worried about the impact of the full S2 invalidation? Or do you
> see another correctness issue?

Hi Will, Marc,

After recheck of the diff, the full S2 invalidation in
stage2_map_walk_table_post() should be well enough to solve this problem.

But I was wondering if we can add the full S2 invalidation in
stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called
for only one time.

If we add the full TLBI in stage2_map_walk_table_post(),
__kvm_tlb_flush_vmid() might be called for many times in the loop and
lots of (unnecessary) CPU instructions will be wasted.

What I'm saying is something like below, please let me know what do you
think.

If this is OK, I can update the diff in v2 and send it with your SOB (is
it appropriate?) after some tests.

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b232bdd142a6..f11fb2996080 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
end, u32 level,
                return 0;

        kvm_set_invalid_pte(ptep);
-       kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+       kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
        data->anchor = ptep;
        return 0;
}

Thanks,

Yanan

2020-12-01 22:35:50

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 0:01, Will Deacon wrote:
> > On Mon, Nov 30, 2020 at 11:24:19PM +0800, wangyanan (Y) wrote:
> > > On 2020/11/30 21:34, Will Deacon wrote:
> > > > On Mon, Nov 30, 2020 at 08:18:46PM +0800, Yanan Wang wrote:
> > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > > index 696b6aa83faf..fec8dc9f2baa 100644
> > > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > > @@ -500,6 +500,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > > > > return 0;
> > > > > }
> > > > > +static void stage2_flush_dcache(void *addr, u64 size);
> > > > > +static bool stage2_pte_cacheable(kvm_pte_t pte);
> > > > > +
> > > > > static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > > struct stage2_map_data *data)
> > > > > {
> > > > > @@ -507,9 +510,17 @@ static int stage2_map_walk_leaf(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
> > > > > struct page *page = virt_to_page(ptep);
> > > > > if (data->anchor) {
> > > > > - if (kvm_pte_valid(pte))
> > > > > + if (kvm_pte_valid(pte)) {
> > > > > + kvm_set_invalid_pte(ptep);
> > > > > + kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu,
> > > > > + addr, level);
> > > > > put_page(page);
> > > > This doesn't make sense to me: the page-table pages we're walking when the
> > > > anchor is set are not accessible to the hardware walker because we unhooked
> > > > the entire sub-table in stage2_map_walk_table_pre(), which has the necessary
> > > > TLB invalidation.
> > > >
> > > > Are you seeing a problem in practice here?
> > > Yes, I indeed find a problem in practice.
> > >
> > > When the migration was cancelled, a TLB conflic abort? was found in guest.
> > >
> > > This problem is fixed before rework of the page table code, you can have a
> > > look in the following two links:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c3736cd32bf5197aed1410ae826d2d254a5b277
> > >
> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2019-March/035031.html
> > Ok, let's go through this, because I still don't see the bug. Please correct
> > me if you spot any mistakes:
> >
> > 1. We have a block mapping for X => Y
> > 2. Dirty logging is enabled, so the block mapping is write-protected and
> > ends up being split into page mappings
> > 3. Dirty logging is disabled due to a failed migration.
> >
> > --- At this point, I think we agree that the state of the MMU is alright ---
> >
> > 4. We take a stage-2 fault and want to reinstall the block mapping:
> >
> > a. kvm_pgtable_stage2_map() is invoked to install the block mapping
> > b. stage2_map_walk_table_pre() finds a table where we would like to
> > install the block:
> >
> > i. The anchor is set to point at this entry
> > ii. The entry is made invalid
> > iii. We invalidate the TLB for the input address. This is
> > TLBI IPAS2SE1IS without level hint and then TLBI VMALLE1IS.
> >
> > *** At this point, the page-table pointed to by the old table entry
> > is not reachable to the hardware walker ***
> >
> > c. stage2_map_walk_leaf() is called for each leaf entry in the
> > now-unreachable subtree, dropping page-references for each valid
> > entry it finds.
> > d. stage2_map_walk_table_post() is eventually called for the entry
> > which we cleared back in b.ii, so we install the new block mapping.
> >
> > You are proposing to add additional TLB invalidation to (c), but I don't
> > think that is necessary, thanks to the invalidation already performed in
> > b.iii. What am I missing here?
>
> The point is at b.iii where the TLBI is not enough. There are many page
> mappings that we need to merge into a block mapping.
>
> We invalidate the TLB for the input address without level hint at b.iii, but
> this operation just flush TLB for one page mapping, there
>
> are still some TLB entries for the other page mappings in the cache, the MMU
> hardware walker can still hit these entries next time.

Ah, yes, I see. Thanks. I hadn't considered the case where there are table
entries beneath the anchor. So how about the diff below?

Will

--->8

diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 0271b4a3b9fe..12526d8c7ae4 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
return 0;

kvm_set_invalid_pte(ptep);
- kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
+ /* TLB invalidation is deferred until the _post handler */
data->anchor = ptep;
return 0;
}
@@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
struct stage2_map_data *data)
{
int ret = 0;
+ kvm_pte_t pte = *ptep;

if (!data->anchor)
return 0;

- free_page((unsigned long)kvm_pte_follow(*ptep));
+ kvm_set_invalid_pte(ptep);
+
+ /*
+ * Invalidate the whole stage-2, as we may have numerous leaf
+ * entries below us which would otherwise need invalidating
+ * individually.
+ */
+ kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
+
+ free_page((unsigned long)kvm_pte_follow(pte));
put_page(virt_to_page(ptep));

if (data->anchor == ptep) {

2020-12-01 22:37:52

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On Tue, Dec 01, 2020 at 02:05:03PM +0000, Marc Zyngier wrote:
> On 2020-12-01 13:46, Will Deacon wrote:
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 0271b4a3b9fe..12526d8c7ae4 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -493,7 +493,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64
> > end, u32 level,
> > return 0;
> >
> > kvm_set_invalid_pte(ptep);
> > - kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> > + /* TLB invalidation is deferred until the _post handler */
> > data->anchor = ptep;
> > return 0;
> > }
> > @@ -547,11 +547,21 @@ static int stage2_map_walk_table_post(u64 addr,
> > u64 end, u32 level,
> > struct stage2_map_data *data)
> > {
> > int ret = 0;
> > + kvm_pte_t pte = *ptep;
> >
> > if (!data->anchor)
> > return 0;
> >
> > - free_page((unsigned long)kvm_pte_follow(*ptep));
> > + kvm_set_invalid_pte(ptep);
> > +
> > + /*
> > + * Invalidate the whole stage-2, as we may have numerous leaf
> > + * entries below us which would otherwise need invalidating
> > + * individually.
> > + */
> > + kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
>
> That's a big hammer, and we so far have been pretty careful not to
> over-invalidate. Is the block-replacing-table *without* an unmap
> in between the only case where this triggers?

Yes, this only happens in that case. The alternative would be to issue
invalidations for every single entry we unmap, which I can implement if
you prefer, but it felt worse to me given that by-IPA invalidation
isn't really great either).

Will

2020-12-01 22:39:00

by Will Deacon

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] KVM: arm64: Fix handling of merging tables into a block entry

On Wed, Dec 02, 2020 at 01:20:33AM +0800, wangyanan (Y) wrote:
> On 2020/12/1 22:35, Marc Zyngier wrote:
>
> > Hi Yanan,
> >
> > On 2020-12-01 14:11, wangyanan (Y) wrote:
> > > On 2020/12/1 21:46, Will Deacon wrote:
> > > > On Tue, Dec 01, 2020 at 10:30:41AM +0800, wangyanan (Y) wrote:
> >
> > [...]
> >
> > > > > The point is at b.iii where the TLBI is not enough. There
> > > > > are many page
> > > > > mappings that we need to merge into a block mapping.
> > > > >
> > > > > We invalidate the TLB for the input address without level
> > > > > hint at b.iii, but
> > > > > this operation just flush TLB for one page mapping, there
> > > > >
> > > > > are still some TLB entries for the other page mappings in
> > > > > the cache, the MMU
> > > > > hardware walker can still hit these entries next time.
> > > > Ah, yes, I see. Thanks. I hadn't considered the case where there
> > > > are table
> > > > entries beneath the anchor. So how about the diff below?
> > > >
> > > > Will
> > > >
> > > > --->8
> > >
> > > Hi, I think it's inappropriate to put the TLBI of all the leaf entries
> > > in function stage2_map_walk_table_post(),
> > >
> > > because the *ptep must be an upper table entry when we enter
> > > stage2_map_walk_table_post().
> > >
> > > We should make the TLBI for every leaf entry not table entry in the
> > > last lookup level,? just as I am proposing
> > >
> > > to add the additional TLBI in function stage2_map_walk_leaf().
> >
> > Could you make your concerns explicit? As far as I can tell, this should
> > address the bug you found, at least from a correctness perspective.
> >
> > Are you worried about the impact of the full S2 invalidation? Or do you
> > see another correctness issue?
>
>
> Hi Will, Marc,
>
>
> After recheck of the diff, the full S2 invalidation in
> stage2_map_walk_table_post() should be well enough to solve this problem.
>
> But I was wondering if we can add the full S2 invalidation in
> stage2_map_walk_table_pre(), where __kvm_tlb_flush_vmid() will be called for
> only one time.
>
> If we add the full TLBI in stage2_map_walk_table_post(),
> __kvm_tlb_flush_vmid() might be called for many times in the loop and lots
> of (unnecessary) CPU instructions will be wasted.
>
> What I'm saying is something like below, please let me know what do you
> think.
>
> If this is OK, I can update the diff in v2 and send it with your SOB (is it
> appropriate?) after some tests.
>
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index b232bdd142a6..f11fb2996080 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -496,7 +496,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end,
> u32 level,
> ??????????????? return 0;
>
> ??????? kvm_set_invalid_pte(ptep);
> -?????? kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, 0);
> +?????? kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
> ??????? data->anchor = ptep;
> ??????? return 0;

Yes, I think that's much better, but please add a comment! (you can
probably more-or-less copy the one I had in the post handler)

Will