Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp587570pxf; Thu, 8 Apr 2021 09:01:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzQTLo+tjpId6eYn03C36x1ERYhoyd3g5IY5ffz0frxJlkU0vT+ZHxinKeZIIVFbK9Komvd X-Received: by 2002:a17:90a:1056:: with SMTP id y22mr8658515pjd.153.1617897697195; Thu, 08 Apr 2021 09:01:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617897697; cv=none; d=google.com; s=arc-20160816; b=W+MGsRHT0HrV8u+dMazczEuRnbpjVC8XwH6OdxQBES3Bxmh3bVa9PafrzrNJRKAOC1 mswp2YlRWQz4Qa6w9eVhTYRSWfYo0UICKCjdxE5SQDBC1vpw4gCA+LMlgyZome0QDWiC DNBA2wY1PjO7nU4LGDyBKytEKrOyrB8lXlOVZmv1KWjTHozmrsAN5E+nGzCTx2KQyR7V DD9G8vOsR290gRl+x3E9wYZULX6Ctbj1gCGQoN42uNn0GUJj7IcWdgKBAAfdC1fbHt4v 4K52z50Ixdcz6OxnMboUDppxITOQQzFJ2PV2ywqKn0SRQEOQhRuKIPYFM+4+7sTLovqg 70Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-language:content-transfer-encoding :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=TGOYe1foKi7mKMoIA1f2p06myD3BDomPaWydjFBH/50=; b=CCy4EQrI8VOANCrZQt7POdDohT9ya11XSm4ax06LRJCjeFb/kzAcr3byt/itWvv2TC QQ4ESA9Scj9yrgHpvhKAJxvOXRe6GKNRAZcuB+XV2/ypxdmzuX0YrFrAyarIruvBi7yi nfCs/DrGpwlpeqEqnTDCcqbwytM5fNDtC5QJZUKiS6akCBiJ4ealvOgY3Gi8sRDvZymD /ahORNVAz8keVPzfNiSo3JtywtYMuBr4PLQbNojhdf7VAdGgttA+2LUV8AYK6+Pbdx7D 3L2o5/JOiKL+Uu6AuDqO0MvqBXul8kuhJ0cWhmfGvgQn1XmP/oHEWv0r+UjuZTUWmUPZ wabA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 66si13801987plf.375.2021.04.08.09.01.23; Thu, 08 Apr 2021 09:01:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232103AbhDHQA2 (ORCPT + 99 others); Thu, 8 Apr 2021 12:00:28 -0400 Received: from foss.arm.com ([217.140.110.172]:52852 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231791AbhDHQAZ (ORCPT ); Thu, 8 Apr 2021 12:00:25 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 09EE4D6E; Thu, 8 Apr 2021 09:00:14 -0700 (PDT) Received: from [192.168.0.110] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DE32F3F73D; Thu, 8 Apr 2021 09:00:11 -0700 (PDT) Subject: Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers To: "wangyanan (Y)" , Marc Zyngier , Will Deacon , Catalin Marinas , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: James Morse , Julien Thierry , Suzuki K Poulose , Gavin Shan , Quentin Perret , wanghaibin.wang@huawei.com, zhukeqian1@huawei.com, yuzenghui@huawei.com References: <20210326031654.3716-1-wangyanan55@huawei.com> <20210326031654.3716-2-wangyanan55@huawei.com> From: Alexandru Elisei Message-ID: <94911842-d55d-bd6f-74ea-a947c09584c2@arm.com> Date: Thu, 8 Apr 2021 16:59:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Yanan, On 4/8/21 10:23 AM, wangyanan (Y) wrote: > Hi Alex, > > On 2021/4/7 23:31, Alexandru Elisei wrote: >> Hi Yanan, >> >> On 3/26/21 3:16 AM, Yanan Wang wrote: >>> We currently uniformly permorm CMOs of D-cache and I-cache in function >>> user_mem_abort before calling the fault handlers. If we get concurrent >>> guest faults(e.g. translation faults, permission faults) or some really >>> unnecessary guest faults caused by BBM, CMOs for the first vcpu are >> I can't figure out what BBM means. > Just as Will has explained, it's Break-Before-Make rule. When we need to > replace an old table entry with a new one, we should firstly invalidate > the old table entry(Break), before installation of the new entry(Make). Got it, thank you and Will for the explanation. > > > And I think this patch mainly introduces benefits in two specific scenarios: > 1) In a VM startup, it will improve efficiency of handling page faults incurred > by vCPUs, when initially populating stage2 page tables. > 2) After live migration, the heavy workload will be resumed on the destination > VMs, however all the stage2 page tables need to be rebuilt. >>> necessary while the others later are not. >>> >>> By moving CMOs to the fault handlers, we can easily identify conditions >>> where they are really needed and avoid the unnecessary ones. As it's a >>> time consuming process to perform CMOs especially when flushing a block >>> range, so this solution reduces much load of kvm and improve efficiency >>> of the page table code. >>> >>> So let's move both clean of D-cache and invalidation of I-cache to the >>> map path and move only invalidation of I-cache to the permission path. >>> Since the original APIs for CMOs in mmu.c are only called in function >>> user_mem_abort, we now also move them to pgtable.c. >>> >>> Signed-off-by: Yanan Wang >>> --- >>>   arch/arm64/include/asm/kvm_mmu.h | 31 --------------- >>>   arch/arm64/kvm/hyp/pgtable.c     | 68 +++++++++++++++++++++++++------- >>>   arch/arm64/kvm/mmu.c             | 23 ++--------- >>>   3 files changed, 57 insertions(+), 65 deletions(-) >>> >>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h >>> index 90873851f677..c31f88306d4e 100644 >>> --- a/arch/arm64/include/asm/kvm_mmu.h >>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>> @@ -177,37 +177,6 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu >>> *vcpu) >>>       return (vcpu_read_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101; >>>   } >>>   -static inline void __clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long >>> size) >>> -{ >>> -    void *va = page_address(pfn_to_page(pfn)); >>> - >>> -    /* >>> -     * With FWB, we ensure that the guest always accesses memory using >>> -     * cacheable attributes, and we don't have to clean to PoC when >>> -     * faulting in pages. Furthermore, FWB implies IDC, so cleaning to >>> -     * PoU is not required either in this case. >>> -     */ >>> -    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) >>> -        return; >>> - >>> -    kvm_flush_dcache_to_poc(va, size); >>> -} >>> - >>> -static inline void __invalidate_icache_guest_page(kvm_pfn_t pfn, >>> -                          unsigned long size) >>> -{ >>> -    if (icache_is_aliasing()) { >>> -        /* any kind of VIPT cache */ >>> -        __flush_icache_all(); >>> -    } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) { >>> -        /* PIPT or VPIPT at EL2 (see comment in __kvm_tlb_flush_vmid_ipa) */ >>> -        void *va = page_address(pfn_to_page(pfn)); >>> - >>> -        invalidate_icache_range((unsigned long)va, >>> -                    (unsigned long)va + size); >>> -    } >>> -} >>> - >>>   void kvm_set_way_flush(struct kvm_vcpu *vcpu); >>>   void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled); >>>   diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c >>> index 4d177ce1d536..829a34eea526 100644 >>> --- a/arch/arm64/kvm/hyp/pgtable.c >>> +++ b/arch/arm64/kvm/hyp/pgtable.c >>> @@ -464,6 +464,43 @@ static int stage2_map_set_prot_attr(enum kvm_pgtable_prot >>> prot, >>>       return 0; >>>   } >>>   +static bool stage2_pte_cacheable(kvm_pte_t pte) >>> +{ >>> +    u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR; >>> +    return memattr == PAGE_S2_MEMATTR(NORMAL); >>> +} >>> + >>> +static bool stage2_pte_executable(kvm_pte_t pte) >>> +{ >>> +    return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN); >>> +} >>> + >>> +static void stage2_flush_dcache(void *addr, u64 size) >>> +{ >>> +    /* >>> +     * With FWB, we ensure that the guest always accesses memory using >>> +     * cacheable attributes, and we don't have to clean to PoC when >>> +     * faulting in pages. Furthermore, FWB implies IDC, so cleaning to >>> +     * PoU is not required either in this case. >>> +     */ >>> +    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) >>> +        return; >>> + >>> +    __flush_dcache_area(addr, size); >>> +} >>> + >>> +static void stage2_invalidate_icache(void *addr, u64 size) >>> +{ >>> +    if (icache_is_aliasing()) { >>> +        /* Flush any kind of VIPT icache */ >>> +        __flush_icache_all(); >>> +    } else if (is_kernel_in_hyp_mode() || !icache_is_vpipt()) { >>> +        /* PIPT or VPIPT at EL2 */ >>> +        invalidate_icache_range((unsigned long)addr, >>> +                    (unsigned long)addr + size); >>> +    } >>> +} >>> + >>>   static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level, >>>                         kvm_pte_t *ptep, >>>                         struct stage2_map_data *data) >>> @@ -495,6 +532,13 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, >>> u32 level, >>>           put_page(page); >>>       } >>>   +    /* Perform CMOs before installation of the new PTE */ >>> +    if (!kvm_pte_valid(old) || stage2_pte_cacheable(old)) >> I'm not sure why the stage2_pte_cacheable(old) condition is needed. >> >> kvm_handle_guest_abort() handles three types of stage 2 data or instruction >> aborts: translation faults (fault_status == FSC_FAULT), access faults >> (fault_status == FSC_ACCESS) and permission faults (fault_status == FSC_PERM). >> >> Access faults are handled in handle_access_fault(), which means user_mem_abort() >> handles translation and permission faults. > Yes, and we are certain that it's a translation fault here in > stage2_map_walker_try_leaf. >> The original code did the dcache clean >> + inval when not a permission fault, which means the CMO was done only on a >> translation fault. Translation faults mean that the IPA was not mapped, so the old >> entry will always be invalid. Even if we're coalescing multiple last level leaf >> entries int oa  block mapping, the table entry which is replaced is invalid >> because it's marked as such in stage2_map_walk_table_pre(). >> >> Is there something I'm missing? > I originally thought that we could possibly have a translation fault on a valid > stage2 table > descriptor due to some special cases, and that's the reason > stage2_pte_cacheable(old) > condition exits, but I can't image any scenario like this. > > I think your above explanation is right, maybe I should just drop that condition. >> >>> +        stage2_flush_dcache(__va(phys), granule); >>> + >>> +    if (stage2_pte_executable(new)) >>> +        stage2_invalidate_icache(__va(phys), granule); >> This, together with the stage2_attr_walker() changes below, look identical to the >> current code in user_mem_abort(). The executable permission is set on an exec >> fault (instruction abort not on a stage 2 translation table walk), and as a result >> of the fault we either need to map a new page here, or relax permissions in >> kvm_pgtable_stage2_relax_perms() -> stage2_attr_walker() below. > I agree. > Do you mean this part of change is right? Yes, I was trying to explain that the behaviour with regard to icache invalidation from this patch is identical to the current behaviour of user_mem_abort () (without this patch). Thanks, Alex > > Thanks, > Yanan >> Thanks, >> >> Alex >> >>> + >>>       smp_store_release(ptep, new); >>>       get_page(page); >>>       data->phys += granule; >>> @@ -651,20 +695,6 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 >>> addr, u64 size, >>>       return ret; >>>   } >>>   -static void stage2_flush_dcache(void *addr, u64 size) >>> -{ >>> -    if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) >>> -        return; >>> - >>> -    __flush_dcache_area(addr, size); >>> -} >>> - >>> -static bool stage2_pte_cacheable(kvm_pte_t pte) >>> -{ >>> -    u64 memattr = pte & KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR; >>> -    return memattr == PAGE_S2_MEMATTR(NORMAL); >>> -} >>> - >>>   static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep, >>>                      enum kvm_pgtable_walk_flags flag, >>>                      void * const arg) >>> @@ -743,8 +773,16 @@ static int stage2_attr_walker(u64 addr, u64 end, u32 >>> level, kvm_pte_t *ptep, >>>        * but worst-case the access flag update gets lost and will be >>>        * set on the next access instead. >>>        */ >>> -    if (data->pte != pte) >>> +    if (data->pte != pte) { >>> +        /* >>> +         * Invalidate the instruction cache before updating >>> +         * if we are going to add the executable permission. >>> +         */ >>> +        if (!stage2_pte_executable(*ptep) && stage2_pte_executable(pte)) >>> +            stage2_invalidate_icache(kvm_pte_follow(pte), >>> +                         kvm_granule_size(level)); >>>           WRITE_ONCE(*ptep, pte); >>> +    } >>>         return 0; >>>   } >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>> index 77cb2d28f2a4..1eec9f63bc6f 100644 >>> --- a/arch/arm64/kvm/mmu.c >>> +++ b/arch/arm64/kvm/mmu.c >>> @@ -609,16 +609,6 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm >>> *kvm, >>>       kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); >>>   } >>>   -static void clean_dcache_guest_page(kvm_pfn_t pfn, unsigned long size) >>> -{ >>> -    __clean_dcache_guest_page(pfn, size); >>> -} >>> - >>> -static void invalidate_icache_guest_page(kvm_pfn_t pfn, unsigned long size) >>> -{ >>> -    __invalidate_icache_guest_page(pfn, size); >>> -} >>> - >>>   static void kvm_send_hwpoison_signal(unsigned long address, short lsb) >>>   { >>>       send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current); >>> @@ -882,13 +872,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, >>> phys_addr_t fault_ipa, >>>       if (writable) >>>           prot |= KVM_PGTABLE_PROT_W; >>>   -    if (fault_status != FSC_PERM && !device) >>> -        clean_dcache_guest_page(pfn, vma_pagesize); >>> - >>> -    if (exec_fault) { >>> +    if (exec_fault) >>>           prot |= KVM_PGTABLE_PROT_X; >>> -        invalidate_icache_guest_page(pfn, vma_pagesize); >>> -    } >>>         if (device) >>>           prot |= KVM_PGTABLE_PROT_DEVICE; >>> @@ -1144,10 +1129,10 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long >>> hva, pte_t pte) >>>       trace_kvm_set_spte_hva(hva); >>>         /* >>> -     * We've moved a page around, probably through CoW, so let's treat it >>> -     * just like a translation fault and clean the cache to the PoC. >>> +     * We've moved a page around, probably through CoW, so let's treat >>> +     * it just like a translation fault and the map handler will clean >>> +     * the cache to the PoC. >>>        */ >>> -    clean_dcache_guest_page(pfn, PAGE_SIZE); >>>       handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pfn); >>>       return 0; >>>   } >> .