Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp326656img; Wed, 20 Mar 2019 01:16:10 -0700 (PDT) X-Google-Smtp-Source: APXvYqwWiBvNazvVOiJlREi0aF+Obl7iNKlIm1Mg7jR9O0Gak1E5iCMrtnkihK42UdklEDOFA/eT X-Received: by 2002:a63:90c1:: with SMTP id a184mr15461887pge.252.1553069769967; Wed, 20 Mar 2019 01:16:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553069769; cv=none; d=google.com; s=arc-20160816; b=o220BYTFeTjCwJeFbuEp3wbI3+IY7UyYkbtqQZSTDI1G1GGGrbZKJqIKD0jvr9H7dR pqvmmaTN0MnczQwPOvwtBJSANQ4I6tOi9Y4XFKLVXt+t7Wvq5xcWy3ZEB5/R1Pj9KGWP UBSRHvdiWFBS5l8AQYA9m3gNJNiMGH0WuvTqHOLbfcucYeF7QquBbrAn6IpEI0EmPmMz 32yijTDc7owqVHty7nxmkfPOOaZVPHxG253+pLjE68CcNKKNHfvJIIZmIIP48FKEniO6 23Jzq22Ug0WD53r2bNqsQ3dC3nak7Jk0RNl5VG2Qu7gq0gaBVnpmIoKtD62014kRr6JJ TBPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:organization:user-agent :references:in-reply-to:subject:cc:to:from:message-id:date; bh=D9tXHw8ZJEIsnkE5aeLXBi3A/e2x/SrYhJd/cCzN+mg=; b=n7XtaFts0zGLjqHav2EqS/v9v0ppd/E+IpxAlEbotrL7YsPk3jnURAHgTFQ1ezEXyk 5J1Vv49TnI5vDuk+6hmwvhxTgrq5ZvOcJI+9ZNYouvW4noycQOOiRat0Q9IFPWezoaKC VTELXhS9lD6pzjRB1qJ+2dduPg14NorJeN7CD+9Pc8dr9/Ni74TUYFZk5RMiu8DLGz/H 3rUnQOb9CiRbK8eYerDAPO6Mckc5yx3RsLrXmTpr5NHawtDG+LjSxgVhkgDVAxaCsATQ C1E7KDVw/kzKAVwpOldLo0LPDAd3+H40XIRr4esiaYtvRADkJ0N0YRUkG/pdPl0rtmAt zPQA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g8si1130763pgd.52.2019.03.20.01.15.54; Wed, 20 Mar 2019 01:16:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727856AbfCTIPH (ORCPT + 99 others); Wed, 20 Mar 2019 04:15:07 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:35872 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727580AbfCTIPH (ORCPT ); Wed, 20 Mar 2019 04:15:07 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 316CCA78; Wed, 20 Mar 2019 01:15:06 -0700 (PDT) Received: from big-swifty.misterjones.org (usa-sjc-mx-foss1.foss.arm.com [217.140.101.70]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A3EBE3F71A; Wed, 20 Mar 2019 01:15:02 -0700 (PDT) Date: Wed, 20 Mar 2019 08:15:00 +0000 Message-ID: <86d0mmynaz.wl-marc.zyngier@arm.com> From: Marc Zyngier To: Suzuki K Poulose Cc: , , , , , , , , , , , , , Christoffer Dall Subject: Re: [PATCH] kvm: arm: Fix handling of stage2 huge mappings In-Reply-To: <1553004668-23296-1-git-send-email-suzuki.poulose@arm.com> References: <25971fd5-3774-3389-a82a-04707480c1e0@huawei.com> <1553004668-23296-1-git-send-email-suzuki.poulose@arm.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL/10.8 EasyPG/1.0.0 Emacs/26 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) Organization: ARM Ltd MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Suzuki, On Tue, 19 Mar 2019 14:11:08 +0000, Suzuki K Poulose wrote: > > We rely on the mmu_notifier call backs to handle the split/merge > of huge pages and thus we are guaranteed that, while creating a > block mapping, either the entire block is unmapped at stage2 or it > is missing permission. > > However, we miss a case where the block mapping is split for dirty > logging case and then could later be made block mapping, if we cancel the > dirty logging. This not only creates inconsistent TLB entries for > the pages in the the block, but also leakes the table pages for > PMD level. > > Handle this corner case for the huge mappings at stage2 by > unmapping the non-huge mapping for the block. This could potentially > release the upper level table. So we need to restart the table walk > once we unmap the range. > > Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages") > Reported-by: Zheng Xiang > Cc: Zheng Xiang > Cc: Zhengui Yu > Cc: Marc Zyngier > Cc: Christoffer Dall > Signed-off-by: Suzuki K Poulose > --- > virt/kvm/arm/mmu.c | 63 ++++++++++++++++++++++++++++++++++++++---------------- > 1 file changed, 45 insertions(+), 18 deletions(-) > > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index fce0983..6ad6f19d 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -1060,25 +1060,43 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache > { > pmd_t *pmd, old_pmd; > > +retry: > pmd = stage2_get_pmd(kvm, cache, addr); > VM_BUG_ON(!pmd); > > old_pmd = *pmd; > + /* > + * Multiple vcpus faulting on the same PMD entry, can > + * lead to them sequentially updating the PMD with the > + * same value. Following the break-before-make > + * (pmd_clear() followed by tlb_flush()) process can > + * hinder forward progress due to refaults generated > + * on missing translations. > + * > + * Skip updating the page table if the entry is > + * unchanged. > + */ > + if (pmd_val(old_pmd) == pmd_val(*new_pmd)) > + return 0; > + > if (pmd_present(old_pmd)) { > /* > - * Multiple vcpus faulting on the same PMD entry, can > - * lead to them sequentially updating the PMD with the > - * same value. Following the break-before-make > - * (pmd_clear() followed by tlb_flush()) process can > - * hinder forward progress due to refaults generated > - * on missing translations. > + * If we already have PTE level mapping for this block, > + * we must unmap it to avoid inconsistent TLB state and > + * leaking the table page. We could end up in this situation > + * if the memory slot was marked for dirty logging and was > + * reverted, leaving PTE level mappings for the pages accessed > + * during the period. So, unmap the PTE level mapping for this > + * block and retry, as we could have released the upper level > + * table in the process. > * > - * Skip updating the page table if the entry is > - * unchanged. > + * Normal THP split/merge follows mmu_notifier callbacks and do > + * get handled accordingly. > */ > - if (pmd_val(old_pmd) == pmd_val(*new_pmd)) > - return 0; > - > + if (!pmd_thp_or_huge(old_pmd)) { > + unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE); > + goto retry; This looks slightly dodgy. Doing this retry results in another call to stage2_get_pmd(), which may or may not result in allocating a PUD. I think this is safe as if we managed to get here, it means the whole hierarchy was already present and nothing was allocated in the first round. Somehow, I would feel more comfortable with just not even trying. Unmap, don't fix the fault, let the vcpu come again for additional punishment. But this is probably more invasive, as none of the stage2_set_p*() return value is ever evaluated. Oh well. > + } > /* > * Mapping in huge pages should only happen through a > * fault. If a page is merged into a transparent huge > @@ -1090,8 +1108,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache > * should become splitting first, unmapped, merged, > * and mapped back in on-demand. > */ > - VM_BUG_ON(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); > - > + WARN_ON_ONCE(pmd_pfn(old_pmd) != pmd_pfn(*new_pmd)); > pmd_clear(pmd); > kvm_tlb_flush_vmid_ipa(kvm, addr); > } else { > @@ -1107,6 +1124,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac > { > pud_t *pudp, old_pud; > > +retry: > pudp = stage2_get_pud(kvm, cache, addr); > VM_BUG_ON(!pudp); > > @@ -1114,16 +1132,25 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac > > /* > * A large number of vcpus faulting on the same stage 2 entry, > - * can lead to a refault due to the > - * stage2_pud_clear()/tlb_flush(). Skip updating the page > - * tables if there is no change. > + * can lead to a refault due to the stage2_pud_clear()/tlb_flush(). > + * Skip updating the page tables if there is no change. > */ > if (pud_val(old_pud) == pud_val(*new_pudp)) > return 0; > > if (stage2_pud_present(kvm, old_pud)) { > - stage2_pud_clear(kvm, pudp); > - kvm_tlb_flush_vmid_ipa(kvm, addr); > + /* > + * If we already have table level mapping for this block, unmap > + * the range for this block and retry. > + */ > + if (!stage2_pud_huge(kvm, old_pud)) { > + unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE); This broke 32bit. I've added the following hunk to fix it: diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h index de2089501b8b..b8f21088a744 100644 --- a/arch/arm/include/asm/stage2_pgtable.h +++ b/arch/arm/include/asm/stage2_pgtable.h @@ -68,6 +68,9 @@ stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end) #define stage2_pmd_table_empty(kvm, pmdp) kvm_page_empty(pmdp) #define stage2_pud_table_empty(kvm, pudp) false +#define S2_PUD_MASK PGDIR_MASK +#define S2_PUD_SIZE PGDIR_SIZE + static inline bool kvm_stage2_has_pud(struct kvm *kvm) { return false; > + goto retry; > + } else { > + WARN_ON_ONCE(pud_pfn(old_pud) != pud_pfn(*new_pudp)); > + stage2_pud_clear(kvm, pudp); > + kvm_tlb_flush_vmid_ipa(kvm, addr); > + } The 'else' line could go, and would make the code similar to the PMD path. > } else { > get_page(virt_to_page(pudp)); > } > -- > 2.7.4 > If you're OK with the above nits, I'll squash them into the patch. Thanks, M. -- Jazz is not dead, it just smell funny.