Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751378AbdGQNDQ (ORCPT ); Mon, 17 Jul 2017 09:03:16 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:38942 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751280AbdGQNDO (ORCPT ); Mon, 17 Jul 2017 09:03:14 -0400 Subject: Re: [PATCH v2] KVM: arm/arm64: Handle hva aging while destroying the vm To: Christoffer Dall References: <1499235631-141725-1-git-send-email-agraf@suse.de> <20170705085700.GA16881@e107814-lin.cambridge.arm.com> <20170706074513.GC18106@cbox> <18e7012c-a095-ecfa-470c-cf81177698a1@arm.com> <20170706094205.GE18106@cbox> <5cb34cc0-27c1-c011-a8d4-c991e47141c3@arm.com> <20170716195658.GA31432@cbox> Cc: Alexander Graf , kvmarm@lists.cs.columbia.edu, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , stable@vger.kernel.org From: Suzuki K Poulose Message-ID: Date: Mon, 17 Jul 2017 14:03:11 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170716195658.GA31432@cbox> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7986 Lines: 194 On 16/07/17 20:56, Christoffer Dall wrote: > On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote: >> On 06/07/17 10:42, Christoffer Dall wrote: >>> On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote: >>>> On 06/07/17 08:45, Christoffer Dall wrote: >>>>> On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote: >>>>>> >>>>>> >>>>>> On 05.07.17 10:57, Suzuki K Poulose wrote: >>>>>>> Hi Alex, >>>>>>> >>>>>>> On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote: >>>>>>>> The kvm_age_hva callback may be called all the way concurrently while >>>>>>>> kvm_mmu_notifier_release() is running. >>>>>>>> >>>>>>>> The release function sets kvm->arch.pgd = NULL which the aging function >>>>>>>> however implicitly relies on in stage2_get_pud(). That means they can >>>>>>>> race and the aging function may dereference a NULL pgd pointer. >>>>>>>> >>>>>>>> This patch adds a check for that case, so that we leave the aging >>>>>>>> function silently. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org >>>>>>>> Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly") >>>>>>>> Signed-off-by: Alexander Graf >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> v1 -> v2: >>>>>>>> >>>>>>>> - Fix commit message >>>>>>>> - Add Fixes and stable tags >>>>>>>> --- >>>>>>>> virt/kvm/arm/mmu.c | 4 ++++ >>>>>>>> 1 file changed, 4 insertions(+) >>>>>>>> >>>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c >>>>>>>> index f2d5b6c..227931f 100644 >>>>>>>> --- a/virt/kvm/arm/mmu.c >>>>>>>> +++ b/virt/kvm/arm/mmu.c >>>>>>>> @@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache >>>>>>>> pgd_t *pgd; >>>>>>>> pud_t *pud; >>>>>>>> + /* Do we clash with kvm_free_stage2_pgd()? */ >>>>>>>> + if (!kvm->arch.pgd) >>>>>>>> + return NULL; >>>>>>>> + >>>>>>> >>>>>>> I think this check should be moved up in the chain. We call kvm_age_hva(), with >>>>>>> the kvm->mmu_lock held and we don't release it till we reach here. So, ideally, >>>>>>> if we find the PGD is null when we reach kvm_age_hva(), we could simply return >>>>>>> there, like we do for other call backs from the KVM mmu_notifier. >>>>>> >>>>>> That probably works too - I'm not sure which version is more >>>>>> consistent as well as more maintainable in the long run. I'll leave >>>>>> the call here to Christoffer. >>>>>> >>>>> >>>>> Let's look at the callers to stage2_get_pmd, which is the only caller of >>>>> stage2_get_pud, where the problem was observed: >>>>> >>>>> user_mem_abort >>>>> -> stage2_set_pmd_huge >>>>> -> stage2_get_pmd >>>>> >>>>> user_mem_abort >>>>> -> stage2_set_pte >>>>> -> stage2_get_pmd >>>>> >>>>> handle_access_fault >>>>> -> stage2_get_pmd >>>>> >>>>> For the above three functions, pgd cannot ever be NULL, because this is >>>>> running in the context of a VCPU thread, which means the reference on >>>>> the VM fd must not reach zero, so no need to call that here. >>>> >>>> I think there is some problem here. See below for more information. >>>> >>>>> >>>>> kvm_set_spte_handler >>>>> -> stage2_set_pte >>>>> -> stage2_get_pmd >>>>> >>>>> This is called from kvm_set_spte_hva, which is one of the MMU notifiers, >>>>> so it can race similarly kvm_age_hva and kvm_test_age_hva, but it >>>>> already checks for !kvm->arch.pgd. >>>>> >>>>> kvm_phys_addr_ioremap >>>>> -> stage2_set_pte >>>>> -> stage2_get_pmd >>>>> >>>>> This is called from two places: (1) The VGIC code (as part of >>>>> vgic_v2_map_resources) and can only be called in the context of running >>>>> a VCPU, so the pgd cannot be null by virtue of the same argument as for >>>>> user_mem_abort. (2) kvm_arch_prepare_memory_region calls >>>>> kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see >>>>> how the VM can be in the middle of being freed while handling ioctls on >>>>> the fd. Therefore, following the same argument, this should be safe as >>>>> well. >>>>> >>>>> kvm_age_hva_handler and kvm_test_age_hva_handler >>>>> -> stage2_get_pmd >>>>> >>>>> Handled by the patch proposed by Suzuki. >>>>> >>>>> What does all that tell us? First, it should give us some comfort that we >>>>> don't have more races of this kind. Second, it teels us that there are >>>>> a number of different and not-obvious call paths to stage2_pet_pud, >>>>> which could be an argument to simply check the pgd whenever it's called, >>>>> despite the minor runtime overhead. On the other hand, the check itself >>>>> is only valid knowing that we synchronize against kvm_free_stage2_pgd >>>>> using the kvm->mmu_lock() and understanding that this only happens when >>>>> mmu notifiers call into the KVM MMU code outside the context of the VM. >>>>> >>>>> The last consideration is the winning argument for me to put the check >>>>> in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's >>>>> important that we document why it's only these three high-level callers >>>>> (incl. kvm_set_spte_handler) that need the check, either in the code or >>>>> in the commit message. >>>> >>>> The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(), >>>> which could be triggered via two different paths. >>>> >>>> 1) kvm_destroy_vm(), where all the VM resources has been released and the >>>> refcount on the KVM instances are dropped, via kvm_put_kvm(). >>>> >>>> kvm_put_kvm() >>>> kvm_destroy_vm() >>>> mmu_notifier_unregsiter >>>> mmu_notifier_ops->release() >>>> kvm_arch_flush_shadow_all >>>> kvm_free_stage2_pgd -> free the page table with the mmu_lock held >>>> occasionally releasing it to avoid contention. >>>> or >>>> >>>> 2) do_signal -> get_signal -> do_group_exit - > >>>> do_exit >>>> exit_mm >>>> mmput => __mmput >>>> exit_mmap >>>> mmu_notifier_release >>>> mmu_notifier_ops->release >>>> kvm_arch_flush_shadow_all >>>> kvm_free_stage2_pgd >>>> >>>> In the first case, all references to the VM are dropped and hence none of the >>>> VCPU could still be executing. >>>> >>>> However, in the second case it may not be. So we have a potential problem with >>>> the VCPU trying to run even when the pages were unmapped. I think the root cause >>>> of all these issues boils down to the assumption that KVM holds a reference to >>>> MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget). >>>> I am not sure if the VCPU should hold a reference to the mmaps to make sure >>>> it is safe to run. That way, the mmap stays until the VCPU eventually exits >>>> and we are safe all the way around. >>> >>> Hmmm, my assumption is that if a VCPU is running, it means there is a >>> VCPU thread that shares the struct mm which is running, so I don't >>> understand how mmput would be able to call exit_mmap in the scenario >>> above? >>> >>> So the distinction here is that I don't assume that the VCPU fd holds a >>> reference to the mm, but I assume that the (running) VCPU thread does. >>> Is this incorrect? >> >> Sorry, I lost this thread in between. > > No worries. > >> >> Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't >> do anything with the mmu if it has dropped it. I was confused based on an >> old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables >> properly"], which was fixed. >> > This keeps haunting me :(. I happened to look at another report from Alex back in April, where a VCPU ends up in an mmu_notifier call back after the PGD was free'd ! So it does look like you could end up in freeing the stage2 page table and yet another VCPU could be running. See : https://lkml.org/lkml/2017/4/21/820 I am trying to see if I can reproduce the hypothesis. Suzuki > OK, that makes sense to me. I hope Andrea also agrees with this, so as > long as VCPU thread is running, exit_mm() will not be called. > > Thanks, > -Christoffer >