Received: by 10.223.176.5 with SMTP id f5csp2159916wra; Thu, 8 Feb 2018 09:20:37 -0800 (PST) X-Google-Smtp-Source: AH8x225hjskQwO3oQ1phzvkqOq2qqQ0uTCQUFUAWDQSkgAS++dEQCuT3+YoJPPIZfW828fKbi5Cd X-Received: by 2002:a17:902:402:: with SMTP id 2-v6mr1244902ple.353.1518110437474; Thu, 08 Feb 2018 09:20:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518110437; cv=none; d=google.com; s=arc-20160816; b=ofb2F+/0SDuGGdwNVqCi2SmPvQLVkWuUP8EDBrCM/+HwiUKUxrHZ+HnS7Lt3O32OIZ 6170Q8x3iK6p3LEpVFbPs8OuTMmDJcmTkQNcCamiWR7ZVX/c11Fq6mm4g2YqnFYlJhEa AGAhTwu3Uir/o6DSXeRStXYkA0l1MkQQblbu1RMKJAceoLOeOj1lPuzCSEu0kdtraZR/ 2AL28X/RDlnOzZiayul2195GXA2ZQppafYkLO8ZRxA2KlVJNhaPCWcXEP9kGGPcTIFT2 7udnAyb1Dv8omwJqIGMYHIWAlZ7CqOe7u+5MksPYxm0w+hp9Sn28gPYIm0Bxfv3p4xLB eRag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=JXNx4MxGIq2UNpUsOPU5mmr9wYUVHVolVAd8nyhKrxI=; b=BaL8aWU7mLyGzykoKJx4IPGp5qg6xSshNsrqMlC9NyUa6evSMjXYCrYK5QEENRikYJ KhD2W07Ii6hpn+DbXvAOIBju7gdOUSlb2szOUiKRJ5uoy+iTtng83qJiobWdt7Scvsek pQItQ/bJWOtdQTpr3UjOE6nEGhGmwM1hH605YpC7LnkpwG94m9kEAX4K0LAUQse2PYfS WoknG8uj4VY44MaCtp6kmjkB+EH49w6DjfnEHJ4ZiDrSenIEcuyRicECcZX6oc7/XoyC ToSwJPH6LB8TXlJo4e7ywHEpDan0EahbmIqQA6r2gzZ6KIEu08wfGFk5SI49196Uti8t glCA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l1-v6si239669plg.56.2018.02.08.09.20.21; Thu, 08 Feb 2018 09:20:37 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752356AbeBHRT2 (ORCPT + 99 others); Thu, 8 Feb 2018 12:19:28 -0500 Received: from foss.arm.com ([217.140.101.70]:37752 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751848AbeBHRT0 (ORCPT ); Thu, 8 Feb 2018 12:19:26 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 84F8880D; Thu, 8 Feb 2018 09:19:26 -0800 (PST) Received: from [10.1.206.28] (e107814-lin.cambridge.arm.com [10.1.206.28]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 21C7E3F24D; Thu, 8 Feb 2018 09:19:23 -0800 (PST) Subject: Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time To: Christoffer Dall Cc: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, marc.zyngier@arm.com, linux-kernel@vger.kernel.org, kristina.martsenko@arm.com, peter.maydell@linaro.org, pbonzini@redhat.com, rkrcmar@redhat.com, will.deacon@arm.com, ard.biesheuvel@linaro.org, mark.rutland@arm.com, catalin.marinas@arm.com, Christoffer Dall References: <20180109190414.4017-1-suzuki.poulose@arm.com> <20180109190414.4017-9-suzuki.poulose@arm.com> <20180208110059.GK29286@cbox> From: Suzuki K Poulose Message-ID: <4b31c124-ccb2-b794-c2b2-350975b8a1cc@arm.com> Date: Thu, 8 Feb 2018 17:19:22 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 MIME-Version: 1.0 In-Reply-To: <20180208110059.GK29286@cbox> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/02/18 11:00, Christoffer Dall wrote: > On Tue, Jan 09, 2018 at 07:04:03PM +0000, Suzuki K Poulose wrote: >> On arm/arm64 we pre-allocate the entry level page tables when >> a VM is created and is free'd when either all the mm users are >> gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd >> is triggered via kvm_arch_flush_shadow_all() which can be invoked >> from two different paths : >> >> 1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all() >> OR >> 2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all() >> >> This has created lot of race conditions in the past as some of >> the VCPUs could be active when we free the stage2 via path (1). > > How?? mmu_notifier->release() is called via __mput->exit_mmap(), which > is only called if mm_users == 0, which means there are no more threads > left than the one currently doing exit(). IIRC, if the process is sent a fatal signal, that could cause all the threads to exit, leaving the "last" thread to do the clean up. The files could still be open, implying that the KVM fds are still active, without a stage2, even though we are not going to run anything. (The race was fixed by moving the stage2 teardown to mmu_notifier->release()). > >> >> On a closer look, all we need to do with kvm_arch_flush_shadow_all() is, >> to ensure that the stage2 mappings are cleared. This doesn't mean we >> have to free up the stage2 entry level page tables yet, which could >> be delayed until the kvm is destroyed. This would avoid issues >> of use-after-free, > > do we have any of those left? None that we know of. > >> This will be later used for delaying >> the allocation of the stage2 entry level page tables until we really >> need to do something with it. > > Fine, but you don't actually explain why this change of flow is > necessary for what you're trying to do later? This patch is not mandatory for the series. But, since we are delaying the "allocation" stage2 tables anyway later, I thought it would be good to clean up the "free" path. >> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c >> index 78253fe00fc4..c94c61ac38b9 100644 >> --- a/virt/kvm/arm/mmu.c >> +++ b/virt/kvm/arm/mmu.c >> @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size) >> pgd = kvm->arch.pgd + stage2_pgd_index(addr); >> do { >> /* >> - * Make sure the page table is still active, as another thread >> - * could have possibly freed the page table, while we released >> - * the lock. >> + * The page table shouldn't be free'd as we still hold a reference >> + * to the KVM. > > To avoid confusion about references to the kernel module KVM and a > specific KVM VM instance, please s/KVM/VM/. ok. > >> */ >> - if (!READ_ONCE(kvm->arch.pgd)) >> + if (WARN_ON(!READ_ONCE(kvm->arch.pgd))) > > This reads a lot like a defensive implementation now, and I think for > this patch to make sense, we shouldn't try to handle a buggy super-racy > implementation gracefully, but rather have VM_BUG_ON() (if we care about > performance of the check) or simply BUG_ON(). > > The rationale being that if we've gotten this flow incorrect and freed > the pgd at the wrong time, we don't want to leave a ticking bomb to blow > up somewhere else randomly (which it will!), but instead crash and burn. Ok. > >> break; >> next = stage2_pgd_addr_end(addr, end); >> if (!stage2_pgd_none(*pgd)) >> @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm) >> up_read(¤t->mm->mmap_sem); >> srcu_read_unlock(&kvm->srcu, idx); >> } >> - >> -/** >> - * kvm_free_stage2_pgd - free all stage-2 tables >> - * @kvm: The KVM struct pointer for the VM. >> - * >> - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all >> - * underlying level-2 and level-3 tables before freeing the actual level-1 table >> - * and setting the struct pointer to NULL. >> +/* >> + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including >> + * device and regular RAM backing memory. >> */ >> -void kvm_free_stage2_pgd(struct kvm *kvm) >> +static void kvm_flush_stage2_all(struct kvm *kvm) >> { >> - void *pgd = NULL; >> - >> spin_lock(&kvm->mmu_lock); >> - if (kvm->arch.pgd) { >> + if (kvm->arch.pgd) >> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE); >> - pgd = READ_ONCE(kvm->arch.pgd); >> - kvm->arch.pgd = NULL; >> - } >> spin_unlock(&kvm->mmu_lock); >> +} >> >> - /* Free the HW pgd, one page at a time */ >> - if (pgd) >> - free_pages_exact(pgd, S2_PGD_SIZE); >> +/** >> + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2. > > nit: you should put the parameter description here and leave a blank > line before the lengthy discussion. > >> + * This is called when all reference to the KVM has gone away and we >> + * really don't need any protection in resetting the PGD. This also > > I don't think I understand the last part of this sentence. i.e, we don't have to use the spin_lock to reset the PGD. > This function is pretty self-explanatory really, and I think we can > either drop the documentation all together or simply say that this > function clears all stage 2 page table entries to release the memory of > the lower-level page tables themselves and then frees the pgd in the > end. The VM is known to go away and no more VCPUs exist at this point. > Ok. >> + * means that nobody should be touching stage2 at this point, as we >> + * have unmapped the entire stage2 already and all dynamic entities, >> + * (VCPUs and devices) are no longer active. >> + * >> + * @kvm: The KVM struct pointer for the VM. >> + */ >> +void kvm_free_stage2_pgd(struct kvm *kvm) >> +{ >> + kvm_flush_stage2_all(kvm); >> + free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE); >> + kvm->arch.pgd = NULL; >> } >> >> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, >> @@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end) >> * large. Otherwise, we may see kernel panics with >> * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR, >> * CONFIG_LOCKDEP. Additionally, holding the lock too long >> - * will also starve other vCPUs. We have to also make sure >> - * that the page tables are not freed while we released >> - * the lock. >> + * will also starve other vCPUs. >> + * The page tables shouldn't be free'd while we released the > > s/shouldn't/can't/ > ok >> + * lock, since we hold a reference to the KVM. > > s/KVM/VM/ > ok. Thanks Suzuki