Received: by 10.223.176.5 with SMTP id f5csp1758298wra; Thu, 8 Feb 2018 03:02:15 -0800 (PST) X-Google-Smtp-Source: AH8x2273o1WRVwIwm/Yi7dkqVCgO1rsVOIR2EgIaynyp1MlBQOa/TC86h/a3QuXozGSdF8yGb/SO X-Received: by 2002:a17:902:34e:: with SMTP id 72-v6mr307647pld.126.1518087735338; Thu, 08 Feb 2018 03:02:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518087735; cv=none; d=google.com; s=arc-20160816; b=AskAH1wGdovSIFuOkqK82pjb+gwiI+ZQWgKQV6AXsB90/7Zwd/0T7oF25hCwmma8zw Zt60b/ePe9wh8eOohItlehlkIBA1aV9vEvu55zXtkCnsF008ehg/NJC5YvC5jl+BjUOL QIqGsuzBSn80FIELuW8yl4KFhDsFMWc/8LT4DFZ9oEWvXO4xtfhlNKkTiGwOpyFxYLHM KMxY3H/Mt++TCiG/59zB+DA+9QJbTruyHtMjtTsPVBMs+CMGOb3QWze963MghgS2KbSF RNRjaYl9HmLgRrGmXselSFeR1dT8Oz7daV/zFWJURrzPqpVVskN7tE//xF7qbwryf0ch XdRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=JDYs/kq+bAseoIHHU1semlaBFxwOR2pV+seyk5tYu4k=; b=kf8xTLx1TQWOpRdwwzGQikhczNHL3N20/4tC2bLRHs26xyGL9URHslBz+Xlw/UkIvk 0n5DTTAWaps0zREUL+pVUBEKFMYGgN1fqhwGyMQOOkMZftGRm7hnoVpmHVjOUeficg6e EdXbaGAIDuaLa8yJTwNbn8WkukEVOsDyVhyRcN74i3oes3Rl2LZ9l7v09RZGa/D+7bhl 18cH4y7ar2dzbrwAWPwiSvmYTffUjhqqkf+aqpcFGA44pU0MUV1uxL7Sq3S0zzNZJL2/ KdLDJe95HTTXea8C4/554W4HNaWTR+9MHpckAAaM9RFnBTELJB8PW9lqILhkX+Kwj8Lm CvSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=WJmZNTnM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y184si463221pgd.291.2018.02.08.03.02.01; Thu, 08 Feb 2018 03:02:15 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=WJmZNTnM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752387AbeBHLBG (ORCPT + 99 others); Thu, 8 Feb 2018 06:01:06 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:38919 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752335AbeBHLBD (ORCPT ); Thu, 8 Feb 2018 06:01:03 -0500 Received: by mail-wm0-f66.google.com with SMTP id b21so9102006wme.4 for ; Thu, 08 Feb 2018 03:01:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=JDYs/kq+bAseoIHHU1semlaBFxwOR2pV+seyk5tYu4k=; b=WJmZNTnM/XWZxZqyCOBjmaEY9JptWVdHwG1wjh5Gr+cfqwqWCq4VKb9C/rQGJ5KJ53 QJvhaLkAc3oHMhtl8GSLwU29Qo05Rmk3Exy3ddP5uV7Kw7JkRyj6MdydQ3MEIIP5Smtz quZ3/ZCG4CXdPwMirnQJBRiQdfNx4MqSPgpd4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=JDYs/kq+bAseoIHHU1semlaBFxwOR2pV+seyk5tYu4k=; b=uFXH0UviI/3VC+qTJbCC9cJMPab8IqIT6XH/Fty61/aiIQ9+NywjOFFOlFFPW3F3uE i7QRG6JyrDYWCBahY6FaiGoNfNx7pyBi4dn2vRkvI593T3hReip/SXh+Gi1Q+37SC2v2 vuTFcDplUfe+3YoQOwbQch2hebVJyecVfSBvIjWgNuH6ntn2MD38xQFU7QX+4JMDYqoy VnNurfzMSxTnOgr7l/SBEiT+UN4oIjiksGAz0qZW8Wo4S4+Yq+DH+MOi+sj2GfNnpuml UdI0OnLGRf6lMbvRUdZEQO3SVgLmR9Xw6Vg2DsLFxdl+1hXTdvuw1p55x3ThJUpA5N/I UD1Q== X-Gm-Message-State: APf1xPBGImAmaU8xyxOZZ8YVafywarqSemVVM+9/RH6um8OfC9TSqlma FIQpDr+5Q74WGXqzyeCFIKwfCQ== X-Received: by 10.80.150.193 with SMTP id z1mr953124eda.175.1518087661523; Thu, 08 Feb 2018 03:01:01 -0800 (PST) Received: from localhost (x50d2404e.cust.hiper.dk. [80.210.64.78]) by smtp.gmail.com with ESMTPSA id y3sm2608803edb.1.2018.02.08.03.01.00 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Feb 2018 03:01:00 -0800 (PST) Date: Thu, 8 Feb 2018 12:00:59 +0100 From: Christoffer Dall To: Suzuki K Poulose Cc: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, marc.zyngier@arm.com, linux-kernel@vger.kernel.org, kristina.martsenko@arm.com, peter.maydell@linaro.org, pbonzini@redhat.com, rkrcmar@redhat.com, will.deacon@arm.com, ard.biesheuvel@linaro.org, mark.rutland@arm.com, catalin.marinas@arm.com, Christoffer Dall Subject: Re: [PATCH v1 08/16] kvm: arm/arm64: Clean up stage2 pgd life time Message-ID: <20180208110059.GK29286@cbox> References: <20180109190414.4017-1-suzuki.poulose@arm.com> <20180109190414.4017-9-suzuki.poulose@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180109190414.4017-9-suzuki.poulose@arm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 09, 2018 at 07:04:03PM +0000, Suzuki K Poulose wrote: > On arm/arm64 we pre-allocate the entry level page tables when > a VM is created and is free'd when either all the mm users are > gone or the KVM is about to get destroyed. i.e, kvm_free_stage2_pgd > is triggered via kvm_arch_flush_shadow_all() which can be invoked > from two different paths : > > 1) do_exit()-> .-> mmu_notifier->release()-> ..-> kvm_arch_flush_shadow_all() > OR > 2) kvm_destroy_vm()-> mmu_notifier_unregister-> kvm_arch_flush_shadow_all() > > This has created lot of race conditions in the past as some of > the VCPUs could be active when we free the stage2 via path (1). How?? mmu_notifier->release() is called via __mput->exit_mmap(), which is only called if mm_users == 0, which means there are no more threads left than the one currently doing exit(). > > On a closer look, all we need to do with kvm_arch_flush_shadow_all() is, > to ensure that the stage2 mappings are cleared. This doesn't mean we > have to free up the stage2 entry level page tables yet, which could > be delayed until the kvm is destroyed. This would avoid issues > of use-after-free, do we have any of those left? > This will be later used for delaying > the allocation of the stage2 entry level page tables until we really > need to do something with it. Fine, but you don't actually explain why this change of flow is necessary for what you're trying to do later? > > Cc: Marc Zyngier > Cc: Christoffer Dall > Signed-off-by: Suzuki K Poulose > --- > virt/kvm/arm/arm.c | 1 + > virt/kvm/arm/mmu.c | 56 ++++++++++++++++++++++++++++-------------------------- > 2 files changed, 30 insertions(+), 27 deletions(-) > > diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c > index c8d49879307f..19b720ddedce 100644 > --- a/virt/kvm/arm/arm.c > +++ b/virt/kvm/arm/arm.c > @@ -189,6 +189,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > } > } > atomic_set(&kvm->online_vcpus, 0); > + kvm_free_stage2_pgd(kvm); > } > > int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index 78253fe00fc4..c94c61ac38b9 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -298,11 +298,10 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size) > pgd = kvm->arch.pgd + stage2_pgd_index(addr); > do { > /* > - * Make sure the page table is still active, as another thread > - * could have possibly freed the page table, while we released > - * the lock. > + * The page table shouldn't be free'd as we still hold a reference > + * to the KVM. To avoid confusion about references to the kernel module KVM and a specific KVM VM instance, please s/KVM/VM/. > */ > - if (!READ_ONCE(kvm->arch.pgd)) > + if (WARN_ON(!READ_ONCE(kvm->arch.pgd))) This reads a lot like a defensive implementation now, and I think for this patch to make sense, we shouldn't try to handle a buggy super-racy implementation gracefully, but rather have VM_BUG_ON() (if we care about performance of the check) or simply BUG_ON(). The rationale being that if we've gotten this flow incorrect and freed the pgd at the wrong time, we don't want to leave a ticking bomb to blow up somewhere else randomly (which it will!), but instead crash and burn. > break; > next = stage2_pgd_addr_end(addr, end); > if (!stage2_pgd_none(*pgd)) > @@ -837,30 +836,33 @@ void stage2_unmap_vm(struct kvm *kvm) > up_read(¤t->mm->mmap_sem); > srcu_read_unlock(&kvm->srcu, idx); > } > - > -/** > - * kvm_free_stage2_pgd - free all stage-2 tables > - * @kvm: The KVM struct pointer for the VM. > - * > - * Walks the level-1 page table pointed to by kvm->arch.pgd and frees all > - * underlying level-2 and level-3 tables before freeing the actual level-1 table > - * and setting the struct pointer to NULL. > +/* > + * kvm_flush_stage2_all: Unmap the entire stage2 mappings including > + * device and regular RAM backing memory. > */ > -void kvm_free_stage2_pgd(struct kvm *kvm) > +static void kvm_flush_stage2_all(struct kvm *kvm) > { > - void *pgd = NULL; > - > spin_lock(&kvm->mmu_lock); > - if (kvm->arch.pgd) { > + if (kvm->arch.pgd) > unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE); > - pgd = READ_ONCE(kvm->arch.pgd); > - kvm->arch.pgd = NULL; > - } > spin_unlock(&kvm->mmu_lock); > +} > > - /* Free the HW pgd, one page at a time */ > - if (pgd) > - free_pages_exact(pgd, S2_PGD_SIZE); > +/** > + * kvm_free_stage2_pgd - Free the entry level page tables in stage-2. nit: you should put the parameter description here and leave a blank line before the lengthy discussion. > + * This is called when all reference to the KVM has gone away and we > + * really don't need any protection in resetting the PGD. This also I don't think I understand the last part of this sentence. This function is pretty self-explanatory really, and I think we can either drop the documentation all together or simply say that this function clears all stage 2 page table entries to release the memory of the lower-level page tables themselves and then frees the pgd in the end. The VM is known to go away and no more VCPUs exist at this point. > + * means that nobody should be touching stage2 at this point, as we > + * have unmapped the entire stage2 already and all dynamic entities, > + * (VCPUs and devices) are no longer active. > + * > + * @kvm: The KVM struct pointer for the VM. > + */ > +void kvm_free_stage2_pgd(struct kvm *kvm) > +{ > + kvm_flush_stage2_all(kvm); > + free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE); > + kvm->arch.pgd = NULL; > } > > static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, > @@ -1189,12 +1191,12 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end) > * large. Otherwise, we may see kernel panics with > * CONFIG_DETECT_HUNG_TASK, CONFIG_LOCKUP_DETECTOR, > * CONFIG_LOCKDEP. Additionally, holding the lock too long > - * will also starve other vCPUs. We have to also make sure > - * that the page tables are not freed while we released > - * the lock. > + * will also starve other vCPUs. > + * The page tables shouldn't be free'd while we released the s/shouldn't/can't/ > + * lock, since we hold a reference to the KVM. s/KVM/VM/ > */ > cond_resched_lock(&kvm->mmu_lock); > - if (!READ_ONCE(kvm->arch.pgd)) > + if (WARN_ON(!READ_ONCE(kvm->arch.pgd))) > break; > next = stage2_pgd_addr_end(addr, end); > if (stage2_pgd_present(*pgd)) > @@ -1950,7 +1952,7 @@ void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots) > > void kvm_arch_flush_shadow_all(struct kvm *kvm) > { > - kvm_free_stage2_pgd(kvm); > + kvm_flush_stage2_all(kvm); > } > > void kvm_arch_flush_shadow_memslot(struct kvm *kvm, > -- > 2.13.6 > Thanks, -Christoffer