2017-04-24 10:12:21

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 0/2] kvm: Fixes for race conditions


The two patches here fixes race conditions in the KVM hypervisor code
dealing with the shadow MMU.

The first one applies to core KVM code where mmu_notifier->ops.release()
could be called twice with one instance possibily accessing a free'd KVM
instance. Reported here :

http://lkml.kernel.org/r/CAAeHK+x8udHKq9xa1zkTO6ax5E8Dk32HYWfaT05FMchL2cr48g@mail.gmail.com

The second patch is specific to arm/arm64 stage2 PGD, where there are issues
with modifications to the PGD pointer outside the mmu_lock, leading to crashes.
Reported here :
http://lkml.kernel.org/r/[email protected]


Suzuki K Poulose (2):
kvm: Fix mmu_notifier release race
kvm: arm/arm64: Fix race in resetting stage2 PGD

arch/arm/kvm/mmu.c | 14 +++++++-----
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 59 ++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 61 insertions(+), 13 deletions(-)

--
2.7.4


2017-04-24 10:12:43

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 1/2] kvm: Fix mmu_notifier release race

The KVM uses mmu_notifier (wherever available) to keep track
of the changes to the mm of the guest. The guest shadow page
tables are released when the VM exits via mmu_notifier->ops.release().
There is a rare chance that the mmu_notifier->release could be
called more than once via two different paths, which could end
up in use-after-free of kvm instance (such as [0]).

e.g:

thread A thread B
------- --------------

get_signal-> kvm_destroy_vm()->
do_exit-> mmu_notifier_unregister->
exit_mm-> kvm_arch_flush_shadow_all()->
exit_mmap-> spin_lock(&kvm->mmu_lock)
mmu_notifier_release-> ....
kvm_arch_flush_shadow_all()-> .....
... spin_lock(&kvm->mmu_lock) .....
spin_unlock(&kvm->mmu_lock)
kvm_arch_free_kvm()
*** use after free of kvm ***

This patch attempts to solve the problem by holding a reference to the KVM
for the mmu_notifier, which is dropped only from notifier->ops.release().
This will ensure that the KVM struct is available till we reach the
kvm_mmu_notifier_release, and the kvm_destroy_vm is called only from/after
it. So, we can unregister the notifier with no_release option and hence
avoiding the race above. However, we need to make sure that the KVM is
freed only after the mmu_notifier has finished processing the notifier due to
the following possible path of execution :

mmu_notifier_release -> kvm_mmu_notifier_release -> kvm_put_kvm ->
kvm_destroy_vm -> kvm_arch_free_kvm

[0] http://lkml.kernel.org/r/CAAeHK+x8udHKq9xa1zkTO6ax5E8Dk32HYWfaT05FMchL2cr48g@mail.gmail.com

Fixes: commit 85db06e514422 ("KVM: mmu_notifiers release method")
Reported-by: [email protected]
Cc: Mark Rutland <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: [email protected]
Cc: Marc Zyngier <[email protected]>
Tested-by: Mark Rutland <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 59 ++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 53 insertions(+), 7 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d025074..561e968 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -424,6 +424,7 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
+ struct rcu_head mmu_notifier_rcu;
#endif
long tlbs_dirty;
struct list_head devices;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 88257b3..2c3fdd4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -471,6 +471,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
idx = srcu_read_lock(&kvm->srcu);
kvm_arch_flush_shadow_all(kvm);
srcu_read_unlock(&kvm->srcu, idx);
+ kvm_put_kvm(kvm);
}

static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
@@ -486,8 +487,46 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {

static int kvm_init_mmu_notifier(struct kvm *kvm)
{
+ int rc;
kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
- return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+ rc = mmu_notifier_register(&kvm->mmu_notifier, current->mm);
+ /*
+ * We hold a reference to KVM here to make sure that the KVM
+ * doesn't get free'd before ops->release() completes.
+ */
+ if (!rc)
+ kvm_get_kvm(kvm);
+ return rc;
+}
+
+static void kvm_free_vm_rcu(struct rcu_head *rcu)
+{
+ struct kvm *kvm = container_of(rcu, struct kvm, mmu_notifier_rcu);
+ kvm_arch_free_vm(kvm);
+}
+
+static void kvm_flush_shadow_mmu(struct kvm *kvm)
+{
+ /*
+ * We hold a reference to kvm instance for mmu_notifier and is
+ * only released when ops->release() is called via exit_mmap path.
+ * So, when we reach here ops->release() has been called already, which
+ * flushes the shadow page tables. Hence there is no need to call the
+ * release() again when we unregister the notifier. However, we need
+ * to delay freeing up the kvm until the release() completes, since
+ * we could reach here via :
+ * kvm_mmu_notifier_release() -> kvm_put_kvm() -> kvm_destroy_vm()
+ */
+ mmu_notifier_unregister_no_release(&kvm->mmu_notifier, kvm->mm);
+}
+
+static void kvm_free_vm(struct kvm *kvm)
+{
+ /*
+ * Wait until the mmu_notifier has finished the release().
+ * See comments above in kvm_flush_shadow_mmu.
+ */
+ mmu_notifier_call_srcu(&kvm->mmu_notifier_rcu, kvm_free_vm_rcu);
}

#else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
@@ -497,6 +536,16 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
return 0;
}

+static void kvm_flush_shadow_mmu(struct kvm *kvm)
+{
+ kvm_arch_flush_shadow_all(kvm);
+}
+
+static void kvm_free_vm(struct kvm *kvm)
+{
+ kvm_arch_free_vm(kvm);
+}
+
#endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */

static struct kvm_memslots *kvm_alloc_memslots(void)
@@ -733,18 +782,14 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm->buses[i] = NULL;
}
kvm_coalesced_mmio_free(kvm);
-#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
- mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
-#else
- kvm_arch_flush_shadow_all(kvm);
-#endif
+ kvm_flush_shadow_mmu(kvm);
kvm_arch_destroy_vm(kvm);
kvm_destroy_devices(kvm);
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
kvm_free_memslots(kvm, kvm->memslots[i]);
cleanup_srcu_struct(&kvm->irq_srcu);
cleanup_srcu_struct(&kvm->srcu);
- kvm_arch_free_vm(kvm);
+ kvm_free_vm(kvm);
preempt_notifier_dec();
hardware_disable_all();
mmdrop(mm);
--
2.7.4

2017-04-24 10:12:31

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH 2/2] kvm: arm/arm64: Fix race in resetting stage2 PGD

In kvm_free_stage2_pgd() we check the stage2 PGD before holding
the lock and proceed to take the lock if it is valid. And we unmap
the page tables, followed by releasing the lock. We reset the PGD
only after dropping this lock, which could cause a race condition
where another thread waiting on the lock could potentially see that
the PGD is still valid and proceed to perform a stage2 operation.

This patch moves the stage2 PGD manipulation under the lock.

Reported-by: Alexander Graf <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 582a972..9c4026d 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -835,16 +835,18 @@ void stage2_unmap_vm(struct kvm *kvm)
*/
void kvm_free_stage2_pgd(struct kvm *kvm)
{
- if (kvm->arch.pgd == NULL)
- return;
+ void *pgd = NULL;

spin_lock(&kvm->mmu_lock);
- unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+ if (kvm->arch.pgd) {
+ unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+ pgd = kvm->arch.pgd;
+ kvm->arch.pgd = NULL;
+ }
spin_unlock(&kvm->mmu_lock);
-
/* Free the HW pgd, one page at a time */
- free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
- kvm->arch.pgd = NULL;
+ if (pgd)
+ free_pages_exact(pgd, S2_PGD_SIZE);
}

static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
--
2.7.4

2017-04-24 12:27:12

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 2/2] kvm: arm/arm64: Fix race in resetting stage2 PGD

On Mon, Apr 24, 2017 at 11:10:24AM +0100, Suzuki K Poulose wrote:
> In kvm_free_stage2_pgd() we check the stage2 PGD before holding
> the lock and proceed to take the lock if it is valid. And we unmap
> the page tables, followed by releasing the lock. We reset the PGD
> only after dropping this lock, which could cause a race condition
> where another thread waiting on the lock could potentially see that
> the PGD is still valid and proceed to perform a stage2 operation.
>
> This patch moves the stage2 PGD manipulation under the lock.
>
> Reported-by: Alexander Graf <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

Reviewed-by: Christoffer Dall <[email protected]>

> ---
> arch/arm/kvm/mmu.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 582a972..9c4026d 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -835,16 +835,18 @@ void stage2_unmap_vm(struct kvm *kvm)
> */
> void kvm_free_stage2_pgd(struct kvm *kvm)
> {
> - if (kvm->arch.pgd == NULL)
> - return;
> + void *pgd = NULL;
>
> spin_lock(&kvm->mmu_lock);
> - unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> + if (kvm->arch.pgd) {
> + unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> + pgd = kvm->arch.pgd;
> + kvm->arch.pgd = NULL;
> + }
> spin_unlock(&kvm->mmu_lock);
> -
> /* Free the HW pgd, one page at a time */
> - free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
> - kvm->arch.pgd = NULL;
> + if (pgd)
> + free_pages_exact(pgd, S2_PGD_SIZE);
> }
>
> static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> --
> 2.7.4
>

2017-04-25 15:37:17

by Christoffer Dall

[permalink] [raw]
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

On Mon, Apr 24, 2017 at 11:10:23AM +0100, Suzuki K Poulose wrote:
> The KVM uses mmu_notifier (wherever available) to keep track
> of the changes to the mm of the guest. The guest shadow page
> tables are released when the VM exits via mmu_notifier->ops.release().
> There is a rare chance that the mmu_notifier->release could be
> called more than once via two different paths, which could end
> up in use-after-free of kvm instance (such as [0]).
>
> e.g:
>
> thread A thread B
> ------- --------------
>
> get_signal-> kvm_destroy_vm()->
> do_exit-> mmu_notifier_unregister->
> exit_mm-> kvm_arch_flush_shadow_all()->
> exit_mmap-> spin_lock(&kvm->mmu_lock)
> mmu_notifier_release-> ....
> kvm_arch_flush_shadow_all()-> .....
> ... spin_lock(&kvm->mmu_lock) .....
> spin_unlock(&kvm->mmu_lock)
> kvm_arch_free_kvm()
> *** use after free of kvm ***
>
> This patch attempts to solve the problem by holding a reference to the KVM
> for the mmu_notifier, which is dropped only from notifier->ops.release().
> This will ensure that the KVM struct is available till we reach the
> kvm_mmu_notifier_release, and the kvm_destroy_vm is called only from/after
> it. So, we can unregister the notifier with no_release option and hence
> avoiding the race above. However, we need to make sure that the KVM is
> freed only after the mmu_notifier has finished processing the notifier due to
> the following possible path of execution :
>
> mmu_notifier_release -> kvm_mmu_notifier_release -> kvm_put_kvm ->
> kvm_destroy_vm -> kvm_arch_free_kvm
>
> [0] http://lkml.kernel.org/r/CAAeHK+x8udHKq9xa1zkTO6ax5E8Dk32HYWfaT05FMchL2cr48g@mail.gmail.com
>
> Fixes: commit 85db06e514422 ("KVM: mmu_notifiers release method")
> Reported-by: [email protected]
> Cc: Mark Rutland <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: [email protected]
> Cc: Marc Zyngier <[email protected]>
> Tested-by: Mark Rutland <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>

This looks good to me, but we should have some KVM generic experts look
at it as well.

Reviewed-by: Christoffer Dall <[email protected]>

> ---
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 59 ++++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 53 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d025074..561e968 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -424,6 +424,7 @@ struct kvm {
> struct mmu_notifier mmu_notifier;
> unsigned long mmu_notifier_seq;
> long mmu_notifier_count;
> + struct rcu_head mmu_notifier_rcu;
> #endif
> long tlbs_dirty;
> struct list_head devices;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 88257b3..2c3fdd4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -471,6 +471,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
> idx = srcu_read_lock(&kvm->srcu);
> kvm_arch_flush_shadow_all(kvm);
> srcu_read_unlock(&kvm->srcu, idx);
> + kvm_put_kvm(kvm);
> }
>
> static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> @@ -486,8 +487,46 @@ static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
>
> static int kvm_init_mmu_notifier(struct kvm *kvm)
> {
> + int rc;
> kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
> - return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
> + rc = mmu_notifier_register(&kvm->mmu_notifier, current->mm);
> + /*
> + * We hold a reference to KVM here to make sure that the KVM
> + * doesn't get free'd before ops->release() completes.
> + */
> + if (!rc)
> + kvm_get_kvm(kvm);
> + return rc;
> +}
> +
> +static void kvm_free_vm_rcu(struct rcu_head *rcu)
> +{
> + struct kvm *kvm = container_of(rcu, struct kvm, mmu_notifier_rcu);
> + kvm_arch_free_vm(kvm);
> +}
> +
> +static void kvm_flush_shadow_mmu(struct kvm *kvm)
> +{
> + /*
> + * We hold a reference to kvm instance for mmu_notifier and is
> + * only released when ops->release() is called via exit_mmap path.
> + * So, when we reach here ops->release() has been called already, which
> + * flushes the shadow page tables. Hence there is no need to call the
> + * release() again when we unregister the notifier. However, we need
> + * to delay freeing up the kvm until the release() completes, since
> + * we could reach here via :
> + * kvm_mmu_notifier_release() -> kvm_put_kvm() -> kvm_destroy_vm()
> + */
> + mmu_notifier_unregister_no_release(&kvm->mmu_notifier, kvm->mm);
> +}
> +
> +static void kvm_free_vm(struct kvm *kvm)
> +{
> + /*
> + * Wait until the mmu_notifier has finished the release().
> + * See comments above in kvm_flush_shadow_mmu.
> + */
> + mmu_notifier_call_srcu(&kvm->mmu_notifier_rcu, kvm_free_vm_rcu);
> }
>
> #else /* !(CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER) */
> @@ -497,6 +536,16 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
> return 0;
> }
>
> +static void kvm_flush_shadow_mmu(struct kvm *kvm)
> +{
> + kvm_arch_flush_shadow_all(kvm);
> +}
> +
> +static void kvm_free_vm(struct kvm *kvm)
> +{
> + kvm_arch_free_vm(kvm);
> +}
> +
> #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
>
> static struct kvm_memslots *kvm_alloc_memslots(void)
> @@ -733,18 +782,14 @@ static void kvm_destroy_vm(struct kvm *kvm)
> kvm->buses[i] = NULL;
> }
> kvm_coalesced_mmio_free(kvm);
> -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> - mmu_notifier_unregister(&kvm->mmu_notifier, kvm->mm);
> -#else
> - kvm_arch_flush_shadow_all(kvm);
> -#endif
> + kvm_flush_shadow_mmu(kvm);
> kvm_arch_destroy_vm(kvm);
> kvm_destroy_devices(kvm);
> for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> kvm_free_memslots(kvm, kvm->memslots[i]);
> cleanup_srcu_struct(&kvm->irq_srcu);
> cleanup_srcu_struct(&kvm->srcu);
> - kvm_arch_free_vm(kvm);
> + kvm_free_vm(kvm);
> preempt_notifier_dec();
> hardware_disable_all();
> mmdrop(mm);
> --
> 2.7.4
>

2017-04-25 18:49:31

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

2017-04-24 11:10+0100, Suzuki K Poulose:
> The KVM uses mmu_notifier (wherever available) to keep track
> of the changes to the mm of the guest. The guest shadow page
> tables are released when the VM exits via mmu_notifier->ops.release().
> There is a rare chance that the mmu_notifier->release could be
> called more than once via two different paths, which could end
> up in use-after-free of kvm instance (such as [0]).
>
> e.g:
>
> thread A thread B
> ------- --------------
>
> get_signal-> kvm_destroy_vm()->
> do_exit-> mmu_notifier_unregister->
> exit_mm-> kvm_arch_flush_shadow_all()->
> exit_mmap-> spin_lock(&kvm->mmu_lock)
> mmu_notifier_release-> ....
> kvm_arch_flush_shadow_all()-> .....
> ... spin_lock(&kvm->mmu_lock) .....
> spin_unlock(&kvm->mmu_lock)
> kvm_arch_free_kvm()
> *** use after free of kvm ***

I don't understand this race ...
a piece of code in mmu_notifier_unregister() says:

/*
* Wait for any running method to finish, of course including
* ->release if it was run by mmu_notifier_release instead of us.
*/
synchronize_srcu(&srcu);

and code before that removes the notifier from the list, so it cannot be
called after we pass this point. mmu_notifier_release() does roughly
the same and explains it as:

/*
* synchronize_srcu here prevents mmu_notifier_release from returning to
* exit_mmap (which would proceed with freeing all pages in the mm)
* until the ->release method returns, if it was invoked by
* mmu_notifier_unregister.
*
* The mmu_notifier_mm can't go away from under us because one mm_count
* is held by exit_mmap.
*/
synchronize_srcu(&srcu);

The call of mmu_notifier->release is protected by srcu in both cases and
while it seems possible that mmu_notifier->release would be called
twice, I don't see a combination that could result in use-after-free
from mmu_notifier_release after mmu_notifier_unregister() has returned.

Doesn't [2/2] solve the exact same issue (that the release method cannot
be called twice in parallel)?

Thanks.

2017-04-26 16:04:01

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

On 25/04/17 19:49, Radim Krčmář wrote:
> 2017-04-24 11:10+0100, Suzuki K Poulose:
>> The KVM uses mmu_notifier (wherever available) to keep track
>> of the changes to the mm of the guest. The guest shadow page
>> tables are released when the VM exits via mmu_notifier->ops.release().
>> There is a rare chance that the mmu_notifier->release could be
>> called more than once via two different paths, which could end
>> up in use-after-free of kvm instance (such as [0]).
>>
>> e.g:
>>
>> thread A thread B
>> ------- --------------
>>
>> get_signal-> kvm_destroy_vm()->
>> do_exit-> mmu_notifier_unregister->
>> exit_mm-> kvm_arch_flush_shadow_all()->
>> exit_mmap-> spin_lock(&kvm->mmu_lock)
>> mmu_notifier_release-> ....
>> kvm_arch_flush_shadow_all()-> .....
>> ... spin_lock(&kvm->mmu_lock) .....
>> spin_unlock(&kvm->mmu_lock)
>> kvm_arch_free_kvm()
>> *** use after free of kvm ***
>
> I don't understand this race ...
> a piece of code in mmu_notifier_unregister() says:
>
> /*
> * Wait for any running method to finish, of course including
> * ->release if it was run by mmu_notifier_release instead of us.
> */
> synchronize_srcu(&srcu);
>
> and code before that removes the notifier from the list, so it cannot be
> called after we pass this point. mmu_notifier_release() does roughly
> the same and explains it as:
>
> /*
> * synchronize_srcu here prevents mmu_notifier_release from returning to
> * exit_mmap (which would proceed with freeing all pages in the mm)
> * until the ->release method returns, if it was invoked by
> * mmu_notifier_unregister.
> *
> * The mmu_notifier_mm can't go away from under us because one mm_count
> * is held by exit_mmap.
> */
> synchronize_srcu(&srcu);
>
> The call of mmu_notifier->release is protected by srcu in both cases and
> while it seems possible that mmu_notifier->release would be called
> twice, I don't see a combination that could result in use-after-free
> from mmu_notifier_release after mmu_notifier_unregister() has returned.

Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
does get triggered for sure !!)

The only difference I can spot with _unregister & _release paths are the way
we use src_read_lock across the deletion of the entry from the list.

In mmu_notifier_unregister() we do :

id = srcu_read_lock(&srcu);
/*
* exit_mmap will block in mmu_notifier_release to guarantee
* that ->release is called before freeing the pages.
*/
if (mn->ops->release)
mn->ops->release(mn, mm);
srcu_read_unlock(&srcu, id);

## Releases the srcu lock here and then goes on to grab the spin_lock.

spin_lock(&mm->mmu_notifier_mm->lock);
/*
* Can not use list_del_rcu() since __mmu_notifier_release
* can delete it before we hold the lock.
*/
hlist_del_init_rcu(&mn->hlist);
spin_unlock(&mm->mmu_notifier_mm->lock);

While in mmu_notifier_release() we hold it until the node(s) are deleted from the
list :
/*
* SRCU here will block mmu_notifier_unregister until
* ->release returns.
*/
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
/*
* If ->release runs before mmu_notifier_unregister it must be
* handled, as it's the only way for the driver to flush all
* existing sptes and stop the driver from establishing any more
* sptes before all the pages in the mm are freed.
*/
if (mn->ops->release)
mn->ops->release(mn, mm);

spin_lock(&mm->mmu_notifier_mm->lock);
while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
mn = hlist_entry(mm->mmu_notifier_mm->list.first,
struct mmu_notifier,
hlist);
/*
* We arrived before mmu_notifier_unregister so
* mmu_notifier_unregister will do nothing other than to wait
* for ->release to finish and for mmu_notifier_unregister to
* return.
*/
hlist_del_init_rcu(&mn->hlist);
}
spin_unlock(&mm->mmu_notifier_mm->lock);
srcu_read_unlock(&srcu, id);

## The lock is release only after the deletion of the node.

Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
could potentially miss SRCU read lock held in _release() path and go onto finish the
synchronize_srcu before the item is deleted ? May be we should do the read_unlock
after the deletion of the node in _unregister (like we do in the _release()) ?

>
> Doesn't [2/2] solve the exact same issue (that the release method cannot
> be called twice in parallel)?

Not really. This could be a race between a release() and one of the other notifier
callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where,
the unregister could have succeeded and released the KVM.


[0] http://lkml.kernel.org/r/[email protected]

In effect this all could be due to the same reason, the synchronize in unregister
missing another reader.

Suzuki

>
> Thanks.
>

2017-04-26 16:18:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

On Wed, Apr 26, 2017 at 05:03:44PM +0100, Suzuki K Poulose wrote:
> On 25/04/17 19:49, Radim Krčmář wrote:
> >2017-04-24 11:10+0100, Suzuki K Poulose:
> >>The KVM uses mmu_notifier (wherever available) to keep track
> >>of the changes to the mm of the guest. The guest shadow page
> >>tables are released when the VM exits via mmu_notifier->ops.release().
> >>There is a rare chance that the mmu_notifier->release could be
> >>called more than once via two different paths, which could end
> >>up in use-after-free of kvm instance (such as [0]).
> >>
> >>e.g:
> >>
> >>thread A thread B
> >>------- --------------
> >>
> >> get_signal-> kvm_destroy_vm()->
> >> do_exit-> mmu_notifier_unregister->
> >> exit_mm-> kvm_arch_flush_shadow_all()->
> >> exit_mmap-> spin_lock(&kvm->mmu_lock)
> >> mmu_notifier_release-> ....
> >> kvm_arch_flush_shadow_all()-> .....
> >> ... spin_lock(&kvm->mmu_lock) .....
> >> spin_unlock(&kvm->mmu_lock)
> >> kvm_arch_free_kvm()
> >> *** use after free of kvm ***
> >
> >I don't understand this race ...
> >a piece of code in mmu_notifier_unregister() says:
> >
> > /*
> > * Wait for any running method to finish, of course including
> > * ->release if it was run by mmu_notifier_release instead of us.
> > */
> > synchronize_srcu(&srcu);
> >
> >and code before that removes the notifier from the list, so it cannot be
> >called after we pass this point. mmu_notifier_release() does roughly
> >the same and explains it as:
> >
> > /*
> > * synchronize_srcu here prevents mmu_notifier_release from returning to
> > * exit_mmap (which would proceed with freeing all pages in the mm)
> > * until the ->release method returns, if it was invoked by
> > * mmu_notifier_unregister.
> > *
> > * The mmu_notifier_mm can't go away from under us because one mm_count
> > * is held by exit_mmap.
> > */
> > synchronize_srcu(&srcu);
> >
> >The call of mmu_notifier->release is protected by srcu in both cases and
> >while it seems possible that mmu_notifier->release would be called
> >twice, I don't see a combination that could result in use-after-free
> >from mmu_notifier_release after mmu_notifier_unregister() has returned.
>
> Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
> does get triggered for sure !!)
>
> The only difference I can spot with _unregister & _release paths are the way
> we use src_read_lock across the deletion of the entry from the list.
>
> In mmu_notifier_unregister() we do :
>
> id = srcu_read_lock(&srcu);
> /*
> * exit_mmap will block in mmu_notifier_release to guarantee
> * that ->release is called before freeing the pages.
> */
> if (mn->ops->release)
> mn->ops->release(mn, mm);
> srcu_read_unlock(&srcu, id);
>
> ## Releases the srcu lock here and then goes on to grab the spin_lock.
>
> spin_lock(&mm->mmu_notifier_mm->lock);
> /*
> * Can not use list_del_rcu() since __mmu_notifier_release
> * can delete it before we hold the lock.
> */
> hlist_del_init_rcu(&mn->hlist);
> spin_unlock(&mm->mmu_notifier_mm->lock);
>
> While in mmu_notifier_release() we hold it until the node(s) are deleted from the
> list :
> /*
> * SRCU here will block mmu_notifier_unregister until
> * ->release returns.
> */
> id = srcu_read_lock(&srcu);
> hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> /*
> * If ->release runs before mmu_notifier_unregister it must be
> * handled, as it's the only way for the driver to flush all
> * existing sptes and stop the driver from establishing any more
> * sptes before all the pages in the mm are freed.
> */
> if (mn->ops->release)
> mn->ops->release(mn, mm);
>
> spin_lock(&mm->mmu_notifier_mm->lock);
> while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> struct mmu_notifier,
> hlist);
> /*
> * We arrived before mmu_notifier_unregister so
> * mmu_notifier_unregister will do nothing other than to wait
> * for ->release to finish and for mmu_notifier_unregister to
> * return.
> */
> hlist_del_init_rcu(&mn->hlist);
> }
> spin_unlock(&mm->mmu_notifier_mm->lock);
> srcu_read_unlock(&srcu, id);
>
> ## The lock is release only after the deletion of the node.
>
> Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
> could potentially miss SRCU read lock held in _release() path and go onto finish the
> synchronize_srcu before the item is deleted ? May be we should do the read_unlock
> after the deletion of the node in _unregister (like we do in the _release()) ?
>
> >
> >Doesn't [2/2] solve the exact same issue (that the release method cannot
> >be called twice in parallel)?
>
> Not really. This could be a race between a release() and one of the other notifier
> callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where,
> the unregister could have succeeded and released the KVM.
>
>
> [0] http://lkml.kernel.org/r/[email protected]
>
> In effect this all could be due to the same reason, the synchronize in unregister
> missing another reader.

If this is at all reproducible, I suggest use of ftrace or event tracing
to work out exactly what is happening.

Thanx, Paul

2017-04-28 17:20:19

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH 1/2] kvm: Fix mmu_notifier release race

On 26/04/17 17:03, Suzuki K Poulose wrote:
> On 25/04/17 19:49, Radim Krčmář wrote:
>> 2017-04-24 11:10+0100, Suzuki K Poulose:
>>> The KVM uses mmu_notifier (wherever available) to keep track
>>> of the changes to the mm of the guest. The guest shadow page
>>> tables are released when the VM exits via mmu_notifier->ops.release().
>>> There is a rare chance that the mmu_notifier->release could be
>>> called more than once via two different paths, which could end
>>> up in use-after-free of kvm instance (such as [0]).
>>>
>>> e.g:
>>>
>>> thread A thread B
>>> ------- --------------
>>>
>>> get_signal-> kvm_destroy_vm()->
>>> do_exit-> mmu_notifier_unregister->
>>> exit_mm-> kvm_arch_flush_shadow_all()->
>>> exit_mmap-> spin_lock(&kvm->mmu_lock)
>>> mmu_notifier_release-> ....
>>> kvm_arch_flush_shadow_all()-> .....
>>> ... spin_lock(&kvm->mmu_lock) .....
>>> spin_unlock(&kvm->mmu_lock)
>>> kvm_arch_free_kvm()
>>> *** use after free of kvm ***
>>
>> I don't understand this race ...
>> a piece of code in mmu_notifier_unregister() says:
>>
>> /*
>> * Wait for any running method to finish, of course including
>> * ->release if it was run by mmu_notifier_release instead of us.
>> */
>> synchronize_srcu(&srcu);
>>
>> and code before that removes the notifier from the list, so it cannot be
>> called after we pass this point. mmu_notifier_release() does roughly
>> the same and explains it as:
>>
>> /*
>> * synchronize_srcu here prevents mmu_notifier_release from returning to
>> * exit_mmap (which would proceed with freeing all pages in the mm)
>> * until the ->release method returns, if it was invoked by
>> * mmu_notifier_unregister.
>> *
>> * The mmu_notifier_mm can't go away from under us because one mm_count
>> * is held by exit_mmap.
>> */
>> synchronize_srcu(&srcu);
>>
>> The call of mmu_notifier->release is protected by srcu in both cases and
>> while it seems possible that mmu_notifier->release would be called
>> twice, I don't see a combination that could result in use-after-free
>> from mmu_notifier_release after mmu_notifier_unregister() has returned.
>
> Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
> does get triggered for sure !!)
>
> The only difference I can spot with _unregister & _release paths are the way
> we use src_read_lock across the deletion of the entry from the list.
>
> In mmu_notifier_unregister() we do :
>
> id = srcu_read_lock(&srcu);
> /*
> * exit_mmap will block in mmu_notifier_release to guarantee
> * that ->release is called before freeing the pages.
> */
> if (mn->ops->release)
> mn->ops->release(mn, mm);
> srcu_read_unlock(&srcu, id);
>
> ## Releases the srcu lock here and then goes on to grab the spin_lock.
>
> spin_lock(&mm->mmu_notifier_mm->lock);
> /*
> * Can not use list_del_rcu() since __mmu_notifier_release
> * can delete it before we hold the lock.
> */
> hlist_del_init_rcu(&mn->hlist);
> spin_unlock(&mm->mmu_notifier_mm->lock);
>
> While in mmu_notifier_release() we hold it until the node(s) are deleted from the
> list :
> /*
> * SRCU here will block mmu_notifier_unregister until
> * ->release returns.
> */
> id = srcu_read_lock(&srcu);
> hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> /*
> * If ->release runs before mmu_notifier_unregister it must be
> * handled, as it's the only way for the driver to flush all
> * existing sptes and stop the driver from establishing any more
> * sptes before all the pages in the mm are freed.
> */
> if (mn->ops->release)
> mn->ops->release(mn, mm);
>
> spin_lock(&mm->mmu_notifier_mm->lock);
> while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> struct mmu_notifier,
> hlist);
> /*
> * We arrived before mmu_notifier_unregister so
> * mmu_notifier_unregister will do nothing other than to wait
> * for ->release to finish and for mmu_notifier_unregister to
> * return.
> */
> hlist_del_init_rcu(&mn->hlist);
> }
> spin_unlock(&mm->mmu_notifier_mm->lock);
> srcu_read_unlock(&srcu, id);
>
> ## The lock is release only after the deletion of the node.
>
> Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
> could potentially miss SRCU read lock held in _release() path and go onto finish the
> synchronize_srcu before the item is deleted ? May be we should do the read_unlock
> after the deletion of the node in _unregister (like we do in the _release()) ?

I haven't been able to reproduce the mmu_notifier race condition, which leads to KVM
free, reported at [1]. I will leave it running (with tracepoints/ftrace) over the
weekend.

>
>>
>> Doesn't [2/2] solve the exact same issue (that the release method cannot
>> be called twice in parallel)?
>
> Not really. This could be a race between a release() and one of the other notifier
> callbacks. e.g, In [0], we were hitting a use-after-free in kvm_unmap_hva() where,
> the unregister could have succeeded and released the KVM.

But I can reproduce this problem [0], and we need the [2/2] for arm/arm64.

[0] http://lkml.kernel.org/r/[email protected]
[1] http://lkml.kernel.org/r/CAAeHK+x8udHKq9xa1zkTO6ax5E8Dk32HYWfaT05FMchL2cr48g@mail.gmail.com


Thanks
Suzuki