This series contains potential fixes for problems reported by [0] & [1].
[0] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/492944.html
[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-March/492943.html
Marc Zyngier (2):
kvm: arm/arm64: Take mmap_sem in stage2_unmap_vm
kvm: arm/arm64: Take mmap_sem in kvm_arch_prepare_memory_region
Suzuki K Poulose (1):
kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd
arch/arm/kvm/mmu.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
--
2.7.4
From: Marc Zyngier <[email protected]>
We don't hold the mmap_sem while searching for the VMAs when
we try to unmap each memslot for a VM. Fix this properly to
avoid unexpected results.
Fixes: commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
Cc: [email protected] # v3.19+
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 962616f..f2e2e0c 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -803,6 +803,7 @@ void stage2_unmap_vm(struct kvm *kvm)
int idx;
idx = srcu_read_lock(&kvm->srcu);
+ down_read(¤t->mm->mmap_sem);
spin_lock(&kvm->mmu_lock);
slots = kvm_memslots(kvm);
@@ -810,6 +811,7 @@ void stage2_unmap_vm(struct kvm *kvm)
stage2_unmap_memslot(kvm, memslot);
spin_unlock(&kvm->mmu_lock);
+ up_read(¤t->mm->mmap_sem);
srcu_read_unlock(&kvm->srcu, idx);
}
--
2.7.4
In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
unmap_stage2_range() on the entire memory range for the guest. This could
cause problems with other callers (e.g, munmap on a memslot) trying to
unmap a range.
Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
Cc: [email protected] # v3.10+
Cc: Marc Zyngier <[email protected]>
Cc: Christoffer Dall <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 13b9c1f..b361f71 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
if (kvm->arch.pgd == NULL)
return;
+ spin_lock(&kvm->mmu_lock);
unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
+ spin_unlock(&kvm->mmu_lock);
+
/* Free the HW pgd, one page at a time */
free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
kvm->arch.pgd = NULL;
--
2.7.4
From: Marc Zyngier <[email protected]>
We don't hold the mmap_sem while searching for VMAs (via find_vma), in
kvm_arch_prepare_memory_region, which can end up in expected failures.
Fixes: commit 8eef91239e57 ("arm/arm64: KVM: map MMIO regions at creation time")
Cc: Ard Biesheuvel <[email protected]>
Cc: Christoffer Dall <[email protected]>
Cc: Eric Auger <[email protected]>
Cc: [email protected] # v3.18+
Signed-off-by: Marc Zyngier <[email protected]>
[ Handle dirty page logging failure case ]
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm/kvm/mmu.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index f2e2e0c..13b9c1f 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1803,6 +1803,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
(KVM_PHYS_SIZE >> PAGE_SHIFT))
return -EFAULT;
+ down_read(¤t->mm->mmap_sem);
/*
* A memory region could potentially cover multiple VMAs, and any holes
* between them, so iterate over all of them to find out if we can map
@@ -1846,8 +1847,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
pa += vm_start - vma->vm_start;
/* IO region dirty page logging not allowed */
- if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES)
- return -EINVAL;
+ if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
+ ret = -EINVAL;
+ goto out;
+ }
ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
vm_end - vm_start,
@@ -1859,7 +1862,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
} while (hva < reg_end);
if (change == KVM_MR_FLAGS_ONLY)
- return ret;
+ goto out;
spin_lock(&kvm->mmu_lock);
if (ret)
@@ -1867,6 +1870,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
else
stage2_flush_memslot(kvm, memslot);
spin_unlock(&kvm->mmu_lock);
+out:
+ up_read(¤t->mm->mmap_sem);
return ret;
}
--
2.7.4
On Tue, Mar 14, 2017 at 02:52:32PM +0000, Suzuki K Poulose wrote:
> From: Marc Zyngier <[email protected]>
>
> We don't hold the mmap_sem while searching for the VMAs when
> we try to unmap each memslot for a VM. Fix this properly to
> avoid unexpected results.
>
> Fixes: commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
> Cc: [email protected] # v3.19+
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Marc Zyngier <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/kvm/mmu.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616f..f2e2e0c 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -803,6 +803,7 @@ void stage2_unmap_vm(struct kvm *kvm)
> int idx;
>
> idx = srcu_read_lock(&kvm->srcu);
> + down_read(¤t->mm->mmap_sem);
> spin_lock(&kvm->mmu_lock);
>
> slots = kvm_memslots(kvm);
> @@ -810,6 +811,7 @@ void stage2_unmap_vm(struct kvm *kvm)
> stage2_unmap_memslot(kvm, memslot);
>
> spin_unlock(&kvm->mmu_lock);
> + up_read(¤t->mm->mmap_sem);
> srcu_read_unlock(&kvm->srcu, idx);
> }
>
> --
> 2.7.4
>
Are we sure that holding mmu_lock is valid while holding the mmap_sem?
Thanks,
-Christoffer
On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
> unmap_stage2_range() on the entire memory range for the guest. This could
> cause problems with other callers (e.g, munmap on a memslot) trying to
> unmap a range.
>
> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
> Cc: [email protected] # v3.10+
> Cc: Marc Zyngier <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> arch/arm/kvm/mmu.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 13b9c1f..b361f71 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> if (kvm->arch.pgd == NULL)
> return;
>
> + spin_lock(&kvm->mmu_lock);
> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> + spin_unlock(&kvm->mmu_lock);
> +
This ends up holding the spin lock for potentially quite a while, where
we can do things like __flush_dcache_area(), which I think can fault.
Is that valid?
Thanks,
-Christoffer
> /* Free the HW pgd, one page at a time */
> free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE);
> kvm->arch.pgd = NULL;
> --
> 2.7.4
>
On 15/03/17 09:17, Christoffer Dall wrote:
> On Tue, Mar 14, 2017 at 02:52:32PM +0000, Suzuki K Poulose wrote:
>> From: Marc Zyngier <[email protected]>
>>
>> We don't hold the mmap_sem while searching for the VMAs when
>> we try to unmap each memslot for a VM. Fix this properly to
>> avoid unexpected results.
>>
>> Fixes: commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
>> Cc: [email protected] # v3.19+
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Marc Zyngier <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/kvm/mmu.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616f..f2e2e0c 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -803,6 +803,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> int idx;
>>
>> idx = srcu_read_lock(&kvm->srcu);
>> + down_read(¤t->mm->mmap_sem);
>> spin_lock(&kvm->mmu_lock);
>>
>> slots = kvm_memslots(kvm);
>> @@ -810,6 +811,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> stage2_unmap_memslot(kvm, memslot);
>>
>> spin_unlock(&kvm->mmu_lock);
>> + up_read(¤t->mm->mmap_sem);
>> srcu_read_unlock(&kvm->srcu, idx);
>> }
>>
>> --
>> 2.7.4
>>
>
> Are we sure that holding mmu_lock is valid while holding the mmap_sem?
Maybe I'm just confused by the many levels of locking, Here's my rational:
- kvm->srcu protects the memslot list
- mmap_sem protects the kernel VMA list
- mmu_lock protects the stage2 page tables (at least here)
I don't immediately see any issue with holding the mmap_sem mutex here
(unless there is a path that would retrigger a down operation on the
mmap_sem?).
Or am I missing something obvious?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 15/03/17 09:21, Christoffer Dall wrote:
> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>> unmap_stage2_range() on the entire memory range for the guest. This could
>> cause problems with other callers (e.g, munmap on a memslot) trying to
>> unmap a range.
>>
>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>> Cc: [email protected] # v3.10+
>> Cc: Marc Zyngier <[email protected]>
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/kvm/mmu.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 13b9c1f..b361f71 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>> if (kvm->arch.pgd == NULL)
>> return;
>>
>> + spin_lock(&kvm->mmu_lock);
>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>> + spin_unlock(&kvm->mmu_lock);
>> +
>
> This ends up holding the spin lock for potentially quite a while, where
> we can do things like __flush_dcache_area(), which I think can fault.
I believe we're always using the linear mapping (or kmap on 32bit) in
order not to fault.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Wed, Mar 15, 2017 at 09:34:53AM +0000, Marc Zyngier wrote:
> On 15/03/17 09:17, Christoffer Dall wrote:
> > On Tue, Mar 14, 2017 at 02:52:32PM +0000, Suzuki K Poulose wrote:
> >> From: Marc Zyngier <[email protected]>
> >>
> >> We don't hold the mmap_sem while searching for the VMAs when
> >> we try to unmap each memslot for a VM. Fix this properly to
> >> avoid unexpected results.
> >>
> >> Fixes: commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
> >> Cc: [email protected] # v3.19+
> >> Cc: Christoffer Dall <[email protected]>
> >> Signed-off-by: Marc Zyngier <[email protected]>
> >> Signed-off-by: Suzuki K Poulose <[email protected]>
> >> ---
> >> arch/arm/kvm/mmu.c | 2 ++
> >> 1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> index 962616f..f2e2e0c 100644
> >> --- a/arch/arm/kvm/mmu.c
> >> +++ b/arch/arm/kvm/mmu.c
> >> @@ -803,6 +803,7 @@ void stage2_unmap_vm(struct kvm *kvm)
> >> int idx;
> >>
> >> idx = srcu_read_lock(&kvm->srcu);
> >> + down_read(¤t->mm->mmap_sem);
> >> spin_lock(&kvm->mmu_lock);
> >>
> >> slots = kvm_memslots(kvm);
> >> @@ -810,6 +811,7 @@ void stage2_unmap_vm(struct kvm *kvm)
> >> stage2_unmap_memslot(kvm, memslot);
> >>
> >> spin_unlock(&kvm->mmu_lock);
> >> + up_read(¤t->mm->mmap_sem);
> >> srcu_read_unlock(&kvm->srcu, idx);
> >> }
> >>
> >> --
> >> 2.7.4
> >>
> >
> > Are we sure that holding mmu_lock is valid while holding the mmap_sem?
>
> Maybe I'm just confused by the many levels of locking, Here's my rational:
>
> - kvm->srcu protects the memslot list
> - mmap_sem protects the kernel VMA list
> - mmu_lock protects the stage2 page tables (at least here)
>
> I don't immediately see any issue with holding the mmap_sem mutex here
> (unless there is a path that would retrigger a down operation on the
> mmap_sem?).
>
> Or am I missing something obvious?
I was worried that someone else could hold the mmu_lock and take the
mmap_sem, but that wouldn't be allowed of course, because the semaphore
can sleep, so I agree, you should be good.
I just needed this conversation to feel good about this patch ;)
Reviewed-by: Christoffer Dall <[email protected]>
On Tue, Mar 14, 2017 at 02:52:33PM +0000, Suzuki K Poulose wrote:
> From: Marc Zyngier <[email protected]>
>
> We don't hold the mmap_sem while searching for VMAs (via find_vma), in
> kvm_arch_prepare_memory_region, which can end up in expected failures.
>
> Fixes: commit 8eef91239e57 ("arm/arm64: KVM: map MMIO regions at creation time")
> Cc: Ard Biesheuvel <[email protected]>
> Cc: Christoffer Dall <[email protected]>
> Cc: Eric Auger <[email protected]>
> Cc: [email protected] # v3.18+
> Signed-off-by: Marc Zyngier <[email protected]>
> [ Handle dirty page logging failure case ]
> Signed-off-by: Suzuki K Poulose <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
> ---
> arch/arm/kvm/mmu.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index f2e2e0c..13b9c1f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1803,6 +1803,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> (KVM_PHYS_SIZE >> PAGE_SHIFT))
> return -EFAULT;
>
> + down_read(¤t->mm->mmap_sem);
> /*
> * A memory region could potentially cover multiple VMAs, and any holes
> * between them, so iterate over all of them to find out if we can map
> @@ -1846,8 +1847,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> pa += vm_start - vma->vm_start;
>
> /* IO region dirty page logging not allowed */
> - if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES)
> - return -EINVAL;
> + if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
> + ret = -EINVAL;
> + goto out;
> + }
>
> ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> vm_end - vm_start,
> @@ -1859,7 +1862,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> } while (hva < reg_end);
>
> if (change == KVM_MR_FLAGS_ONLY)
> - return ret;
> + goto out;
>
> spin_lock(&kvm->mmu_lock);
> if (ret)
> @@ -1867,6 +1870,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> else
> stage2_flush_memslot(kvm, memslot);
> spin_unlock(&kvm->mmu_lock);
> +out:
> + up_read(¤t->mm->mmap_sem);
> return ret;
> }
>
> --
> 2.7.4
>
On 15/03/17 10:56, Christoffer Dall wrote:
> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>> On 15/03/17 09:21, Christoffer Dall wrote:
>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>> unmap a range.
>>>>
>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>> Cc: [email protected] # v3.10+
>>>> Cc: Marc Zyngier <[email protected]>
>>>> Cc: Christoffer Dall <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>> ---
>>>> arch/arm/kvm/mmu.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 13b9c1f..b361f71 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>>> if (kvm->arch.pgd == NULL)
>>>> return;
>>>>
>>>> + spin_lock(&kvm->mmu_lock);
>>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>>>> + spin_unlock(&kvm->mmu_lock);
>>>> +
>>>
>>> This ends up holding the spin lock for potentially quite a while, where
>>> we can do things like __flush_dcache_area(), which I think can fault.
>>
>> I believe we're always using the linear mapping (or kmap on 32bit) in
>> order not to fault.
>>
>
> ok, then there's just the concern that we may be holding a spinlock for
> a very long time. I seem to recall Mario once added something where he
> unlocked and gave a chance to schedule something else for each PUD or
> something like that, because he ran into the issue during migration. Am
> I confusing this with something else?
That definitely rings a bell: stage2_wp_range() uses that kind of trick
to give the system a chance to breathe. Maybe we could use a similar
trick in our S2 unmapping code? How about this (completely untested) patch:
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 962616fd4ddd..1786c24212d4 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
phys_addr_t addr = start, end = start + size;
phys_addr_t next;
+ BUG_ON(!spin_is_locked(&kvm->mmu_lock));
+
pgd = kvm->arch.pgd + stage2_pgd_index(addr);
do {
+ if (need_resched() || spin_needbreak(&kvm->mmu_lock))
+ cond_resched_lock(&kvm->mmu_lock);
+
next = stage2_pgd_addr_end(addr, end);
if (!stage2_pgd_none(*pgd))
unmap_stage2_puds(kvm, pgd, addr, next);
The additional BUG_ON() is just for my own peace of mind - we seem to
have missed a couple of these lately, and the "breathing" code makes
it imperative that this lock is being taken prior to entering the
function.
Thoughts?
M.
--
Jazz is not dead. It just smells funny...
On 15/03/2017 10:17, Christoffer Dall wrote:
> On Tue, Mar 14, 2017 at 02:52:32PM +0000, Suzuki K Poulose wrote:
>> From: Marc Zyngier <[email protected]>
>>
>> We don't hold the mmap_sem while searching for the VMAs when
>> we try to unmap each memslot for a VM. Fix this properly to
>> avoid unexpected results.
>>
>> Fixes: commit 957db105c997 ("arm/arm64: KVM: Introduce stage2_unmap_vm")
>> Cc: [email protected] # v3.19+
>> Cc: Christoffer Dall <[email protected]>
>> Signed-off-by: Marc Zyngier <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>
>> ---
>> arch/arm/kvm/mmu.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616f..f2e2e0c 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -803,6 +803,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> int idx;
>>
>> idx = srcu_read_lock(&kvm->srcu);
>> + down_read(¤t->mm->mmap_sem);
>> spin_lock(&kvm->mmu_lock);
>>
>> slots = kvm_memslots(kvm);
>> @@ -810,6 +811,7 @@ void stage2_unmap_vm(struct kvm *kvm)
>> stage2_unmap_memslot(kvm, memslot);
>>
>> spin_unlock(&kvm->mmu_lock);
>> + up_read(¤t->mm->mmap_sem);
>> srcu_read_unlock(&kvm->srcu, idx);
>> }
>>
>> --
>> 2.7.4
>>
>
> Are we sure that holding mmu_lock is valid while holding the mmap_sem?
Sure, spinlock-inside-semaphore and spinlock-inside-mutex is always okay.
Paolo
On Wed, Mar 15, 2017 at 01:28:07PM +0000, Marc Zyngier wrote:
> On 15/03/17 10:56, Christoffer Dall wrote:
> > On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
> >> On 15/03/17 09:21, Christoffer Dall wrote:
> >>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
> >>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
> >>>> unmap_stage2_range() on the entire memory range for the guest. This could
> >>>> cause problems with other callers (e.g, munmap on a memslot) trying to
> >>>> unmap a range.
> >>>>
> >>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
> >>>> Cc: [email protected] # v3.10+
> >>>> Cc: Marc Zyngier <[email protected]>
> >>>> Cc: Christoffer Dall <[email protected]>
> >>>> Signed-off-by: Suzuki K Poulose <[email protected]>
> >>>> ---
> >>>> arch/arm/kvm/mmu.c | 3 +++
> >>>> 1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>>> index 13b9c1f..b361f71 100644
> >>>> --- a/arch/arm/kvm/mmu.c
> >>>> +++ b/arch/arm/kvm/mmu.c
> >>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> >>>> if (kvm->arch.pgd == NULL)
> >>>> return;
> >>>>
> >>>> + spin_lock(&kvm->mmu_lock);
> >>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> >>>> + spin_unlock(&kvm->mmu_lock);
> >>>> +
> >>>
> >>> This ends up holding the spin lock for potentially quite a while, where
> >>> we can do things like __flush_dcache_area(), which I think can fault.
> >>
> >> I believe we're always using the linear mapping (or kmap on 32bit) in
> >> order not to fault.
> >>
> >
> > ok, then there's just the concern that we may be holding a spinlock for
> > a very long time. I seem to recall Mario once added something where he
> > unlocked and gave a chance to schedule something else for each PUD or
> > something like that, because he ran into the issue during migration. Am
> > I confusing this with something else?
>
> That definitely rings a bell: stage2_wp_range() uses that kind of trick
> to give the system a chance to breathe. Maybe we could use a similar
> trick in our S2 unmapping code? How about this (completely untested) patch:
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..1786c24212d4 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> phys_addr_t addr = start, end = start + size;
> phys_addr_t next;
>
> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
> +
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> + cond_resched_lock(&kvm->mmu_lock);
> +
> next = stage2_pgd_addr_end(addr, end);
> if (!stage2_pgd_none(*pgd))
> unmap_stage2_puds(kvm, pgd, addr, next);
>
> The additional BUG_ON() is just for my own peace of mind - we seem to
> have missed a couple of these lately, and the "breathing" code makes
> it imperative that this lock is being taken prior to entering the
> function.
>
Looks good to me!
-Christoffer
On 15/03/17 13:35, Christoffer Dall wrote:
> On Wed, Mar 15, 2017 at 01:28:07PM +0000, Marc Zyngier wrote:
>> On 15/03/17 10:56, Christoffer Dall wrote:
>>> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>>>> On 15/03/17 09:21, Christoffer Dall wrote:
>>>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>>>> unmap a range.
>>>>>>
>>>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>>>> Cc: [email protected] # v3.10+
>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>> ---
>>>>>> arch/arm/kvm/mmu.c | 3 +++
>>>>>> 1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>> index 13b9c1f..b361f71 100644
>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>>>>> if (kvm->arch.pgd == NULL)
>>>>>> return;
>>>>>>
>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>>>>>> + spin_unlock(&kvm->mmu_lock);
>>>>>> +
>>>>>
>>>>> This ends up holding the spin lock for potentially quite a while, where
>>>>> we can do things like __flush_dcache_area(), which I think can fault.
>>>>
>>>> I believe we're always using the linear mapping (or kmap on 32bit) in
>>>> order not to fault.
>>>>
>>>
>>> ok, then there's just the concern that we may be holding a spinlock for
>>> a very long time. I seem to recall Mario once added something where he
>>> unlocked and gave a chance to schedule something else for each PUD or
>>> something like that, because he ran into the issue during migration. Am
>>> I confusing this with something else?
>>
>> That definitely rings a bell: stage2_wp_range() uses that kind of trick
>> to give the system a chance to breathe. Maybe we could use a similar
>> trick in our S2 unmapping code? How about this (completely untested) patch:
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..1786c24212d4 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>> phys_addr_t addr = start, end = start + size;
>> phys_addr_t next;
>>
>> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
>> +
>> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>> do {
>> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> + cond_resched_lock(&kvm->mmu_lock);
>> +
>> next = stage2_pgd_addr_end(addr, end);
>> if (!stage2_pgd_none(*pgd))
>> unmap_stage2_puds(kvm, pgd, addr, next);
>>
>> The additional BUG_ON() is just for my own peace of mind - we seem to
>> have missed a couple of these lately, and the "breathing" code makes
>> it imperative that this lock is being taken prior to entering the
>> function.
>>
>
> Looks good to me!
OK. I'll stash that on top of Suzuki's series, and start running some
actual tests... ;-)
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
> On 15/03/17 09:21, Christoffer Dall wrote:
> > On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
> >> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
> >> unmap_stage2_range() on the entire memory range for the guest. This could
> >> cause problems with other callers (e.g, munmap on a memslot) trying to
> >> unmap a range.
> >>
> >> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
> >> Cc: [email protected] # v3.10+
> >> Cc: Marc Zyngier <[email protected]>
> >> Cc: Christoffer Dall <[email protected]>
> >> Signed-off-by: Suzuki K Poulose <[email protected]>
> >> ---
> >> arch/arm/kvm/mmu.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> index 13b9c1f..b361f71 100644
> >> --- a/arch/arm/kvm/mmu.c
> >> +++ b/arch/arm/kvm/mmu.c
> >> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
> >> if (kvm->arch.pgd == NULL)
> >> return;
> >>
> >> + spin_lock(&kvm->mmu_lock);
> >> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> >> + spin_unlock(&kvm->mmu_lock);
> >> +
> >
> > This ends up holding the spin lock for potentially quite a while, where
> > we can do things like __flush_dcache_area(), which I think can fault.
>
> I believe we're always using the linear mapping (or kmap on 32bit) in
> order not to fault.
>
ok, then there's just the concern that we may be holding a spinlock for
a very long time. I seem to recall Mario once added something where he
unlocked and gave a chance to schedule something else for each PUD or
something like that, because he ran into the issue during migration. Am
I confusing this with something else?
Thanks,
-Christoffer
Hi Marc,
On 15/03/17 13:43, Marc Zyngier wrote:
> On 15/03/17 13:35, Christoffer Dall wrote:
>> On Wed, Mar 15, 2017 at 01:28:07PM +0000, Marc Zyngier wrote:
>>> On 15/03/17 10:56, Christoffer Dall wrote:
>>>> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>>>>> On 15/03/17 09:21, Christoffer Dall wrote:
>>>>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>>>>> unmap a range.
>>>>>>>
>>>>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>>>>> Cc: [email protected] # v3.10+
>>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>> ---
>>>>>>> arch/arm/kvm/mmu.c | 3 +++
>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>>> index 13b9c1f..b361f71 100644
>>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>>>>>> if (kvm->arch.pgd == NULL)
>>>>>>> return;
>>>>>>>
>>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>>>>>>> + spin_unlock(&kvm->mmu_lock);
>>>>>>> +
>>>>>>
>>>>>> This ends up holding the spin lock for potentially quite a while, where
>>>>>> we can do things like __flush_dcache_area(), which I think can fault.
>>>>>
>>>>> I believe we're always using the linear mapping (or kmap on 32bit) in
>>>>> order not to fault.
>>>>>
>>>>
>>>> ok, then there's just the concern that we may be holding a spinlock for
>>>> a very long time. I seem to recall Mario once added something where he
>>>> unlocked and gave a chance to schedule something else for each PUD or
>>>> something like that, because he ran into the issue during migration. Am
>>>> I confusing this with something else?
>>>
>>> That definitely rings a bell: stage2_wp_range() uses that kind of trick
>>> to give the system a chance to breathe. Maybe we could use a similar
>>> trick in our S2 unmapping code? How about this (completely untested) patch:
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 962616fd4ddd..1786c24212d4 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>>> phys_addr_t addr = start, end = start + size;
>>> phys_addr_t next;
>>>
>>> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
Nit: assert_spin_locked() is somewhat more pleasant (and currently looks
to expand to the exact same code).
Robin.
>>> +
>>> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>>> do {
>>> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>>> + cond_resched_lock(&kvm->mmu_lock);
>>> +
>>> next = stage2_pgd_addr_end(addr, end);
>>> if (!stage2_pgd_none(*pgd))
>>> unmap_stage2_puds(kvm, pgd, addr, next);
>>>
>>> The additional BUG_ON() is just for my own peace of mind - we seem to
>>> have missed a couple of these lately, and the "breathing" code makes
>>> it imperative that this lock is being taken prior to entering the
>>> function.
>>>
>>
>> Looks good to me!
>
> OK. I'll stash that on top of Suzuki's series, and start running some
> actual tests... ;-)
>
> Thanks,
>
> M.
>
On 15/03/17 13:50, Robin Murphy wrote:
> Hi Marc,
>
> On 15/03/17 13:43, Marc Zyngier wrote:
>> On 15/03/17 13:35, Christoffer Dall wrote:
>>> On Wed, Mar 15, 2017 at 01:28:07PM +0000, Marc Zyngier wrote:
>>>> On 15/03/17 10:56, Christoffer Dall wrote:
>>>>> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>>>>>> On 15/03/17 09:21, Christoffer Dall wrote:
>>>>>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>>>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>>>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>>>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>>>>>> unmap a range.
>>>>>>>>
>>>>>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>>>>>> Cc: [email protected] # v3.10+
>>>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>>>>>>> ---
>>>>>>>> arch/arm/kvm/mmu.c | 3 +++
>>>>>>>> 1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>>>> index 13b9c1f..b361f71 100644
>>>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>>>> @@ -831,7 +831,10 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>>>>>>>> if (kvm->arch.pgd == NULL)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> + spin_lock(&kvm->mmu_lock);
>>>>>>>> unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
>>>>>>>> + spin_unlock(&kvm->mmu_lock);
>>>>>>>> +
>>>>>>>
>>>>>>> This ends up holding the spin lock for potentially quite a while, where
>>>>>>> we can do things like __flush_dcache_area(), which I think can fault.
>>>>>>
>>>>>> I believe we're always using the linear mapping (or kmap on 32bit) in
>>>>>> order not to fault.
>>>>>>
>>>>>
>>>>> ok, then there's just the concern that we may be holding a spinlock for
>>>>> a very long time. I seem to recall Mario once added something where he
>>>>> unlocked and gave a chance to schedule something else for each PUD or
>>>>> something like that, because he ran into the issue during migration. Am
>>>>> I confusing this with something else?
>>>>
>>>> That definitely rings a bell: stage2_wp_range() uses that kind of trick
>>>> to give the system a chance to breathe. Maybe we could use a similar
>>>> trick in our S2 unmapping code? How about this (completely untested) patch:
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 962616fd4ddd..1786c24212d4 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>>>> phys_addr_t addr = start, end = start + size;
>>>> phys_addr_t next;
>>>>
>>>> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
>
> Nit: assert_spin_locked() is somewhat more pleasant (and currently looks
> to expand to the exact same code).
Fancy!
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 15/03/17 13:28, Marc Zyngier wrote:
> On 15/03/17 10:56, Christoffer Dall wrote:
>> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>>> On 15/03/17 09:21, Christoffer Dall wrote:
>>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>>> unmap a range.
>>>>>
>>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>>> Cc: [email protected] # v3.10+
>>>>> Cc: Marc Zyngier <[email protected]>
>>>>> Cc: Christoffer Dall <[email protected]>
>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
...
>> ok, then there's just the concern that we may be holding a spinlock for
>> a very long time. I seem to recall Mario once added something where he
>> unlocked and gave a chance to schedule something else for each PUD or
>> something like that, because he ran into the issue during migration. Am
>> I confusing this with something else?
>
> That definitely rings a bell: stage2_wp_range() uses that kind of trick
> to give the system a chance to breathe. Maybe we could use a similar
> trick in our S2 unmapping code? How about this (completely untested) patch:
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..1786c24212d4 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
> phys_addr_t addr = start, end = start + size;
> phys_addr_t next;
>
> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
> +
> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> do {
> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
> + cond_resched_lock(&kvm->mmu_lock);
nit: I think we could make the cond_resched_lock() unconditionally here:
Given, __cond_resched_lock() already does all the above checks :
kernel/sched/core.c:
int __cond_resched_lock(spinlock_t *lock)
{
int resched = should_resched(PREEMPT_LOCK_OFFSET);
...
if (spin_needbreak(lock) || resched) {
Suzuki
On 15/03/17 14:33, Suzuki K Poulose wrote:
> On 15/03/17 13:28, Marc Zyngier wrote:
>> On 15/03/17 10:56, Christoffer Dall wrote:
>>> On Wed, Mar 15, 2017 at 09:39:26AM +0000, Marc Zyngier wrote:
>>>> On 15/03/17 09:21, Christoffer Dall wrote:
>>>>> On Tue, Mar 14, 2017 at 02:52:34PM +0000, Suzuki K Poulose wrote:
>>>>>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling
>>>>>> unmap_stage2_range() on the entire memory range for the guest. This could
>>>>>> cause problems with other callers (e.g, munmap on a memslot) trying to
>>>>>> unmap a range.
>>>>>>
>>>>>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
>>>>>> Cc: [email protected] # v3.10+
>>>>>> Cc: Marc Zyngier <[email protected]>
>>>>>> Cc: Christoffer Dall <[email protected]>
>>>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>
> ...
>
>>> ok, then there's just the concern that we may be holding a spinlock for
>>> a very long time. I seem to recall Mario once added something where he
>>> unlocked and gave a chance to schedule something else for each PUD or
>>> something like that, because he ran into the issue during migration. Am
>>> I confusing this with something else?
>>
>> That definitely rings a bell: stage2_wp_range() uses that kind of trick
>> to give the system a chance to breathe. Maybe we could use a similar
>> trick in our S2 unmapping code? How about this (completely untested) patch:
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..1786c24212d4 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -292,8 +292,13 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>> phys_addr_t addr = start, end = start + size;
>> phys_addr_t next;
>>
>> + BUG_ON(!spin_is_locked(&kvm->mmu_lock));
>> +
>> pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>> do {
>> + if (need_resched() || spin_needbreak(&kvm->mmu_lock))
>> + cond_resched_lock(&kvm->mmu_lock);
>
> nit: I think we could make the cond_resched_lock() unconditionally here:
> Given, __cond_resched_lock() already does all the above checks :
>
> kernel/sched/core.c:
>
> int __cond_resched_lock(spinlock_t *lock)
> {
> int resched = should_resched(PREEMPT_LOCK_OFFSET);
>
> ...
>
> if (spin_needbreak(lock) || resched) {
Right. And should_resched() also contains a test for need_resched().
This means we can also simplify stage2_wp_range(). Awesome!
Thanks,
M.
--
Jazz is not dead. It just smells funny...