2021-09-09 16:25:44

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

Resend because I missed ccing people on the actual patches ...

RFC because the patches are essentially untested and I did not actually
try to trigger any of the things these patches are supposed to fix. It
merely matches my current understanding (and what other code does :) ). I
did compile-test as far as possible.

After learning more about the wonderful world of page tables and their
interaction with the mmap_sem and VMAs, I spotted some issues in our
page table walkers that allow user space to trigger nasty behavior when
playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
should be hard to trigger, others are fairly easy because we provide
conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).

Future work:
- Don't use get_locked_pte() when it's not required to actually allocate
page tables -- similar to how storage keys are now handled. Examples are
get_pgste() and __gmap_zap.
- Don't use get_locked_pte() and instead let page fault logic allocate page
tables when we actually do need page tables -- also, similar to how
storage keys are now handled. Examples are set_pgste_bits() and
pgste_perform_essa().
- Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
__gmap_zap() that's very easy.

Cc: Christian Borntraeger <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Cornelia Huck <[email protected]>
Cc: Claudio Imbrenda <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Niklas Schnelle <[email protected]>
Cc: Gerald Schaefer <[email protected]>
Cc: Ulrich Weigand <[email protected]>

David Hildenbrand (9):
s390/gmap: validate VMA in __gmap_zap()
s390/gmap: don't unconditionally call pte_unmap_unlock() in
__gmap_zap()
s390/mm: validate VMA in PGSTE manipulation functions
s390/mm: fix VMA and page table handling code in storage key handling
functions
s390/uv: fully validate the VMA before calling follow_page()
s390/pci_mmio: fully validate the VMA before calling follow_pte()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is
present
s390/mm: optimize set_guest_storage_key()
s390/mm: optimize reset_guest_reference_bit()

arch/s390/kernel/uv.c | 2 +-
arch/s390/mm/gmap.c | 11 +++-
arch/s390/mm/pgtable.c | 109 +++++++++++++++++++++++++++------------
arch/s390/pci/pci_mmio.c | 4 +-
4 files changed, 89 insertions(+), 37 deletions(-)


base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
--
2.31.1


2021-09-09 16:26:04

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 4/9] s390/mm: fix VMA and page table handling code in storage key handling functions

There are multiple things broken about our storage key handling
functions:

1. We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap"). gfn_to_hva() will only translate using
KVM memory regions, but won't validate the VMA.

2. We should not allocate page tables outside of VMA boundaries: if
evil user space decides to map hugetlbfs to these ranges, bad things
will happen because we suddenly have PTE or PMD page tables where we
shouldn't have them.

3. We don't handle large PUDs that might suddenly appeared inside our page
table hierarchy.

Don't manually allocate page tables, properly validate that we have VMA and
bail out on pud_large().

All callers of page table handling functions, except
get_guest_storage_key(), call fixup_user_fault() in case they
receive an -EFAULT and retry; this will allocate the necessary page tables
if required.

To keep get_guest_storage_key() working as expected and not requiring
kvm_s390_get_skeys() to call fixup_user_fault() distinguish between
"there is simply no page table or huge page yet and the key is assumed
to be 0" and "this is a fault to be reported".

Although commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling")
introduced most of the affected code, it was actually already broken
before when using get_locked_pte() without any VMA checks.

Note: Ever since commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key
handling") we can no longer set a guest storage key (for example from
QEMU during VM live migration) without actually resolving a fault.
Although we would have created most page tables, we would choke on the
!pmd_present(), requiring a call to fixup_user_fault(). I would
have thought that this is problematic in combination with postcopy life
migration ... but nobody noticed and this patch doesn't change the
situation. So maybe it's just fine.

Fixes: 9fcf93b5de06 ("KVM: S390: Create helper function get_guest_storage_key")
Fixes: 24d5dd0208ed ("s390/kvm: Provide function for setting the guest storage key")
Fixes: a7e19ab55ffd ("KVM: s390: handle missing storage-key facility")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/pgtable.c | 57 +++++++++++++++++++++++++++++-------------
1 file changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 54969e0f3a94..5fb409ff7842 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -429,22 +429,36 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
}

#ifdef CONFIG_PGSTE
-static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
+static int pmd_lookup(struct mm_struct *mm, unsigned long addr, pmd_t **pmdp)
{
+ struct vm_area_struct *vma;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
- pmd_t *pmd;
+
+ /* We need a valid VMA, otherwise this is clearly a fault. */
+ vma = vma_lookup(mm, addr);
+ if (!vma)
+ return -EFAULT;

pgd = pgd_offset(mm, addr);
- p4d = p4d_alloc(mm, pgd, addr);
- if (!p4d)
- return NULL;
- pud = pud_alloc(mm, p4d, addr);
- if (!pud)
- return NULL;
- pmd = pmd_alloc(mm, pud, addr);
- return pmd;
+ if (!pgd_present(*pgd))
+ return -ENOENT;
+
+ p4d = p4d_offset(pgd, addr);
+ if (!p4d_present(*p4d))
+ return -ENOENT;
+
+ pud = pud_offset(p4d, addr);
+ if (!pud_present(*pud))
+ return -ENOENT;
+
+ /* Large PUDs are not supported yet. */
+ if (pud_large(*pud))
+ return -EFAULT;
+
+ *pmdp = pmd_offset(pud, addr);
+ return 0;
}
#endif

@@ -778,8 +792,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp;
pte_t *ptep;

- pmdp = pmd_alloc_map(mm, addr);
- if (unlikely(!pmdp))
+ if (pmd_lookup(mm, addr, &pmdp))
return -EFAULT;

ptl = pmd_lock(mm, pmdp);
@@ -881,8 +894,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
pte_t *ptep;
int cc = 0;

- pmdp = pmd_alloc_map(mm, addr);
- if (unlikely(!pmdp))
+ if (pmd_lookup(mm, addr, &pmdp))
return -EFAULT;

ptl = pmd_lock(mm, pmdp);
@@ -935,15 +947,24 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp;
pte_t *ptep;

- pmdp = pmd_alloc_map(mm, addr);
- if (unlikely(!pmdp))
+ /*
+ * If we don't have a PTE table and if there is no huge page mapped,
+ * the storage key is 0.
+ */
+ *key = 0;
+
+ switch (pmd_lookup(mm, addr, &pmdp)) {
+ case -ENOENT:
+ return 0;
+ case 0:
+ break;
+ default:
return -EFAULT;
+ }

ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
- /* Not yet mapped memory has a zero key */
spin_unlock(ptl);
- *key = 0;
return 0;
}

--
2.31.1

2021-09-09 16:26:12

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 6/9] s390/pci_mmio: fully validate the VMA before calling follow_pte()

We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap").

find_vma() does not check if the address is >= the VMA start address;
use vma_lookup() instead.

Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/pci/pci_mmio.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
index ae683aa623ac..c5b35ea129cf 100644
--- a/arch/s390/pci/pci_mmio.c
+++ b/arch/s390/pci/pci_mmio.c
@@ -159,7 +159,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,

mmap_read_lock(current->mm);
ret = -EINVAL;
- vma = find_vma(current->mm, mmio_addr);
+ vma = vma_lookup(current->mm, mmio_addr);
if (!vma)
goto out_unlock_mmap;
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
@@ -298,7 +298,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,

mmap_read_lock(current->mm);
ret = -EINVAL;
- vma = find_vma(current->mm, mmio_addr);
+ vma = vma_lookup(current->mm, mmio_addr);
if (!vma)
goto out_unlock_mmap;
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
--
2.31.1

2021-09-09 16:26:22

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 3/9] s390/mm: validate VMA in PGSTE manipulation functions

We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap"). gfn_to_hva() will only translate using
KVM memory regions, but won't validate the VMA.

Further, we should not allocate page tables outside of VMA boundaries: if
evil user space decides to map hugetlbfs to these ranges, bad things will
happen because we suddenly have PTE or PMD page tables where we
shouldn't have them.

Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
calling get_locked_pte().

Fixes: 2d42f9477320 ("s390/kvm: Add PGSTE manipulation functions")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/pgtable.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index eec3a9d7176e..54969e0f3a94 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -988,6 +988,7 @@ EXPORT_SYMBOL(get_guest_storage_key);
int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
unsigned long *oldpte, unsigned long *oldpgste)
{
+ struct vm_area_struct *vma;
unsigned long pgstev;
spinlock_t *ptl;
pgste_t pgste;
@@ -997,6 +998,10 @@ int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
WARN_ON_ONCE(orc > ESSA_MAX);
if (unlikely(orc > ESSA_MAX))
return -EINVAL;
+
+ vma = vma_lookup(mm, hva);
+ if (!vma || is_vm_hugetlb_page(vma))
+ return -EFAULT;
ptep = get_locked_pte(mm, hva, &ptl);
if (unlikely(!ptep))
return -EFAULT;
@@ -1089,10 +1094,14 @@ EXPORT_SYMBOL(pgste_perform_essa);
int set_pgste_bits(struct mm_struct *mm, unsigned long hva,
unsigned long bits, unsigned long value)
{
+ struct vm_area_struct *vma;
spinlock_t *ptl;
pgste_t new;
pte_t *ptep;

+ vma = vma_lookup(mm, hva);
+ if (!vma || is_vm_hugetlb_page(vma))
+ return -EFAULT;
ptep = get_locked_pte(mm, hva, &ptl);
if (unlikely(!ptep))
return -EFAULT;
@@ -1117,9 +1126,13 @@ EXPORT_SYMBOL(set_pgste_bits);
*/
int get_pgste(struct mm_struct *mm, unsigned long hva, unsigned long *pgstep)
{
+ struct vm_area_struct *vma;
spinlock_t *ptl;
pte_t *ptep;

+ vma = vma_lookup(mm, hva);
+ if (!vma || is_vm_hugetlb_page(vma))
+ return -EFAULT;
ptep = get_locked_pte(mm, hva, &ptl);
if (unlikely(!ptep))
return -EFAULT;
--
2.31.1

2021-09-09 16:27:29

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 1/9] s390/gmap: validate VMA in __gmap_zap()

We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap"). The pure prescence in our guest_to_host
radix tree does not imply that there is a VMA.

Further, we should not allocate page tables (via get_locked_pte()) outside
of VMA boundaries: if evil user space decides to map hugetlbfs to these
ranges, bad things will happen because we suddenly have PTE or PMD page
tables where we shouldn't have them.

Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
calling get_locked_pte().

Note that gmap_discard() is different:
zap_page_range()->unmap_single_vma() makes sure to stay within VMA
boundaries.

Fixes: b31288fa83b2 ("s390/kvm: support collaborative memory management")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/gmap.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..b6b56cd4ca64 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -673,6 +673,7 @@ EXPORT_SYMBOL_GPL(gmap_fault);
*/
void __gmap_zap(struct gmap *gmap, unsigned long gaddr)
{
+ struct vm_area_struct *vma;
unsigned long vmaddr;
spinlock_t *ptl;
pte_t *ptep;
@@ -682,6 +683,11 @@ void __gmap_zap(struct gmap *gmap, unsigned long gaddr)
gaddr >> PMD_SHIFT);
if (vmaddr) {
vmaddr |= gaddr & ~PMD_MASK;
+
+ vma = vma_lookup(gmap->mm, vmaddr);
+ if (!vma || is_vm_hugetlb_page(vma))
+ return;
+
/* Get pointer to the page table entry */
ptep = get_locked_pte(gmap->mm, vmaddr, &ptl);
if (likely(ptep))
--
2.31.1

2021-09-09 16:27:41

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 8/9] s390/mm: optimize set_guest_storage_key()

We already optimize get_guest_storage_key() to assume that if we don't have
a PTE table and don't have a huge page mapped that the storage key is 0.

Similarly, optimize set_guest_storage_key() to simply do nothing in case
the key to set is 0.

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/pgtable.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4e77b8ebdcc5..534939a3eca5 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -792,13 +792,23 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp;
pte_t *ptep;

- if (pmd_lookup(mm, addr, &pmdp))
+ /*
+ * If we don't have a PTE table and if there is no huge page mapped,
+ * we can ignore attempts to set the key to 0, because it already is 0.
+ */
+ switch (pmd_lookup(mm, addr, &pmdp)) {
+ case -ENOENT:
+ return key ? -EFAULT : 0;
+ case 0:
+ break;
+ default:
return -EFAULT;
+ }

ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
- return -EFAULT;
+ return key ? -EFAULT : 0;
}

if (pmd_large(*pmdp)) {
--
2.31.1

2021-09-09 16:27:51

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 5/9] s390/uv: fully validate the VMA before calling follow_page()

We should not walk/touch page tables outside of VMA boundaries when
holding only the mmap sem in read mode. Evil user space can modify the
VMA layout just before this function runs and e.g., trigger races with
page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
with read mmap_sem in munmap").

find_vma() does not check if the address is >= the VMA start address;
use vma_lookup() instead.

Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests")
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/kernel/uv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index aeb0a15bcbb7..193205fb2777 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -227,7 +227,7 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
uaddr = __gmap_translate(gmap, gaddr);
if (IS_ERR_VALUE(uaddr))
goto out;
- vma = find_vma(gmap->mm, uaddr);
+ vma = vma_lookup(gmap->mm, uaddr);
if (!vma)
goto out;
/*
--
2.31.1

2021-09-09 16:28:49

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 7/9] s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present

pte_map_lock() is sufficient.

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/pgtable.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5fb409ff7842..4e77b8ebdcc5 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -814,10 +814,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
}
spin_unlock(ptl);

- ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
- if (unlikely(!ptep))
- return -EFAULT;
-
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
new = old = pgste_get_lock(ptep);
pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -912,10 +909,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
}
spin_unlock(ptl);

- ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
- if (unlikely(!ptep))
- return -EFAULT;
-
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
new = old = pgste_get_lock(ptep);
/* Reset guest reference bit only */
pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -977,10 +971,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
}
spin_unlock(ptl);

- ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
- if (unlikely(!ptep))
- return -EFAULT;
-
+ ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
pgste = pgste_get_lock(ptep);
*key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
paddr = pte_val(*ptep) & PAGE_MASK;
--
2.31.1

2021-09-09 16:29:22

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH resend RFC 9/9] s390/mm: optimize reset_guest_reference_bit()

We already optimize get_guest_storage_key() to assume that if we don't have
a PTE table and don't have a huge page mapped that the storage key is 0.

Similarly, optimize reset_guest_reference_bit() to simply do nothing if
there is no PTE table and no huge page mapped.

Signed-off-by: David Hildenbrand <[email protected]>
---
arch/s390/mm/pgtable.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 534939a3eca5..50ab2fed3397 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -901,13 +901,23 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
pte_t *ptep;
int cc = 0;

- if (pmd_lookup(mm, addr, &pmdp))
+ /*
+ * If we don't have a PTE table and if there is no huge page mapped,
+ * the storage key is 0 and there is nothing for us to do.
+ */
+ switch (pmd_lookup(mm, addr, &pmdp)) {
+ case -ENOENT:
+ return 0;
+ case 0:
+ break;
+ default:
return -EFAULT;
+ }

ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
- return -EFAULT;
+ return 0;
}

if (pmd_large(*pmdp)) {
--
2.31.1

2021-09-14 16:52:34

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

On Thu, 9 Sep 2021 18:22:39 +0200
David Hildenbrand <[email protected]> wrote:

> Resend because I missed ccing people on the actual patches ...
>
> RFC because the patches are essentially untested and I did not actually
> try to trigger any of the things these patches are supposed to fix. It

this is an interesting series, and the code makes sense, but I would
really like to see some regression tests, and maybe even some
selftests to trigger (at least some of) the issues.

the follow-up question is: how did we manage to go on so long without
noticing these issues? :D

> merely matches my current understanding (and what other code does :) ). I
> did compile-test as far as possible.
>
> After learning more about the wonderful world of page tables and their
> interaction with the mmap_sem and VMAs, I spotted some issues in our
> page table walkers that allow user space to trigger nasty behavior when
> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
> should be hard to trigger, others are fairly easy because we provide
> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
>
> Future work:
> - Don't use get_locked_pte() when it's not required to actually allocate
> page tables -- similar to how storage keys are now handled. Examples are
> get_pgste() and __gmap_zap.
> - Don't use get_locked_pte() and instead let page fault logic allocate page
> tables when we actually do need page tables -- also, similar to how
> storage keys are now handled. Examples are set_pgste_bits() and
> pgste_perform_essa().
> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
> __gmap_zap() that's very easy.
>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Janosch Frank <[email protected]>
> Cc: Cornelia Huck <[email protected]>
> Cc: Claudio Imbrenda <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Vasily Gorbik <[email protected]>
> Cc: Niklas Schnelle <[email protected]>
> Cc: Gerald Schaefer <[email protected]>
> Cc: Ulrich Weigand <[email protected]>
>
> David Hildenbrand (9):
> s390/gmap: validate VMA in __gmap_zap()
> s390/gmap: don't unconditionally call pte_unmap_unlock() in
> __gmap_zap()
> s390/mm: validate VMA in PGSTE manipulation functions
> s390/mm: fix VMA and page table handling code in storage key handling
> functions
> s390/uv: fully validate the VMA before calling follow_page()
> s390/pci_mmio: fully validate the VMA before calling follow_pte()
> s390/mm: no need for pte_alloc_map_lock() if we know the pmd is
> present
> s390/mm: optimize set_guest_storage_key()
> s390/mm: optimize reset_guest_reference_bit()
>
> arch/s390/kernel/uv.c | 2 +-
> arch/s390/mm/gmap.c | 11 +++-
> arch/s390/mm/pgtable.c | 109 +++++++++++++++++++++++++++------------
> arch/s390/pci/pci_mmio.c | 4 +-
> 4 files changed, 89 insertions(+), 37 deletions(-)
>
>
> base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f

2021-09-14 16:55:01

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 1/9] s390/gmap: validate VMA in __gmap_zap()

On Thu, 9 Sep 2021 18:22:40 +0200
David Hildenbrand <[email protected]> wrote:

> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap"). The pure prescence in our guest_to_host
> radix tree does not imply that there is a VMA.
>
> Further, we should not allocate page tables (via get_locked_pte()) outside
> of VMA boundaries: if evil user space decides to map hugetlbfs to these
> ranges, bad things will happen because we suddenly have PTE or PMD page
> tables where we shouldn't have them.
>
> Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
> calling get_locked_pte().
>
> Note that gmap_discard() is different:
> zap_page_range()->unmap_single_vma() makes sure to stay within VMA
> boundaries.
>
> Fixes: b31288fa83b2 ("s390/kvm: support collaborative memory management")
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/mm/gmap.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 9bb2c7512cd5..b6b56cd4ca64 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -673,6 +673,7 @@ EXPORT_SYMBOL_GPL(gmap_fault);
> */
> void __gmap_zap(struct gmap *gmap, unsigned long gaddr)
> {
> + struct vm_area_struct *vma;
> unsigned long vmaddr;
> spinlock_t *ptl;
> pte_t *ptep;
> @@ -682,6 +683,11 @@ void __gmap_zap(struct gmap *gmap, unsigned long gaddr)
> gaddr >> PMD_SHIFT);
> if (vmaddr) {
> vmaddr |= gaddr & ~PMD_MASK;
> +
> + vma = vma_lookup(gmap->mm, vmaddr);
> + if (!vma || is_vm_hugetlb_page(vma))
> + return;
> +
> /* Get pointer to the page table entry */
> ptep = get_locked_pte(gmap->mm, vmaddr, &ptl);
> if (likely(ptep))

2021-09-14 16:56:29

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 5/9] s390/uv: fully validate the VMA before calling follow_page()

On Thu, 9 Sep 2021 18:22:44 +0200
David Hildenbrand <[email protected]> wrote:

> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap").
>
> find_vma() does not check if the address is >= the VMA start address;
> use vma_lookup() instead.
>
> Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests")
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/kernel/uv.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index aeb0a15bcbb7..193205fb2777 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -227,7 +227,7 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> uaddr = __gmap_translate(gmap, gaddr);
> if (IS_ERR_VALUE(uaddr))
> goto out;
> - vma = find_vma(gmap->mm, uaddr);
> + vma = vma_lookup(gmap->mm, uaddr);
> if (!vma)
> goto out;
> /*

2021-09-14 16:56:31

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 3/9] s390/mm: validate VMA in PGSTE manipulation functions

On Thu, 9 Sep 2021 18:22:42 +0200
David Hildenbrand <[email protected]> wrote:

> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap"). gfn_to_hva() will only translate using
> KVM memory regions, but won't validate the VMA.
>
> Further, we should not allocate page tables outside of VMA boundaries: if
> evil user space decides to map hugetlbfs to these ranges, bad things will
> happen because we suddenly have PTE or PMD page tables where we
> shouldn't have them.
>
> Similarly, we have to check if we suddenly find a hugetlbfs VMA, before
> calling get_locked_pte().
>
> Fixes: 2d42f9477320 ("s390/kvm: Add PGSTE manipulation functions")
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/mm/pgtable.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index eec3a9d7176e..54969e0f3a94 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -988,6 +988,7 @@ EXPORT_SYMBOL(get_guest_storage_key);
> int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
> unsigned long *oldpte, unsigned long *oldpgste)
> {
> + struct vm_area_struct *vma;
> unsigned long pgstev;
> spinlock_t *ptl;
> pgste_t pgste;
> @@ -997,6 +998,10 @@ int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
> WARN_ON_ONCE(orc > ESSA_MAX);
> if (unlikely(orc > ESSA_MAX))
> return -EINVAL;
> +
> + vma = vma_lookup(mm, hva);
> + if (!vma || is_vm_hugetlb_page(vma))
> + return -EFAULT;
> ptep = get_locked_pte(mm, hva, &ptl);
> if (unlikely(!ptep))
> return -EFAULT;
> @@ -1089,10 +1094,14 @@ EXPORT_SYMBOL(pgste_perform_essa);
> int set_pgste_bits(struct mm_struct *mm, unsigned long hva,
> unsigned long bits, unsigned long value)
> {
> + struct vm_area_struct *vma;
> spinlock_t *ptl;
> pgste_t new;
> pte_t *ptep;
>
> + vma = vma_lookup(mm, hva);
> + if (!vma || is_vm_hugetlb_page(vma))
> + return -EFAULT;
> ptep = get_locked_pte(mm, hva, &ptl);
> if (unlikely(!ptep))
> return -EFAULT;
> @@ -1117,9 +1126,13 @@ EXPORT_SYMBOL(set_pgste_bits);
> */
> int get_pgste(struct mm_struct *mm, unsigned long hva, unsigned long *pgstep)
> {
> + struct vm_area_struct *vma;
> spinlock_t *ptl;
> pte_t *ptep;
>
> + vma = vma_lookup(mm, hva);
> + if (!vma || is_vm_hugetlb_page(vma))
> + return -EFAULT;
> ptep = get_locked_pte(mm, hva, &ptl);
> if (unlikely(!ptep))
> return -EFAULT;

2021-09-14 16:56:37

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 6/9] s390/pci_mmio: fully validate the VMA before calling follow_pte()

On Thu, 9 Sep 2021 18:22:45 +0200
David Hildenbrand <[email protected]> wrote:

> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap").
>
> find_vma() does not check if the address is >= the VMA start address;
> use vma_lookup() instead.
>
> Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/pci/pci_mmio.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> index ae683aa623ac..c5b35ea129cf 100644
> --- a/arch/s390/pci/pci_mmio.c
> +++ b/arch/s390/pci/pci_mmio.c
> @@ -159,7 +159,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned
> long, mmio_addr,
> mmap_read_lock(current->mm);
> ret = -EINVAL;
> - vma = find_vma(current->mm, mmio_addr);
> + vma = vma_lookup(current->mm, mmio_addr);
> if (!vma)
> goto out_unlock_mmap;
> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> @@ -298,7 +298,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned
> long, mmio_addr,
> mmap_read_lock(current->mm);
> ret = -EINVAL;
> - vma = find_vma(current->mm, mmio_addr);
> + vma = vma_lookup(current->mm, mmio_addr);
> if (!vma)
> goto out_unlock_mmap;
> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))

2021-09-14 16:58:25

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 7/9] s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present

On Thu, 9 Sep 2021 18:22:46 +0200
David Hildenbrand <[email protected]> wrote:

> pte_map_lock() is sufficient.

Can you explain the difference and why it is enough?

>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/s390/mm/pgtable.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 5fb409ff7842..4e77b8ebdcc5 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -814,10 +814,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> }
> spin_unlock(ptl);
>
> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
> - if (unlikely(!ptep))
> - return -EFAULT;
> -
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> new = old = pgste_get_lock(ptep);
> pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
> PGSTE_ACC_BITS | PGSTE_FP_BIT);
> @@ -912,10 +909,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> }
> spin_unlock(ptl);
>
> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
> - if (unlikely(!ptep))
> - return -EFAULT;
> -
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> new = old = pgste_get_lock(ptep);
> /* Reset guest reference bit only */
> pgste_val(new) &= ~PGSTE_GR_BIT;
> @@ -977,10 +971,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> }
> spin_unlock(ptl);
>
> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
> - if (unlikely(!ptep))
> - return -EFAULT;
> -
> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> pgste = pgste_get_lock(ptep);
> *key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
> paddr = pte_val(*ptep) & PAGE_MASK;

2021-09-14 17:28:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH resend RFC 7/9] s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present

On 14.09.21 18:54, Claudio Imbrenda wrote:
> On Thu, 9 Sep 2021 18:22:46 +0200
> David Hildenbrand <[email protected]> wrote:
>
>> pte_map_lock() is sufficient.
>
> Can you explain the difference and why it is enough?

I didn't repeat the $subject:

"No need for pte_alloc_map_lock() if we know the pmd is present;
pte_map_lock() is sufficient, because there isn't anything to allocate."

>
>>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> arch/s390/mm/pgtable.c | 15 +++------------
>> 1 file changed, 3 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
>> index 5fb409ff7842..4e77b8ebdcc5 100644
>> --- a/arch/s390/mm/pgtable.c
>> +++ b/arch/s390/mm/pgtable.c
>> @@ -814,10 +814,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
>> }
>> spin_unlock(ptl);
>>
>> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
>> - if (unlikely(!ptep))
>> - return -EFAULT;
>> -
>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> new = old = pgste_get_lock(ptep);
>> pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
>> PGSTE_ACC_BITS | PGSTE_FP_BIT);
>> @@ -912,10 +909,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
>> }
>> spin_unlock(ptl);
>>
>> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
>> - if (unlikely(!ptep))
>> - return -EFAULT;
>> -
>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> new = old = pgste_get_lock(ptep);
>> /* Reset guest reference bit only */
>> pgste_val(new) &= ~PGSTE_GR_BIT;
>> @@ -977,10 +971,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
>> }
>> spin_unlock(ptl);
>>
>> - ptep = pte_alloc_map_lock(mm, pmdp, addr, &ptl);
>> - if (unlikely(!ptep))
>> - return -EFAULT;
>> -
>> + ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> pgste = pgste_get_lock(ptep);
>> *key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
>> paddr = pte_val(*ptep) & PAGE_MASK;
>


--
Thanks,

David / dhildenb

2021-09-14 18:08:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

On 14.09.21 18:50, Claudio Imbrenda wrote:
> On Thu, 9 Sep 2021 18:22:39 +0200
> David Hildenbrand <[email protected]> wrote:
>
>> Resend because I missed ccing people on the actual patches ...
>>
>> RFC because the patches are essentially untested and I did not actually
>> try to trigger any of the things these patches are supposed to fix. It
>
> this is an interesting series, and the code makes sense, but I would
> really like to see some regression tests, and maybe even some
> selftests to trigger (at least some of) the issues.

Yep, it most certainly needs regression testing before picking any of
this. selftests would be great, but I won't find time for it in the
foreseeable future.

>
> the follow-up question is: how did we manage to go on so long without
> noticing these issues? :D

Excellent question - I guess we simply weren't aware of the dos and
don'ts when dealing with process page tables :)

>
>> merely matches my current understanding (and what other code does :) ). I
>> did compile-test as far as possible.
>>
>> After learning more about the wonderful world of page tables and their
>> interaction with the mmap_sem and VMAs, I spotted some issues in our
>> page table walkers that allow user space to trigger nasty behavior when
>> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
>> should be hard to trigger, others are fairly easy because we provide
>> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
>>
>> Future work:
>> - Don't use get_locked_pte() when it's not required to actually allocate
>> page tables -- similar to how storage keys are now handled. Examples are
>> get_pgste() and __gmap_zap.
>> - Don't use get_locked_pte() and instead let page fault logic allocate page
>> tables when we actually do need page tables -- also, similar to how
>> storage keys are now handled. Examples are set_pgste_bits() and
>> pgste_perform_essa().
>> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
>> __gmap_zap() that's very easy.
>>
>> Cc: Christian Borntraeger <[email protected]>
>> Cc: Janosch Frank <[email protected]>
>> Cc: Cornelia Huck <[email protected]>
>> Cc: Claudio Imbrenda <[email protected]>
>> Cc: Heiko Carstens <[email protected]>
>> Cc: Vasily Gorbik <[email protected]>
>> Cc: Niklas Schnelle <[email protected]>
>> Cc: Gerald Schaefer <[email protected]>
>> Cc: Ulrich Weigand <[email protected]>
>>
>> David Hildenbrand (9):
>> s390/gmap: validate VMA in __gmap_zap()
>> s390/gmap: don't unconditionally call pte_unmap_unlock() in
>> __gmap_zap()
>> s390/mm: validate VMA in PGSTE manipulation functions
>> s390/mm: fix VMA and page table handling code in storage key handling
>> functions
>> s390/uv: fully validate the VMA before calling follow_page()
>> s390/pci_mmio: fully validate the VMA before calling follow_pte()
>> s390/mm: no need for pte_alloc_map_lock() if we know the pmd is
>> present
>> s390/mm: optimize set_guest_storage_key()
>> s390/mm: optimize reset_guest_reference_bit()
>>
>> arch/s390/kernel/uv.c | 2 +-
>> arch/s390/mm/gmap.c | 11 +++-
>> arch/s390/mm/pgtable.c | 109 +++++++++++++++++++++++++++------------
>> arch/s390/pci/pci_mmio.c | 4 +-
>> 4 files changed, 89 insertions(+), 37 deletions(-)
>>
>>
>> base-commit: 7d2a07b769330c34b4deabeed939325c77a7ec2f
>


--
Thanks,

David / dhildenb

2021-09-14 22:43:10

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH resend RFC 5/9] s390/uv: fully validate the VMA before calling follow_page()

Reviwed-by: Liam R. Howlett <[email protected]>

* David Hildenbrand <[email protected]> [210909 12:23]:
> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap").
>
> find_vma() does not check if the address is >= the VMA start address;
> use vma_lookup() instead.
>
> Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests")
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/s390/kernel/uv.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index aeb0a15bcbb7..193205fb2777 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -227,7 +227,7 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> uaddr = __gmap_translate(gmap, gaddr);
> if (IS_ERR_VALUE(uaddr))
> goto out;
> - vma = find_vma(gmap->mm, uaddr);
> + vma = vma_lookup(gmap->mm, uaddr);
> if (!vma)
> goto out;
> /*
> --
> 2.31.1
>
>

2021-09-14 22:45:25

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH resend RFC 6/9] s390/pci_mmio: fully validate the VMA before calling follow_pte()

Reviewed-by: Liam R. Howlett <[email protected]>

* David Hildenbrand <[email protected]> [210909 12:24]:
> We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap").
>
> find_vma() does not check if the address is >= the VMA start address;
> use vma_lookup() instead.
>
> Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> arch/s390/pci/pci_mmio.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/pci/pci_mmio.c b/arch/s390/pci/pci_mmio.c
> index ae683aa623ac..c5b35ea129cf 100644
> --- a/arch/s390/pci/pci_mmio.c
> +++ b/arch/s390/pci/pci_mmio.c
> @@ -159,7 +159,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_write, unsigned long, mmio_addr,
>
> mmap_read_lock(current->mm);
> ret = -EINVAL;
> - vma = find_vma(current->mm, mmio_addr);
> + vma = vma_lookup(current->mm, mmio_addr);
> if (!vma)
> goto out_unlock_mmap;
> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> @@ -298,7 +298,7 @@ SYSCALL_DEFINE3(s390_pci_mmio_read, unsigned long, mmio_addr,
>
> mmap_read_lock(current->mm);
> ret = -EINVAL;
> - vma = find_vma(current->mm, mmio_addr);
> + vma = vma_lookup(current->mm, mmio_addr);
> if (!vma)
> goto out_unlock_mmap;
> if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
> --
> 2.31.1
>
>

2021-09-27 16:44:39

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 4/9] s390/mm: fix VMA and page table handling code in storage key handling functions

On Thu, 9 Sep 2021 18:22:43 +0200
David Hildenbrand <[email protected]> wrote:

> There are multiple things broken about our storage key handling
> functions:
>
> 1. We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap"). gfn_to_hva() will only translate using
> KVM memory regions, but won't validate the VMA.
>
> 2. We should not allocate page tables outside of VMA boundaries: if
> evil user space decides to map hugetlbfs to these ranges, bad things
> will happen because we suddenly have PTE or PMD page tables where we
> shouldn't have them.
>
> 3. We don't handle large PUDs that might suddenly appeared inside our page
> table hierarchy.
>
> Don't manually allocate page tables, properly validate that we have VMA and
> bail out on pud_large().
>
> All callers of page table handling functions, except
> get_guest_storage_key(), call fixup_user_fault() in case they
> receive an -EFAULT and retry; this will allocate the necessary page tables
> if required.
>
> To keep get_guest_storage_key() working as expected and not requiring
> kvm_s390_get_skeys() to call fixup_user_fault() distinguish between
> "there is simply no page table or huge page yet and the key is assumed
> to be 0" and "this is a fault to be reported".
>
> Although commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling")
> introduced most of the affected code, it was actually already broken
> before when using get_locked_pte() without any VMA checks.
>
> Note: Ever since commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key
> handling") we can no longer set a guest storage key (for example from
> QEMU during VM live migration) without actually resolving a fault.
> Although we would have created most page tables, we would choke on the
> !pmd_present(), requiring a call to fixup_user_fault(). I would
> have thought that this is problematic in combination with postcopy life
> migration ... but nobody noticed and this patch doesn't change the
> situation. So maybe it's just fine.
>
> Fixes: 9fcf93b5de06 ("KVM: S390: Create helper function get_guest_storage_key")
> Fixes: 24d5dd0208ed ("s390/kvm: Provide function for setting the guest storage key")
> Fixes: a7e19ab55ffd ("KVM: s390: handle missing storage-key facility")
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/mm/pgtable.c | 57 +++++++++++++++++++++++++++++-------------
> 1 file changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 54969e0f3a94..5fb409ff7842 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -429,22 +429,36 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
> }
>
> #ifdef CONFIG_PGSTE
> -static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
> +static int pmd_lookup(struct mm_struct *mm, unsigned long addr, pmd_t **pmdp)
> {
> + struct vm_area_struct *vma;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> - pmd_t *pmd;
> +
> + /* We need a valid VMA, otherwise this is clearly a fault. */
> + vma = vma_lookup(mm, addr);
> + if (!vma)
> + return -EFAULT;
>
> pgd = pgd_offset(mm, addr);
> - p4d = p4d_alloc(mm, pgd, addr);
> - if (!p4d)
> - return NULL;
> - pud = pud_alloc(mm, p4d, addr);
> - if (!pud)
> - return NULL;
> - pmd = pmd_alloc(mm, pud, addr);
> - return pmd;
> + if (!pgd_present(*pgd))
> + return -ENOENT;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (!p4d_present(*p4d))
> + return -ENOENT;
> +
> + pud = pud_offset(p4d, addr);
> + if (!pud_present(*pud))
> + return -ENOENT;
> +
> + /* Large PUDs are not supported yet. */
> + if (pud_large(*pud))
> + return -EFAULT;
> +
> + *pmdp = pmd_offset(pud, addr);
> + return 0;
> }
> #endif
>
> @@ -778,8 +792,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + if (pmd_lookup(mm, addr, &pmdp))
> return -EFAULT;
>
> ptl = pmd_lock(mm, pmdp);
> @@ -881,8 +894,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> pte_t *ptep;
> int cc = 0;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + if (pmd_lookup(mm, addr, &pmdp))
> return -EFAULT;
>
> ptl = pmd_lock(mm, pmdp);
> @@ -935,15 +947,24 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + /*
> + * If we don't have a PTE table and if there is no huge page mapped,
> + * the storage key is 0.
> + */
> + *key = 0;
> +
> + switch (pmd_lookup(mm, addr, &pmdp)) {
> + case -ENOENT:
> + return 0;
> + case 0:
> + break;
> + default:
> return -EFAULT;
> + }
>
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> - /* Not yet mapped memory has a zero key */
> spin_unlock(ptl);
> - *key = 0;
> return 0;
> }
>

2021-09-27 17:04:57

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 9/9] s390/mm: optimize reset_guest_reference_bit()

On Thu, 9 Sep 2021 18:22:48 +0200
David Hildenbrand <[email protected]> wrote:

> We already optimize get_guest_storage_key() to assume that if we don't have
> a PTE table and don't have a huge page mapped that the storage key is 0.
>
> Similarly, optimize reset_guest_reference_bit() to simply do nothing if
> there is no PTE table and no huge page mapped.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/mm/pgtable.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 534939a3eca5..50ab2fed3397 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -901,13 +901,23 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> pte_t *ptep;
> int cc = 0;
>
> - if (pmd_lookup(mm, addr, &pmdp))
> + /*
> + * If we don't have a PTE table and if there is no huge page mapped,
> + * the storage key is 0 and there is nothing for us to do.
> + */
> + switch (pmd_lookup(mm, addr, &pmdp)) {
> + case -ENOENT:
> + return 0;
> + case 0:
> + break;
> + default:
> return -EFAULT;
> + }
>
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> spin_unlock(ptl);
> - return -EFAULT;
> + return 0;
> }
>
> if (pmd_large(*pmdp)) {

2021-09-27 17:10:19

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 8/9] s390/mm: optimize set_guest_storage_key()

On Thu, 9 Sep 2021 18:22:47 +0200
David Hildenbrand <[email protected]> wrote:

> We already optimize get_guest_storage_key() to assume that if we don't have
> a PTE table and don't have a huge page mapped that the storage key is 0.
>
> Similarly, optimize set_guest_storage_key() to simply do nothing in case
> the key to set is 0.
>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/mm/pgtable.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 4e77b8ebdcc5..534939a3eca5 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -792,13 +792,23 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - if (pmd_lookup(mm, addr, &pmdp))
> + /*
> + * If we don't have a PTE table and if there is no huge page mapped,
> + * we can ignore attempts to set the key to 0, because it already is 0.
> + */
> + switch (pmd_lookup(mm, addr, &pmdp)) {
> + case -ENOENT:
> + return key ? -EFAULT : 0;
> + case 0:
> + break;
> + default:
> return -EFAULT;
> + }
>
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> spin_unlock(ptl);
> - return -EFAULT;
> + return key ? -EFAULT : 0;
> }
>
> if (pmd_large(*pmdp)) {

2021-09-28 11:01:22

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

On Thu, Sep 09, 2021 at 06:22:39PM +0200, David Hildenbrand wrote:
> Resend because I missed ccing people on the actual patches ...
>
> RFC because the patches are essentially untested and I did not actually
> try to trigger any of the things these patches are supposed to fix. It
> merely matches my current understanding (and what other code does :) ). I
> did compile-test as far as possible.
>
> After learning more about the wonderful world of page tables and their
> interaction with the mmap_sem and VMAs, I spotted some issues in our
> page table walkers that allow user space to trigger nasty behavior when
> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
> should be hard to trigger, others are fairly easy because we provide
> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
>
> Future work:
> - Don't use get_locked_pte() when it's not required to actually allocate
> page tables -- similar to how storage keys are now handled. Examples are
> get_pgste() and __gmap_zap.
> - Don't use get_locked_pte() and instead let page fault logic allocate page
> tables when we actually do need page tables -- also, similar to how
> storage keys are now handled. Examples are set_pgste_bits() and
> pgste_perform_essa().
> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
> __gmap_zap() that's very easy.
>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Janosch Frank <[email protected]>
> Cc: Cornelia Huck <[email protected]>
> Cc: Claudio Imbrenda <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Vasily Gorbik <[email protected]>
> Cc: Niklas Schnelle <[email protected]>
> Cc: Gerald Schaefer <[email protected]>
> Cc: Ulrich Weigand <[email protected]>

For the whole series:
Acked-by: Heiko Carstens <[email protected]>

Christian, given that this is mostly about KVM I'd assume this should
go via the KVM tree. Patch 6 (pci_mmio) is already upstream.

2021-09-28 11:07:55

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers


Am 28.09.21 um 12:59 schrieb Heiko Carstens:
> On Thu, Sep 09, 2021 at 06:22:39PM +0200, David Hildenbrand wrote:
>> Resend because I missed ccing people on the actual patches ...
>>
>> RFC because the patches are essentially untested and I did not actually
>> try to trigger any of the things these patches are supposed to fix. It
>> merely matches my current understanding (and what other code does :) ). I
>> did compile-test as far as possible.
>>
>> After learning more about the wonderful world of page tables and their
>> interaction with the mmap_sem and VMAs, I spotted some issues in our
>> page table walkers that allow user space to trigger nasty behavior when
>> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
>> should be hard to trigger, others are fairly easy because we provide
>> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
>>
>> Future work:
>> - Don't use get_locked_pte() when it's not required to actually allocate
>> page tables -- similar to how storage keys are now handled. Examples are
>> get_pgste() and __gmap_zap.
>> - Don't use get_locked_pte() and instead let page fault logic allocate page
>> tables when we actually do need page tables -- also, similar to how
>> storage keys are now handled. Examples are set_pgste_bits() and
>> pgste_perform_essa().
>> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
>> __gmap_zap() that's very easy.
>>
>> Cc: Christian Borntraeger <[email protected]>
>> Cc: Janosch Frank <[email protected]>
>> Cc: Cornelia Huck <[email protected]>
>> Cc: Claudio Imbrenda <[email protected]>
>> Cc: Heiko Carstens <[email protected]>
>> Cc: Vasily Gorbik <[email protected]>
>> Cc: Niklas Schnelle <[email protected]>
>> Cc: Gerald Schaefer <[email protected]>
>> Cc: Ulrich Weigand <[email protected]>
>
> For the whole series:
> Acked-by: Heiko Carstens <[email protected]>
>
> Christian, given that this is mostly about KVM I'd assume this should
> go via the KVM tree. Patch 6 (pci_mmio) is already upstream.

Right, I think I will queue this even without testing for now.
Claudio, is patch 7 ok for you with the explanation from David?

2021-09-28 14:41:10

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

On Tue, 28 Sep 2021 13:06:26 +0200
Christian Borntraeger <[email protected]> wrote:

> Am 28.09.21 um 12:59 schrieb Heiko Carstens:
> > On Thu, Sep 09, 2021 at 06:22:39PM +0200, David Hildenbrand wrote:
> >> Resend because I missed ccing people on the actual patches ...
> >>
> >> RFC because the patches are essentially untested and I did not actually
> >> try to trigger any of the things these patches are supposed to fix. It
> >> merely matches my current understanding (and what other code does :) ). I
> >> did compile-test as far as possible.
> >>
> >> After learning more about the wonderful world of page tables and their
> >> interaction with the mmap_sem and VMAs, I spotted some issues in our
> >> page table walkers that allow user space to trigger nasty behavior when
> >> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
> >> should be hard to trigger, others are fairly easy because we provide
> >> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
> >>
> >> Future work:
> >> - Don't use get_locked_pte() when it's not required to actually allocate
> >> page tables -- similar to how storage keys are now handled. Examples are
> >> get_pgste() and __gmap_zap.
> >> - Don't use get_locked_pte() and instead let page fault logic allocate page
> >> tables when we actually do need page tables -- also, similar to how
> >> storage keys are now handled. Examples are set_pgste_bits() and
> >> pgste_perform_essa().
> >> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
> >> __gmap_zap() that's very easy.
> >>
> >> Cc: Christian Borntraeger <[email protected]>
> >> Cc: Janosch Frank <[email protected]>
> >> Cc: Cornelia Huck <[email protected]>
> >> Cc: Claudio Imbrenda <[email protected]>
> >> Cc: Heiko Carstens <[email protected]>
> >> Cc: Vasily Gorbik <[email protected]>
> >> Cc: Niklas Schnelle <[email protected]>
> >> Cc: Gerald Schaefer <[email protected]>
> >> Cc: Ulrich Weigand <[email protected]>
> >
> > For the whole series:
> > Acked-by: Heiko Carstens <[email protected]>
> >
> > Christian, given that this is mostly about KVM I'd assume this should
> > go via the KVM tree. Patch 6 (pci_mmio) is already upstream.
>
> Right, I think I will queue this even without testing for now.
> Claudio, is patch 7 ok for you with the explanation from David?

yes

2021-09-28 16:08:27

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH resend RFC 0/9] s390: fixes, cleanups and optimizations for page table walkers

Am 09.09.21 um 18:22 schrieb David Hildenbrand:
> Resend because I missed ccing people on the actual patches ...
>
> RFC because the patches are essentially untested and I did not actually
> try to trigger any of the things these patches are supposed to fix. It
> merely matches my current understanding (and what other code does :) ). I
> did compile-test as far as possible.
>
> After learning more about the wonderful world of page tables and their
> interaction with the mmap_sem and VMAs, I spotted some issues in our
> page table walkers that allow user space to trigger nasty behavior when
> playing dirty tricks with munmap() or mmap() of hugetlb. While some issues
> should be hard to trigger, others are fairly easy because we provide
> conventient interfaces (e.g., KVM_S390_GET_SKEYS and KVM_S390_SET_SKEYS).
>
> Future work:
> - Don't use get_locked_pte() when it's not required to actually allocate
> page tables -- similar to how storage keys are now handled. Examples are
> get_pgste() and __gmap_zap.
> - Don't use get_locked_pte() and instead let page fault logic allocate page
> tables when we actually do need page tables -- also, similar to how
> storage keys are now handled. Examples are set_pgste_bits() and
> pgste_perform_essa().
> - Maybe switch to mm/pagewalk.c to avoid custom page table walkers. For
> __gmap_zap() that's very easy.
>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Janosch Frank <[email protected]>
> Cc: Cornelia Huck <[email protected]>
> Cc: Claudio Imbrenda <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Vasily Gorbik <[email protected]>
> Cc: Niklas Schnelle <[email protected]>
> Cc: Gerald Schaefer <[email protected]>
> Cc: Ulrich Weigand <[email protected]>
>
> David Hildenbrand (9):
> s390/gmap: validate VMA in __gmap_zap()
> s390/gmap: don't unconditionally call pte_unmap_unlock() in
> __gmap_zap()
> s390/mm: validate VMA in PGSTE manipulation functions
> s390/mm: fix VMA and page table handling code in storage key handling
> functions
> s390/uv: fully validate the VMA before calling follow_page()
> s390/pci_mmio: fully validate the VMA before calling follow_pte()
> s390/mm: no need for pte_alloc_map_lock() if we know the pmd is
> present
> s390/mm: optimize set_guest_storage_key()
> s390/mm: optimize reset_guest_reference_bit()
>
> arch/s390/kernel/uv.c | 2 +-
> arch/s390/mm/gmap.c | 11 +++-
> arch/s390/mm/pgtable.c | 109 +++++++++++++++++++++++++++------------
> arch/s390/pci/pci_mmio.c | 4 +-
> 4 files changed, 89 insertions(+), 37 deletions(-)
>

Thanks applied. Will run some test on those commits, but its already pushed
out to my next tree to give it some coverage.
g