2017-08-29 23:54:59

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 00/13] mmu_notifier kill invalidate_page callback

(Sorry for so many list cross-posting and big cc)

Please help testing !

The invalidate_page callback suffered from 2 pitfalls. First it used to
happen after page table lock was release and thus a new page might have
been setup for the virtual address before the call to invalidate_page().

This is in a weird way fixed by c7ab0d2fdc840266b39db94538f74207ec2afbf6
which moved the callback under the page table lock. Which also broke
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP pages.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

There is now 2 clear API (I am not mentioning the youngess API which is
seldomly used):
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lock


Note that a lot of existing user feels broken in respect to range_start/
range_end. Many user only have range_start() callback but there is nothing
preventing them to undo what was invalidated in their range_start() callback
after it returns but before any CPU page table update take place.

The code pattern use in kvm or umem odp is an example on how to properly
avoid such race. In a nutshell use some kind of sequence number and active
range invalidation counter to block anything that might undo what the
range_start() callback did.

If you do not care about keeping fully in sync with CPU page table (ie
you can live with CPU page table pointing to new different page for a
given virtual address) then you can take a reference on the pages inside
the range_start callback and drop it in range_end or when your driver
is done with those pages.

Last alternative is to use invalidate_range() if you can do invalidation
without sleeping as invalidate_range() callback happens under the CPU
page table spinlock right after the page table is updated.


Note this is barely tested. I intend to do more testing of next few days
but i do not have access to all hardware that make use of the mmu_notifier
API.


First 2 patches convert existing call of mmu_notifier_invalidate_page()
to mmu_notifier_invalidate_range() and bracket those call with call to
mmu_notifier_invalidate_range_start()/end().

The next 10 patches remove existing invalidate_page() callback as it can
no longer happen.

Finaly the last page remove it completely so it can RIP.

Jérôme Glisse (13):
dax: update to new mmu_notifier semantic
mm/rmap: update to new mmu_notifier semantic
powerpc/powernv: update to new mmu_notifier semantic
drm/amdgpu: update to new mmu_notifier semantic
IB/umem: update to new mmu_notifier semantic
IB/hfi1: update to new mmu_notifier semantic
iommu/amd: update to new mmu_notifier semantic
iommu/intel: update to new mmu_notifier semantic
misc/mic/scif: update to new mmu_notifier semantic
sgi-gru: update to new mmu_notifier semantic
xen/gntdev: update to new mmu_notifier semantic
KVM: update to new mmu_notifier semantic
mm/mmu_notifier: kill invalidate_page

Cc: Kirill A. Shutemov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Sudeep Dutt <[email protected]>
Cc: Ashutosh Dixit <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Jack Steiner <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]


arch/powerpc/platforms/powernv/npu-dma.c | 10 --------
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 ----------------------
drivers/infiniband/core/umem_odp.c | 19 --------------
drivers/infiniband/hw/hfi1/mmu_rb.c | 9 -------
drivers/iommu/amd_iommu_v2.c | 8 ------
drivers/iommu/intel-svm.c | 9 -------
drivers/misc/mic/scif/scif_dma.c | 11 --------
drivers/misc/sgi-gru/grutlbpurge.c | 12 ---------
drivers/xen/gntdev.c | 8 ------
fs/dax.c | 19 ++++++++------
include/linux/mm.h | 1 +
include/linux/mmu_notifier.h | 25 ------------------
mm/memory.c | 26 +++++++++++++++----
mm/mmu_notifier.c | 14 ----------
mm/rmap.c | 44 +++++++++++++++++++++++++++++---
virt/kvm/kvm_main.c | 42 ------------------------------
16 files changed, 74 insertions(+), 214 deletions(-)

--
2.13.5


2017-08-29 23:55:05

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 01/13] dax: update to new mmu_notifier semantic

Replacing all mmu_notifier_invalida_page() by mmu_notifier_invalidat_range
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Bernhard Held <[email protected]>
Cc: Adam Borowski <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: axie <[email protected]>
Cc: Andrew Morton <[email protected]>
---
fs/dax.c | 19 +++++++++++--------
include/linux/mm.h | 1 +
mm/memory.c | 26 +++++++++++++++++++++-----
3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 865d42c63e23..ab925dc6647a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -646,11 +646,10 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
pte_t pte, *ptep = NULL;
pmd_t *pmdp = NULL;
spinlock_t *ptl;
- bool changed;

i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
- unsigned long address;
+ unsigned long address, start, end;

cond_resched();

@@ -658,8 +657,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
continue;

address = pgoff_address(index, vma);
- changed = false;
- if (follow_pte_pmd(vma->vm_mm, address, &ptep, &pmdp, &ptl))
+
+ /*
+ * Note because we provide start/end to follow_pte_pmd it will
+ * call mmu_notifier_invalidate_range_start() on our behalf
+ * before taking any lock.
+ */
+ if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
continue;

if (pmdp) {
@@ -676,7 +680,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
pmd = pmd_wrprotect(pmd);
pmd = pmd_mkclean(pmd);
set_pmd_at(vma->vm_mm, address, pmdp, pmd);
- changed = true;
+ mmu_notifier_invalidate_range(vma->vm_mm, start, end);
unlock_pmd:
spin_unlock(ptl);
#endif
@@ -691,13 +695,12 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
pte = pte_wrprotect(pte);
pte = pte_mkclean(pte);
set_pte_at(vma->vm_mm, address, ptep, pte);
- changed = true;
+ mmu_notifier_invalidate_range(vma->vm_mm, start, end);
unlock_pte:
pte_unmap_unlock(ptep, ptl);
}

- if (changed)
- mmu_notifier_invalidate_page(vma->vm_mm, address);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
}
i_mmap_unlock_read(mapping);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..c1f6c95f3496 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1260,6 +1260,7 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+ unsigned long *start, unsigned long *end,
pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
int follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn);
diff --git a/mm/memory.c b/mm/memory.c
index fe2fba27ded2..56e48e4593cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4008,7 +4008,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
#endif /* __PAGETABLE_PMD_FOLDED */

static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
- pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+ unsigned long *start, unsigned long *end,
+ pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -4035,17 +4036,29 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
if (!pmdpp)
goto out;

+ if (start && end) {
+ *start = address & PMD_MASK;
+ *end = *start + PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, *start, *end);
+ }
*ptlp = pmd_lock(mm, pmd);
if (pmd_huge(*pmd)) {
*pmdpp = pmd;
return 0;
}
spin_unlock(*ptlp);
+ if (start && end)
+ mmu_notifier_invalidate_range_end(mm, *start, *end);
}

if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
goto out;

+ if (start && end) {
+ *start = address & PAGE_MASK;
+ *end = *start + PAGE_SIZE;
+ mmu_notifier_invalidate_range_start(mm, *start, *end);
+ }
ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
if (!pte_present(*ptep))
goto unlock;
@@ -4053,6 +4066,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
return 0;
unlock:
pte_unmap_unlock(ptep, *ptlp);
+ if (start && end)
+ mmu_notifier_invalidate_range_end(mm, *start, *end);
out:
return -EINVAL;
}
@@ -4064,20 +4079,21 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,

/* (void) is needed to make gcc happy */
(void) __cond_lock(*ptlp,
- !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
- ptlp)));
+ !(res = __follow_pte_pmd(mm, address, NULL, NULL,
+ ptepp, NULL, ptlp)));
return res;
}

int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+ unsigned long *start, unsigned long *end,
pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
{
int res;

/* (void) is needed to make gcc happy */
(void) __cond_lock(*ptlp,
- !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
- ptlp)));
+ !(res = __follow_pte_pmd(mm, address, start, end,
+ ptepp, pmdpp, ptlp)));
return res;
}
EXPORT_SYMBOL(follow_pte_pmd);
--
2.13.5

2017-08-29 23:55:19

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 09/13] misc/mic/scif: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Sudeep Dutt <[email protected]>
Cc: Ashutosh Dixit <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/misc/mic/scif/scif_dma.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 64d5760d069a..63d6246d6dff 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,16 +200,6 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
schedule_work(&scif_info.misc_work);
}

-static void scif_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- struct scif_mmu_notif *mmn;
-
- mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
- scif_rma_destroy_tcw(mmn, address, PAGE_SIZE);
-}
-
static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
@@ -235,7 +225,6 @@ static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
static const struct mmu_notifier_ops scif_mmu_notifier_ops = {
.release = scif_mmu_notifier_release,
.clear_flush_young = NULL,
- .invalidate_page = scif_mmu_notifier_invalidate_page,
.invalidate_range_start = scif_mmu_notifier_invalidate_range_start,
.invalidate_range_end = scif_mmu_notifier_invalidate_range_end};

--
2.13.5

2017-08-29 23:55:25

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 13/13] mm/mmu_notifier: kill invalidate_page

The invalidate_page callback suffered from 2 pitfalls. First it use to
happen after page table lock was release and thus a new page might have
setup before the call to invalidate_page() happened.

This is in a weird way fix by c7ab0d2fdc840266b39db94538f74207ec2afbf6
that moved the callback under the page table lock but this also break
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

Finaly this also simplify the end user life as there is now 2 clear
choice:
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lock

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Bernhard Held <[email protected]>
Cc: Adam Borowski <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: axie <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/mmu_notifier.h | 25 -------------------------
mm/mmu_notifier.c | 14 --------------
2 files changed, 39 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index c91b3bcd158f..7b2e31b1745a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,17 +95,6 @@ struct mmu_notifier_ops {
pte_t pte);

/*
- * Before this is invoked any secondary MMU is still ok to
- * read/write to the page previously pointed to by the Linux
- * pte because the page hasn't been freed yet and it won't be
- * freed until this returns. If required set_page_dirty has to
- * be called internally to this method.
- */
- void (*invalidate_page)(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address);
-
- /*
* invalidate_range_start() and invalidate_range_end() must be
* paired and are called only when the mmap_sem and/or the
* locks protecting the reverse maps are held. If the subsystem
@@ -220,8 +209,6 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -268,13 +255,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
__mmu_notifier_change_pte(mm, address, pte);
}

-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
- if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_page(mm, address);
-}
-
static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
@@ -442,11 +422,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
{
}

-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
-}
-
static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 54ca54562928..314285284e6e 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,20 +174,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
srcu_read_unlock(&srcu, id);
}

-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
- unsigned long address)
-{
- struct mmu_notifier *mn;
- int id;
-
- id = srcu_read_lock(&srcu);
- hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
- if (mn->ops->invalidate_page)
- mn->ops->invalidate_page(mn, mm, address);
- }
- srcu_read_unlock(&srcu, id);
-}
-
void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
--
2.13.5

2017-08-29 23:55:43

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 12/13] KVM: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: [email protected]
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
virt/kvm/kvm_main.c | 42 ------------------------------------------
1 file changed, 42 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 15252d723b54..4d81f6ded88e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,47 +322,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
}

-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- struct kvm *kvm = mmu_notifier_to_kvm(mn);
- int need_tlb_flush, idx;
-
- /*
- * When ->invalidate_page runs, the linux pte has been zapped
- * already but the page is still allocated until
- * ->invalidate_page returns. So if we increase the sequence
- * here the kvm page fault will notice if the spte can't be
- * established because the page is going to be freed. If
- * instead the kvm page fault establishes the spte before
- * ->invalidate_page runs, kvm_unmap_hva will release it
- * before returning.
- *
- * The sequence increase only need to be seen at spin_unlock
- * time, and not at spin_lock time.
- *
- * Increasing the sequence after the spin_unlock would be
- * unsafe because the kvm page fault could then establish the
- * pte after kvm_unmap_hva returned, without noticing the page
- * is going to be freed.
- */
- idx = srcu_read_lock(&kvm->srcu);
- spin_lock(&kvm->mmu_lock);
-
- kvm->mmu_notifier_seq++;
- need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
- /* we've to flush the tlb before the pages can be freed */
- if (need_tlb_flush)
- kvm_flush_remote_tlbs(kvm);
-
- spin_unlock(&kvm->mmu_lock);
-
- kvm_arch_mmu_notifier_invalidate_page(kvm, address);
-
- srcu_read_unlock(&kvm->srcu, idx);
-}
-
static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long address,
@@ -510,7 +469,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
}

static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
- .invalidate_page = kvm_mmu_notifier_invalidate_page,
.invalidate_range_start = kvm_mmu_notifier_invalidate_range_start,
.invalidate_range_end = kvm_mmu_notifier_invalidate_range_end,
.clear_flush_young = kvm_mmu_notifier_clear_flush_young,
--
2.13.5

2017-08-29 23:55:18

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 10/13] sgi-gru: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dimitri Sivanich <[email protected]>
Cc: Jack Steiner <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/misc/sgi-gru/grutlbpurge.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e936d43895d2..9918eda0e05f 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -247,17 +247,6 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
}

-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
-{
- struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
- ms_notifier);
-
- STAT(mmu_invalidate_page);
- gru_flush_tlb_range(gms, address, PAGE_SIZE);
- gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
-}
-
static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
@@ -269,7 +258,6 @@ static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)


static const struct mmu_notifier_ops gru_mmuops = {
- .invalidate_page = gru_invalidate_page,
.invalidate_range_start = gru_invalidate_range_start,
.invalidate_range_end = gru_invalidate_range_end,
.release = gru_release,
--
2.13.5

2017-08-29 23:56:12

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Roger Pau Monné <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: [email protected]
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/xen/gntdev.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index f3bf8f4e2d6c..82360594fa8e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -484,13 +484,6 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
mutex_unlock(&priv->lock);
}

-static void mn_invl_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
-}
-
static void mn_release(struct mmu_notifier *mn,
struct mm_struct *mm)
{
@@ -522,7 +515,6 @@ static void mn_release(struct mmu_notifier *mn,

static const struct mmu_notifier_ops gntdev_mmu_ops = {
.release = mn_release,
- .invalidate_page = mn_invl_page,
.invalidate_range_start = mn_invl_range_start,
};

--
2.13.5

2017-08-29 23:55:15

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 08/13] iommu/intel: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]
Cc: Joerg Roedel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/iommu/intel-svm.c | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index f167c0d84ebf..f620dccec8ee 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -223,14 +223,6 @@ static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm,
intel_flush_svm_range(svm, address, 1, 1, 0);
}

-static void intel_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
- unsigned long address)
-{
- struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
-
- intel_flush_svm_range(svm, address, 1, 1, 0);
-}
-
/* Pages have been freed at this point */
static void intel_invalidate_range(struct mmu_notifier *mn,
struct mm_struct *mm,
@@ -285,7 +277,6 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
static const struct mmu_notifier_ops intel_mmuops = {
.release = intel_mm_release,
.change_pte = intel_change_pte,
- .invalidate_page = intel_invalidate_page,
.invalidate_range = intel_invalidate_range,
};

--
2.13.5

2017-08-29 23:55:13

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 03/13] powerpc/powernv: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: Alistair Popple <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
arch/powerpc/platforms/powernv/npu-dma.c | 10 ----------
1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b5d960d6db3d..4c7b8591f737 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -614,15 +614,6 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
mmio_invalidate(npu_context, 1, address, true);
}

-static void pnv_npu2_mn_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- struct npu_context *npu_context = mn_to_npu_context(mn);
-
- mmio_invalidate(npu_context, 1, address, true);
-}
-
static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end)
@@ -640,7 +631,6 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
.release = pnv_npu2_mn_release,
.change_pte = pnv_npu2_mn_change_pte,
- .invalidate_page = pnv_npu2_mn_invalidate_page,
.invalidate_range = pnv_npu2_mn_invalidate_range,
};

--
2.13.5

2017-08-29 23:56:42

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 07/13] iommu/amd: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Suravee Suthikulpanit <[email protected]>
Cc: [email protected]
Cc: Joerg Roedel <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/iommu/amd_iommu_v2.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 6629c472eafd..dccf5b76eff2 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -391,13 +391,6 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
return 0;
}

-static void mn_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- __mn_flush_page(mn, address);
-}
-
static void mn_invalidate_range(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end)
@@ -436,7 +429,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
static const struct mmu_notifier_ops iommu_mn = {
.release = mn_release,
.clear_flush_young = mn_clear_flush_young,
- .invalidate_page = mn_invalidate_page,
.invalidate_range = mn_invalidate_range,
};

--
2.13.5

2017-08-29 23:55:11

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: Felix Kuehling <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
1 file changed, 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 6558a3ed57a7..e1cde6b80027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
}

/**
- * amdgpu_mn_invalidate_page - callback to notify about mm change
- *
- * @mn: our notifier
- * @mn: the mm this callback is about
- * @address: address of invalidate page
- *
- * Invalidation of a single page. Blocks for all BOs mapping it
- * and unmap them by move them into system domain again.
- */
-static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
- struct interval_tree_node *it;
-
- mutex_lock(&rmn->lock);
-
- it = interval_tree_iter_first(&rmn->objects, address, address);
- if (it) {
- struct amdgpu_mn_node *node;
-
- node = container_of(it, struct amdgpu_mn_node, it);
- amdgpu_mn_invalidate_node(node, address, address);
- }
-
- mutex_unlock(&rmn->lock);
-}
-
-/**
* amdgpu_mn_invalidate_range_start - callback to notify about mm change
*
* @mn: our notifier
@@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,

static const struct mmu_notifier_ops amdgpu_mn_ops = {
.release = amdgpu_mn_release,
- .invalidate_page = amdgpu_mn_invalidate_page,
.invalidate_range_start = amdgpu_mn_invalidate_range_start,
};

--
2.13.5

2017-08-29 23:56:59

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: Dean Luick <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {

static unsigned long mmu_node_start(struct mmu_rb_node *);
static unsigned long mmu_node_last(struct mmu_rb_node *);
-static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
- unsigned long);
static inline void mmu_notifier_range_start(struct mmu_notifier *,
struct mm_struct *,
unsigned long, unsigned long);
@@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler,
static void handle_remove(struct work_struct *work);

static const struct mmu_notifier_ops mn_opts = {
- .invalidate_page = mmu_notifier_page,
.invalidate_range_start = mmu_notifier_range_start,
};

@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
handler->ops->remove(handler->ops_arg, node);
}

-static inline void mmu_notifier_page(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long addr)
-{
- mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
--
2.13.5

2017-08-29 23:57:25

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 05/13] IB/umem: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Leon Romanovsky <[email protected]>
Cc: [email protected]
Cc: Artemy Kovalyov <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 19 -------------------
1 file changed, 19 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 8c4ec564e495..55e8f5ed8b3c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,24 +166,6 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
return 0;
}

-static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
- struct mm_struct *mm,
- unsigned long address)
-{
- struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
-
- if (!context->invalidate_range)
- return;
-
- ib_ucontext_notifier_start_account(context);
- down_read(&context->umem_rwsem);
- rbt_ib_umem_for_each_in_range(&context->umem_tree, address,
- address + PAGE_SIZE,
- invalidate_page_trampoline, NULL);
- up_read(&context->umem_rwsem);
- ib_ucontext_notifier_end_account(context);
-}
-
static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
u64 end, void *cookie)
{
@@ -237,7 +219,6 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,

static const struct mmu_notifier_ops ib_umem_notifiers = {
.release = ib_umem_notifier_release,
- .invalidate_page = ib_umem_notifier_invalidate_page,
.invalidate_range_start = ib_umem_notifier_invalidate_range_start,
.invalidate_range_end = ib_umem_notifier_invalidate_range_end,
};
--
2.13.5

2017-08-29 23:57:46

by Jerome Glisse

[permalink] [raw]
Subject: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Ross Zwisler <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Bernhard Held <[email protected]>
Cc: Adam Borowski <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: axie <[email protected]>
Cc: Andrew Morton <[email protected]>
---
mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c8993c63eb25..da97ed525088 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
.address = address,
.flags = PVMW_SYNC,
};
+ unsigned long start = address, end;
int *cleaned = arg;

+ /*
+ * We have to assume the worse case ie pmd for invalidation. Note that
+ * the page can not be free from this function.
+ */
+ end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
while (page_vma_mapped_walk(&pvmw)) {
+ unsigned long cstart, cend;
int ret = 0;
- address = pvmw.address;
+
+ cstart = address = pvmw.address;
if (pvmw.pte) {
pte_t entry;
pte_t *pte = pvmw.pte;
@@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(vma->vm_mm, address, pte, entry);
+ cend = cstart + PAGE_SIZE;
ret = 1;
} else {
#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
@@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
entry = pmd_wrprotect(entry);
entry = pmd_mkclean(entry);
set_pmd_at(vma->vm_mm, address, pmd, entry);
+ cstart &= PMD_MASK;
+ cend = cstart + PMD_SIZE;
ret = 1;
#else
/* unexpected pmd-mapped page? */
@@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
}

if (ret) {
- mmu_notifier_invalidate_page(vma->vm_mm, address);
+ mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
(*cleaned)++;
}
}

+ mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
return true;
}

@@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pte_t pteval;
struct page *subpage;
bool ret = true;
+ unsigned long start = address, end;
enum ttu_flags flags = (enum ttu_flags)arg;

/* munlock has nothing to gain from examining un-locked vmas */
@@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
flags & TTU_MIGRATION, page);
}

+ /*
+ * We have to assume the worse case ie pmd for invalidation. Note that
+ * the page can not be free in this function as call of try_to_unmap()
+ * must hold a reference on the page.
+ */
+ end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
while (page_vma_mapped_walk(&pvmw)) {
/*
* If the page is mlock()d, we cannot swap it out.
@@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
set_huge_swap_pte_at(mm, address,
pvmw.pte, pteval,
vma_mmu_pagesize(vma));
+ mmu_notifier_invalidate_range(mm, address,
+ address + vma_mmu_pagesize(vma));
} else {
dec_mm_counter(mm, mm_counter(page));
set_pte_at(mm, address, pvmw.pte, pteval);
@@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
set_pte_at(mm, address, pvmw.pte, swp_pte);
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
} else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(subpage) };
pte_t swp_pte;
@@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
WARN_ON_ONCE(1);
ret = false;
+ /* We have to invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
page_vma_mapped_walk_done(&pvmw);
break;
}
@@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (!PageSwapBacked(page)) {
if (!PageDirty(page)) {
dec_mm_counter(mm, MM_ANONPAGES);
+ /* Invalidate as we cleared the pte */
+ mmu_notifier_invalidate_range(mm,
+ address, address + PAGE_SIZE);
goto discard;
}

@@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
set_pte_at(mm, address, pvmw.pte, swp_pte);
+ mmu_notifier_invalidate_range(mm, address,
+ address + PAGE_SIZE);
} else
dec_mm_counter(mm, mm_counter_file(page));
discard:
page_remove_rmap(subpage, PageHuge(page));
put_page(page);
- mmu_notifier_invalidate_page(mm, address);
}
+
+ mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
return ret;
}

--
2.13.5

2017-08-30 00:11:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <[email protected]> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

> 16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

Linus

2017-08-30 00:56:24

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, J?r?me Glisse <[email protected]> wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
>
> Thanks for doing this.
>
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
>
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
>
> And the final end result certainly looks nice:
>
> > 16 files changed, 74 insertions(+), 214 deletions(-)
>
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
>
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
>
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

>
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
>
> People - *especially* the people who saw issues under KVM - can you
> try out J?r?me's patch-series? I aded some people to the cc, the full
> series is on lkml. J?r?me - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
to push a clean and slim one)

J?r?me

2017-08-30 02:46:13

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Jérôme Glisse <[email protected]> wrote:

> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
>
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.
>
> Signed-off-by: Jérôme Glisse <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Ross Zwisler <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Bernhard Held <[email protected]>
> Cc: Adam Borowski <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Takashi Iwai <[email protected]>
> Cc: Nadav Amit <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: axie <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8993c63eb25..da97ed525088 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> .address = address,
> .flags = PVMW_SYNC,
> };
> + unsigned long start = address, end;
> int *cleaned = arg;
>
> + /*
> + * We have to assume the worse case ie pmd for invalidation. Note that
> + * the page can not be free from this function.
> + */
> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> while (page_vma_mapped_walk(&pvmw)) {
> + unsigned long cstart, cend;
> int ret = 0;
> - address = pvmw.address;
> +
> + cstart = address = pvmw.address;
> if (pvmw.pte) {
> pte_t entry;
> pte_t *pte = pvmw.pte;
> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
> set_pte_at(vma->vm_mm, address, pte, entry);
> + cend = cstart + PAGE_SIZE;
> ret = 1;
> } else {
> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> entry = pmd_wrprotect(entry);
> entry = pmd_mkclean(entry);
> set_pmd_at(vma->vm_mm, address, pmd, entry);
> + cstart &= PMD_MASK;
> + cend = cstart + PMD_SIZE;
> ret = 1;
> #else
> /* unexpected pmd-mapped page? */
> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> }
>
> if (ret) {
> - mmu_notifier_invalidate_page(vma->vm_mm, address);
> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> (*cleaned)++;
> }
> }
>
> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
> return true;
> }
>
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> pte_t pteval;
> struct page *subpage;
> bool ret = true;
> + unsigned long start = address, end;
> enum ttu_flags flags = (enum ttu_flags)arg;
>
> /* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> flags & TTU_MIGRATION, page);
> }
>
> + /*
> + * We have to assume the worse case ie pmd for invalidation. Note that
> + * the page can not be free in this function as call of try_to_unmap()
> + * must hold a reference on the page.
> + */
> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> while (page_vma_mapped_walk(&pvmw)) {
> /*
> * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> set_huge_swap_pte_at(mm, address,
> pvmw.pte, pteval,
> vma_mmu_pagesize(vma));
> + mmu_notifier_invalidate_range(mm, address,
> + address + vma_mmu_pagesize(vma));

I don’t think that the notifier should be called after the PTE is set, but
after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
is changed without clearing the PTE before, but it does not apply here, and
there is a different notifier for this case.]

Therefore, IIUC, try_to_umap_one() should only call
mmu_notifier_invalidate_range() after ptep_get_and_clear() and
ptep_clear_flush() are called. All the other calls to
mmu_notifier_invalidate_range() in this function can be removed.

Regards,
Nadav

2017-08-30 02:59:19

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> Jérôme Glisse <[email protected]> wrote:
>
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> >
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> >
> > Signed-off-by: Jérôme Glisse <[email protected]>
> > Cc: Dan Williams <[email protected]>
> > Cc: Ross Zwisler <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Bernhard Held <[email protected]>
> > Cc: Adam Borowski <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Radim Krčmář <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > Cc: Paolo Bonzini <[email protected]>
> > Cc: Takashi Iwai <[email protected]>
> > Cc: Nadav Amit <[email protected]>
> > Cc: Mike Galbraith <[email protected]>
> > Cc: Kirill A. Shutemov <[email protected]>
> > Cc: axie <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > ---
> > mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 41 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index c8993c63eb25..da97ed525088 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > .address = address,
> > .flags = PVMW_SYNC,
> > };
> > + unsigned long start = address, end;
> > int *cleaned = arg;
> >
> > + /*
> > + * We have to assume the worse case ie pmd for invalidation. Note that
> > + * the page can not be free from this function.
> > + */
> > + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > while (page_vma_mapped_walk(&pvmw)) {
> > + unsigned long cstart, cend;
> > int ret = 0;
> > - address = pvmw.address;
> > +
> > + cstart = address = pvmw.address;
> > if (pvmw.pte) {
> > pte_t entry;
> > pte_t *pte = pvmw.pte;
> > @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(vma->vm_mm, address, pte, entry);
> > + cend = cstart + PAGE_SIZE;
> > ret = 1;
> > } else {
> > #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> > @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > entry = pmd_wrprotect(entry);
> > entry = pmd_mkclean(entry);
> > set_pmd_at(vma->vm_mm, address, pmd, entry);
> > + cstart &= PMD_MASK;
> > + cend = cstart + PMD_SIZE;
> > ret = 1;
> > #else
> > /* unexpected pmd-mapped page? */
> > @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > }
> >
> > if (ret) {
> > - mmu_notifier_invalidate_page(vma->vm_mm, address);
> > + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > (*cleaned)++;
> > }
> > }
> >
> > + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > +
> > return true;
> > }
> >
> > @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > pte_t pteval;
> > struct page *subpage;
> > bool ret = true;
> > + unsigned long start = address, end;
> > enum ttu_flags flags = (enum ttu_flags)arg;
> >
> > /* munlock has nothing to gain from examining un-locked vmas */
> > @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > flags & TTU_MIGRATION, page);
> > }
> >
> > + /*
> > + * We have to assume the worse case ie pmd for invalidation. Note that
> > + * the page can not be free in this function as call of try_to_unmap()
> > + * must hold a reference on the page.
> > + */
> > + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > while (page_vma_mapped_walk(&pvmw)) {
> > /*
> > * If the page is mlock()d, we cannot swap it out.
> > @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > set_huge_swap_pte_at(mm, address,
> > pvmw.pte, pteval,
> > vma_mmu_pagesize(vma));
> > + mmu_notifier_invalidate_range(mm, address,
> > + address + vma_mmu_pagesize(vma));
>
> I don’t think that the notifier should be called after the PTE is set, but
> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
> is changed without clearing the PTE before, but it does not apply here, and
> there is a different notifier for this case.]
>
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> ptep_clear_flush() are called. All the other calls to
> mmu_notifier_invalidate_range() in this function can be removed.

Yes it would simplify the patch, i was trying to optimize for the case
where we restore the pte to its original value after ptep_clear_flush()
or ptep_get_and_clear() as in this case there is no need to invalidate
any secondary page table but that's an overkill optimization that just
add too much complexity.

Jérôme

2017-08-30 03:16:49

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Jerome Glisse <[email protected]> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Jérôme Glisse <[email protected]> wrote:
>>
>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>> end.
>>>
>>> Note that because we can not presume the pmd value or pte value we have to
>>> assume the worse and unconditionaly report an invalidation as happening.
>>>
>>> Signed-off-by: Jérôme Glisse <[email protected]>
>>> Cc: Dan Williams <[email protected]>
>>> Cc: Ross Zwisler <[email protected]>
>>> Cc: Linus Torvalds <[email protected]>
>>> Cc: Bernhard Held <[email protected]>
>>> Cc: Adam Borowski <[email protected]>
>>> Cc: Andrea Arcangeli <[email protected]>
>>> Cc: Radim Krčmář <[email protected]>
>>> Cc: Wanpeng Li <[email protected]>
>>> Cc: Paolo Bonzini <[email protected]>
>>> Cc: Takashi Iwai <[email protected]>
>>> Cc: Nadav Amit <[email protected]>
>>> Cc: Mike Galbraith <[email protected]>
>>> Cc: Kirill A. Shutemov <[email protected]>
>>> Cc: axie <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> ---
>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c8993c63eb25..da97ed525088 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> .address = address,
>>> .flags = PVMW_SYNC,
>>> };
>>> + unsigned long start = address, end;
>>> int *cleaned = arg;
>>>
>>> + /*
>>> + * We have to assume the worse case ie pmd for invalidation. Note that
>>> + * the page can not be free from this function.
>>> + */
>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> + unsigned long cstart, cend;
>>> int ret = 0;
>>> - address = pvmw.address;
>>> +
>>> + cstart = address = pvmw.address;
>>> if (pvmw.pte) {
>>> pte_t entry;
>>> pte_t *pte = pvmw.pte;
>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> entry = pte_wrprotect(entry);
>>> entry = pte_mkclean(entry);
>>> set_pte_at(vma->vm_mm, address, pte, entry);
>>> + cend = cstart + PAGE_SIZE;
>>> ret = 1;
>>> } else {
>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> entry = pmd_wrprotect(entry);
>>> entry = pmd_mkclean(entry);
>>> set_pmd_at(vma->vm_mm, address, pmd, entry);
>>> + cstart &= PMD_MASK;
>>> + cend = cstart + PMD_SIZE;
>>> ret = 1;
>>> #else
>>> /* unexpected pmd-mapped page? */
>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> }
>>>
>>> if (ret) {
>>> - mmu_notifier_invalidate_page(vma->vm_mm, address);
>>> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>> (*cleaned)++;
>>> }
>>> }
>>>
>>> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>> +
>>> return true;
>>> }
>>>
>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> pte_t pteval;
>>> struct page *subpage;
>>> bool ret = true;
>>> + unsigned long start = address, end;
>>> enum ttu_flags flags = (enum ttu_flags)arg;
>>>
>>> /* munlock has nothing to gain from examining un-locked vmas */
>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> flags & TTU_MIGRATION, page);
>>> }
>>>
>>> + /*
>>> + * We have to assume the worse case ie pmd for invalidation. Note that
>>> + * the page can not be free in this function as call of try_to_unmap()
>>> + * must hold a reference on the page.
>>> + */
>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> while (page_vma_mapped_walk(&pvmw)) {
>>> /*
>>> * If the page is mlock()d, we cannot swap it out.
>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> set_huge_swap_pte_at(mm, address,
>>> pvmw.pte, pteval,
>>> vma_mmu_pagesize(vma));
>>> + mmu_notifier_invalidate_range(mm, address,
>>> + address + vma_mmu_pagesize(vma));
>>
>> I don’t think that the notifier should be called after the PTE is set, but
>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>> is changed without clearing the PTE before, but it does not apply here, and
>> there is a different notifier for this case.]
>>
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>> ptep_clear_flush() are called. All the other calls to
>> mmu_notifier_invalidate_range() in this function can be removed.
>
> Yes it would simplify the patch, i was trying to optimize for the case
> where we restore the pte to its original value after ptep_clear_flush()
> or ptep_get_and_clear() as in this case there is no need to invalidate
> any secondary page table but that's an overkill optimization that just
> add too much complexity.

Interesting. Actually, prior to your changes, it seems that the break
statements would skip mmu_notifier_invalidate_page() when the PTE is not
changed. So why not just change mmu_notifier_invalidate_page() into
mmu_notifier_invalidate_range() ?

Regards,
Nadav

2017-08-30 03:18:45

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Nadav Amit <[email protected]> wrote:

> Jerome Glisse <[email protected]> wrote:
>
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Jérôme Glisse <[email protected]> wrote:
>>>
>>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>>> end.
>>>>
>>>> Note that because we can not presume the pmd value or pte value we have to
>>>> assume the worse and unconditionaly report an invalidation as happening.
>>>>
>>>> Signed-off-by: Jérôme Glisse <[email protected]>
>>>> Cc: Dan Williams <[email protected]>
>>>> Cc: Ross Zwisler <[email protected]>
>>>> Cc: Linus Torvalds <[email protected]>
>>>> Cc: Bernhard Held <[email protected]>
>>>> Cc: Adam Borowski <[email protected]>
>>>> Cc: Andrea Arcangeli <[email protected]>
>>>> Cc: Radim Krčmář <[email protected]>
>>>> Cc: Wanpeng Li <[email protected]>
>>>> Cc: Paolo Bonzini <[email protected]>
>>>> Cc: Takashi Iwai <[email protected]>
>>>> Cc: Nadav Amit <[email protected]>
>>>> Cc: Mike Galbraith <[email protected]>
>>>> Cc: Kirill A. Shutemov <[email protected]>
>>>> Cc: axie <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> ---
>>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index c8993c63eb25..da97ed525088 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> .address = address,
>>>> .flags = PVMW_SYNC,
>>>> };
>>>> + unsigned long start = address, end;
>>>> int *cleaned = arg;
>>>>
>>>> + /*
>>>> + * We have to assume the worse case ie pmd for invalidation. Note that
>>>> + * the page can not be free from this function.
>>>> + */
>>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> + unsigned long cstart, cend;
>>>> int ret = 0;
>>>> - address = pvmw.address;
>>>> +
>>>> + cstart = address = pvmw.address;
>>>> if (pvmw.pte) {
>>>> pte_t entry;
>>>> pte_t *pte = pvmw.pte;
>>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> entry = pte_wrprotect(entry);
>>>> entry = pte_mkclean(entry);
>>>> set_pte_at(vma->vm_mm, address, pte, entry);
>>>> + cend = cstart + PAGE_SIZE;
>>>> ret = 1;
>>>> } else {
>>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> entry = pmd_wrprotect(entry);
>>>> entry = pmd_mkclean(entry);
>>>> set_pmd_at(vma->vm_mm, address, pmd, entry);
>>>> + cstart &= PMD_MASK;
>>>> + cend = cstart + PMD_SIZE;
>>>> ret = 1;
>>>> #else
>>>> /* unexpected pmd-mapped page? */
>>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> }
>>>>
>>>> if (ret) {
>>>> - mmu_notifier_invalidate_page(vma->vm_mm, address);
>>>> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>>> (*cleaned)++;
>>>> }
>>>> }
>>>>
>>>> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>>> +
>>>> return true;
>>>> }
>>>>
>>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> pte_t pteval;
>>>> struct page *subpage;
>>>> bool ret = true;
>>>> + unsigned long start = address, end;
>>>> enum ttu_flags flags = (enum ttu_flags)arg;
>>>>
>>>> /* munlock has nothing to gain from examining un-locked vmas */
>>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> flags & TTU_MIGRATION, page);
>>>> }
>>>>
>>>> + /*
>>>> + * We have to assume the worse case ie pmd for invalidation. Note that
>>>> + * the page can not be free in this function as call of try_to_unmap()
>>>> + * must hold a reference on the page.
>>>> + */
>>>> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> while (page_vma_mapped_walk(&pvmw)) {
>>>> /*
>>>> * If the page is mlock()d, we cannot swap it out.
>>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> set_huge_swap_pte_at(mm, address,
>>>> pvmw.pte, pteval,
>>>> vma_mmu_pagesize(vma));
>>>> + mmu_notifier_invalidate_range(mm, address,
>>>> + address + vma_mmu_pagesize(vma));
>>>
>>> I don’t think that the notifier should be called after the PTE is set, but
>>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>>> is changed without clearing the PTE before, but it does not apply here, and
>>> there is a different notifier for this case.]
>>>
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>>> ptep_clear_flush() are called. All the other calls to
>>> mmu_notifier_invalidate_range() in this function can be removed.
>>
>> Yes it would simplify the patch, i was trying to optimize for the case
>> where we restore the pte to its original value after ptep_clear_flush()
>> or ptep_get_and_clear() as in this case there is no need to invalidate
>> any secondary page table but that's an overkill optimization that just
>> add too much complexity.
>
> Interesting. Actually, prior to your changes, it seems that the break
> statements would skip mmu_notifier_invalidate_page() when the PTE is not
> changed. So why not just change mmu_notifier_invalidate_page() into
> mmu_notifier_invalidate_range() ?

Sorry - I noticed it was actually wrong before (as you noted) at least
in one case:
(unlikely(PageSwapBacked(page) != PageSwapCache(page)))

Regards,
Nadav

2017-08-30 06:13:55

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH 05/13] IB/umem: update to new mmu_notifier semantic

On Tue, Aug 29, 2017 at 07:54:39PM -0400, J?r?me Glisse wrote:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: J?r?me Glisse <[email protected]>
> Cc: Leon Romanovsky <[email protected]>
> Cc: [email protected]
> Cc: Artemy Kovalyov <[email protected]>
> Cc: Doug Ledford <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> ---
> drivers/infiniband/core/umem_odp.c | 19 -------------------
> 1 file changed, 19 deletions(-)
>

Hi Jerome,

I took this series for the tests on Mellanox ConnectX-4/5 cards which
are devices beneath of this UMEM ODP code.

As a reference, I took latest Doug's for-next + Linus's master
(36fde05f3fb5) + whole series.

Thanks


Attachments:
(No filename) (1.03 kB)
signature.asc (833.00 B)
Download all attachments

2017-08-30 06:18:57

by Christian König

[permalink] [raw]
Subject: Re: [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic

Am 30.08.2017 um 01:54 schrieb Jérôme Glisse:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: Jérôme Glisse <[email protected]>

Reviewed-by: Christian König <[email protected]>

The general approach is Acked-by: Christian König
<[email protected]>.

It's something very welcome since I was one of the people (together with
the Intel guys) which failed to recognize what this callback really does.

Regards,
Christian.

> Cc: [email protected]
> Cc: Felix Kuehling <[email protected]>
> Cc: Christian König <[email protected]>
> Cc: Alex Deucher <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
> 1 file changed, 31 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 6558a3ed57a7..e1cde6b80027 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> }
>
> /**
> - * amdgpu_mn_invalidate_page - callback to notify about mm change
> - *
> - * @mn: our notifier
> - * @mn: the mm this callback is about
> - * @address: address of invalidate page
> - *
> - * Invalidation of a single page. Blocks for all BOs mapping it
> - * and unmap them by move them into system domain again.
> - */
> -static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
> - struct mm_struct *mm,
> - unsigned long address)
> -{
> - struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> - struct interval_tree_node *it;
> -
> - mutex_lock(&rmn->lock);
> -
> - it = interval_tree_iter_first(&rmn->objects, address, address);
> - if (it) {
> - struct amdgpu_mn_node *node;
> -
> - node = container_of(it, struct amdgpu_mn_node, it);
> - amdgpu_mn_invalidate_node(node, address, address);
> - }
> -
> - mutex_unlock(&rmn->lock);
> -}
> -
> -/**
> * amdgpu_mn_invalidate_range_start - callback to notify about mm change
> *
> * @mn: our notifier
> @@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
>
> static const struct mmu_notifier_ops amdgpu_mn_ops = {
> .release = amdgpu_mn_release,
> - .invalidate_page = amdgpu_mn_invalidate_page,
> .invalidate_range_start = amdgpu_mn_invalidate_range_start,
> };
>


2017-08-30 08:41:06

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
>
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
>
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

-Mike


2017-08-30 14:58:19

by Adam Borowski

[permalink] [raw]
Subject: Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> >
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
>
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀

2017-08-30 16:52:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Hello Jerome,

On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
>
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.

I pointed out in earlier email ->invalidate_range can only be
implemented (as mutually exclusive alternative to
->invalidate_range_start/end) by secondary MMUs that shares the very
same pagetables with the core linux VM of the primary MMU, and those
invalidate_range are already called by
__mmu_notifier_invalidate_range_end. The other bit is done by the MMU
gather (because mmu_notifier_invalidate_range_start is a noop for
drivers that implement s->invalidate_range).

The difference between sharing the same pagetables or not allows for
->invalidate_range to work because when the Linux MM changes the
primary MMU pagetables it also automatically invalidated updates
secondary MMU at the same time (because of the pagetable sharing
between primary and secondary MMUs). So then all that is left to do is
an invalidate_range to flush the secondary MMU TLBs.

There's no need of action in mmu_notifier_invalidate_range_start for
those pagetable sharing drivers because there's no risk of a secondary
MMU shadow pagetable layer to be re-created in between
mmu_notifier_invalidate_range_start and the actual pagetable
invalidate because again the pagetables are shared.

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
struct mmu_notifier *mn;
int id;

id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
/*
* Call invalidate_range here too to avoid the need for the
* subsystem of having to register an invalidate_range_end
* call-back when there is invalidate_range already. Usually a
* subsystem registers either invalidate_range_start()/end() or
* invalidate_range(), so this will be no additional overhead
* (besides the pointer check).
*/
if (mn->ops->invalidate_range)
mn->ops->invalidate_range(mn, mm, start, end);
^^^^^^^^^^^^^^^^^^^^^^^^^
if (mn->ops->invalidate_range_end)
mn->ops->invalidate_range_end(mn, mm, start, end);
}
srcu_read_unlock(&srcu, id);
}

So this conversion from invalidate_page to invalidate_range looks
superflous and the final mmu_notifier_invalidate_range_end should be
enough.

AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
uses it.

My suggestion is to remove from below all
mmu_notifier_invalidate_range calls and keep only the
mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
intel-svm with it under heavy swapping.

The only critical constraint to keep for invalidate_range to stay safe
with a single call of mmu_notifier_invalidate_range_end after put_page
is that the put_page cannot be the last put_page. That only applies to
the case where the page isn't freed through MMU gather (MMU gather
calls mmu_notifier_invalidate_range of its own before freeing the
page, as opposed mmu gather does nothing for drivers using
invalidate_range_start/end because invalidate_range_start acted as
barrier to avoid establishing mappings on the secondary MMUs for
those).

Not strictly required but I think it would be safer and more efficient
to replace the put_page with something like:

static inline void put_page_not_freeing(struct page *page)
{
page = compound_head(page);

if (put_page_testzero(page))
VM_WARN_ON_PAGE(1, page);
}

Thanks!
Andrea

> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> }
>
> if (ret) {
> - mmu_notifier_invalidate_page(vma->vm_mm, address);
> + mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> (*cleaned)++;
> }
> }
>
> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
> return true;
> }
>
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> pte_t pteval;
> struct page *subpage;
> bool ret = true;
> + unsigned long start = address, end;
> enum ttu_flags flags = (enum ttu_flags)arg;
>
> /* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> flags & TTU_MIGRATION, page);
> }
>
> + /*
> + * We have to assume the worse case ie pmd for invalidation. Note that
> + * the page can not be free in this function as call of try_to_unmap()
> + * must hold a reference on the page.
> + */
> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> while (page_vma_mapped_walk(&pvmw)) {
> /*
> * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> set_huge_swap_pte_at(mm, address,
> pvmw.pte, pteval,
> vma_mmu_pagesize(vma));
> + mmu_notifier_invalidate_range(mm, address,
> + address + vma_mmu_pagesize(vma));
> } else {
> dec_mm_counter(mm, mm_counter(page));
> set_pte_at(mm, address, pvmw.pte, pteval);
> @@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (pte_soft_dirty(pteval))
> swp_pte = pte_swp_mksoft_dirty(swp_pte);
> set_pte_at(mm, address, pvmw.pte, swp_pte);
> + mmu_notifier_invalidate_range(mm, address,
> + address + PAGE_SIZE);
> } else if (PageAnon(page)) {
> swp_entry_t entry = { .val = page_private(subpage) };
> pte_t swp_pte;
> @@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
> WARN_ON_ONCE(1);
> ret = false;
> + /* We have to invalidate as we cleared the pte */
> + mmu_notifier_invalidate_range(mm, address,
> + address + PAGE_SIZE);
> page_vma_mapped_walk_done(&pvmw);
> break;
> }
> @@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (!PageSwapBacked(page)) {
> if (!PageDirty(page)) {
> dec_mm_counter(mm, MM_ANONPAGES);
> + /* Invalidate as we cleared the pte */
> + mmu_notifier_invalidate_range(mm,
> + address, address + PAGE_SIZE);
> goto discard;
> }
>
> @@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> if (pte_soft_dirty(pteval))
> swp_pte = pte_swp_mksoft_dirty(swp_pte);
> set_pte_at(mm, address, pvmw.pte, swp_pte);
> + mmu_notifier_invalidate_range(mm, address,
> + address + PAGE_SIZE);
> } else
> dec_mm_counter(mm, mm_counter_file(page));
> discard:
> page_remove_rmap(subpage, PageHuge(page));
> put_page(page);
> - mmu_notifier_invalidate_page(mm, address);
> }
> +
> + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
> return ret;
> }
>
> --
> 2.13.5
>

2017-08-30 17:27:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and

That would trigger an unnecessarily double call to
->invalidate_range() both from mmu_notifier_invalidate_range() after
ptep_get_and_clear() and later at the end in
mmu_notifier_invalidate_range_end().

The only advantage of adding a mmu_notifier_invalidate_range() after
ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
the pagetables have to be shared with the primary MMU in order to use
the ->invalidate_range()) inside the PT lock.

So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
needed or not, again boils down to the issue if the old code calling
->invalidate_page outside the PT lock was always broken before or
not. I don't see why exactly it was broken, we even hold the page lock
there so I don't see a migration race possible either. Of course the
constraint to be safe is that the put_page in try_to_unmap_one cannot
be the last one, and that had to be enforced by the caller taking an
additional reference on it.

One can wonder if the primary MMU TLB flush in ptep_clear_flush
(should_defer_flush returning false) could be put after releasing the
PT lock too (because that's not different from deferring the secondary
MMU notifier TLB flush in ->invalidate_range down to
mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
which may be safe too for the same reasons.

When should_defer_flush returns true we already defer the primary MMU
TLB flush to much later to even mmu_notifier_invalidate_range_end, not
just after the PT lock release so at least when should_defer_flush is
true, it looks obviously safe to defer the secondary MMU TLB flush to
mmu_notifier_invalidate_range_end for the drivers implementing
->invalidate_range.

If I'm wrong and all TLB flushes must happen inside the PT lock, then
we should at least reconsider the implicit call to ->invalidate_range
method from mmu_notifier_invalidate_range_end or we would call it
twice unnecessarily which doesn't look optimal. Either ways this
doesn't look optimal. We would need to introduce a
mmu_notifier_invalidate_range_end_full that calls also
->invalidate_range in such case so we skip the ->invalidate_range call
in mmu_notifier_invalidate_range_end if we put an explicit
mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
lock like you suggested above.

Thanks,
Andrea

2017-08-30 17:51:44

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 06:52:50PM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
>
> On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> >
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
>
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end. The other bit is done by the MMU
> gather (because mmu_notifier_invalidate_range_start is a noop for
> drivers that implement s->invalidate_range).
>
> The difference between sharing the same pagetables or not allows for
> ->invalidate_range to work because when the Linux MM changes the
> primary MMU pagetables it also automatically invalidated updates
> secondary MMU at the same time (because of the pagetable sharing
> between primary and secondary MMUs). So then all that is left to do is
> an invalidate_range to flush the secondary MMU TLBs.
>
> There's no need of action in mmu_notifier_invalidate_range_start for
> those pagetable sharing drivers because there's no risk of a secondary
> MMU shadow pagetable layer to be re-created in between
> mmu_notifier_invalidate_range_start and the actual pagetable
> invalidate because again the pagetables are shared.

Yes but we still need to call invalidate_range() while under the page
table spinlock as this hardware sharing the CPU page table have their
own tlb (not even talking about tlb of each individual devices that
use PASID/ATS).

If we do not call it under spinlock then there is a chance that something
else get mapped between the time we drop the CPU page table and the
time we call mmu_notifier_invalidate_range_end() which itself call
invalidate_range().

Actually i would remove the call to invalidate_range() from range_end()
and audit all places where we call range_end() to assert that there is
a call to invalidate_range() under the page table spinlock that preced
it.

>
> void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> struct mmu_notifier *mn;
> int id;
>
> id = srcu_read_lock(&srcu);
> hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> /*
> * Call invalidate_range here too to avoid the need for the
> * subsystem of having to register an invalidate_range_end
> * call-back when there is invalidate_range already. Usually a
> * subsystem registers either invalidate_range_start()/end() or
> * invalidate_range(), so this will be no additional overhead
> * (besides the pointer check).
> */
> if (mn->ops->invalidate_range)
> mn->ops->invalidate_range(mn, mm, start, end);
> ^^^^^^^^^^^^^^^^^^^^^^^^^
> if (mn->ops->invalidate_range_end)
> mn->ops->invalidate_range_end(mn, mm, start, end);
> }
> srcu_read_unlock(&srcu, id);
> }
>
> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

See above.

> AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
> uses it.

powerpc has something similar too but i don't know its status

> My suggestion is to remove from below all
> mmu_notifier_invalidate_range calls and keep only the
> mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
> intel-svm with it under heavy swapping.
>
> The only critical constraint to keep for invalidate_range to stay safe
> with a single call of mmu_notifier_invalidate_range_end after put_page
> is that the put_page cannot be the last put_page. That only applies to
> the case where the page isn't freed through MMU gather (MMU gather
> calls mmu_notifier_invalidate_range of its own before freeing the
> page, as opposed mmu gather does nothing for drivers using
> invalidate_range_start/end because invalidate_range_start acted as
> barrier to avoid establishing mappings on the secondary MMUs for
> those).
>
> Not strictly required but I think it would be safer and more efficient
> to replace the put_page with something like:
>
> static inline void put_page_not_freeing(struct page *page)
> {
> page = compound_head(page);
>
> if (put_page_testzero(page))
> VM_WARN_ON_PAGE(1, page);
> }

Yes adding such check make sense.

Thank for looking into this too

J?r?me

2017-08-30 18:00:38

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Andrea Arcangeli <[email protected]> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
>
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
>
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
>
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
>
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.

Agreed. It just seems as an enhancement and not directly a bug fix.

>
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

It is not trivial to flush TLBs (primary or secondary) without holding the
page-table lock, and as we recently encountered this resulted in several
bugs [1]. The main problem is that even if you perform the TLB flush
immediately after the PT-lock is released, you cause a situation in which
other threads may make decisions based on the updated PTE value, without
being aware that a TLB flush is needed.

For example, we recently encountered a Linux bug when two threads run
MADV_DONTNEED concurrently on the same address range [2]. One of the threads
may update a PTE, setting it as non-present, and then deferring the TLB
flush (for batching). As a result, however, it would cause the second
thread, which also changes the PTEs to assume that the PTE is already
non-present and TLB flush is not necessary. As a result the second core may
still hold stale PTEs in its TLB following MADV_DONTNEED.

There is a swarm of such problems, and some are not too trivial. Deferring
TLB flushes needs to be done in a very careful manner.


[1] https://marc.info/?t=149973438800002&r=1
[2] http://www.spinics.net/lists/linux-mm/msg131525.html

2017-08-30 18:20:29

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> > Therefore, IIUC, try_to_umap_one() should only call
> > mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
>
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
>
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
>
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
>
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.
>
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

So i went over call to try_to_unmap() (try_to_munlock() is fine as it
does not clear the CPU page table entry). I believe they are 2 cases
where you can get a new pte entry after we release spinlock and before
we call mmu_notifier_invalidate_range_end()

First case is :
if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
...
break;
}

The pte is clear, there was an error condition and this should never
happen but a racing thread might install a new pte in the meantime.
Maybe we should restore the pte value here. Anyway when this happens
bad things are going on.

The second case is non dirty anonymous page and MADV_FREE. But here
the application is telling us that no one should care about that
virtual address any more. So i am not sure how much we should care.


If we ignore this 2 cases than the CPU pte can never be replace by
something else between the time we release the spinlock and the time
we call mmu_notifier_invalidate_range_end() so not invalidating the
devices tlb is ok here. Do we want this kind of optimization ?

J?r?me

2017-08-30 18:40:13

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Jerome Glisse <[email protected]> wrote:

> On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>>
>> That would trigger an unnecessarily double call to
>> ->invalidate_range() both from mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() and later at the end in
>> mmu_notifier_invalidate_range_end().
>>
>> The only advantage of adding a mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
>> the pagetables have to be shared with the primary MMU in order to use
>> the ->invalidate_range()) inside the PT lock.
>>
>> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
>> needed or not, again boils down to the issue if the old code calling
>> ->invalidate_page outside the PT lock was always broken before or
>> not. I don't see why exactly it was broken, we even hold the page lock
>> there so I don't see a migration race possible either. Of course the
>> constraint to be safe is that the put_page in try_to_unmap_one cannot
>> be the last one, and that had to be enforced by the caller taking an
>> additional reference on it.
>>
>> One can wonder if the primary MMU TLB flush in ptep_clear_flush
>> (should_defer_flush returning false) could be put after releasing the
>> PT lock too (because that's not different from deferring the secondary
>> MMU notifier TLB flush in ->invalidate_range down to
>> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
>> which may be safe too for the same reasons.
>>
>> When should_defer_flush returns true we already defer the primary MMU
>> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
>> just after the PT lock release so at least when should_defer_flush is
>> true, it looks obviously safe to defer the secondary MMU TLB flush to
>> mmu_notifier_invalidate_range_end for the drivers implementing
>> ->invalidate_range.
>>
>> If I'm wrong and all TLB flushes must happen inside the PT lock, then
>> we should at least reconsider the implicit call to ->invalidate_range
>> method from mmu_notifier_invalidate_range_end or we would call it
>> twice unnecessarily which doesn't look optimal. Either ways this
>> doesn't look optimal. We would need to introduce a
>> mmu_notifier_invalidate_range_end_full that calls also
>> ->invalidate_range in such case so we skip the ->invalidate_range call
>> in mmu_notifier_invalidate_range_end if we put an explicit
>> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
>> lock like you suggested above.
>
> So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> does not clear the CPU page table entry). I believe they are 2 cases
> where you can get a new pte entry after we release spinlock and before
> we call mmu_notifier_invalidate_range_end()
>
> First case is :
> if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
> ...
> break;
> }
>
> The pte is clear, there was an error condition and this should never
> happen but a racing thread might install a new pte in the meantime.
> Maybe we should restore the pte value here. Anyway when this happens
> bad things are going on.
>
> The second case is non dirty anonymous page and MADV_FREE. But here
> the application is telling us that no one should care about that
> virtual address any more. So i am not sure how much we should care.
>
>
> If we ignore this 2 cases than the CPU pte can never be replace by
> something else between the time we release the spinlock and the time
> we call mmu_notifier_invalidate_range_end() so not invalidating the
> devices tlb is ok here. Do we want this kind of optimization ?

The mmu_notifier users would have to be aware that invalidations may be
deferred. If they perform their ``invalidations’’ unconditionally, it may be
ok. If the notifier users avoid invalidations based on the PTE in the
secondary page-table, it can be a problem.

On another note, you may want to consider combining the secondary page-table
mechanisms with the existing TLB-flush mechanisms. Right now, it is
partially done: tlb_flush_mmu_tlbonly(), for example, calls
mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
mmu_notifier_invalidate_range_start().

This can also prevent all kind of inconsistencies, and potential bugs. For
instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
but in between there is no call for mmu_notifier_invalidate_range().

Regards,
Nadav


2017-08-30 20:45:55

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> Jerome Glisse <[email protected]> wrote:
>
> > On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> >>> Therefore, IIUC, try_to_umap_one() should only call
> >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> >>
> >> That would trigger an unnecessarily double call to
> >> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() and later at the end in
> >> mmu_notifier_invalidate_range_end().
> >>
> >> The only advantage of adding a mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> >> the pagetables have to be shared with the primary MMU in order to use
> >> the ->invalidate_range()) inside the PT lock.
> >>
> >> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> >> needed or not, again boils down to the issue if the old code calling
> >> ->invalidate_page outside the PT lock was always broken before or
> >> not. I don't see why exactly it was broken, we even hold the page lock
> >> there so I don't see a migration race possible either. Of course the
> >> constraint to be safe is that the put_page in try_to_unmap_one cannot
> >> be the last one, and that had to be enforced by the caller taking an
> >> additional reference on it.
> >>
> >> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> >> (should_defer_flush returning false) could be put after releasing the
> >> PT lock too (because that's not different from deferring the secondary
> >> MMU notifier TLB flush in ->invalidate_range down to
> >> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> >> which may be safe too for the same reasons.
> >>
> >> When should_defer_flush returns true we already defer the primary MMU
> >> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> >> just after the PT lock release so at least when should_defer_flush is
> >> true, it looks obviously safe to defer the secondary MMU TLB flush to
> >> mmu_notifier_invalidate_range_end for the drivers implementing
> >> ->invalidate_range.
> >>
> >> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> >> we should at least reconsider the implicit call to ->invalidate_range
> >> method from mmu_notifier_invalidate_range_end or we would call it
> >> twice unnecessarily which doesn't look optimal. Either ways this
> >> doesn't look optimal. We would need to introduce a
> >> mmu_notifier_invalidate_range_end_full that calls also
> >> ->invalidate_range in such case so we skip the ->invalidate_range call
> >> in mmu_notifier_invalidate_range_end if we put an explicit
> >> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> >> lock like you suggested above.
> >
> > So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> > does not clear the CPU page table entry). I believe they are 2 cases
> > where you can get a new pte entry after we release spinlock and before
> > we call mmu_notifier_invalidate_range_end()
> >
> > First case is :
> > if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
> > ...
> > break;
> > }
> >
> > The pte is clear, there was an error condition and this should never
> > happen but a racing thread might install a new pte in the meantime.
> > Maybe we should restore the pte value here. Anyway when this happens
> > bad things are going on.
> >
> > The second case is non dirty anonymous page and MADV_FREE. But here
> > the application is telling us that no one should care about that
> > virtual address any more. So i am not sure how much we should care.
> >
> >
> > If we ignore this 2 cases than the CPU pte can never be replace by
> > something else between the time we release the spinlock and the time
> > we call mmu_notifier_invalidate_range_end() so not invalidating the
> > devices tlb is ok here. Do we want this kind of optimization ?
>
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidations’’ unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
do not matter AFAICT (i doubt that hardware rewalk the page table just to
decide not to flush that would be terribly dumb for hardware engineer to do
so).

Intel have a deferred flush mecanism, basicly if no device is actively using
the page table then there is no flush (see deferred invalidation in [1]). But
i am unsure about the hardware ie does it also means that when a PASID is not
actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
unsure about ATS/PASID specification in respect to having device always report
when they are done with a given PASID (ie does the spec say that device tlb
must be invalidated when device stop using a pasid).

https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html

So i think we can side with caution here and call invalidate_range() under the
page table lock. If IOMMU folks see performance issue with real workload due
to the double invalidation that take place then we can work on that.

KVM or XEN are not impacted by this as they only care about start/end with this
patchset.

Andrea is that inline with your assessment ?


> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

tlb_gather_mmu() is always call in place that are themself bracketed by
mmu_notifier_invalidate_range_start()/end()

>
> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It is for softdirty, we should probably invalidate_range() in that case i
need to check how dirtyness is handled in ATS/PASID ie does device update
the dirty bit of the CPU page table on write. Or maybe device don't update
the dirty flag.

Jérôme

2017-08-30 20:55:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidations’’ unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

invalidate_page was always deferred post PT lock release.

This ->invalidate_range post PT lock release, is not a new thing,
we're still back to squre one to find out if invalidate_page callout
after PT lock release has always been broken here or not.

> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

If you implement ->invalidate_range_start you don't care about tlb
gather at all and you must not implement ->invalidate_range.

> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It's done in mmu_notifier_invalidate_range_end which is again fully
equivalent except run after PT lock release.

2017-08-30 21:25:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> It is not trivial to flush TLBs (primary or secondary) without holding the
> page-table lock, and as we recently encountered this resulted in several
> bugs [1]. The main problem is that even if you perform the TLB flush
> immediately after the PT-lock is released, you cause a situation in which
> other threads may make decisions based on the updated PTE value, without
> being aware that a TLB flush is needed.
>
> For example, we recently encountered a Linux bug when two threads run
> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> may update a PTE, setting it as non-present, and then deferring the TLB
> flush (for batching). As a result, however, it would cause the second
> thread, which also changes the PTEs to assume that the PTE is already
> non-present and TLB flush is not necessary. As a result the second core may
> still hold stale PTEs in its TLB following MADV_DONTNEED.

The source of those complex races that requires taking into account
nested tlb gather to solve it, originates from skipping primary MMU
tlb flushes depending on the value of the pagetables (as an
optimization).

For mmu_notifier_invalidate_range_end we always ignore the value of
the pagetables and mmu_notifier_invalidate_range_end always runs
unconditionally invalidating the secondary MMU for the whole range
under consideration. There are no optimizations that attempts to skip
mmu_notifier_invalidate_range_end depending on the pagetable value and
there's no TLB gather for secondary MMUs either. That is to keep it
simple of course.

As long as those mmu_notifier_invalidate_range_end stay unconditional,
I don't see how those races you linked, can be possibly relevant in
evaluating if ->invalidate_range (again only for iommuv2 and
intel-svm) has to run inside the PT lock or not.

> There is a swarm of such problems, and some are not too trivial. Deferring
> TLB flushes needs to be done in a very careful manner.

I agree it's not trivial, but I don't think any complexity comes from
above.

The only complexity is about, what if the page is copied to some other
page and replaced, because the copy is the only case where coherency
could be retained by the primary MMU. What if the primary MMU starts
working on the new page in between PT lock release and
mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
the old page? That is the only problem we deal with here, the copy to
other page and replace. Any other case that doesn't involve the copy
seems non coherent by definition, and you couldn't measure it.

I can't think of a scenario that requires the explicit
mmu_notifier_invalidate_range call before releasing the PT lock, at
least for try_to_unmap_one.

Could you think of a scenario where calling ->invalidate_range inside
mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
intel-svm? Those two are the only ones requiring
->invalidate_range calls, all other mmu notifier users are safe
without running mmu_notifier_invalidate_range_end under PT lock thanks
to mmu_notifier_invalidate_range_start blocking the secondary MMU.

Could you post a tangible scenario that invalidates my theory that
those mmu_notifier_invalidate_range calls inside PT lock would be
superfluous?

Some of the scenarios under consideration:

1) migration entry -> migration_entry_wait -> page lock, plus
migrate_pages taking the lock so it can't race with try_to_unmap
from other places
2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
3) CoW -> do_wp_page -> page lock on old page
4) KSM -> replace_page -> page lock on old page
5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
it's not measurable that we let the guest run a bit longer on the
old page past PT lock release

If you could post a multi CPU trace that shows how iommuv2 or
intel-svm are broken if ->invalidate_range is run inside
mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
would help.

Of course if we run mmu_notifier_invalidate_range inside PT lock and
we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
all will be obviously safe, so we could just do it to avoid thinking
about the above, but the resulting code will be uglier and less
optimal (even when disarmed there will be dummy branches I wouldn't
love) and I currently can't see why it wouldn't be safe.

Replacing a page completely without any relation to the old page
content allows no coherency anyway, so even if it breaks you cannot
measure it because it's undefined.

If the page has a relation with the old contents and coherency has to
be provided for both primary MMU and secondary MMUs, this relation
between old and new page during the replacement, is enforced by some
other mean besides the PT lock, migration entry on locked old page
with migration_entry_wait and page lock in migrate_pages etc..

Thanks,
Andrea

2017-08-30 21:53:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <[email protected]> wrote:
>
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end.

I have to admit that I didn't notice that fact - that we are already
in the situation that
invalidate_range is called by by the rand_end() nofifier.

I agree that that should simplify all the code, and means that we
don't have to worry about the few cases that already implemented only
the "invalidate_page()" and "invalidate_range()" cases.

So I think that simplifies Jérôme's patch further - once you have put
the range_start/end() cases around the inner loop, you can just drop
the invalidate_page() things entirely.

> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

Yes. I missed the fact that we already called range() from range_end().

That said, the double call shouldn't hurt correctness, and it's
"closer" to old behavior for those people who only did the range/page
ones, so I wonder if we can keep Jérôme's patch in its current state
for 4.13.

Because I still want to release 4.13 this weekend, despite this
upheaval. Otherwise I'll have timing problems during the next merge
window.

Andrea, do you otherwise agree with the whole series as is?

Linus

2017-08-30 22:17:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 04:45:49PM -0400, Jerome Glisse wrote:
> So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
> do not matter AFAICT (i doubt that hardware rewalk the page table just to
> decide not to flush that would be terribly dumb for hardware engineer to do
> so).
>
> Intel have a deferred flush mecanism, basicly if no device is actively using
> the page table then there is no flush (see deferred invalidation in [1]). But
> i am unsure about the hardware ie does it also means that when a PASID is not
> actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
> unsure about ATS/PASID specification in respect to having device always report
> when they are done with a given PASID (ie does the spec say that device tlb
> must be invalidated when device stop using a pasid).
>
> https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html
>
> So i think we can side with caution here and call invalidate_range() under the
> page table lock. If IOMMU folks see performance issue with real workload due
> to the double invalidation that take place then we can work on that.
>
> KVM or XEN are not impacted by this as they only care about start/end with this
> patchset.
>
> Andrea is that inline with your assessment ?

That is obviously safe. The mmu_notifier_invalidate_range()
calls under PT lock could always be removed later.

However I'm afraid I've to figure out if
mmu_notifier_invalidate_range() must run inside the PT lock
regardless, because that's just the very same problem of
->invalidate_page run outside the PT lock with a majority of
production kernels out there:

pte_unmap_unlock(pte, ptl);
if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
mmu_notifier_invalidate_page(mm, address);

So even if we take the easy route for mmu_notifier_invalidate_range,
I'm still forced to think about this issue.

Currently I tend to believe the old code is safe and in turn I'm more
inclined to skip the explicit mmu_notifier_invalidate_range() inside
the PT lock for amd_iommu_v2 and intel-svm, and add it later if it's
truly proven unsafe.

However as long as the reason for this is just to keep it simpler and
robustness, I don't mind either ways. Double call is not nice though
and not doing double call will mess the code more.

So I thought it was attractive to explain why it was safe not to run
mmu_notifier_invalidate_range() inside the PT lock, which would then
allow for the most strightforward and more optimal upstream
implementation (in addition of not having to fix the old code).

> It is for softdirty, we should probably invalidate_range() in that case i
> need to check how dirtyness is handled in ATS/PASID ie does device update
> the dirty bit of the CPU page table on write. Or maybe device don't update
> the dirty flag.

They both call handle_mm_fault and establish a readonly secondary MMU
mapping and then call handle_mm_fault again before there's a DMA to
memory, to establish a writable mapping and set the dirty bit here at
the first write fault post read fault:

bool write = vmf->flags & FAULT_FLAG_WRITE;

vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
goto unlock;

entry = pmd_mkyoung(orig_pmd);
if (write)
entry = pmd_mkdirty(entry);
[..]
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}

I doubt the PT lock is relevant for how the dirty bit gets set by
those two faulting-capable iommus.

All it matters is that post clear_refs_write there's a
mmu_notifier_invalidate_range_end so ->invalidate_range is called and
at the next write they call handle_mm_fault(FAULT_FLAG_WRITE) once
again. Where exactly the invalidate runs (i.e. post PT lock) I don't
see a big issue there, definitely clear_refs wouldn't change the
pte/hugepmd to point to a different page, so any coherency issue with
primary and secondary MMU temporarily being out of sync doesn't exist
there.

Kirill is the last committer to the handle_mm_fault line in both
drivers so it'd be great if he can double check to confirm that's the
way the dirty bit is handled and in turn causes no concern at
->invalidate_range time.

Thanks,
Andrea

2017-08-30 23:01:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <[email protected]> wrote:
> >
> > I pointed out in earlier email ->invalidate_range can only be
> > implemented (as mutually exclusive alternative to
> > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > same pagetables with the core linux VM of the primary MMU, and those
> > invalidate_range are already called by
> > __mmu_notifier_invalidate_range_end.
>
> I have to admit that I didn't notice that fact - that we are already
> in the situation that
> invalidate_range is called by by the rand_end() nofifier.
>
> I agree that that should simplify all the code, and means that we
> don't have to worry about the few cases that already implemented only
> the "invalidate_page()" and "invalidate_range()" cases.
>
> So I think that simplifies J?r?me's patch further - once you have put
> the range_start/end() cases around the inner loop, you can just drop
> the invalidate_page() things entirely.
>
> > So this conversion from invalidate_page to invalidate_range looks
> > superflous and the final mmu_notifier_invalidate_range_end should be
> > enough.
>
> Yes. I missed the fact that we already called range() from range_end().
>
> That said, the double call shouldn't hurt correctness, and it's
> "closer" to old behavior for those people who only did the range/page
> ones, so I wonder if we can keep J?r?me's patch in its current state
> for 4.13.

Yes, the double call doesn't hurt correctness. Keeping it in current
state is safer if something, so I've no objection to it other than I'd
like to optimize it further if possible, but it can be done later.

We're already running the double call in various fast paths too in
fact, and rmap walk isn't the fastest path that would be doing such
double call, so it's not a major concern.

Also not a bug, but one further (but more obviously safe) enhancement
I would like is to restrict those rmap invalidation ranges to
PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.

+ /*
+ * We have to assume the worse case ie pmd for invalidation. Note that
+ * the page can not be free in this function as call of try_to_unmap()
+ * must hold a reference on the page.
+ */
+ end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

We don't need to invalidate 2MB of secondary MMU mappings surrounding
a 4KB page, just to swapout a 4k page. split_huge_page can't run while
holding the rmap locks, so compound_order(page) is safe to use there.

It can also be optimized incrementally later.

> Because I still want to release 4.13 this weekend, despite this
> upheaval. Otherwise I'll have timing problems during the next merge
> window.
>
> Andrea, do you otherwise agree with the whole series as is?

I only wish we had more time to test Jerome's patchset, but I sure
agree in principle and I don't see regressions in it.

The callouts to ->invalidate_page seems to have diminished over time
(for the various reasons we know) so if we don't use it for the fast
paths, using it only in rmap walk slow paths probably wasn't providing
much performance benefit.

Thanks,
Andrea

2017-08-30 23:26:00

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

[cc’ing IOMMU people, which for some reason are not cc’d]

Andrea Arcangeli <[email protected]> wrote:

> On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
>> It is not trivial to flush TLBs (primary or secondary) without holding the
>> page-table lock, and as we recently encountered this resulted in several
>> bugs [1]. The main problem is that even if you perform the TLB flush
>> immediately after the PT-lock is released, you cause a situation in which
>> other threads may make decisions based on the updated PTE value, without
>> being aware that a TLB flush is needed.
>>
>> For example, we recently encountered a Linux bug when two threads run
>> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
>> may update a PTE, setting it as non-present, and then deferring the TLB
>> flush (for batching). As a result, however, it would cause the second
>> thread, which also changes the PTEs to assume that the PTE is already
>> non-present and TLB flush is not necessary. As a result the second core may
>> still hold stale PTEs in its TLB following MADV_DONTNEED.
>
> The source of those complex races that requires taking into account
> nested tlb gather to solve it, originates from skipping primary MMU
> tlb flushes depending on the value of the pagetables (as an
> optimization).
>
> For mmu_notifier_invalidate_range_end we always ignore the value of
> the pagetables and mmu_notifier_invalidate_range_end always runs
> unconditionally invalidating the secondary MMU for the whole range
> under consideration. There are no optimizations that attempts to skip
> mmu_notifier_invalidate_range_end depending on the pagetable value and
> there's no TLB gather for secondary MMUs either. That is to keep it
> simple of course.
>
> As long as those mmu_notifier_invalidate_range_end stay unconditional,
> I don't see how those races you linked, can be possibly relevant in
> evaluating if ->invalidate_range (again only for iommuv2 and
> intel-svm) has to run inside the PT lock or not.

Thanks for the clarifications. It now makes much more sense.

>
>> There is a swarm of such problems, and some are not too trivial. Deferring
>> TLB flushes needs to be done in a very careful manner.
>
> I agree it's not trivial, but I don't think any complexity comes from
> above.
>
> The only complexity is about, what if the page is copied to some other
> page and replaced, because the copy is the only case where coherency
> could be retained by the primary MMU. What if the primary MMU starts
> working on the new page in between PT lock release and
> mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> the old page? That is the only problem we deal with here, the copy to
> other page and replace. Any other case that doesn't involve the copy
> seems non coherent by definition, and you couldn't measure it.
>
> I can't think of a scenario that requires the explicit
> mmu_notifier_invalidate_range call before releasing the PT lock, at
> least for try_to_unmap_one.
>
> Could you think of a scenario where calling ->invalidate_range inside
> mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> intel-svm? Those two are the only ones requiring
> ->invalidate_range calls, all other mmu notifier users are safe
> without running mmu_notifier_invalidate_range_end under PT lock thanks
> to mmu_notifier_invalidate_range_start blocking the secondary MMU.
>
> Could you post a tangible scenario that invalidates my theory that
> those mmu_notifier_invalidate_range calls inside PT lock would be
> superfluous?
>
> Some of the scenarios under consideration:
>
> 1) migration entry -> migration_entry_wait -> page lock, plus
> migrate_pages taking the lock so it can't race with try_to_unmap
> from other places
> 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> 3) CoW -> do_wp_page -> page lock on old page
> 4) KSM -> replace_page -> page lock on old page
> 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
> it's not measurable that we let the guest run a bit longer on the
> old page past PT lock release

For both CoW and KSM, the correctness is maintained by calling
ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
(i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
will cause memory corruption, and page-lock would not be enough.

BTW: I see some calls to ptep_clear_flush_notify() which are followed
immediately after by set_pte_at_notify(). I do not understand why it makes
sense, as both notifications end up doing the same thing - invalidating the
secondary MMU. The set_pte_at_notify() in these cases can be changed to
set_pte(). No?

> If you could post a multi CPU trace that shows how iommuv2 or
> intel-svm are broken if ->invalidate_range is run inside
> mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> would help.
>
> Of course if we run mmu_notifier_invalidate_range inside PT lock and
> we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> all will be obviously safe, so we could just do it to avoid thinking
> about the above, but the resulting code will be uglier and less
> optimal (even when disarmed there will be dummy branches I wouldn't
> love) and I currently can't see why it wouldn't be safe.
>
> Replacing a page completely without any relation to the old page
> content allows no coherency anyway, so even if it breaks you cannot
> measure it because it's undefined.
>
> If the page has a relation with the old contents and coherency has to
> be provided for both primary MMU and secondary MMUs, this relation
> between old and new page during the replacement, is enforced by some
> other mean besides the PT lock, migration entry on locked old page
> with migration_entry_wait and page lock in migrate_pages etc..

I think that basically you are correct, and assuming that you always
notify/invalidate unconditionally any PTE range you read/write, you are
safe. Yet, I want to have another look at the code. Anyhow, just deferring
all the TLB flushes, including those of set_pte_at_notify(), is likely to
result in errors.

Regards,
Nadav

2017-08-31 00:47:26

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> [cc’ing IOMMU people, which for some reason are not cc’d]
>
> Andrea Arcangeli <[email protected]> wrote:
>
> > On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> >> It is not trivial to flush TLBs (primary or secondary) without holding the
> >> page-table lock, and as we recently encountered this resulted in several
> >> bugs [1]. The main problem is that even if you perform the TLB flush
> >> immediately after the PT-lock is released, you cause a situation in which
> >> other threads may make decisions based on the updated PTE value, without
> >> being aware that a TLB flush is needed.
> >>
> >> For example, we recently encountered a Linux bug when two threads run
> >> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> >> may update a PTE, setting it as non-present, and then deferring the TLB
> >> flush (for batching). As a result, however, it would cause the second
> >> thread, which also changes the PTEs to assume that the PTE is already
> >> non-present and TLB flush is not necessary. As a result the second core may
> >> still hold stale PTEs in its TLB following MADV_DONTNEED.
> >
> > The source of those complex races that requires taking into account
> > nested tlb gather to solve it, originates from skipping primary MMU
> > tlb flushes depending on the value of the pagetables (as an
> > optimization).
> >
> > For mmu_notifier_invalidate_range_end we always ignore the value of
> > the pagetables and mmu_notifier_invalidate_range_end always runs
> > unconditionally invalidating the secondary MMU for the whole range
> > under consideration. There are no optimizations that attempts to skip
> > mmu_notifier_invalidate_range_end depending on the pagetable value and
> > there's no TLB gather for secondary MMUs either. That is to keep it
> > simple of course.
> >
> > As long as those mmu_notifier_invalidate_range_end stay unconditional,
> > I don't see how those races you linked, can be possibly relevant in
> > evaluating if ->invalidate_range (again only for iommuv2 and
> > intel-svm) has to run inside the PT lock or not.
>
> Thanks for the clarifications. It now makes much more sense.
>
> >
> >> There is a swarm of such problems, and some are not too trivial. Deferring
> >> TLB flushes needs to be done in a very careful manner.
> >
> > I agree it's not trivial, but I don't think any complexity comes from
> > above.
> >
> > The only complexity is about, what if the page is copied to some other
> > page and replaced, because the copy is the only case where coherency
> > could be retained by the primary MMU. What if the primary MMU starts
> > working on the new page in between PT lock release and
> > mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> > the old page? That is the only problem we deal with here, the copy to
> > other page and replace. Any other case that doesn't involve the copy
> > seems non coherent by definition, and you couldn't measure it.
> >
> > I can't think of a scenario that requires the explicit
> > mmu_notifier_invalidate_range call before releasing the PT lock, at
> > least for try_to_unmap_one.
> >
> > Could you think of a scenario where calling ->invalidate_range inside
> > mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> > intel-svm? Those two are the only ones requiring
> > ->invalidate_range calls, all other mmu notifier users are safe
> > without running mmu_notifier_invalidate_range_end under PT lock thanks
> > to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> >
> > Could you post a tangible scenario that invalidates my theory that
> > those mmu_notifier_invalidate_range calls inside PT lock would be
> > superfluous?
> >
> > Some of the scenarios under consideration:
> >
> > 1) migration entry -> migration_entry_wait -> page lock, plus
> > migrate_pages taking the lock so it can't race with try_to_unmap
> > from other places
> > 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> > 3) CoW -> do_wp_page -> page lock on old page
> > 4) KSM -> replace_page -> page lock on old page
> > 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
> > it's not measurable that we let the guest run a bit longer on the
> > old page past PT lock release
>
> For both CoW and KSM, the correctness is maintained by calling
> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> will cause memory corruption, and page-lock would not be enough.

Just to add up, the IOMMU have its own CPU page table walker and it can
walk the page table at any time (not the page table current to current
CPU, IOMMU have an array that match a PASID with a page table and device
request translation for a given virtual address against a PASID).

So this means the following can happen with ptep_clear_flush() only:

CPU | IOMMU
| - walk page table populate tlb at addr A
- clear pte at addr A |
- set new pte |

Device is using old page and CPU new page :(

But with ptep_clear_flush_notify()

CPU | IOMMU
| - walk page table populate tlb at addr A
- clear pte at addr A |
- notify -> invalidate_range | > flush IOMMU/device tlb
- set new pte |

So now either the IOMMU see the empty pte and trigger a page fault (this is
if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
before setting the new pte) or it see the new pte. Either way both IOMMU
and CPU have a coherent view of what a virtual address points to.

>
> BTW: I see some calls to ptep_clear_flush_notify() which are followed
> immediately after by set_pte_at_notify(). I do not understand why it makes
> sense, as both notifications end up doing the same thing - invalidating the
> secondary MMU. The set_pte_at_notify() in these cases can be changed to
> set_pte(). No?

Andrea explained to me the historical reasons set_pte_at_notify call the
change_pte callback and it was intended so that KVM could update the
secondary page table directly without having to fault. It is now a pointless
optimization as the call to range_start() happening in all the places before
any set_pte_at_notify() invalidate the secondary page table and thus will
lead to page fault for the vm. I have talk with Andrea on way to bring back
this optimization.

> > If you could post a multi CPU trace that shows how iommuv2 or
> > intel-svm are broken if ->invalidate_range is run inside
> > mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> > would help.
> >
> > Of course if we run mmu_notifier_invalidate_range inside PT lock and
> > we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> > all will be obviously safe, so we could just do it to avoid thinking
> > about the above, but the resulting code will be uglier and less
> > optimal (even when disarmed there will be dummy branches I wouldn't
> > love) and I currently can't see why it wouldn't be safe.
> >
> > Replacing a page completely without any relation to the old page
> > content allows no coherency anyway, so even if it breaks you cannot
> > measure it because it's undefined.
> >
> > If the page has a relation with the old contents and coherency has to
> > be provided for both primary MMU and secondary MMUs, this relation
> > between old and new page during the replacement, is enforced by some
> > other mean besides the PT lock, migration entry on locked old page
> > with migration_entry_wait and page lock in migrate_pages etc..
>
> I think that basically you are correct, and assuming that you always
> notify/invalidate unconditionally any PTE range you read/write, you are
> safe. Yet, I want to have another look at the code. Anyhow, just deferring
> all the TLB flushes, including those of set_pte_at_notify(), is likely to
> result in errors.

Yes we need the following sequence for IOMMU:
- clear pte
- invalidate IOMMU/device TLB
- set new pte

Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
stall entry.

Note that this is not necessary for all the case. For try_to_unmap it
is fine for instance to move the IOMMU tlb shoot down after changing the
CPU page table as we are not pointing the pte to a different page. Either
we clear the pte or we set a swap entry and as long as the page that use
to be pointed by the pte is not free before the IOMMU tlb flush then we
are fine.

In fact i think the only case where we need the above sequence (clear,
flush secondary tlb, set new pte) is for COW. I think all other cases
we can get rid of invalidate_range() from inside the page table lock
and rely on invalidate_range_end() to call unconditionaly.

I might ponder on all this for more cleanup for mmu_notifier as i have
some optimization that i have line up for it but this is next cycle
material. For 4.13 i believe the current patchset is the safest way
to go.

Jérôme

2017-08-31 17:12:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> > For both CoW and KSM, the correctness is maintained by calling
> > ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> > (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> > will cause memory corruption, and page-lock would not be enough.
>
> Just to add up, the IOMMU have its own CPU page table walker and it can
> walk the page table at any time (not the page table current to current
> CPU, IOMMU have an array that match a PASID with a page table and device
> request translation for a given virtual address against a PASID).
>
> So this means the following can happen with ptep_clear_flush() only:
>
> CPU | IOMMU
> | - walk page table populate tlb at addr A
> - clear pte at addr A |
> - set new pte |
mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb

>
> Device is using old page and CPU new page :(

That condition won't be persistent.

>
> But with ptep_clear_flush_notify()
>
> CPU | IOMMU
> | - walk page table populate tlb at addr A
> - clear pte at addr A |
> - notify -> invalidate_range | > flush IOMMU/device tlb
> - set new pte |
>
> So now either the IOMMU see the empty pte and trigger a page fault (this is
> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
> before setting the new pte) or it see the new pte. Either way both IOMMU
> and CPU have a coherent view of what a virtual address points to.

Sure, the _notify version is obviously safe.

> Andrea explained to me the historical reasons set_pte_at_notify call the
> change_pte callback and it was intended so that KVM could update the
> secondary page table directly without having to fault. It is now a pointless
> optimization as the call to range_start() happening in all the places before
> any set_pte_at_notify() invalidate the secondary page table and thus will
> lead to page fault for the vm. I have talk with Andrea on way to bring back
> this optimization.

Yes, we known for a long time, the optimization got basically disabled
when range_start/end expanded. It'd be nice to optimize change_pte
again but this is for later.

> Yes we need the following sequence for IOMMU:
> - clear pte
> - invalidate IOMMU/device TLB
> - set new pte
>
> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
> stall entry.
>
> Note that this is not necessary for all the case. For try_to_unmap it
> is fine for instance to move the IOMMU tlb shoot down after changing the
> CPU page table as we are not pointing the pte to a different page. Either
> we clear the pte or we set a swap entry and as long as the page that use
> to be pointed by the pte is not free before the IOMMU tlb flush then we
> are fine.
>
> In fact i think the only case where we need the above sequence (clear,
> flush secondary tlb, set new pte) is for COW. I think all other cases
> we can get rid of invalidate_range() from inside the page table lock
> and rely on invalidate_range_end() to call unconditionaly.

Even with CoW, it's not big issue if the IOMMU keeps reading from the
old page for a little while longer (in between PT lock release and
mmu_notifier_invalidate_range_end).

How can you tell you read the old data a bit longer, because it
noticed the new page only when mmu_notifier_invalidate_range_end run,
and not because the CPU was faster at writing than the IOMMU was fast
at loading the new pagetable?

I figure it would be detectable only that the CPU could see pageA
changing before pageB. The iommu-v2 could see pageB changing before
pageA. If that's a concern that is the only good reason I can think of
right now, for requiring ->invalidate_page inside the CoW PT lock to
enforce the same ordering. However if the modifications happens
simultaneously and it's a correct runtime, the order must not matter,
but still it would be detectable which may not be desirable.

Currently the _notify is absolutely needed to run ->invalidate_range
before put_page is run in the CoW code below, because of the put_page
that is done in a not scalable place, it would be better moved down
regardless of ->invalidate_range to reduce the size of the PT lock
protected critical section.

- if (new_page)
- put_page(new_page);

pte_unmap_unlock(vmf->pte, vmf->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ if (new_page)
+ put_page(new_page);

Of course the iommu will not immediately start reading from the new
page, but even if it triggers a write protect fault and calls
handle_mm_fault, it will find a writeable pte already there and it'll
get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
runs so it can sure notice the new page.

Now write_protect_page in KSM...

What bad can happen if iommu keeps writing to the write protected
page, for a little while longer? As long as nothing writes to the page
anymore by the time write_protect_page() returns, pages_identical will
work. How do you know the IOMMU was just a bit faster and wrote a few
more bytes and it wasn't mmu_notifier_invalidate_range_end that run a
bit later after dropping the PT lock?

Now replace_page in KSM...

ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, newpte);

page_remove_rmap(page, false);
if (!page_mapped(page))
try_to_free_swap(page);
- put_page(page);

pte_unmap_unlock(ptep, ptl);
err = 0;
out_mn:
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
+ put_page(page); /* TODO: broken of course, fix error cases */
return err;
}

If we free the old page after mmu_notifier_invalidate_range_end
(fixing up the error case, the above change ignores the error paths),
the content of the old and new page are identical for replace_page.

Even if new page takes immediately a COW, how do you know the IOMMU
was just a bit slower and read the old page content pre-COW? They're
guaranteed identical and both readonly already at that point.

All the above considered, if this is getting way too complex, it may
be preferable to keep things obviously safe and always run
mmu_notifier_invalidate_range inside the PT lock and be done with it,
and remove the ->invalidate_range from
mmu_notifier_invalidate_range_end too to avoid the double invalidates
for the secondary MMUs with hardware pagetable walkers and shared
pagetables with the primary MMU.

In principle the primary reason for doing _notify or explicit
mmu_notifier_invalidate_range() is to keep things simpler and to avoid
having to care where pages exactly gets freed (i.e. before or after
mmu_notifier_invalidate_range_end).

For example zap_page_range tlb gather freeing strictly requires an
explicit mmu_notifier_invalidate_range before the page is actually
freed (because the tlb gather will free the pages well before
mmu_notifier_invalidate_range_end can run).

The concern that an ->invalidate_range is always needed before PT lock
is released if the primary TLB was flushed inside PT lock, is a more
recent concern and it looks like to me it's not always needed but
perhaps only in some case.

An example where the ->invalidate_range inside
mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
That doesn't flush the TLB before setting the pagetable clean. So the
primary MMU can still write through the dirty primary TLB without
setting the dirty/accessed bit after madvise_free_pte_range returns.

ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);

ptent = pte_mkold(ptent);
ptent = pte_mkclean(ptent);
set_pte_at(mm, addr, pte, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);

Not even the primary TLB is flushed here. All concurrent writes of the
primary MMU can still go lost while MADV_FREE runs. All that is
guaranteed is that after madvise MADV_FREE syscall returns to
userland, the new writes will stick, so then it's also enough to call
->invalidate_range inside the single
mmu_notifier_invalidate_range_end:

madvise_free_page_range(&tlb, vma, start, end);
mmu_notifier_invalidate_range_end(mm, start, end);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This why we got both _notify and mmu_notifier_invalidate_range_end. If
we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
we'll have to add a mmu_notifier_invalidate_range in places like above
(just before or just after mmu_notifier_invalidate_range_end above).

So with ptep_clear_flush_notify that avoids any issue with page
freeing in places like CoW, and the explicit
mmu_notifier_invalidate_range in the tlb gather, the rest got covered
automatically by mmu_notifier_invalidate_range_end. And again this is
only started to be needed when we added support for hardware pagetable
walkers that cannot stop the pagetable walking (unless they break the
sharing of the pagetable with the primary MMU which of course is not
desirable and it would cause unnecessary overhead).

The current ->invalidate_range handling however results in double
calls here and there when armed, but it reduces the number of explicit
hooks required in the common code and it keeps the mmu_notifier code
less intrusive and more optimal when disarmed (but less optimal when
armed). So the current state is a reasonable tradeoff, but there's
room for optimization.

> I might ponder on all this for more cleanup for mmu_notifier as i have
> some optimization that i have line up for it but this is next cycle
> material. For 4.13 i believe the current patchset is the safest way
> to go.

Yes, lots of material above for pondering, but the current status is
fine.

Did you look in reducing the range flushed in try_to_unmap_once to
PAGE_SIZE << compound_order(page)? That looked a straightforward
optimization possible that won't add any complexity and it would avoid
to worsen the granularity of the invalidates during try_to_unmap
compared to the previous code.

Thanks,
Andrea

2017-08-31 18:26:01

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Thu, Aug 31, 2017 at 01:01:25AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> > On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <[email protected]> wrote:
> > >
> > > I pointed out in earlier email ->invalidate_range can only be
> > > implemented (as mutually exclusive alternative to
> > > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > > same pagetables with the core linux VM of the primary MMU, and those
> > > invalidate_range are already called by
> > > __mmu_notifier_invalidate_range_end.
> >
> > I have to admit that I didn't notice that fact - that we are already
> > in the situation that
> > invalidate_range is called by by the rand_end() nofifier.
> >
> > I agree that that should simplify all the code, and means that we
> > don't have to worry about the few cases that already implemented only
> > the "invalidate_page()" and "invalidate_range()" cases.
> >
> > So I think that simplifies J?r?me's patch further - once you have put
> > the range_start/end() cases around the inner loop, you can just drop
> > the invalidate_page() things entirely.
> >
> > > So this conversion from invalidate_page to invalidate_range looks
> > > superflous and the final mmu_notifier_invalidate_range_end should be
> > > enough.
> >
> > Yes. I missed the fact that we already called range() from range_end().
> >
> > That said, the double call shouldn't hurt correctness, and it's
> > "closer" to old behavior for those people who only did the range/page
> > ones, so I wonder if we can keep J?r?me's patch in its current state
> > for 4.13.
>
> Yes, the double call doesn't hurt correctness. Keeping it in current
> state is safer if something, so I've no objection to it other than I'd
> like to optimize it further if possible, but it can be done later.
>
> We're already running the double call in various fast paths too in
> fact, and rmap walk isn't the fastest path that would be doing such
> double call, so it's not a major concern.
>
> Also not a bug, but one further (but more obviously safe) enhancement
> I would like is to restrict those rmap invalidation ranges to
> PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.
>
> + /*
> + * We have to assume the worse case ie pmd for invalidation. Note that
> + * the page can not be free in this function as call of try_to_unmap()
> + * must hold a reference on the page.
> + */
> + end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>
> We don't need to invalidate 2MB of secondary MMU mappings surrounding
> a 4KB page, just to swapout a 4k page. split_huge_page can't run while
> holding the rmap locks, so compound_order(page) is safe to use there.
>
> It can also be optimized incrementally later.

This optimization is safe i believe. Linus i can respin with that and
with further kvm dead code removal.

J?r?me

2017-08-31 19:16:03

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

Andrea Arcangeli <[email protected]> wrote:

> On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
>> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
>>> For both CoW and KSM, the correctness is maintained by calling
>>> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
>>> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
>>> will cause memory corruption, and page-lock would not be enough.
>>
>> Just to add up, the IOMMU have its own CPU page table walker and it can
>> walk the page table at any time (not the page table current to current
>> CPU, IOMMU have an array that match a PASID with a page table and device
>> request translation for a given virtual address against a PASID).
>>
>> So this means the following can happen with ptep_clear_flush() only:
>>
>> CPU | IOMMU
>> | - walk page table populate tlb at addr A
>> - clear pte at addr A |
>> - set new pte |
> mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb
>
>> Device is using old page and CPU new page :(
>
> That condition won't be persistent.
>
>> But with ptep_clear_flush_notify()
>>
>> CPU | IOMMU
>> | - walk page table populate tlb at addr A
>> - clear pte at addr A |
>> - notify -> invalidate_range | > flush IOMMU/device tlb
>> - set new pte |
>>
>> So now either the IOMMU see the empty pte and trigger a page fault (this is
>> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
>> before setting the new pte) or it see the new pte. Either way both IOMMU
>> and CPU have a coherent view of what a virtual address points to.
>
> Sure, the _notify version is obviously safe.
>
>> Andrea explained to me the historical reasons set_pte_at_notify call the
>> change_pte callback and it was intended so that KVM could update the
>> secondary page table directly without having to fault. It is now a pointless
>> optimization as the call to range_start() happening in all the places before
>> any set_pte_at_notify() invalidate the secondary page table and thus will
>> lead to page fault for the vm. I have talk with Andrea on way to bring back
>> this optimization.
>
> Yes, we known for a long time, the optimization got basically disabled
> when range_start/end expanded. It'd be nice to optimize change_pte
> again but this is for later.
>
>> Yes we need the following sequence for IOMMU:
>> - clear pte
>> - invalidate IOMMU/device TLB
>> - set new pte
>>
>> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
>> stall entry.
>>
>> Note that this is not necessary for all the case. For try_to_unmap it
>> is fine for instance to move the IOMMU tlb shoot down after changing the
>> CPU page table as we are not pointing the pte to a different page. Either
>> we clear the pte or we set a swap entry and as long as the page that use
>> to be pointed by the pte is not free before the IOMMU tlb flush then we
>> are fine.
>>
>> In fact i think the only case where we need the above sequence (clear,
>> flush secondary tlb, set new pte) is for COW. I think all other cases
>> we can get rid of invalidate_range() from inside the page table lock
>> and rely on invalidate_range_end() to call unconditionaly.
>
> Even with CoW, it's not big issue if the IOMMU keeps reading from the
> old page for a little while longer (in between PT lock release and
> mmu_notifier_invalidate_range_end).
>
> How can you tell you read the old data a bit longer, because it
> noticed the new page only when mmu_notifier_invalidate_range_end run,
> and not because the CPU was faster at writing than the IOMMU was fast
> at loading the new pagetable?
>
> I figure it would be detectable only that the CPU could see pageA
> changing before pageB. The iommu-v2 could see pageB changing before
> pageA. If that's a concern that is the only good reason I can think of
> right now, for requiring ->invalidate_page inside the CoW PT lock to
> enforce the same ordering. However if the modifications happens
> simultaneously and it's a correct runtime, the order must not matter,
> but still it would be detectable which may not be desirable.

I don’t know what is the memory model that SVM provides, but what you
describe here potentially violates it. I don’t think user software should
be expected to deal with it.

>
> Currently the _notify is absolutely needed to run ->invalidate_range
> before put_page is run in the CoW code below, because of the put_page
> that is done in a not scalable place, it would be better moved down
> regardless of ->invalidate_range to reduce the size of the PT lock
> protected critical section.
>
> - if (new_page)
> - put_page(new_page);
>
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> + if (new_page)
> + put_page(new_page);
>
> Of course the iommu will not immediately start reading from the new
> page, but even if it triggers a write protect fault and calls
> handle_mm_fault, it will find a writeable pte already there and it'll
> get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
> runs so it can sure notice the new page.
>
> Now write_protect_page in KSM...
>
> What bad can happen if iommu keeps writing to the write protected
> page, for a little while longer? As long as nothing writes to the page
> anymore by the time write_protect_page() returns, pages_identical will
> work. How do you know the IOMMU was just a bit faster and wrote a few
> more bytes and it wasn't mmu_notifier_invalidate_range_end that run a
> bit later after dropping the PT lock?

I guess it should be ok in this case.

>
> Now replace_page in KSM...
>
> ptep_clear_flush_notify(vma, addr, ptep);
> set_pte_at_notify(mm, addr, ptep, newpte);
>
> page_remove_rmap(page, false);
> if (!page_mapped(page))
> try_to_free_swap(page);
> - put_page(page);
>
> pte_unmap_unlock(ptep, ptl);
> err = 0;
> out_mn:
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> out:
> + put_page(page); /* TODO: broken of course, fix error cases */
> return err;
> }
>
> If we free the old page after mmu_notifier_invalidate_range_end
> (fixing up the error case, the above change ignores the error paths),
> the content of the old and new page are identical for replace_page.
>
> Even if new page takes immediately a COW, how do you know the IOMMU
> was just a bit slower and read the old page content pre-COW? They're
> guaranteed identical and both readonly already at that point.

A potential problem might be that it requires that there would be no
promotion of PTE access-rights (e.g., RO->RW), since in these cases there
would be no invalidation. I did not find such a code-path, but I might have
missed one, or new one can be introduced in the future. For example, someone
may optimize KSM not to COW the last reference of a KSM’d upon page-fault.

Therefore, such optimizations may require some soft of a checker, or the
very least clear documentation when an invalidation is required, especially
that it would require invalidation on access-rights promotion, which is
currently not needed.

> All the above considered, if this is getting way too complex, it may
> be preferable to keep things obviously safe and always run
> mmu_notifier_invalidate_range inside the PT lock and be done with it,
> and remove the ->invalidate_range from
> mmu_notifier_invalidate_range_end too to avoid the double invalidates
> for the secondary MMUs with hardware pagetable walkers and shared
> pagetables with the primary MMU.
>
> In principle the primary reason for doing _notify or explicit
> mmu_notifier_invalidate_range() is to keep things simpler and to avoid
> having to care where pages exactly gets freed (i.e. before or after
> mmu_notifier_invalidate_range_end).
>
> For example zap_page_range tlb gather freeing strictly requires an
> explicit mmu_notifier_invalidate_range before the page is actually
> freed (because the tlb gather will free the pages well before
> mmu_notifier_invalidate_range_end can run).
>
> The concern that an ->invalidate_range is always needed before PT lock
> is released if the primary TLB was flushed inside PT lock, is a more
> recent concern and it looks like to me it's not always needed but
> perhaps only in some case.
>
> An example where the ->invalidate_range inside
> mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
> That doesn't flush the TLB before setting the pagetable clean. So the
> primary MMU can still write through the dirty primary TLB without
> setting the dirty/accessed bit after madvise_free_pte_range returns.
>
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
>
> ptent = pte_mkold(ptent);
> ptent = pte_mkclean(ptent);
> set_pte_at(mm, addr, pte, ptent);
> tlb_remove_tlb_entry(tlb, pte, addr);
>
> Not even the primary TLB is flushed here. All concurrent writes of the
> primary MMU can still go lost while MADV_FREE runs. All that is
> guaranteed is that after madvise MADV_FREE syscall returns to
> userland, the new writes will stick, so then it's also enough to call
> ->invalidate_range inside the single
> mmu_notifier_invalidate_range_end:
>
> madvise_free_page_range(&tlb, vma, start, end);
> mmu_notifier_invalidate_range_end(mm, start, end);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This why we got both _notify and mmu_notifier_invalidate_range_end. If
> we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
> we'll have to add a mmu_notifier_invalidate_range in places like above
> (just before or just after mmu_notifier_invalidate_range_end above).
>
> So with ptep_clear_flush_notify that avoids any issue with page
> freeing in places like CoW, and the explicit
> mmu_notifier_invalidate_range in the tlb gather, the rest got covered
> automatically by mmu_notifier_invalidate_range_end. And again this is
> only started to be needed when we added support for hardware pagetable
> walkers that cannot stop the pagetable walking (unless they break the
> sharing of the pagetable with the primary MMU which of course is not
> desirable and it would cause unnecessary overhead).
>
> The current ->invalidate_range handling however results in double
> calls here and there when armed, but it reduces the number of explicit
> hooks required in the common code and it keeps the mmu_notifier code
> less intrusive and more optimal when disarmed (but less optimal when
> armed). So the current state is a reasonable tradeoff, but there's
> room for optimization.

Everything you say makes sense, but at the end of the day you end up with
two different schemes for flushing the primary and secondary TLBs. For
example, the primary TLB flush mechanism is much more selective in flushing
than the secondary one. The discrepancy between the two surely does not
make things simpler.

Indeed, there might be different costs and trade-offs in the primary and
secondary TLBs (e.g., TLB flush/shootdown time, TLB miss latency, whether an
address-space is always loaded), but I have some doubts whether this
decision was data driven.

Regards,
Nadav


2017-08-31 19:40:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

On Thu, Aug 31, 2017 at 11:25 AM, Jerome Glisse <[email protected]> wrote:
>
> This optimization is safe i believe. Linus i can respin with that and
> with further kvm dead code removal.

Yes, please.

And if you've been tracking the reviewed-by and tested-by's that would
be really good too. I lost much of yesterday to a sudden kidney stone
emergency, so I might not have tracked every ack.

Linus

2017-09-01 14:47:32

by Jeff Cook

[permalink] [raw]
Subject: Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > >
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> >
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
>
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core
Aug 03 15:02:47 kvm_master kernel: snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c
Aug 03 15:02:47 kvm_master kernel: crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P B D W O 4.12.3-1-ARCH #1
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000
Aug 03 15:02:47 kvm_master kernel: FS: 0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000
Aug 03 15:02:47 kvm_master kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Aug 03 15:02:47 kvm_master kernel: Call Trace:
Aug 03 15:02:47 kvm_master kernel: drop_spte+0x1a/0xb0 [kvm]
Aug 03 15:02:47 kvm_master kernel: mmu_page_zap_pte+0x9c/0xe0 [kvm]
Aug 03 15:02:47 kvm_master kernel: kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel:
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel: kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel: kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel: __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel: exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel: ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel: ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel: ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel: mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel: process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel: worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel: kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel: ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel: ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel: ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

>
>
> Meow!
> --
> ⢀⣴⠾⠻⢶⣦⠀
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀

2017-09-06 14:08:11

by Arumugam, Kamenee

[permalink] [raw]
Subject: RE: [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic

Tested this patch with hfi1 driver and no issue found.

Tested-by: Kamenee Arumugam <[email protected]>

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jérôme Glisse
Sent: Tuesday, August 29, 2017 7:55 PM
To: [email protected]; [email protected]
Cc: Jérôme Glisse <[email protected]>; [email protected]; Luick, Dean <[email protected]>; Weiny, Ira <[email protected]>; Doug Ledford <[email protected]>; Kirill A . Shutemov <[email protected]>; Andrew Morton <[email protected]>; Linus Torvalds <[email protected]>; Andrea Arcangeli <[email protected]>
Subject: [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <[email protected]>
Cc: [email protected]
Cc: Dean Luick <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---
drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {

static unsigned long mmu_node_start(struct mmu_rb_node *); static unsigned long mmu_node_last(struct mmu_rb_node *); -static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
- unsigned long);
static inline void mmu_notifier_range_start(struct mmu_notifier *,
struct mm_struct *,
unsigned long, unsigned long); @@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler, static void handle_remove(struct work_struct *work);

static const struct mmu_notifier_ops mn_opts = {
- .invalidate_page = mmu_notifier_page,
.invalidate_range_start = mmu_notifier_range_start, };

@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
handler->ops->remove(handler->ops_arg, node); }

-static inline void mmu_notifier_page(struct mmu_notifier *mn,
- struct mm_struct *mm, unsigned long addr)
-{
- mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
--
2.13.5

2017-11-30 09:34:30

by Fabian Grünbichler

[permalink] [raw]
Subject: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback

On Tue, Aug 29, 2017 at 07:54:34PM -0400, J�r�me Glisse wrote:
> (Sorry for so many list cross-posting and big cc)

Ditto (trimmed it a bit already, feel free to limit replies as you see
fit).

>
> Please help testing !
>

Kernels 4.13 and 4.14 (which both contain these patch series in its
final form) are affected by a bug triggering BSOD
(CRITICAL_STRUCTURE_CORRUPTION) in Windows 10/2016 VMs in Qemu under
certain conditions on certain hardware/microcode versions (see below for
details).

Testing this proved to be quite cumbersome, as only some systems are
affected and it took a while to find a semi-reliable test setup. Some
users reported that microcode updates made the problem disappear on some
affected systems[1].

Bisecting the 4.13 release cycle first pointed to

aac2fea94f7a3df8ad1eeb477eb2643f81fd5393 rmap: do not call mmu_notifier_invalidate_page() under ptl

as likely culprit (although it was not possible to bisect exactly down
to this commit).

It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
the symptoms disappeared until this series was merged, which contains

369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2

We haven't bisected the individual commits of the series yet, but the
commit immediately preceding its merge exhibits no problems, while
everything after does. It is not known whether the bug is actually in
the series itself, or whether increasing the likelihood of triggering it
is just a side-effect. There is a similar report[2] concerning an
upgrade from 4.12.12 to 4.12.13, which does not contain this series in
any form AFAICT but might be worth another look as well.

Our test setup consists of the following:
CPU: Intel(R) Xeon(R) CPU D-1528 @ 1.90GHz (single socket)
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc
cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
microcode: 0x700000e
Mainboard: Supermicro X10SDV-6C+-TLN4F
RAM: 64G DDR4 [3]
Swap: 8G on an LV
Qemu: 2.9.1 with some patches on top ([4])
OS: Debian Stretch based (PVE 5.1)
KSM is enabled, but turning it off just increases the number of test
iterations needed to trigger the bug.
Kernel config: [5]

VMs:
A: Windows 2016 with virtio-blk disks, 6G RAM
B: Windows 10 with (virtual) IDE disks, 6G RAM
C: Debian Stretch, ~55G RAM

fio config:
[global]
thread=2
runtime=1800

[write1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randwrite
iodepth=16

[read1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randread
iodepth=16

Test run:

- Start all three VMs
- run 'stress-ng --vm-bytes 1G --vm 52 -t 6000' in VM C
- wait until swap is (almost) full and KSM starts to merge pages
- start fio in VM A and B
- stop stress-ng in VM C, power off VM C
- run 'swapoff -av' on host
- wait until swap content has been swapped in again (this takes a while)
- observe BSOD in at least one of A / B around 30% of the time

While this test case is pretty artifical, the BSOD issue does affect
users in the wild running regular work loads (where it can take from
multiple hours up to several days to trigger).

We have reverted this patch series in our 4.13 based kernel for now,
with positive feedback from users and our own testing. If more detailed
traces or data from a test run on an affected system is needed, we will
of course provide it.

Any further input / pointers are highly appreciated!

1: https://forum.proxmox.com/threads/blue-screen-with-5-1.37664/
2: http://www.spinics.net/lists/kvm/msg159179.html
https://bugs.launchpad.net/qemu/+bug/1728256
https://bugzilla.kernel.org/show_bug.cgi?id=197951
3: http://www.samsung.com/semiconductor/products/dram/server-dram/ddr4-registered-dimm/M393A2K40BB1?ia=2503
5: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;h=2c516be8e69a033d14809b17e8a661b3808257f7;hb=8d4a2d3f5569817221c19a91f763964c40e00292
6: https://gist.github.com/Fabian-Gruenbichler/5c3af22ac7e6faae46840bdcebd7df14


From 1585873946518971970@xxx Mon Dec 04 17:07:31 +0000 2017
X-GM-THRID: 1585619160060917264
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread