Hi,
here is a patch-set to extend the mmu_notifiers in the Linux
kernel to allow managing CPU external TLBs. Those TLBs may
be implemented in IOMMUs or any other external device, e.g.
ATS/PRI capable PCI devices.
The problem with managing these TLBs are the semantics of
the invalidate_range_start/end call-backs currently
available. Currently the subsystem using mmu_notifiers has
to guarantee that no new TLB entries are established between
invalidate_range_start/end. Furthermore the
invalidate_range_start() function is called when all pages
are still mapped and invalidate_range_end() when the pages
are unmapped an already freed.
So both call-backs can't be used to safely flush any non-CPU
TLB because _start() is called too early and _end() too
late.
In the AMD IOMMUv2 driver this is currently implemented by
assigning an empty page-table to the external device between
_start() and _end(). But as tests have shown this doesn't
work as external devices don't re-fault infinitly but enter
a failure state after some time.
Next problem with this solution is that it causes an
interrupt storm for IO page faults to be handled when an
empty page-table is assigned.
To solve this situation I wrote a patch-set to introduce a
new notifier call-back: mmu_notifer_invalidate_range(). This
notifier lifts the strict requirements that no new
references are taken in the range between _start() and
_end(). When the subsystem can't guarantee that any new
references are taken is has to provide the
invalidate_range() call-back to clear any new references in
there.
It is called between invalidate_range_start() and _end()
every time the VMM has to wipe out any references to a
couple of pages. This are usually the places where the CPU
TLBs are flushed too and where its important that this
happens before invalidate_range_end() is called.
Any comments and review appreciated!
Thanks,
Joerg
Joerg Roedel (3):
mmu_notifier: Add mmu_notifier_invalidate_range()
mmu_notifier: Call mmu_notifier_invalidate_range() from VMM
mmu_notifier: Add the call-back for mmu_notifier_invalidate_range()
include/linux/mmu_notifier.h | 66 ++++++++++++++++++++++++++++++++++++++++----
kernel/events/uprobes.c | 2 +-
mm/fremap.c | 2 +-
mm/huge_memory.c | 9 +++---
mm/hugetlb.c | 7 ++++-
mm/ksm.c | 4 +--
mm/memory.c | 3 +-
mm/migrate.c | 3 +-
mm/mmu_notifier.c | 15 ++++++++++
mm/rmap.c | 2 +-
10 files changed, 95 insertions(+), 18 deletions(-)
--
1.9.1
From: Joerg Roedel <[email protected]>
Add calls to the new mmu_notifier_invalidate_range()
function to all places if the VMM that need it.
Signed-off-by: Joerg Roedel <[email protected]>
---
include/linux/mmu_notifier.h | 28 ++++++++++++++++++++++++++++
kernel/events/uprobes.c | 2 +-
mm/fremap.c | 2 +-
mm/huge_memory.c | 9 +++++----
mm/hugetlb.c | 7 ++++++-
mm/ksm.c | 4 ++--
mm/memory.c | 3 ++-
mm/migrate.c | 3 ++-
mm/rmap.c | 2 +-
9 files changed, 48 insertions(+), 12 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f333668..6959dc8 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -273,6 +273,32 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
__young; \
})
+#define ptep_clear_flush_notify(__vma, __address, __ptep) \
+({ \
+ unsigned long ___addr = __address & PAGE_MASK; \
+ struct mm_struct *___mm = (__vma)->vm_mm; \
+ pte_t ___pte; \
+ \
+ ___pte = ptep_clear_flush(__vma, __address, __ptep); \
+ mmu_notifier_invalidate_range(___mm, ___addr, \
+ ___addr + PAGE_SIZE); \
+ \
+ ___pte; \
+})
+
+#define pmdp_clear_flush_notify(__vma, __haddr, __pmd) \
+({ \
+ unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \
+ struct mm_struct *___mm = (__vma)->vm_mm; \
+ pmd_t ___pmd; \
+ \
+ ___pmd = pmdp_clear_flush(__vma, __haddr, __pmd); \
+ mmu_notifier_invalidate_range(___mm, ___haddr, \
+ ___haddr + HPAGE_PMD_SIZE); \
+ \
+ ___pmd; \
+})
+
/*
* set_pte_at_notify() sets the pte _after_ running the notifier.
* This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -346,6 +372,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
#define ptep_clear_flush_young_notify ptep_clear_flush_young
#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
+#define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
#define set_pte_at_notify set_pte_at
#endif /* CONFIG_MMU_NOTIFIER */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6f3254e..642262d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
}
flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
+ ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
page_remove_rmap(page);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..9129013 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma,
if (pte_present(pte)) {
flush_cache_page(vma, addr, pte_pfn(pte));
- pte = ptep_clear_flush(vma, addr, ptep);
+ pte = ptep_clear_flush_notify(vma, addr, ptep);
page = vm_normal_page(vma, addr, pte);
if (page) {
if (pte_dirty(pte))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..b322c97 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1031,7 +1031,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
goto out_free_pages;
VM_BUG_ON_PAGE(!PageHead(page), page);
- pmdp_clear_flush(vma, haddr, pmd);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -1168,7 +1168,7 @@ alloc:
pmd_t entry;
entry = mk_huge_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- pmdp_clear_flush(vma, haddr, pmd);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
@@ -1499,7 +1499,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
ret = 1;
if (!prot_numa) {
- entry = pmdp_get_and_clear(mm, addr, pmd);
+ entry = pmdp_get_and_clear_notify(mm, addr, pmd);
if (pmd_numa(entry))
entry = pmd_mknonnuma(entry);
entry = pmd_modify(entry, newprot);
@@ -1631,6 +1631,7 @@ static int __split_huge_page_splitting(struct page *page,
* serialize against split_huge_page*.
*/
pmdp_splitting_flush(vma, address, pmd);
+
ret = 1;
spin_unlock(ptl);
}
@@ -2793,7 +2794,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pmd_t _pmd;
int i;
- pmdp_clear_flush(vma, haddr, pmd);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
/* leave pmd empty until pte is filled */
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2024bbd..da37ad1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2602,8 +2602,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
}
set_huge_pte_at(dst, addr, dst_pte, entry);
} else {
- if (cow)
+ if (cow) {
huge_ptep_set_wrprotect(src, addr, src_pte);
+ mmu_notifier_invalidate_range(src, mmun_start,
+ mmun_end);
+ }
ptepage = pte_page(entry);
get_page(ptepage);
page_dup_rmap(ptepage);
@@ -2910,6 +2913,7 @@ retry_avoidcopy:
/* Break COW */
huge_ptep_clear_flush(vma, address, ptep);
+ mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
set_huge_pte_at(mm, address, ptep,
make_huge_pte(vma, new_page, 1));
page_remove_rmap(old_page);
@@ -3384,6 +3388,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
* and that page table be reused and filled with junk.
*/
flush_tlb_range(vma, start, end);
+ mmu_notifier_invalidate_range(mm, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
mmu_notifier_invalidate_range_end(mm, start, end);
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..a73df3b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
* this assure us that no O_DIRECT can happen after the check
* or in the middle of the check.
*/
- entry = ptep_clear_flush(vma, addr, ptep);
+ entry = ptep_clear_flush_notify(vma, addr, ptep);
/*
* Check that no O_DIRECT or similar I/O is in progress on the
* page
@@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
page_add_anon_rmap(kpage, vma, addr);
flush_cache_page(vma, addr, pte_pfn(*ptep));
- ptep_clear_flush(vma, addr, ptep);
+ ptep_clear_flush_notify(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
page_remove_rmap(page);
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..ceff829 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
{
tlb->need_flush = 0;
tlb_flush(tlb);
+ mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
#ifdef CONFIG_HAVE_RCU_TABLE_FREE
tlb_table_flush(tlb);
#endif
@@ -2232,7 +2233,7 @@ gotten:
* seen in the presence of one thread doing SMC and another
* thread doing COW.
*/
- ptep_clear_flush(vma, address, page_table);
+ ptep_clear_flush_notify(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
/*
* We call the notify macro here because, when using secondary
diff --git a/mm/migrate.c b/mm/migrate.c
index 9e0beaa..812c0d6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1874,7 +1874,7 @@ fail_putback:
*/
flush_cache_range(vma, mmun_start, mmun_end);
page_add_anon_rmap(new_page, vma, mmun_start);
- pmdp_clear_flush(vma, mmun_start, pmd);
+ pmdp_clear_flush_notify(vma, mmun_start, pmd);
set_pmd_at(mm, mmun_start, pmd, entry);
flush_tlb_range(vma, mmun_start, mmun_end);
update_mmu_cache_pmd(vma, address, &entry);
@@ -1882,6 +1882,7 @@ fail_putback:
if (page_count(page) != 2) {
set_pmd_at(mm, mmun_start, pmd, orig_entry);
flush_tlb_range(vma, mmun_start, mmun_end);
+ mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
update_mmu_cache_pmd(vma, address, &entry);
page_remove_rmap(new_page);
goto fail_putback;
diff --git a/mm/rmap.c b/mm/rmap.c
index b7e94eb..c2a703b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1384,7 +1384,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
+ pteval = ptep_clear_flush_notify(vma, address, pte);
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address)) {
--
1.9.1
From: Joerg Roedel <[email protected]>
This notifier closes an important gap with the current
invalidate_range_start()/end() notifiers. The _start() part
is called when all pages are still mapped while the _end()
notifier is called when all pages are potentially unmapped
and already freed.
This does not allow to manage external (non-CPU) hardware
TLBs with MMU-notifiers because there is no way to prevent
that hardware will establish new TLB entries between the
calls of these two functions. But this is a requirement to
the subsytem that implements these existing notifiers.
To allow managing external TLBs the MMU-notifiers need to
catch the moment when pages are unmapped but not yet freed.
This new notifier catches that moment and notifies the
interested subsytem when pages that were unmapped are about
to be freed. The new notifier will only be called between
invalidate_range_start()/end().
Signed-off-by: Joerg Roedel <[email protected]>
---
include/linux/mmu_notifier.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..f333668 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -228,6 +228,11 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
__mmu_notifier_invalidate_range_start(mm, start, end);
}
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+}
+
static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
@@ -321,6 +326,11 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
{
}
+static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+}
+
static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
--
1.9.1
From: Joerg Roedel <[email protected]>
Now that the mmu_notifier_invalidate_range() calls are in
place, add the call-back to allow subsystems to register
against it.
Signed-off-by: Joerg Roedel <[email protected]>
---
include/linux/mmu_notifier.h | 28 ++++++++++++++++++++++------
mm/mmu_notifier.c | 15 +++++++++++++++
2 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 6959dc8..50dc679 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,11 +95,11 @@ struct mmu_notifier_ops {
/*
* invalidate_range_start() and invalidate_range_end() must be
* paired and are called only when the mmap_sem and/or the
- * locks protecting the reverse maps are held. The subsystem
- * must guarantee that no additional references are taken to
- * the pages in the range established between the call to
- * invalidate_range_start() and the matching call to
- * invalidate_range_end().
+ * locks protecting the reverse maps are held. If the subsystem
+ * can't guarantee that no additional references are taken to
+ * the pages in the range, it has to implement the
+ * invalidate_range() notifier to remove any references taken
+ * after invalidate_range_start().
*
* Invalidation of multiple concurrent ranges may be
* optionally permitted by the driver. Either way the
@@ -110,9 +110,19 @@ struct mmu_notifier_ops {
* invalidate_range_start() is called when all pages in the
* range are still mapped and have at least a refcount of one.
*
+ * invalidate_range() is called between invalidate_range_start()
+ * and invalidate_range_end() when the memory management code
+ * removed mappings to pages in the range and is about to free
+ * them. This captures the point when pages are unmapped but
+ * not yet freed.
+ * Note that invalidate_range() might be called only on a
+ * sub-range of the range passed to the corresponding
+ * invalidate_range_start() call.
+ *
* invalidate_range_end() is called when all pages in the
* range have been unmapped and the pages have been freed by
- * the VM.
+ * the VM. It might be called under the ptl spin-lock, so this
+ * notifier is not allowed to preempt.
*
* The VM will remove the page table entries and potentially
* the page between invalidate_range_start() and
@@ -138,6 +148,8 @@ struct mmu_notifier_ops {
void (*invalidate_range_start)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end);
+ void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
void (*invalidate_range_end)(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start, unsigned long end);
@@ -182,6 +194,8 @@ extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
unsigned long address);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end);
+extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end);
@@ -231,6 +245,8 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
+ if (mm_has_notifiers(mm))
+ __mmu_notifier_invalidate_range(mm, start, end);
}
static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 41cefdf..d1bdea0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -165,6 +165,21 @@ void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
+void __mmu_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mmu_notifier *mn;
+ int id;
+
+ id = srcu_read_lock(&srcu);
+ hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
+ if (mn->ops->invalidate_range)
+ mn->ops->invalidate_range(mn, mm, start, end);
+ }
+ srcu_read_unlock(&srcu, id);
+}
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
+
void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
--
1.9.1
On Thu, Jul 24, 2014 at 04:35:38PM +0200, Joerg Roedel wrote:
> To solve this situation I wrote a patch-set to introduce a
> new notifier call-back: mmu_notifer_invalidate_range(). This
> notifier lifts the strict requirements that no new
> references are taken in the range between _start() and
> _end(). When the subsystem can't guarantee that any new
> references are taken is has to provide the
> invalidate_range() call-back to clear any new references in
> there.
>
> It is called between invalidate_range_start() and _end()
> every time the VMM has to wipe out any references to a
> couple of pages. This are usually the places where the CPU
> TLBs are flushed too and where its important that this
> happens before invalidate_range_end() is called.
>
> Any comments and review appreciated!
Reviewed-by: Andrea Arcangeli <[email protected]>
On Thu, 24 Jul 2014 16:35:38 +0200 Joerg Roedel <[email protected]> wrote:
> here is a patch-set to extend the mmu_notifiers in the Linux
> kernel to allow managing CPU external TLBs. Those TLBs may
> be implemented in IOMMUs or any other external device, e.g.
> ATS/PRI capable PCI devices.
>
> The problem with managing these TLBs are the semantics of
> the invalidate_range_start/end call-backs currently
> available. Currently the subsystem using mmu_notifiers has
> to guarantee that no new TLB entries are established between
> invalidate_range_start/end. Furthermore the
> invalidate_range_start() function is called when all pages
> are still mapped and invalidate_range_end() when the pages
> are unmapped an already freed.
>
> So both call-backs can't be used to safely flush any non-CPU
> TLB because _start() is called too early and _end() too
> late.
>
> In the AMD IOMMUv2 driver this is currently implemented by
> assigning an empty page-table to the external device between
> _start() and _end(). But as tests have shown this doesn't
> work as external devices don't re-fault infinitly but enter
> a failure state after some time.
>
> Next problem with this solution is that it causes an
> interrupt storm for IO page faults to be handled when an
> empty page-table is assigned.
>
> To solve this situation I wrote a patch-set to introduce a
> new notifier call-back: mmu_notifer_invalidate_range(). This
> notifier lifts the strict requirements that no new
> references are taken in the range between _start() and
> _end(). When the subsystem can't guarantee that any new
> references are taken is has to provide the
> invalidate_range() call-back to clear any new references in
> there.
>
> It is called between invalidate_range_start() and _end()
> every time the VMM has to wipe out any references to a
> couple of pages. This are usually the places where the CPU
> TLBs are flushed too and where its important that this
> happens before invalidate_range_end() is called.
>
> Any comments and review appreciated!
It looks pretty simple and harmless.
I assume the AMD IOMMUv2 driver actually uses this and it's all
tested and good? What is the status of that driver?
Yes. AMD has tested this with the iommuv2 driver and verified it works correctly. There is a corresponding change in the iommuv2 driver to use the new API.
> On Jul 24, 2014, at 6:33 PM, "Andrew Morton" <[email protected]> wrote:
>
>> On Thu, 24 Jul 2014 16:35:38 +0200 Joerg Roedel <[email protected]> wrote:
>>
>> here is a patch-set to extend the mmu_notifiers in the Linux
>> kernel to allow managing CPU external TLBs. Those TLBs may
>> be implemented in IOMMUs or any other external device, e.g.
>> ATS/PRI capable PCI devices.
>>
>> The problem with managing these TLBs are the semantics of
>> the invalidate_range_start/end call-backs currently
>> available. Currently the subsystem using mmu_notifiers has
>> to guarantee that no new TLB entries are established between
>> invalidate_range_start/end. Furthermore the
>> invalidate_range_start() function is called when all pages
>> are still mapped and invalidate_range_end() when the pages
>> are unmapped an already freed.
>>
>> So both call-backs can't be used to safely flush any non-CPU
>> TLB because _start() is called too early and _end() too
>> late.
>>
>> In the AMD IOMMUv2 driver this is currently implemented by
>> assigning an empty page-table to the external device between
>> _start() and _end(). But as tests have shown this doesn't
>> work as external devices don't re-fault infinitly but enter
>> a failure state after some time.
>>
>> Next problem with this solution is that it causes an
>> interrupt storm for IO page faults to be handled when an
>> empty page-table is assigned.
>>
>> To solve this situation I wrote a patch-set to introduce a
>> new notifier call-back: mmu_notifer_invalidate_range(). This
>> notifier lifts the strict requirements that no new
>> references are taken in the range between _start() and
>> _end(). When the subsystem can't guarantee that any new
>> references are taken is has to provide the
>> invalidate_range() call-back to clear any new references in
>> there.
>>
>> It is called between invalidate_range_start() and _end()
>> every time the VMM has to wipe out any references to a
>> couple of pages. This are usually the places where the CPU
>> TLBs are flushed too and where its important that this
>> happens before invalidate_range_end() is called.
>>
>> Any comments and review appreciated!
>
> It looks pretty simple and harmless.
>
> I assume the AMD IOMMUv2 driver actually uses this and it's all
> tested and good? What is the status of that driver?
>
Hi Andrew,
On Thu, Jul 24, 2014 at 04:33:03PM -0700, Andrew Morton wrote:
> On Thu, 24 Jul 2014 16:35:38 +0200 Joerg Roedel <[email protected]> wrote:
> >
> > Any comments and review appreciated!
>
> It looks pretty simple and harmless.
>
> I assume the AMD IOMMUv2 driver actually uses this and it's all
> tested and good? What is the status of that driver?
Yes, the AMD IOMMUv2 driver will use this once it is upstream, the
changes to be made there are pretty simple.
The driver itself will get its first user soon with the AMD KFD driver
currently under review.
Joerg
On Thu, 24 Jul 2014 16:35:39 +0200
Joerg Roedel <[email protected]> wrote:
> From: Joerg Roedel <[email protected]>
>
> This notifier closes an important gap with the current
> invalidate_range_start()/end() notifiers. The _start() part
> is called when all pages are still mapped while the _end()
> notifier is called when all pages are potentially unmapped
> and already freed.
>
> This does not allow to manage external (non-CPU) hardware
> TLBs with MMU-notifiers because there is no way to prevent
> that hardware will establish new TLB entries between the
> calls of these two functions. But this is a requirement to
> the subsytem that implements these existing notifiers.
>
> To allow managing external TLBs the MMU-notifiers need to
> catch the moment when pages are unmapped but not yet freed.
> This new notifier catches that moment and notifies the
> interested subsytem when pages that were unmapped are about
> to be freed. The new notifier will only be called between
> invalidate_range_start()/end().
So if we were actually sharing page tables, we should be able to make
start/end no-ops and just use this new callback, assuming we didn't
need to do any other serialization or debug stuff, right?
Seems like a good addition, and saves us a bunch of trouble...
Thanks,
--
Jesse Barnes, Intel Open Source Technology Center
On Fri, Jul 25, 2014 at 01:16:39PM -0700, Jesse Barnes wrote:
> On Thu, 24 Jul 2014 16:35:39 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > From: Joerg Roedel <[email protected]>
> >
> > This notifier closes an important gap with the current
> > invalidate_range_start()/end() notifiers. The _start() part
> > is called when all pages are still mapped while the _end()
> > notifier is called when all pages are potentially unmapped
> > and already freed.
> >
> > This does not allow to manage external (non-CPU) hardware
> > TLBs with MMU-notifiers because there is no way to prevent
> > that hardware will establish new TLB entries between the
> > calls of these two functions. But this is a requirement to
> > the subsytem that implements these existing notifiers.
> >
> > To allow managing external TLBs the MMU-notifiers need to
> > catch the moment when pages are unmapped but not yet freed.
> > This new notifier catches that moment and notifies the
> > interested subsytem when pages that were unmapped are about
> > to be freed. The new notifier will only be called between
> > invalidate_range_start()/end().
>
> So if we were actually sharing page tables, we should be able to make
> start/end no-ops and just use this new callback, assuming we didn't
> need to do any other serialization or debug stuff, right?
>
> Seems like a good addition, and saves us a bunch of trouble...
Pondering on that i think there is a missing call to mmu_notifier_invalidate_range
inside move_huge_pmd which is call by move_page_tables.
But otherwise yes, you should not need to register range_start/end() callback. It
should be enought to only register the invalidate_range callback.
Note that on my side i will remain an user of range_start/end() but other listener
like kvm or xen or sgi might want to revisit there code with this new callback.
Cheers,
J?r?me
>
> Thanks,
> --
> Jesse Barnes, Intel Open Source Technology Center
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri, Jul 25, 2014 at 01:16:39PM -0700, Jesse Barnes wrote:
> > To allow managing external TLBs the MMU-notifiers need to
> > catch the moment when pages are unmapped but not yet freed.
> > This new notifier catches that moment and notifies the
> > interested subsytem when pages that were unmapped are about
> > to be freed. The new notifier will only be called between
> > invalidate_range_start()/end().
>
> So if we were actually sharing page tables, we should be able to make
> start/end no-ops and just use this new callback, assuming we didn't
> need to do any other serialization or debug stuff, right?
Well, not completly. What you need with this patch-set is a
invalidate_range and an invalidate_end call-back. There are call sites
of the start/end functions where the TLB flush happens after the _end
notifier (or at least can wait until _end is called). I did not add
invalidate_range calls to these places (yet). But you can easily discard
invalidate_range_start, any flush done in there is useless with shared
page-tables.
I though about removing the need for invalidate_range_end too when
writing the patches, and possible solutions are
1) Add mmu_notifier_invalidate_range() to all places where
start/end is called too. This might add some unnecessary
overhead.
2) Call the invalidate_range() call-back from the
mmu_notifier_invalidate_range_end too.
3) Just let the user register the same function for
invalidate_range and invalidate_range_end
I though that option 1) adds overhead that is not needed (but it might
not be too bad, the overhead is an additional iteration over the
mmu_notifer list when there are no call-backs registered).
Option 2) might also be overhead if a user registers different functions
for invalidate_range() and invalidate_range_end(). In the end I came to
the conclusion that option 3) is the best one from an overhead POV.
But probably targeting better usability with one of the other options is
a better choice? I am open for thoughts and suggestions on that.
Joerg
On Fri, 25 Jul 2014 23:38:06 +0200
Joerg Roedel <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 01:16:39PM -0700, Jesse Barnes wrote:
> > > To allow managing external TLBs the MMU-notifiers need to
> > > catch the moment when pages are unmapped but not yet freed.
> > > This new notifier catches that moment and notifies the
> > > interested subsytem when pages that were unmapped are about
> > > to be freed. The new notifier will only be called between
> > > invalidate_range_start()/end().
> >
> > So if we were actually sharing page tables, we should be able to make
> > start/end no-ops and just use this new callback, assuming we didn't
> > need to do any other serialization or debug stuff, right?
>
> Well, not completly. What you need with this patch-set is a
> invalidate_range and an invalidate_end call-back. There are call sites
> of the start/end functions where the TLB flush happens after the _end
> notifier (or at least can wait until _end is called). I did not add
> invalidate_range calls to these places (yet). But you can easily discard
> invalidate_range_start, any flush done in there is useless with shared
> page-tables.
>
> I though about removing the need for invalidate_range_end too when
> writing the patches, and possible solutions are
>
> 1) Add mmu_notifier_invalidate_range() to all places where
> start/end is called too. This might add some unnecessary
> overhead.
>
> 2) Call the invalidate_range() call-back from the
> mmu_notifier_invalidate_range_end too.
>
> 3) Just let the user register the same function for
> invalidate_range and invalidate_range_end
>
> I though that option 1) adds overhead that is not needed (but it might
> not be too bad, the overhead is an additional iteration over the
> mmu_notifer list when there are no call-backs registered).
>
> Option 2) might also be overhead if a user registers different functions
> for invalidate_range() and invalidate_range_end(). In the end I came to
> the conclusion that option 3) is the best one from an overhead POV.
>
> But probably targeting better usability with one of the other options is
> a better choice? I am open for thoughts and suggestions on that.
Making the _end callback just do another TLB flush is fine too, but it
would be nice to have the consistency of (1). I can live with either
though, as long as the callbacks are well documented.
Thanks,
--
Jesse Barnes, Intel Open Source Technology Center
On Fri, Jul 25, 2014 at 11:38:06PM +0200, Joerg Roedel wrote:
> On Fri, Jul 25, 2014 at 01:16:39PM -0700, Jesse Barnes wrote:
> > > To allow managing external TLBs the MMU-notifiers need to
> > > catch the moment when pages are unmapped but not yet freed.
> > > This new notifier catches that moment and notifies the
> > > interested subsytem when pages that were unmapped are about
> > > to be freed. The new notifier will only be called between
> > > invalidate_range_start()/end().
> >
> > So if we were actually sharing page tables, we should be able to make
> > start/end no-ops and just use this new callback, assuming we didn't
> > need to do any other serialization or debug stuff, right?
>
> Well, not completly. What you need with this patch-set is a
> invalidate_range and an invalidate_end call-back. There are call sites
> of the start/end functions where the TLB flush happens after the _end
> notifier (or at least can wait until _end is called). I did not add
> invalidate_range calls to these places (yet). But you can easily discard
> invalidate_range_start, any flush done in there is useless with shared
> page-tables.
>
> I though about removing the need for invalidate_range_end too when
> writing the patches, and possible solutions are
>
> 1) Add mmu_notifier_invalidate_range() to all places where
> start/end is called too. This might add some unnecessary
> overhead.
>
> 2) Call the invalidate_range() call-back from the
> mmu_notifier_invalidate_range_end too.
>
> 3) Just let the user register the same function for
> invalidate_range and invalidate_range_end
>
> I though that option 1) adds overhead that is not needed (but it might
> not be too bad, the overhead is an additional iteration over the
> mmu_notifer list when there are no call-backs registered).
>
> Option 2) might also be overhead if a user registers different functions
> for invalidate_range() and invalidate_range_end(). In the end I came to
> the conclusion that option 3) is the best one from an overhead POV.
>
> But probably targeting better usability with one of the other options is
> a better choice? I am open for thoughts and suggestions on that.
>
I should add that for hmm it is crucial to exactly match call to start and
end ie hmm needs to know when it can start again to do cpu page table look
up and expect valid content.
Cheers,
J?r?me
On Fri, Jul 25, 2014 at 02:42:13PM -0700, Jesse Barnes wrote:
> On Fri, 25 Jul 2014 23:38:06 +0200
> Joerg Roedel <[email protected]> wrote:
> > I though about removing the need for invalidate_range_end too when
> > writing the patches, and possible solutions are
> >
> > 1) Add mmu_notifier_invalidate_range() to all places where
> > start/end is called too. This might add some unnecessary
> > overhead.
> >
> > 2) Call the invalidate_range() call-back from the
> > mmu_notifier_invalidate_range_end too.
> >
> > 3) Just let the user register the same function for
> > invalidate_range and invalidate_range_end
> >
> > I though that option 1) adds overhead that is not needed (but it might
> > not be too bad, the overhead is an additional iteration over the
> > mmu_notifer list when there are no call-backs registered).
> >
> > Option 2) might also be overhead if a user registers different functions
> > for invalidate_range() and invalidate_range_end(). In the end I came to
> > the conclusion that option 3) is the best one from an overhead POV.
> >
> > But probably targeting better usability with one of the other options is
> > a better choice? I am open for thoughts and suggestions on that.
>
> Making the _end callback just do another TLB flush is fine too, but it
> would be nice to have the consistency of (1). I can live with either
> though, as long as the callbacks are well documented.
You are right, having this consistency would be good. The more I think
about it, the more it makes sense to go with option 2). Option 1) would
mean that invalidate_range is explicitly called right before
invalidate_range_end at some places. Doing this implicitly like in
option 2) is cleaner and less error-prone. And the list of mmu_notifiers
needs only be traversed once in invalidate_range_end(), so additional
overhead is minimal. I'll update patch 3 for this, unless there are
other opinions.
Joerg