2019-04-16 13:50:50

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 00/31] Speculative page faults

This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the page fault handler is aborting if a file pointer is associated to the
VMA.

Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
benchmark [2].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried. VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move
forward. This allows the SPF handler to map the PTE safely. If the PMD
value is different from the one recorded at the beginning of the SPF
operation, the classic page fault handler will be called to handle the
operation while holding the mmap_sem. As the PTE lock is done with the
interrupts disabled, the lock is done using spin_trylock() to avoid dead
lock when handling a page fault while a TLB invalidate is requested by
another CPU holding the PTE.

In pseudo code, this could be seen as:
speculative_page_fault()
{
vma = find_vma_rcu()
check vma sequence count
check vma's support
disable interrupt
check pgd,p4d,...,pte
save pmd and pte in vmf
save vma sequence counter in vmf
enable interrupt
check vma sequence count
handle_pte_fault(vma)
..
page = alloc_page()
pte_map_lock()
disable interrupt
abort if sequence counter has changed
abort if pmd or pte has changed
pte map and lock
enable interrupt
if abort
free page
abort
...
put_vma(vma)
}

arch_fault_handler()
{
if (speculative_page_fault(&vma))
goto done
again:
lock(mmap_sem)
vma = find_vma();
handle_pte_fault(vma);
if retry
unlock(mmap_sem)
goto again;
done:
handle fault error
}

Support for THP is not done because when checking for the PMD, we can be
confused by an in progress collapsing operation done by khugepaged. The
issue is that pmd_none() could be true either if the PMD is not already
populated or if the underlying PTE are in the way to be collapsed. So we
cannot safely allocate a PMD if pmd_none() is true.

This series add a new software performance event named 'speculative-faults'
or 'spf'. It counts the number of successful page fault event handled
speculatively. When recording 'faults,spf' events, the faults one is
counting the total number of page fault events while 'spf' is only counting
the part of the faults processed speculatively.

There are some trace events introduced by this series. They allow
identifying why the page faults were not processed speculatively. This
doesn't take in account the faults generated by a monothreaded process
which directly processed while holding the mmap_sem. This trace events are
grouped in a system named 'pagefault', they are:

- pagefault:spf_vma_changed : if the VMA has been changed in our back
- pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
- pagefault:spf_vma_notsup : the VMA's type is not supported
- pagefault:spf_vma_access : the VMA's access right are not respected
- pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
back.

To record all the related events, the easier is to run perf with the
following arguments :
$ perf stat -e 'faults,spf,pagefault:*' <command>

There is also a dedicated vmstat counter showing the number of successful
page fault handled speculatively. I can be seen this way:
$ grep speculative_pgfault /proc/vmstat

It is possible to deactivate the speculative page fault handler by echoing
0 in /proc/sys/vm/speculative_page_fault.

This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
functional on x86, PowerPC. I cross built it on arm64 but I was not able to
test it.

This series is also available on github [4].

---------------------
Real Workload results

Test using a "popular in memory multithreaded database product" on 128cores
SMT8 Power system are in progress and I will come back with performance
mesurement as soon as possible. With the previous series we seen up to 30%
improvements in the number of transaction processed per second, and we hope
this will be the case with this series too.

------------------
Benchmarks results

Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
SPF is BASE + this series

Kernbench:
----------
Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
kernel (kernel is build 5 times):

Average Half load -j 24
Run (std deviation)
BASE SPF
Elapsed Time 56.52 (1.39185) 56.256 (1.15106) 0.47%
User Time 980.018 (2.94734) 984.958 (1.98518) -0.50%
System Time 130.744 (1.19148) 133.616 (0.873573) -2.20%
Percent CPU 1965.6 (49.682) 1988.4 (40.035) -1.16%
Context Switches 29926.6 (272.789) 30472.4 (109.569) -1.82%
Sleeps 124793 (415.87) 125003 (591.008) -0.17%

Average Optimal load -j 48
Run (std deviation)
BASE SPF
Elapsed Time 46.354 (0.917949) 45.968 (1.42786) 0.83%
User Time 1193.42 (224.96) 1196.78 (223.28) -0.28%
System Time 143.306 (13.2726) 146.177 (13.2659) -2.00%
Percent CPU 2668.6 (743.157) 2699.9 (753.767) -1.17%
Context Switches 62268.3 (34097.1) 62721.7 (33999.1) -0.73%
Sleeps 132556 (8222.99) 132607 (8077.6) -0.04%

During a run on the SPF, perf events were captured:
Performance counter stats for '../kernbench -M':
525,873,132 faults
242 spf
0 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
441 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed

Very few speculative page faults were recorded as most of the processes
involved are monothreaded (sounds that on this architecture some threads
were created during the kernel build processing).

Here are the kerbench results on a 1024 CPUs Power8 VM:

5.1.0-rc4-mm1+ 5.1.0-rc4-mm1-spf-rcu+
Average Half load -j 512 Run (std deviation):
Elapsed Time 52.52 (0.906697) 52.778 (0.510069) -0.49%
User Time 3855.43 (76.378) 3890.44 (73.0466) -0.91%
System Time 1977.24 (182.316) 1974.56 (166.097) 0.14%
Percent CPU 11111.6 (540.461) 11115.2 (458.907) -0.03%
Context Switches 83245.6 (3061.44) 83651.8 (1202.31) -0.49%
Sleeps 613459 (23091.8) 628378 (27485.2) -2.43%

Average Optimal load -j 1024 Run (std deviation):
Elapsed Time 52.964 (0.572346) 53.132 (0.825694) -0.32%
User Time 4058.22 (222.034) 4070.2 (201.646) -0.30%
System Time 2672.81 (759.207) 2712.13 (797.292) -1.47%
Percent CPU 12756.7 (1786.35) 12806.5 (1858.89) -0.39%
Context Switches 88818.5 (6772) 87890.6 (5567.72) 1.04%
Sleeps 618658 (20842.2) 636297 (25044) -2.85%

During a run on the SPF, perf events were captured:
Performance counter stats for '../kernbench -M':
149 375 832 faults
1 spf
0 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
561 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed

Most of the processes involved are monothreaded so SPF is not activated but
there is no impact on the performance.

Ebizzy:
-------
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
consistent result I repeated the test 100 times and measure the average
result. The number is the record processes per second, the higher is the best.

BASE SPF delta
24 CPUs x86 5492.69 9383.07 70.83%
1024 CPUS P8 VM 8476.74 17144.38 102%

Here are the performance counter read during a run on a 48 CPUs x86 node:
Performance counter stats for './ebizzy -mTt 48':
11,846,569 faults
10,886,706 spf
957,702 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
815 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed

And the ones captured during a run on a 1024 CPUs Power VM:
Performance counter stats for './ebizzy -mTt 1024':
1 359 789 faults
1 284 910 spf
72 085 pagefault:spf_vma_changed
0 pagefault:spf_vma_noanon
2 669 pagefault:spf_vma_notsup
0 pagefault:spf_vma_access
0 pagefault:spf_pmd_changed

In ebizzy's case most of the page fault were handled in a speculative way,
leading the ebizzy performance boost.

------------------
Changes since v11 [3]
- Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
- Abort speculative page fault when doing swap readhead because VMA's
boundaries are not protected at this time. Doing this the first swap in
is doing a readhead, the next fault should be handled in a speculative
way as the page is present in the swap read page.
- Handle a race between copy_pte_range() and the wp_page_copy called by
the speculative page fault handler.
- Ported to Kernel v5.0
- Moved VM_FAULT_PTNOTSAME define in mm_types.h
- Use RCU to protect the MM RB tree instead of a rwlock.
- Add a toggle interface: /proc/sys/vm/speculative_page_fault

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
[3] https://lore.kernel.org/linux-mm/[email protected]/
[4] https://github.com/ldu4/linux/tree/spf-v12

Laurent Dufour (25):
mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
mm: make pte_unmap_same compatible with SPF
mm: introduce INIT_VMA()
mm: protect VMA modifications using VMA sequence count
mm: protect mremap() against SPF hanlder
mm: protect SPF handler against anon_vma changes
mm: cache some VMA fields in the vm_fault structure
mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
mm: introduce __lru_cache_add_active_or_unevictable
mm: introduce __vm_normal_page()
mm: introduce __page_add_new_anon_rmap()
mm: protect against PTE changes done by dup_mmap()
mm: protect the RB tree with a sequence lock
mm: introduce vma reference counter
mm: Introduce find_vma_rcu()
mm: don't do swap readahead during speculative page fault
mm: adding speculative page fault failure trace events
perf: add a speculative page fault sw event
perf tools: add support for the SPF perf event
mm: add speculative page fault vmstats
powerpc/mm: add speculative page fault
mm: Add a speculative page fault switch in sysctl

Mahendran Ganesh (2):
arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
arm64/mm: add speculative page fault

Peter Zijlstra (4):
mm: prepare for FAULT_FLAG_SPECULATIVE
mm: VMA sequence count
mm: provide speculative fault infrastructure
x86/mm: add speculative pagefault handling

arch/arm64/Kconfig | 1 +
arch/arm64/mm/fault.c | 12 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/mm/fault.c | 16 +
arch/x86/Kconfig | 1 +
arch/x86/mm/fault.c | 14 +
fs/exec.c | 1 +
fs/proc/task_mmu.c | 5 +-
fs/userfaultfd.c | 17 +-
include/linux/hugetlb_inline.h | 2 +-
include/linux/migrate.h | 4 +-
include/linux/mm.h | 138 +++++-
include/linux/mm_types.h | 16 +-
include/linux/pagemap.h | 4 +-
include/linux/rmap.h | 12 +-
include/linux/swap.h | 10 +-
include/linux/vm_event_item.h | 3 +
include/trace/events/pagefault.h | 80 ++++
include/uapi/linux/perf_event.h | 1 +
kernel/fork.c | 35 +-
kernel/sysctl.c | 9 +
mm/Kconfig | 22 +
mm/huge_memory.c | 6 +-
mm/hugetlb.c | 2 +
mm/init-mm.c | 3 +
mm/internal.h | 45 ++
mm/khugepaged.c | 5 +
mm/madvise.c | 6 +-
mm/memory.c | 631 ++++++++++++++++++++++----
mm/mempolicy.c | 51 ++-
mm/migrate.c | 6 +-
mm/mlock.c | 13 +-
mm/mmap.c | 249 ++++++++--
mm/mprotect.c | 4 +-
mm/mremap.c | 13 +
mm/nommu.c | 1 +
mm/rmap.c | 5 +-
mm/swap.c | 6 +-
mm/swap_state.c | 10 +-
mm/vmstat.c | 5 +-
tools/include/uapi/linux/perf_event.h | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 4 +
tools/perf/util/parse-events.l | 1 +
tools/perf/util/python.c | 1 +
45 files changed, 1277 insertions(+), 196 deletions(-)
create mode 100644 include/trace/events/pagefault.h

--
2.21.0


2019-04-16 13:47:41

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

From: Mahendran Ganesh <[email protected]>

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.

Signed-off-by: Ganesh Mahendran <[email protected]>
---
arch/arm64/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 870ef86a64ed..8e86934d598b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -174,6 +174,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
ARM 64-bit (AArch64) Linux support.

--
2.21.0

2019-04-16 13:47:49

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap()

When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/rmap.h | 12 ++++++++++--
mm/memory.c | 8 ++++----
mm/rmap.c | 5 ++---
3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, bool);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+ VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+ __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
void page_add_file_rmap(struct page *, bool);
void page_remove_rmap(struct page *, bool);

diff --git a/mm/memory.c b/mm/memory.c
index be93f2c8ebe0..46f877b6abea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* thread doing COW.
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
- page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -2897,7 +2897,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3049,7 +3049,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
}

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
setpte:
@@ -3328,7 +3328,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, vmf->address, false);
+ __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..2148e8ce6e34 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1140,7 +1140,7 @@ void do_page_add_anon_rmap(struct page *page,
}

/**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * __page_add_new_anon_rmap - add pte mapping to a new anonymous page
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
@@ -1150,12 +1150,11 @@ void do_page_add_anon_rmap(struct page *page,
* This means the inc-and-test can be bypassed.
* Page does not have to be locked.
*/
-void page_add_new_anon_rmap(struct page *page,
+void __page_add_new_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address, bool compound)
{
int nr = compound ? hpage_nr_pages(page) : 1;

- VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
__SetPageSwapBacked(page);
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
--
2.21.0

2019-04-16 13:47:52

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

This allows to search for a VMA structure without holding the mmap_sem.

The search is repeated while the mm seqlock is changing and until we found
a valid VMA.

While under the RCU protection, a reference is taken on the VMA, so the
caller must call put_vma() once it not more need the VMA structure.

At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
reference is taken to the VMA by calling get_vma().

When removing a VMA from the MM RB tree, the VMA is not release immediately
but at the end of the RCU grace period through vm_rcu_put(). This ensures
that the VMA remains allocated until the end the RCU grace period.

Since the vm_file pointer, if valid, is released in put_vma(), there is no
guarantee that the file pointer will be valid on the returned VMA.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm_types.h | 1 +
mm/internal.h | 5 ++-
mm/mmap.c | 76 ++++++++++++++++++++++++++++++++++++++--
3 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a6159e11a3f..9af6694cb95d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -287,6 +287,7 @@ struct vm_area_struct {

#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
atomic_t vm_ref_count;
+ struct rcu_head vm_rcu;
#endif
struct rb_node vm_rb;

diff --git a/mm/internal.h b/mm/internal.h
index 302382bed406..1e368e4afe3c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
__free_vma(vma);
}

-#else
+extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
+ unsigned long addr);
+
+#else /* CONFIG_SPECULATIVE_PAGE_FAULT */

static inline void get_vma(struct vm_area_struct *vma)
{
diff --git a/mm/mmap.c b/mm/mmap.c
index c106440dcae7..34bf261dc2c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
{
write_sequnlock(&mm->mm_seq);
}
+
+static void __vm_rcu_put(struct rcu_head *head)
+{
+ struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
+ vm_rcu);
+ put_vma(vma);
+}
+static void vm_rcu_put(struct vm_area_struct *vma)
+{
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
+ call_rcu(&vma->vm_rcu, __vm_rcu_put);
+}
#else
static inline void mm_write_seqlock(struct mm_struct *mm)
{
@@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)

void __free_vma(struct vm_area_struct *vma)
{
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
mpol_put(vma_policy(vma));
vm_area_free(vma);
}
@@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
/*
* Close a vm structure and free it, returning the next.
*/
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
{
struct vm_area_struct *next = vma->vm_next;

might_sleep();
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
+ !RB_EMPTY_NODE(&vma->vm_rb)) {
+ /*
+ * If the VMA is still linked in the RB tree, we must release
+ * that reference by calling put_vma().
+ * This should only happen when called from exit_mmap().
+ * We forcely clear the node to satisfy the chec in
+ * __free_vma(). This is safe since the RB tree is not walked
+ * anymore.
+ */
+ RB_CLEAR_NODE(&vma->vm_rb);
+ put_vma(vma);
+ }
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (vma->vm_file)
@@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
return next;
}

+static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+{
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
+ VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
+ return __remove_vma(vma);
+}
+
static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
struct list_head *uf);
SYSCALL_DEFINE1(brk, unsigned long, brk)
@@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,

/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);
-
+ get_vma(vma);
rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
}

@@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
mm_write_seqlock(mm);
rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
mm_write_sequnlock(mm); /* wmb */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ /*
+ * Ensure the removal is complete before clearing the node.
+ * Matched by vma_has_changed()/handle_speculative_fault().
+ */
+ RB_CLEAR_NODE(&vma->vm_rb);
+ vm_rcu_put(vma);
+#endif
}

static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
@@ -2331,6 +2373,34 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)

EXPORT_SYMBOL(find_vma);

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+/*
+ * Like find_vma() but under the protection of RCU and the mm sequence counter.
+ * The vma returned has to be relaesed by the caller through the call to
+ * put_vma()
+ */
+struct vm_area_struct *find_vma_rcu(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma = NULL;
+ unsigned int seq;
+
+ do {
+ if (vma)
+ put_vma(vma);
+
+ seq = read_seqbegin(&mm->mm_seq);
+
+ rcu_read_lock();
+ vma = find_vma(mm, addr);
+ if (vma)
+ get_vma(vma);
+ rcu_read_unlock();
+ } while (read_seqretry(&mm->mm_seq, seq));
+
+ return vma;
+}
+#endif
+
/*
* Same as find_vma, but also return a pointer to the previous VMA in *pprev.
*/
@@ -3231,7 +3301,7 @@ void exit_mmap(struct mm_struct *mm)
while (vma) {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- vma = remove_vma(vma);
+ vma = __remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
}
--
2.21.0

2019-04-16 13:47:57

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.

The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to handle the speculative handler.

The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
does processing that is not compatible with the speculative handling in the
case ARCH_HAS_PTE_SPECIAL is not set.

Suggested-by: Thomas Gleixner <[email protected]>
Suggested-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
mm/Kconfig | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0eada3f818fa..ff278ac9978a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -761,4 +761,26 @@ config GUP_BENCHMARK
config ARCH_HAS_PTE_SPECIAL
bool

+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+ bool "Speculative page faults"
+ default y
+ depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+ depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
+ help
+ Try to handle user space page faults without holding the mmap_sem.
+
+ This should allow better concurrency for massively threaded processes
+ since the page fault handler will not wait for other thread's memory
+ layout change to be done, assuming that this change is done in
+ another part of the process's memory space. This type of page fault
+ is named speculative page fault.
+
+ If the speculative page fault fails because a concurrent modification
+ is detected or because underlying PMD or PTE tables are not yet
+ allocated, the speculative page fault fails and a classic page fault
+ is then tried.
+
endmenu
--
2.21.0

2019-04-16 13:48:03

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 09/31] mm: VMA sequence count

From: Peter Zijlstra <[email protected]>

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The calls to vm_write_begin/end() in unmap_page_range() are
used to detect when a VMA is being unmap and thus that new page fault
should not be satisfied for this VMA. If the seqcount hasn't changed when
the page table are locked, this means we are safe to satisfy the page
fault.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

The VMA's sequence counter is also used to detect change to various VMA's
fields used during the page fault handling, such as:
- vm_start, vm_end
- vm_pgoff
- vm_flags, vm_page_prot
- anon_vma
- vm_policy

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
path]
[move sequence initialisation INIT_VMA()]
[Review the patch description about unmap_page_range()]
Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 44 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 3 +++
mm/memory.c | 2 ++
mm/mmap.c | 30 +++++++++++++++++++++++++++
4 files changed, 79 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ceb1d2869a6..906b9e06f18e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1410,6 +1410,9 @@ struct zap_details {
static inline void INIT_VMA(struct vm_area_struct *vma)
{
INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqcount_init(&vma->vm_sequence);
+#endif
}

struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
unmap_mapping_range(mapping, holebegin, holelen, 0);
}

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+ write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+ int subclass)
+{
+ write_seqcount_begin_nested(&vma->vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+ write_seqcount_end(&vma->vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+ raw_write_seqcount_begin(&vma->vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+ raw_write_seqcount_end(&vma->vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+ int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd7d38ee2e33..e78f72eb2576 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -337,6 +337,9 @@ struct vm_area_struct {
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqcount_t vm_sequence;
+#endif
} __randomize_layout;

struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index d5bebca47d98..423fa8ea0569 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;

BUG_ON(addr >= end);
+ vm_write_begin(vma);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+ vm_write_end(vma);
}


diff --git a/mm/mmap.c b/mm/mmap.c
index 5ad3a3228d76..a4e4d52a5148 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
long adjust_next = 0;
int remove_next = 0;

+ /*
+ * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
+ *
+ * Locked is complaining about a theoretical lock dependency, involving
+ * 3 locks:
+ * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
+ *
+ * Here are the major path leading to this dependency :
+ * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem
+ * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
+ * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
+ * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
+ *
+ * So there is no way to solve this easily, especially because in
+ * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
+ * VMAs are not yet known.
+ * However, the way the vm_seq is used is guarantying that we will
+ * never block on it since we just check for its value and never wait
+ * for it to move, see vma_has_changed() and handle_speculative_fault().
+ */
+ vm_raw_write_begin(vma);
+ if (next)
+ vm_raw_write_begin(next);
+
if (next && !insert) {
struct vm_area_struct *exporter = NULL, *importer = NULL;

@@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
* "vma->vm_next" gap must be updated.
*/
next = vma->vm_next;
+ if (next)
+ vm_raw_write_begin(next);
} else {
/*
* For the scope of the comment "next" and
@@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (insert && file)
uprobe_mmap(insert);

+ if (next && next != vma)
+ vm_raw_write_end(next);
+ vm_raw_write_end(vma);
+
validate_mm(mm);

return 0;
--
2.21.0

2019-04-16 13:48:10

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 19/31] mm: protect the RB tree with a sequence lock

Introducing a per mm_struct seqlock, mm_seq field, to protect the changes
made in the MM RB tree. This allows to walk the RB tree without grabbing
the mmap_sem, and on the walk is done to double check that sequence counter
was stable during the walk.

The mm seqlock is held while inserting and removing entries into the MM RB
tree. Later in this series, it will be check when looking for a VMA
without holding the mmap_sem.

This is based on the initial work from Peter Zijlstra:
https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm_types.h | 3 +++
kernel/fork.c | 3 +++
mm/init-mm.c | 3 +++
mm/mmap.c | 48 +++++++++++++++++++++++++++++++---------
4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e78f72eb2576..24b3f8ce9e42 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -358,6 +358,9 @@ struct mm_struct {
struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqlock_t mm_seq;
+#endif
u64 vmacache_seqnum; /* per-thread vmacache */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 2992d2c95256..3a1739197ebc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1008,6 +1008,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ seqlock_init(&mm->mm_seq);
+#endif
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..69346b883a4e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -27,6 +27,9 @@
*/
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
+#endif
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
diff --git a/mm/mmap.c b/mm/mmap.c
index 13460b38b0fb..f7f6027a7dff 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -170,6 +170,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
}
}

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+ write_seqlock(&mm->mm_seq);
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+ write_sequnlock(&mm->mm_seq);
+}
+#else
+static inline void mm_write_seqlock(struct mm_struct *mm)
+{
+}
+static inline void mm_write_sequnlock(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* Close a vm structure and free it, returning the next.
*/
@@ -445,26 +463,32 @@ static void vma_gap_update(struct vm_area_struct *vma)
}

static inline void vma_rb_insert(struct vm_area_struct *vma,
- struct rb_root *root)
+ struct mm_struct *mm)
{
+ struct rb_root *root = &mm->mm_rb;
+
/* All rb_subtree_gap values must be consistent prior to insertion */
validate_mm_rb(root, NULL);

rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
}

-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
{
+ struct rb_root *root = &mm->mm_rb;
+
/*
* Note rb_erase_augmented is a fairly large inline function,
* so make sure we instantiate it only once with our desired
* augmented rbtree callbacks.
*/
+ mm_write_seqlock(mm);
rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+ mm_write_sequnlock(mm); /* wmb */
}

static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
- struct rb_root *root,
+ struct mm_struct *mm,
struct vm_area_struct *ignore)
{
/*
@@ -472,21 +496,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
* with the possible exception of the "next" vma being erased if
* next->vm_start was reduced.
*/
- validate_mm_rb(root, ignore);
+ validate_mm_rb(&mm->mm_rb, ignore);

- __vma_rb_erase(vma, root);
+ __vma_rb_erase(vma, mm);
}

static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
- struct rb_root *root)
+ struct mm_struct *mm)
{
/*
* All rb_subtree_gap values must be consistent prior to erase,
* with the possible exception of the vma being erased.
*/
- validate_mm_rb(root, vma);
+ validate_mm_rb(&mm->mm_rb, vma);

- __vma_rb_erase(vma, root);
+ __vma_rb_erase(vma, mm);
}

/*
@@ -601,10 +625,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
* immediately update the gap to the correct value. Finally we
* rebalance the rbtree after all augmented values have been set.
*/
+ mm_write_seqlock(mm);
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
vma->rb_subtree_gap = 0;
vma_gap_update(vma);
- vma_rb_insert(vma, &mm->mm_rb);
+ vma_rb_insert(vma, mm);
+ mm_write_sequnlock(mm);
}

static void __vma_link_file(struct vm_area_struct *vma)
@@ -680,7 +706,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
{
struct vm_area_struct *next;

- vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
+ vma_rb_erase_ignore(vma, mm, ignore);
next = vma->vm_next;
if (has_prev)
prev->vm_next = next;
@@ -2674,7 +2700,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
do {
- vma_rb_erase(vma, &mm->mm_rb);
+ vma_rb_erase(vma, mm);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
--
2.21.0

2019-04-16 13:48:14

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 24/31] mm: adding speculative page fault failure trace events

This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/trace/events/pagefault.h | 80 ++++++++++++++++++++++++++++++++
mm/memory.c | 57 ++++++++++++++++++-----
2 files changed, 125 insertions(+), 12 deletions(-)
create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index 000000000000..d9438f3e6bad
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include <linux/tracepoint.h>
+#include <linux/mm.h>
+
+DECLARE_EVENT_CLASS(spf,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, caller)
+ __field(unsigned long, vm_start)
+ __field(unsigned long, vm_end)
+ __field(unsigned long, address)
+ ),
+
+ TP_fast_assign(
+ __entry->caller = caller;
+ __entry->vm_start = vma->vm_start;
+ __entry->vm_end = vma->vm_end;
+ __entry->address = address;
+ ),
+
+ TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+ TP_PROTO(unsigned long caller,
+ struct vm_area_struct *vma, unsigned long address),
+
+ TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1991da97e2db..509851ad7c95 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -81,6 +81,9 @@

#include "internal.h"

+#define CREATE_TRACE_POINTS
+#include <trace/events/pagefault.h>
+
#if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
#endif
@@ -2100,8 +2103,10 @@ static bool pte_spinlock(struct vm_fault *vmf)

again:
local_irq_disable();
- if (vma_has_changed(vmf))
+ if (vma_has_changed(vmf)) {
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2109,8 +2114,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#endif

vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2121,6 +2128,7 @@ static bool pte_spinlock(struct vm_fault *vmf)

if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}

@@ -2154,8 +2162,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
*/
again:
local_irq_disable();
- if (vma_has_changed(vmf))
+ if (vma_has_changed(vmf)) {
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2163,8 +2173,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
* is not a huge collapse operation in progress in our back.
*/
pmdval = READ_ONCE(*vmf->pmd);
- if (!pmd_same(pmdval, vmf->orig_pmd))
+ if (!pmd_same(pmdval, vmf->orig_pmd)) {
+ trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+ }
#endif

/*
@@ -2184,6 +2196,7 @@ static bool pte_map_lock(struct vm_fault *vmf)

if (vma_has_changed(vmf)) {
pte_unmap_unlock(pte, ptl);
+ trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}

@@ -4187,47 +4200,60 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,

/* rmb <-> seqlock,vma_rb_erase() */
seq = raw_read_seqcount(&vma->vm_sequence);
- if (seq & 1)
+ if (seq & 1) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }

/*
* Can't call vm_ops service has we don't know what they would do
* with the VMA.
* This include huge page from hugetlbfs.
*/
- if (vma->vm_ops && vma->vm_ops->fault)
+ if (vma->vm_ops && vma->vm_ops->fault) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }

/*
* __anon_vma_prepare() requires the mmap_sem to be held
* because vm_next and vm_prev must be safe. This can't be guaranteed
* in the speculative path.
*/
- if (unlikely(!vma->anon_vma))
+ if (unlikely(!vma->anon_vma)) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }

vmf.vma_flags = READ_ONCE(vma->vm_flags);
vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);

/* Can't call userland page fault handler in the speculative path */
- if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
+ if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }

- if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
+ if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) {
/*
* This could be detected by the check address against VMA's
* boundaries but we want to trace it as not supported instead
* of changed.
*/
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }

if (address < READ_ONCE(vma->vm_start)
- || READ_ONCE(vma->vm_end) <= address)
+ || READ_ONCE(vma->vm_end) <= address) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }

if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
flags & FAULT_FLAG_REMOTE)) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
@@ -4235,10 +4261,12 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
/* This is one is required to check that the VMA has write access set */
if (flags & FAULT_FLAG_WRITE) {
if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
+ trace_spf_vma_access(_RET_IP_, vma, address);
ret = VM_FAULT_SIGSEGV;
goto out_put;
}
@@ -4252,8 +4280,10 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
pol = __get_vma_policy(vma, address);
if (!pol)
pol = get_task_policy(current);
- if (pol && pol->mode == MPOL_INTERLEAVE)
+ if (pol && pol->mode == MPOL_INTERLEAVE) {
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
goto out_put;
+ }
#endif

/*
@@ -4326,8 +4356,10 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
* We need to re-validate the VMA after checking the bounds, otherwise
* we might have a false positive on the bounds.
*/
- if (read_seqcount_retry(&vma->vm_sequence, seq))
+ if (read_seqcount_retry(&vma->vm_sequence, seq)) {
+ trace_spf_vma_changed(_RET_IP_, vma, address);
goto out_put;
+ }

mem_cgroup_enter_user_fault();
ret = handle_pte_fault(&vmf);
@@ -4346,6 +4378,7 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
return ret;

out_walk:
+ trace_spf_vma_notsup(_RET_IP_, vma, address);
local_irq_enable();
out_put:
put_vma(vma);
--
2.21.0

2019-04-16 13:48:23

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable

The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Acked-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/swap.h | 10 ++++++++--
mm/memory.c | 8 ++++----
mm/swap.c | 6 +++---
3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..d33b94eb3c69 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);

-extern void lru_cache_add_active_or_unevictable(struct page *page,
- struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+ unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+ struct vm_area_struct *vma)
+{
+ return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}

/* linux/mm/vmscan.c */
extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index 56802850e72c..85ec5ce5c0a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
- lru_cache_add_active_or_unevictable(new_page, vma);
+ __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
* We call the notify macro here because, when using secondary
* mmu page tables (such as kvm shadow page tables), we want the
@@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

@@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
- lru_cache_add_active_or_unevictable(page, vma);
+ __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 3a75722e68a9..a55f0505b563 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
* directly back onto it's zone's unevictable list, it does NOT use a
* per cpu pagevec.
*/
-void lru_cache_add_active_or_unevictable(struct page *page,
- struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+ unsigned long vma_flags)
{
VM_BUG_ON_PAGE(PageLRU(page), page);

- if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+ if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
--
2.21.0

2019-04-16 13:48:30

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 28/31] x86/mm: add speculative pagefault handling

From: Peter Zijlstra <[email protected]>

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour <[email protected]>
---
arch/x86/mm/fault.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 667f1da36208..4390d207a7a1 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1401,6 +1401,18 @@ void do_user_addr_fault(struct pt_regs *regs,
}
#endif

+ /*
+ * Do not try to do a speculative page fault if the fault was due to
+ * protection keys since it can't be resolved.
+ */
+ if (!(hw_error_code & X86_PF_PK)) {
+ fault = handle_speculative_fault(mm, address, flags);
+ if (fault != VM_FAULT_RETRY) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+ goto done;
+ }
+ }
+
/*
* Kernel-mode access to the user address space should only occur
* on well-defined single instructions listed in the exception
@@ -1499,6 +1511,8 @@ void do_user_addr_fault(struct pt_regs *regs,
}

up_read(&mm->mmap_sem);
+
+done:
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, hw_error_code, address, fault);
return;
--
2.21.0

2019-04-16 13:48:35

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 31/31] mm: Add a speculative page fault switch in sysctl

This allows to turn on/off the use of the speculative page fault handler.

By default it's turned on.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 3 +++
kernel/sysctl.c | 9 +++++++++
mm/memory.c | 3 +++
3 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec609cbad25a..f5bf13a2197a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1531,6 +1531,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);

#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern int sysctl_speculative_page_fault;
extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
unsigned long address,
unsigned int flags);
@@ -1538,6 +1539,8 @@ static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
unsigned long address,
unsigned int flags)
{
+ if (unlikely(!sysctl_speculative_page_fault))
+ return VM_FAULT_RETRY;
/*
* Try speculative page fault for multithreaded user space task only.
*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9df14b07a488..3a712e52c14a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1295,6 +1295,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &two,
},
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ {
+ .procname = "speculative_page_fault",
+ .data = &sysctl_speculative_page_fault,
+ .maxlen = sizeof(sysctl_speculative_page_fault),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{
.procname = "panic_on_oom",
.data = &sysctl_panic_on_oom,
diff --git a/mm/memory.c b/mm/memory.c
index c65e8011d285..a12a60891350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -83,6 +83,9 @@

#define CREATE_TRACE_POINTS
#include <trace/events/pagefault.h>
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+int sysctl_speculative_page_fault = 1;
+#endif

#if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
--
2.21.0

2019-04-16 13:48:38

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 30/31] arm64/mm: add speculative page fault

From: Mahendran Ganesh <[email protected]>

This patch enables the speculative page fault on the arm64
architecture.

I completed spf porting in 4.9. From the test result,
we can see app launching time improved by about 10% in average.
For the apps which have more than 50 threads, 15% or even more
improvement can be got.

Signed-off-by: Ganesh Mahendran <[email protected]>

[handle_speculative_fault() is no more returning the vma pointer]
Signed-off-by: Laurent Dufour <[email protected]>
---
arch/arm64/mm/fault.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 4f343e603925..b5e2a93f9c21 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -485,6 +485,16 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

+ /*
+ * let's try a speculative page fault without grabbing the
+ * mmap_sem.
+ */
+ fault = handle_speculative_fault(mm, addr, mm_flags);
+ if (fault != VM_FAULT_RETRY) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, addr);
+ goto done;
+ }
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
@@ -535,6 +545,8 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
}
up_read(&mm->mmap_sem);

+done:
+
/*
* Handle the "normal" (no error) case first.
*/
--
2.21.0

2019-04-16 13:48:39

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE

From: Peter Zijlstra <[email protected]>

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
[move the define of FAULT_FLAG_SPECULATIVE later in the series]
[review error path in do_swap_page(), do_anonymous_page() and
wp_page_copy()]
Signed-off-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 87 +++++++++++++++++++++++++++++++++++------------------
1 file changed, 58 insertions(+), 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c6ddadd9d2b7..fc3698d13cb5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);

+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ return true;
+}
+
/*
* handle_pte_fault chooses page fault handler according to an entry which was
* read non-atomically. Before making any commitment, on those architectures
@@ -2261,25 +2268,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
int page_copied = 0;
struct mem_cgroup *memcg;
struct mmu_notifier_range range;
+ int ret = VM_FAULT_OOM;

if (unlikely(anon_vma_prepare(vma)))
- goto oom;
+ goto out;

if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
- goto oom;
+ goto out;
cow_user_page(new_page, old_page, vmf->address, vma);
}

if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
- goto oom_free_new;
+ goto out_free_new;

__SetPageUptodate(new_page);

@@ -2291,7 +2299,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/*
* Re-check the pte - we dropped the lock
*/
- vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto out_uncharge;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2378,12 +2389,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_uncharge:
+ mem_cgroup_cancel_charge(new_page, memcg, false);
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
- return VM_FAULT_OOM;
+ return ret;
}

/**
@@ -2405,8 +2418,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
{
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
- vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
/*
* We might have raced with another page fault while we released the
* pte_offset_map_lock.
@@ -2527,8 +2540,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ unlock_page(vmf->page);
+ put_page(vmf->page);
+ return VM_FAULT_RETRY;
+ }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2744,11 +2760,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

if (!page) {
/*
- * Back out if somebody else faulted in this pte
- * while we released the pte lock.
+ * Back out if the VMA has changed in our back during
+ * a speculative page fault or if somebody else
+ * faulted in this pte while we released the pte lock.
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ ret = VM_FAULT_RETRY;
+ goto out;
+ }
if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2801,10 +2821,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}

/*
- * Back out if somebody else already faulted in this pte.
+ * Back out if the VMA has changed in our back during a speculative
+ * page fault or if somebody else already faulted in this pte.
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
+ goto out_cancel_cgroup;
+ }
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
goto out_nomap;

@@ -2882,8 +2905,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg, false);
pte_unmap_unlock(vmf->pte, vmf->ptl);
+out_cancel_cgroup:
+ mem_cgroup_cancel_charge(page, memcg, false);
out_page:
unlock_page(page);
out_release:
@@ -2934,8 +2958,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
if (!pte_none(*vmf->pte))
goto unlock;
ret = check_stable_address_space(vma->vm_mm);
@@ -2971,14 +2995,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
- if (!pte_none(*vmf->pte))
+ if (!pte_map_lock(vmf)) {
+ ret = VM_FAULT_RETRY;
goto release;
+ }
+ if (!pte_none(*vmf->pte))
+ goto unlock_and_release;

ret = check_stable_address_space(vma->vm_mm);
if (ret)
- goto release;
+ goto unlock_and_release;

/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
@@ -3000,10 +3026,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
+unlock_and_release:
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
release:
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- goto unlock;
+ return ret;
oom_free_page:
put_page(page);
oom:
@@ -3118,8 +3146,9 @@ static vm_fault_t pte_alloc_one_map(struct vm_fault *vmf)
* pte_none() under vmf->ptl protection when we return to
* alloc_set_pte().
*/
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
- &vmf->ptl);
+ if (!pte_map_lock(vmf))
+ return VM_FAULT_RETRY;
+
return 0;
}

--
2.21.0

2019-04-16 13:48:46

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
called by update_mmu_cache()

Cc: Michael Ellerman <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
arch/powerpc/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2d0be82c3061..a29887ea5383 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,6 +238,7 @@ config PPC
select PCI_SYSCALL if PCI
select RTC_LIB
select SPARSE_IRQ
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
select VIRT_TO_BUS if !PPC64
--
2.21.0

2019-04-16 13:48:48

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 11/31] mm: protect mremap() against SPF hanlder

If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1 CPU2
enter mremap()
unmap the dest area
copy_vma() Enter speculative page fault handler
>> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler
<data written in the new page>
move_ptes()
> it is assumed that the dest area is empty,
> the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 24 ++++++++++++++++-----
mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++-----------
mm/mremap.c | 13 ++++++++++++
3 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 906b9e06f18e..5d45b7d8718d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);

/* mmap.c */
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
+
extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
- struct vm_area_struct *expand);
+ struct vm_area_struct *expand, bool keep_locked);
+
static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
{
- return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+ return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
}
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+
+extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
+ struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+ unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+ pgoff_t pgoff, struct mempolicy *mpol,
+ struct vm_userfaultfd_ctx uff, bool keep_locked);
+
+static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
- unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *, struct vm_userfaultfd_ctx);
+ unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+ pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+ return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
+ pol, uff, false);
+}
+
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index b77ec0149249..13460b38b0fb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
*/
int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
- struct vm_area_struct *expand)
+ struct vm_area_struct *expand, bool keep_locked)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,

importer->anon_vma = exporter->anon_vma;
error = anon_vma_clone(importer, exporter);
- if (error)
+ if (error) {
+ if (next && next != vma)
+ vm_raw_write_end(next);
+ vm_raw_write_end(vma);
return error;
+ }
}
}
again:
@@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,

if (next && next != vma)
vm_raw_write_end(next);
- vm_raw_write_end(vma);
+ if (!keep_locked)
+ vm_raw_write_end(vma);

validate_mm(mm);

@@ -1161,12 +1166,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
* parameter) may establish ptes with the wrong permissions of NNNN
* instead of the right permissions of XXXX.
*/
-struct vm_area_struct *vma_merge(struct mm_struct *mm,
+struct vm_area_struct *__vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr,
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ bool keep_locked)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1214,10 +1220,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
/* cases 1, 6 */
err = __vma_adjust(prev, prev->vm_start,
next->vm_end, prev->vm_pgoff, NULL,
- prev);
+ prev, keep_locked);
} else /* cases 2, 5, 7 */
err = __vma_adjust(prev, prev->vm_start,
- end, prev->vm_pgoff, NULL, prev);
+ end, prev->vm_pgoff, NULL, prev,
+ keep_locked);
if (err)
return NULL;
khugepaged_enter_vma_merge(prev, vm_flags);
@@ -1234,10 +1241,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
vm_userfaultfd_ctx)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
- addr, prev->vm_pgoff, NULL, next);
+ addr, prev->vm_pgoff, NULL, next,
+ keep_locked);
else { /* cases 3, 8 */
err = __vma_adjust(area, addr, next->vm_end,
- next->vm_pgoff - pglen, NULL, next);
+ next->vm_pgoff - pglen, NULL, next,
+ keep_locked);
/*
* In case 3 area is already equal to next and
* this is a noop, but in case 8 "area" has
@@ -3259,9 +3268,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,

if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
return NULL; /* should never get here */
- new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+
+ /* There is 3 cases to manage here in
+ * AAAA AAAA AAAA AAAA
+ * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN
+ * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL
+ * PPPPPPPPNNNN(2)
+ * PPPPNNNNNNNN(3)
+ *
+ * new_vma == prev in case A,1,2
+ * new_vma == next in case B,3
+ */
+ new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
+ vma->anon_vma, vma->vm_file, pgoff,
+ vma_policy(vma), vma->vm_userfaultfd_ctx, true);
if (new_vma) {
/*
* Source vma may have been merged into new_vma
@@ -3299,6 +3319,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
get_file(new_vma->vm_file);
if (new_vma->vm_ops && new_vma->vm_ops->open)
new_vma->vm_ops->open(new_vma);
+ /*
+ * As the VMA is linked right now, it may be hit by the
+ * speculative page fault handler. But we don't want it to
+ * to start mapping page in this area until the caller has
+ * potentially move the pte from the moved VMA. To prevent
+ * that we protect it right now, and let the caller unprotect
+ * it once the move is done.
+ */
+ vm_raw_write_begin(new_vma);
vma_link(mm, new_vma, prev, rb_link, rb_parent);
*need_rmap_locks = false;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index fc241d23cd97..ae5c3379586e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -357,6 +357,14 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (!new_vma)
return -ENOMEM;

+ /* new_vma is returned protected by copy_vma, to prevent speculative
+ * page fault to be done in the destination area before we move the pte.
+ * Now, we must also protect the source VMA since we don't want pages
+ * to be mapped in our back while we are copying the PTEs.
+ */
+ if (vma != new_vma)
+ vm_raw_write_begin(vma);
+
moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
need_rmap_locks);
if (moved_len < old_len) {
@@ -373,6 +381,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
*/
move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
true);
+ if (vma != new_vma)
+ vm_raw_write_end(vma);
vma = new_vma;
old_len = new_len;
old_addr = new_addr;
@@ -381,7 +391,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
mremap_userfaultfd_prep(new_vma, uf);
arch_remap(mm, old_addr, old_addr + old_len,
new_addr, new_addr + new_len);
+ if (vma != new_vma)
+ vm_raw_write_end(vma);
}
+ vm_raw_write_end(new_vma);

/* Conceal VM_ACCOUNT so old reservation is not undone */
if (vm_flags & VM_ACCOUNT) {
--
2.21.0

2019-04-16 13:48:50

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

Vinayak Menon and Ganesh Mahendran reported that the following scenario may
lead to thread being blocked due to data corruption:

CPU 1 CPU 2 CPU 3
Process 1, Process 1, Process 1,
Thread A Thread B Thread C

while (1) { while (1) { while(1) {
pthread_mutex_lock(l) pthread_mutex_lock(l) fork
pthread_mutex_unlock(l) pthread_mutex_unlock(l) }
} }

In the details this happens because :

CPU 1 CPU 2 CPU 3
fork()
copy_pte_range()
set PTE rdonly
got to next VMA...
. PTE is seen rdonly PTE still writable
. thread is writing to page
. -> page fault
. copy the page Thread writes to page
. . -> no page fault
. update the PTE
. flush TLB for that PTE
flush TLB PTE are now rdonly

So the write done by the CPU 3 is interfering with the page copy operation
done by CPU 2, leading to the data corruption.

To avoid this we mark all the VMA involved in the COW mechanism as changing
by calling vm_write_begin(). This ensures that the speculative page fault
handler will not try to handle a fault on these pages.
The marker is set until the TLB is flushed, ensuring that all the CPUs will
now see the PTE as not writable.
Once the TLB is flush, the marker is removed by calling vm_write_end().

The variable last is used to keep tracked of the latest VMA marked to
handle the error path where part of the VMA may have been marked.

Since multiple VMA from the same mm may have the sequence count increased
during this process, the use of the vm_raw_write_begin/end() is required to
avoid lockdep false warning messages.

Reported-by: Ganesh Mahendran <[email protected]>
Reported-by: Vinayak Menon <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
kernel/fork.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f8dae021c2e5..2992d2c95256 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
static __latent_entropy int dup_mmap(struct mm_struct *mm,
struct mm_struct *oldmm)
{
- struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+ struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
struct rb_node **rb_link, *rb_parent;
int retval;
unsigned long charge;
@@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
rb_parent = &tmp->vm_rb;

mm->map_count++;
- if (!(tmp->vm_flags & VM_WIPEONFORK))
+ if (!(tmp->vm_flags & VM_WIPEONFORK)) {
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+ /*
+ * Mark this VMA as changing to prevent the
+ * speculative page fault hanlder to process
+ * it until the TLB are flushed below.
+ */
+ last = mpnt;
+ vm_raw_write_begin(mpnt);
+ }
retval = copy_page_range(mm, oldmm, mpnt);
+ }

if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
@@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
out:
up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
+
+ if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
+ /*
+ * Since the TLB has been flush, we can safely unmark the
+ * copied VMAs and allows the speculative page fault handler to
+ * process them again.
+ * Walk back the VMA list from the last marked VMA.
+ */
+ for (; last; last = last->vm_prev) {
+ if (last->vm_flags & VM_DONTCOPY)
+ continue;
+ if (!(last->vm_flags & VM_WIPEONFORK))
+ vm_raw_write_end(last);
+ }
+ }
+
up_write(&oldmm->mmap_sem);
dup_userfaultfd_complete(&uf);
fail_uprobe_end:
--
2.21.0

2019-04-16 13:48:58

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 25/31] perf: add a speculative page fault sw event

Add a new software event to count succeeded speculative page faults.

Acked-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT = 10,
+ PERF_COUNT_SW_SPF = 11,

PERF_COUNT_SW_MAX, /* non-ABI */
};
--
2.21.0

2019-04-16 13:49:06

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 22/31] mm: provide speculative fault infrastructure

From: Peter Zijlstra <[email protected]>

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

[Manage the newly introduced pte_spinlock() for speculative page
fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
other CPU's invalidating the TLB and requiring this CPU to catch the
inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
mm->mm_users in handle_speculative_fault() defined in mm.h]
[Abort if vm_ops->fault is set instead of checking only vm_ops]
[Use find_vma_rcu() and call put_vma() when we are done with the VMA]
Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/hugetlb_inline.h | 2 +-
include/linux/mm.h | 30 +++
include/linux/pagemap.h | 4 +-
mm/internal.h | 15 ++
mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
5 files changed, 389 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@

static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
{
- return !!(vma->vm_flags & VM_HUGETLB);
+ return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
}

#else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f761a9c65c74..ec609cbad25a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */

#define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -409,6 +410,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations */
pgoff_t pgoff; /* Logical page offset based on vma */
unsigned long address; /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ unsigned int sequence;
+ pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
* the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags);
+static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ /*
+ * Try speculative page fault for multithreaded user space task only.
+ */
+ if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
+ return VM_FAULT_RETRY;
+ return __handle_speculative_fault(mm, address, flags);
+}
+#else
+static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ return VM_FAULT_RETRY;
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2e8438a1216a..2fcfaa910007 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
pgoff_t pgoff;
if (unlikely(is_vm_hugetlb_page(vma)))
return linear_hugepage_index(vma, address);
- pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
- pgoff += vma->vm_pgoff;
+ pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
+ pgoff += READ_ONCE(vma->vm_pgoff);
return pgoff;
}

diff --git a/mm/internal.h b/mm/internal.h
index 1e368e4afe3c..ed91b199cb8c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
unsigned long addr);

+
+static inline bool vma_has_changed(struct vm_fault *vmf)
+{
+ int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
+ unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
+
+ /*
+ * Matches both the wmb in write_seqlock_{begin,end}() and
+ * the wmb in vma_rb_erase().
+ */
+ smp_rmb();
+
+ return ret || seq != vmf->sequence;
+}
+
#else /* CONFIG_SPECULATIVE_PAGE_FAULT */

static inline void get_vma(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index 46f877b6abea..6e6bf61c0e5c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
if (page)
dump_page(page, "bad pte");
pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
- (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
+ (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
+ mapping, index);
pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
vma->vm_file,
vma->vm_ops ? vma->vm_ops->fault : NULL,
@@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static bool pte_spinlock(struct vm_fault *vmf)
+{
+ bool ret = false;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+
+ /* Check if vma is still valid */
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ return true;
+ }
+
+again:
+ local_irq_disable();
+ if (vma_has_changed(vmf))
+ goto out;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto out;
+#endif
+
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ if (unlikely(!spin_trylock(vmf->ptl))) {
+ local_irq_enable();
+ goto again;
+ }
+
+ if (vma_has_changed(vmf)) {
+ spin_unlock(vmf->ptl);
+ goto out;
+ }
+
+ ret = true;
+out:
+ local_irq_enable();
+ return ret;
+}
+
+static bool pte_map_lock(struct vm_fault *vmf)
+{
+ bool ret = false;
+ pte_t *pte;
+ spinlock_t *ptl;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ pmd_t pmdval;
+#endif
+
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ return true;
+ }
+
+ /*
+ * The first vma_has_changed() guarantees the page-tables are still
+ * valid, having IRQs disabled ensures they stay around, hence the
+ * second vma_has_changed() to make sure they are still valid once
+ * we've got the lock. After that a concurrent zap_pte_range() will
+ * block on the PTL and thus we're safe.
+ */
+again:
+ local_irq_disable();
+ if (vma_has_changed(vmf))
+ goto out;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ /*
+ * We check if the pmd value is still the same to ensure that there
+ * is not a huge collapse operation in progress in our back.
+ */
+ pmdval = READ_ONCE(*vmf->pmd);
+ if (!pmd_same(pmdval, vmf->orig_pmd))
+ goto out;
+#endif
+
+ /*
+ * Same as pte_offset_map_lock() except that we call
+ * spin_trylock() in place of spin_lock() to avoid race with
+ * unmap path which may have the lock and wait for this CPU
+ * to invalidate TLB but this CPU has irq disabled.
+ * Since we are in a speculative patch, accept it could fail
+ */
+ ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ pte = pte_offset_map(vmf->pmd, vmf->address);
+ if (unlikely(!spin_trylock(ptl))) {
+ pte_unmap(pte);
+ local_irq_enable();
+ goto again;
+ }
+
+ if (vma_has_changed(vmf)) {
+ pte_unmap_unlock(pte, ptl);
+ goto out;
+ }
+
+ vmf->pte = pte;
+ vmf->ptl = ptl;
+ ret = true;
+out:
+ local_irq_enable();
+ return ret;
+}
+#else
static inline bool pte_spinlock(struct vm_fault *vmf)
{
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
vmf->address, &vmf->ptl);
return true;
}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

/*
* handle_pte_fault chooses page fault handler according to an entry which was
@@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto unlock;
+ /*
+ * Don't call the userfaultfd during the speculative path.
+ * We already checked for the VMA to not be managed through
+ * userfaultfd, but it may be set in our back once we have lock
+ * the pte. In such a case we can ignore it this time.
+ */
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ goto setpte;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
goto unlock_and_release;

/* Deliver the page fault to userland, check inside PT lock */
- if (userfaultfd_missing(vma)) {
+ if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
+ userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
@@ -3836,6 +3959,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
pte_t entry;

if (unlikely(pmd_none(*vmf->pmd))) {
+ /*
+ * In the case of the speculative page fault handler we abort
+ * the speculative path immediately as the pmd is probably
+ * in the way to be converted in a huge one. We will try
+ * again holding the mmap_sem (which implies that the collapse
+ * operation is done).
+ */
+ if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
/*
* Leave __pte_alloc() until later: because vm_ops->fault may
* want to allocate huge page, and if we expose page table
@@ -3843,7 +3975,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* concurrent faults and from rmap lookups.
*/
vmf->pte = NULL;
- } else {
+ } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
/* See comment in pte_alloc_one_map() */
if (pmd_devmap_trans_unstable(vmf->pmd))
return 0;
@@ -3852,6 +3984,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* pmd from under us anymore at this point because we hold the
* mmap_sem read mode and khugepaged takes it in write mode.
* So now it's safe to run pte_offset_map().
+ * This is not applicable to the speculative page fault handler
+ * but in that case, the pte is fetched earlier in
+ * handle_speculative_fault().
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
vmf->orig_pte = *vmf->pte;
@@ -3874,6 +4009,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
+ else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
+ return VM_FAULT_RETRY;
else
return do_fault(vmf);
}
@@ -3971,6 +4108,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
+#endif
if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
ret = create_huge_pmd(&vmf);
if (!(ret & VM_FAULT_FALLBACK))
@@ -4004,6 +4144,204 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return handle_pte_fault(&vmf);
}

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+/*
+ * Tries to handle the page fault in a speculative way, without grabbing the
+ * mmap_sem.
+ */
+vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
+ unsigned long address,
+ unsigned int flags)
+{
+ struct vm_fault vmf = {
+ .address = address,
+ };
+ pgd_t *pgd, pgdval;
+ p4d_t *p4d, p4dval;
+ pud_t pudval;
+ int seq;
+ vm_fault_t ret = VM_FAULT_RETRY;
+ struct vm_area_struct *vma;
+#ifdef CONFIG_NUMA
+ struct mempolicy *pol;
+#endif
+
+ /* Clear flags that may lead to release the mmap_sem to retry */
+ flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
+ flags |= FAULT_FLAG_SPECULATIVE;
+
+ vma = find_vma_rcu(mm, address);
+ if (!vma)
+ return ret;
+
+ /* rmb <-> seqlock,vma_rb_erase() */
+ seq = raw_read_seqcount(&vma->vm_sequence);
+ if (seq & 1)
+ goto out_put;
+
+ /*
+ * Can't call vm_ops service has we don't know what they would do
+ * with the VMA.
+ * This include huge page from hugetlbfs.
+ */
+ if (vma->vm_ops && vma->vm_ops->fault)
+ goto out_put;
+
+ /*
+ * __anon_vma_prepare() requires the mmap_sem to be held
+ * because vm_next and vm_prev must be safe. This can't be guaranteed
+ * in the speculative path.
+ */
+ if (unlikely(!vma->anon_vma))
+ goto out_put;
+
+ vmf.vma_flags = READ_ONCE(vma->vm_flags);
+ vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
+
+ /* Can't call userland page fault handler in the speculative path */
+ if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
+ goto out_put;
+
+ if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
+ /*
+ * This could be detected by the check address against VMA's
+ * boundaries but we want to trace it as not supported instead
+ * of changed.
+ */
+ goto out_put;
+
+ if (address < READ_ONCE(vma->vm_start)
+ || READ_ONCE(vma->vm_end) <= address)
+ goto out_put;
+
+ if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+ flags & FAULT_FLAG_INSTRUCTION,
+ flags & FAULT_FLAG_REMOTE)) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+
+ /* This is one is required to check that the VMA has write access set */
+ if (flags & FAULT_FLAG_WRITE) {
+ if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+ } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
+ ret = VM_FAULT_SIGSEGV;
+ goto out_put;
+ }
+
+#ifdef CONFIG_NUMA
+ /*
+ * MPOL_INTERLEAVE implies additional checks in
+ * mpol_misplaced() which are not compatible with the
+ *speculative page fault processing.
+ */
+ pol = __get_vma_policy(vma, address);
+ if (!pol)
+ pol = get_task_policy(current);
+ if (pol && pol->mode == MPOL_INTERLEAVE)
+ goto out_put;
+#endif
+
+ /*
+ * Do a speculative lookup of the PTE entry.
+ */
+ local_irq_disable();
+ pgd = pgd_offset(mm, address);
+ pgdval = READ_ONCE(*pgd);
+ if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
+ goto out_walk;
+
+ p4d = p4d_offset(pgd, address);
+ p4dval = READ_ONCE(*p4d);
+ if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
+ goto out_walk;
+
+ vmf.pud = pud_offset(p4d, address);
+ pudval = READ_ONCE(*vmf.pud);
+ if (pud_none(pudval) || unlikely(pud_bad(pudval)))
+ goto out_walk;
+
+ /* Huge pages at PUD level are not supported. */
+ if (unlikely(pud_trans_huge(pudval)))
+ goto out_walk;
+
+ vmf.pmd = pmd_offset(vmf.pud, address);
+ vmf.orig_pmd = READ_ONCE(*vmf.pmd);
+ /*
+ * pmd_none could mean that a hugepage collapse is in progress
+ * in our back as collapse_huge_page() mark it before
+ * invalidating the pte (which is done once the IPI is catched
+ * by all CPU and we have interrupt disabled).
+ * For this reason we cannot handle THP in a speculative way since we
+ * can't safely identify an in progress collapse operation done in our
+ * back on that PMD.
+ * Regarding the order of the following checks, see comment in
+ * pmd_devmap_trans_unstable()
+ */
+ if (unlikely(pmd_devmap(vmf.orig_pmd) ||
+ pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
+ is_swap_pmd(vmf.orig_pmd)))
+ goto out_walk;
+
+ /*
+ * The above does not allocate/instantiate page-tables because doing so
+ * would lead to the possibility of instantiating page-tables after
+ * free_pgtables() -- and consequently leaking them.
+ *
+ * The result is that we take at least one !speculative fault per PMD
+ * in order to instantiate it.
+ */
+
+ vmf.pte = pte_offset_map(vmf.pmd, address);
+ vmf.orig_pte = READ_ONCE(*vmf.pte);
+ barrier(); /* See comment in handle_pte_fault() */
+ if (pte_none(vmf.orig_pte)) {
+ pte_unmap(vmf.pte);
+ vmf.pte = NULL;
+ }
+
+ vmf.vma = vma;
+ vmf.pgoff = linear_page_index(vma, address);
+ vmf.gfp_mask = __get_fault_gfp_mask(vma);
+ vmf.sequence = seq;
+ vmf.flags = flags;
+
+ local_irq_enable();
+
+ /*
+ * We need to re-validate the VMA after checking the bounds, otherwise
+ * we might have a false positive on the bounds.
+ */
+ if (read_seqcount_retry(&vma->vm_sequence, seq))
+ goto out_put;
+
+ mem_cgroup_enter_user_fault();
+ ret = handle_pte_fault(&vmf);
+ mem_cgroup_exit_user_fault();
+
+ put_vma(vma);
+
+ /*
+ * The task may have entered a memcg OOM situation but
+ * if the allocation error was handled gracefully (no
+ * VM_FAULT_OOM), there is no need to kill anything.
+ * Just clean up the OOM state peacefully.
+ */
+ if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
+ mem_cgroup_oom_synchronize(false);
+ return ret;
+
+out_walk:
+ local_irq_enable();
+out_put:
+ put_vma(vma);
+ return ret;
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
/*
* By the time we get here, we already hold the mm semaphore
*
--
2.21.0

2019-04-16 13:49:11

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 10/31] mm: protect VMA modifications using VMA sequence count

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour <[email protected]>
---
fs/proc/task_mmu.c | 5 ++++-
fs/userfaultfd.c | 17 ++++++++++++----
mm/khugepaged.c | 3 +++
mm/madvise.c | 6 +++++-
mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++----------------
mm/mlock.c | 13 +++++++-----
mm/mmap.c | 28 ++++++++++++++++---------
mm/mprotect.c | 4 +++-
mm/swap_state.c | 10 ++++++---
9 files changed, 95 insertions(+), 42 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 01d4eb0e6bd1..0864c050b2de 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1162,8 +1162,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
- vma->vm_flags &= ~VM_SOFTDIRTY;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags & ~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+ vm_write_end(vma);
}
downgrade_write(&mm->mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 3b30301c90ec..2e0f98cadd81 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -667,8 +667,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)

octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+ vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
- vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+ vm_write_end(vma);
return 0;
}

@@ -908,8 +911,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
vma = prev;
else
prev = vma;
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ vm_write_end(vma);
}
skip_mm:
up_write(&mm->mmap_sem);
@@ -1474,8 +1479,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+ vm_write_end(vma);

skip:
prev = vma;
@@ -1636,8 +1643,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
* the next vma was merged into the current one and
* the current one has not been updated yet.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+ vm_write_end(vma);

skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..6a0cbca3885e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;

+ vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);

pte = pte_offset_map(pmd, address);
@@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+ vm_write_end(vma);
result = SCAN_FAIL;
goto out;
}
@@ -1081,6 +1083,7 @@ static void collapse_huge_page(struct mm_struct *mm,
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
spin_unlock(pmd_ptl);
+ vm_write_end(vma);

*hpage = NULL;

diff --git a/mm/madvise.c b/mm/madvise.c
index a692d2a893b5..6cf07dc546fc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
/*
* vm_flags is protected by the mmap_sem held in write mode.
*/
- vma->vm_flags = new_flags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, new_flags);
+ vm_write_end(vma);
out:
return error;
}
@@ -450,9 +452,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
.private = tlb,
};

+ vm_write_begin(vma);
tlb_start_vma(tlb, vma);
walk_page_range(addr, end, &free_walk);
tlb_end_vma(tlb, vma);
+ vm_write_end(vma);
}

static int madvise_free_single_vma(struct vm_area_struct *vma,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2219e747df49..94c103c5034a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -380,8 +380,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
struct vm_area_struct *vma;

down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ vm_write_begin(vma);
mpol_rebind_policy(vma->vm_policy, new);
+ vm_write_end(vma);
+ }
up_write(&mm->mmap_sem);
}

@@ -575,9 +578,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
{
int nr_updated;

+ vm_write_begin(vma);
nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
if (nr_updated)
count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+ vm_write_end(vma);

return nr_updated;
}
@@ -683,6 +688,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
if (IS_ERR(new))
return PTR_ERR(new);

+ vm_write_begin(vma);
if (vma->vm_ops && vma->vm_ops->set_policy) {
err = vma->vm_ops->set_policy(vma, new);
if (err)
@@ -690,11 +696,17 @@ static int vma_replace_policy(struct vm_area_struct *vma,
}

old = vma->vm_policy;
- vma->vm_policy = new; /* protected by mmap_sem */
+ /*
+ * The speculative page fault handler accesses this field without
+ * hodling the mmap_sem.
+ */
+ WRITE_ONCE(vma->vm_policy, new);
+ vm_write_end(vma);
mpol_put(old);

return 0;
err_out:
+ vm_write_end(vma);
mpol_put(new);
return err;
}
@@ -1654,23 +1666,28 @@ COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid,
struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
unsigned long addr)
{
- struct mempolicy *pol = NULL;
+ struct mempolicy *pol;

- if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- pol = vma->vm_ops->get_policy(vma, addr);
- } else if (vma->vm_policy) {
- pol = vma->vm_policy;
+ if (!vma)
+ return NULL;

- /*
- * shmem_alloc_page() passes MPOL_F_SHARED policy with
- * a pseudo vma whose vma->vm_ops=NULL. Take a reference
- * count on these policies which will be dropped by
- * mpol_cond_put() later
- */
- if (mpol_needs_cond_ref(pol))
- mpol_get(pol);
- }
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ return vma->vm_ops->get_policy(vma, addr);
+
+ /*
+ * This could be called without holding the mmap_sem in the
+ * speculative page fault handler's path.
+ */
+ pol = READ_ONCE(vma->vm_policy);
+ if (pol) {
+ /*
+ * shmem_alloc_page() passes MPOL_F_SHARED policy with
+ * a pseudo vma whose vma->vm_ops=NULL. Take a reference
+ * count on these policies which will be dropped by
+ * mpol_cond_put() later
+ */
+ if (mpol_needs_cond_ref(pol))
+ mpol_get(pol);
}

return pol;
diff --git a/mm/mlock.c b/mm/mlock.c
index 080f3b36415b..f390903d9bbb 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -445,7 +445,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK);
+ vm_write_end(vma);

while (start < end) {
struct page *page;
@@ -569,10 +571,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* It's okay if try_to_unmap_one unmaps a page just after we
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/
-
- if (lock)
- vma->vm_flags = newflags;
- else
+ if (lock) {
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, newflags);
+ vm_write_end(vma);
+ } else
munlock_vma_pages_range(vma, start, end);

out:
diff --git a/mm/mmap.c b/mm/mmap.c
index a4e4d52a5148..b77ec0149249 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -877,17 +877,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
}

if (start != vma->vm_start) {
- vma->vm_start = start;
+ WRITE_ONCE(vma->vm_start, start);
start_changed = true;
}
if (end != vma->vm_end) {
- vma->vm_end = end;
+ WRITE_ONCE(vma->vm_end, end);
end_changed = true;
}
- vma->vm_pgoff = pgoff;
+ WRITE_ONCE(vma->vm_pgoff, pgoff);
if (adjust_next) {
- next->vm_start += adjust_next << PAGE_SHIFT;
- next->vm_pgoff += adjust_next;
+ WRITE_ONCE(next->vm_start,
+ next->vm_start + (adjust_next << PAGE_SHIFT));
+ WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
}

if (root) {
@@ -1850,12 +1851,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
out:
perf_event_mmap(vma);

+ vm_write_begin(vma);
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ WRITE_ONCE(vma->vm_flags,
+ vma->vm_flags &= VM_LOCKED_CLEAR_MASK);
else
mm->locked_vm += (len >> PAGE_SHIFT);
}
@@ -1870,9 +1873,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* then new mapped in-place (which must be aimed as
* a completely new data area).
*/
- vma->vm_flags |= VM_SOFTDIRTY;
+ WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY);

vma_set_page_prot(vma);
+ vm_write_end(vma);

return addr;

@@ -2430,7 +2434,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_end = address;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_end, address);
+ vm_write_end(vma);
anon_vma_interval_tree_post_update_vma(vma);
if (vma->vm_next)
vma_gap_update(vma->vm_next);
@@ -2510,8 +2516,10 @@ int expand_downwards(struct vm_area_struct *vma,
mm->locked_vm += grow;
vm_stat_account(mm, vma->vm_flags, grow);
anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_start = address;
- vma->vm_pgoff -= grow;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_start, address);
+ WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow);
+ vm_write_end(vma);
anon_vma_interval_tree_post_update_vma(vma);
vma_gap_update(vma);
spin_unlock(&mm->page_table_lock);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 65242f1e4457..78fce873ca3a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -427,12 +427,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
* vm_flags and vm_page_prot are protected by the mmap_sem
* held in write mode.
*/
- vma->vm_flags = newflags;
+ vm_write_begin(vma);
+ WRITE_ONCE(vma->vm_flags, newflags);
dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
vma_set_page_prot(vma);

change_protection(vma, start, end, vma->vm_page_prot,
dirty_accountable, 0);
+ vm_write_end(vma);

/*
* Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
diff --git a/mm/swap_state.c b/mm/swap_state.c
index eb714165afd2..c45f9122b457 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -523,7 +523,11 @@ static unsigned long swapin_nr_pages(unsigned long offset)
* This has been extended to use the NUMA policies from the mm triggering
* the readahead.
*
- * Caller must hold read mmap_sem if vmf->vma is not NULL.
+ * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
+ * This is needed to ensure the VMA will not be freed in our back. In the case
+ * of the speculative page fault handler, this cannot happen, even if we don't
+ * hold the mmap_sem. Callees are assumed to take care of reading VMA's fields
+ * using READ_ONCE() to read consistent values.
*/
struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
@@ -624,9 +628,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
unsigned long *start,
unsigned long *end)
{
- *start = max3(lpfn, PFN_DOWN(vma->vm_start),
+ *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
PFN_DOWN(faddr & PMD_MASK));
- *end = min3(rpfn, PFN_DOWN(vma->vm_end),
+ *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
}

--
2.21.0

2019-04-16 13:49:14

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 16/31] mm: introduce __vm_normal_page()

When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 18 +++++++++++++++---
mm/memory.c | 21 ++++++++++++---------
2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f465bb2b049e..f14b2c9ddfd4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1421,9 +1421,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
#endif
}

-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, bool with_public_device);
-#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte,
+ bool with_public_device)
+{
+ return __vm_normal_page(vma, addr, pte, with_public_device,
+ vma->vm_flags);
+}
+static inline struct page *vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte)
+{
+ return _vm_normal_page(vma, addr, pte, false);
+}

struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 85ec5ce5c0a8..be93f2c8ebe0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -533,7 +533,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
}

/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+ * __vm_normal_page -- This function gets the "struct page" associated with
+ * a pte.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
@@ -574,8 +575,9 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
* PFNMAP mappings in order to support COWable mappings.
*
*/
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
{
unsigned long pfn = pte_pfn(pte);

@@ -584,7 +586,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
- if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -620,8 +622,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,

/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */

- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
+ if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+ if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -630,7 +632,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
- if (!is_cow_mapping(vma->vm_flags))
+ if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2532,7 +2534,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;

- vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+ vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+ vmf->vma_flags);
if (!vmf->page) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -3706,7 +3709,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
update_mmu_cache(vma, vmf->address, vmf->pte);

- page = vm_normal_page(vma, vmf->address, pte);
+ page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
if (!page) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
--
2.21.0

2019-04-16 13:49:18

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 08/31] mm: introduce INIT_VMA()

Some VMA struct fields need to be initialized once the VMA structure is
allocated.
Currently this only concerns anon_vma_chain field but some other will be
added to support the speculative page fault.

Instead of spreading the initialization calls all over the code, let's
introduce a dedicated inline function.

Signed-off-by: Laurent Dufour <[email protected]>
---
fs/exec.c | 1 +
include/linux/mm.h | 5 +++++
kernel/fork.c | 2 +-
mm/mmap.c | 3 +++
mm/nommu.c | 1 +
5 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..9762e060295c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -266,6 +266,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_start = vma->vm_end - PAGE_SIZE;
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+ INIT_VMA(vma);

err = insert_vm_struct(mm, vma);
if (err)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ba2f53f9d60..2ceb1d2869a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1407,6 +1407,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap */
};

+static inline void INIT_VMA(struct vm_area_struct *vma)
+{
+ INIT_LIST_HEAD(&vma->anon_vma_chain);
+}
+
struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte, bool with_public_device);
#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
diff --git a/kernel/fork.c b/kernel/fork.c
index 915be4918a2b..f8dae021c2e5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -341,7 +341,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)

if (new) {
*new = *orig;
- INIT_LIST_HEAD(&new->anon_vma_chain);
+ INIT_VMA(new);
}
return new;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index bd7b9f293b39..5ad3a3228d76 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1765,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
+ INIT_VMA(vma);

if (file) {
if (vm_flags & VM_DENYWRITE) {
@@ -3037,6 +3038,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
}

vma_set_anonymous(vma);
+ INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_pgoff = pgoff;
@@ -3395,6 +3397,7 @@ static struct vm_area_struct *__install_special_mapping(
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);

+ INIT_VMA(vma);
vma->vm_start = addr;
vma->vm_end = addr + len;

diff --git a/mm/nommu.c b/mm/nommu.c
index 749276beb109..acf7ca72ca90 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1210,6 +1210,7 @@ unsigned long do_mmap(struct file *file,
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;

+ INIT_VMA(vma);
vma->vm_flags = vm_flags;
vma->vm_pgoff = pgoff;

--
2.21.0

2019-04-16 13:49:24

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Acked-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm_types.h | 6 +++++-
mm/memory.c | 37 +++++++++++++++++++++++++++----------
2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8ec38b11b361..fd7d38ee2e33 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -652,6 +652,8 @@ typedef __bitwise unsigned int vm_fault_t;
* @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
* fsync() to complete (for synchronous page faults
* in DAX)
+ * @VM_FAULT_PTNOTSAME Page table entries have changed during a
+ * speculative page fault handling.
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -669,6 +671,7 @@ enum vm_fault_reason {
VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
+ VM_FAULT_PTNOTSAME = (__force vm_fault_t)0x004000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};

@@ -693,7 +696,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
- { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_PTNOTSAME, "PTNOTSAME" }

struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 221ccdf34991..d5bebca47d98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2094,21 +2094,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
* parts, do_swap_page must check under lock before unmapping the pte and
* proceeding (but do_wp_page is only called after already making such a check;
* and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0 if the PTE are the same
+ * VM_FAULT_PTNOTSAME if the PTE are different
+ * VM_FAULT_RETRY if the VMA has changed in our back during
+ * a speculative page fault handling.
*/
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
- pte_t *page_table, pte_t orig_pte)
+static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
{
- int same = 1;
+ int ret = 0;
+
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- same = pte_same(*page_table, orig_pte);
- spin_unlock(ptl);
+ if (pte_spinlock(vmf)) {
+ if (!pte_same(*vmf->pte, vmf->orig_pte))
+ ret = VM_FAULT_PTNOTSAME;
+ spin_unlock(vmf->ptl);
+ } else
+ ret = VM_FAULT_RETRY;
}
#endif
- pte_unmap(page_table);
- return same;
+ pte_unmap(vmf->pte);
+ return ret;
}

static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
@@ -2714,8 +2722,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
int exclusive = 0;
vm_fault_t ret = 0;

- if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+ ret = pte_unmap_same(vmf);
+ if (ret) {
+ /*
+ * If pte != orig_pte, this means another thread did the
+ * swap operation in our back.
+ * So nothing else to do.
+ */
+ if (ret == VM_FAULT_PTNOTSAME)
+ ret = 0;
goto out;
+ }

entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
--
2.21.0

2019-04-16 13:49:37

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fc3698d13cb5..221ccdf34991 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL_GPL(apply_to_page_range);

+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+ vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+ spin_lock(vmf->ptl);
+ return true;
+}
+
static inline bool pte_map_lock(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3656,8 +3663,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -3850,8 +3857,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(vmf->ptl);
+ if (!pte_spinlock(vmf))
+ return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
--
2.21.0

2019-04-16 13:49:40

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 29/31] powerpc/mm: add speculative page fault

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Signed-off-by: Laurent Dufour <[email protected]>
---
arch/powerpc/mm/fault.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ec74305fa330..5d48016073cb 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -491,6 +491,21 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;

+ /*
+ * Try speculative page fault before grabbing the mmap_sem.
+ * The Page fault is done if VM_FAULT_RETRY is not returned.
+ * But if the memory protection keys are active, we don't know if the
+ * fault is due to key mistmatch or due to a classic protection check.
+ * To differentiate that, we will need the VMA we no more have, so
+ * let's retry with the mmap_sem held.
+ */
+ fault = handle_speculative_fault(mm, address, flags);
+ if (fault != VM_FAULT_RETRY && (IS_ENABLED(CONFIG_PPC_MEM_KEYS) &&
+ fault != VM_FAULT_SIGSEGV)) {
+ perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+ goto done;
+ }
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -600,6 +615,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,

up_read(&current->mm->mmap_sem);

+done:
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);

--
2.21.0

2019-04-16 13:49:48

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 26/31] perf tools: add support for the SPF perf event

Add support for the new speculative faults event.

Acked-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
tools/include/uapi/linux/perf_event.h | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 4 ++++
tools/perf/util/parse-events.l | 1 +
tools/perf/util/python.c | 1 +
5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 7198ddd0c6b1..3b4356c55caa 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT = 10,
+ PERF_COUNT_SW_SPF = 11,

PERF_COUNT_SW_MAX, /* non-ABI */
};
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66d066f18b5b..1f3bea4379b2 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -435,6 +435,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+ "speculative-faults",
};

static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 5ef4939408f2..effa8929cc90 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias = "",
},
+ [PERF_COUNT_SW_SPF] = {
+ .symbol = "speculative-faults",
+ .alias = "spf",
+ },
};

#define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 7805c71aaae2..d28a6edd0a95 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -324,6 +324,7 @@ emulation-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
dummy { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
duration_time { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
bpf-output { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }

/*
* We have to handle the kernel PMU event cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index dda0ac978b1e..c617a4751549 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1200,6 +1200,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+ PERF_CONST(COUNT_SW_SPF),

PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
--
2.21.0

2019-04-16 13:49:48

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure

When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

In the detail, when we deal with a speculative page fault, the mmap_sem is
not taken, so parallel VMA's changes can occurred. When a VMA change is
done which will impact the page fault processing, we assumed that the VMA
sequence counter will be changed. In the page fault processing, at the
time the PTE is locked, we checked the VMA sequence counter to detect
changes done in our back. If no change is detected we can continue further.
But this doesn't prevent the VMA to not be changed in our back while the
PTE is locked. So VMA's fields which are used while the PTE is locked must
be saved to ensure that we are using *static* values. This is important
since the PTE changes will be made on regards to these VMA fields and they
need to be consistent. This concerns the vma->vm_flags and
vma->vm_page_prot VMA fields.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 10 +++++++--
mm/huge_memory.c | 6 +++---
mm/hugetlb.c | 2 ++
mm/khugepaged.c | 2 ++
mm/memory.c | 53 ++++++++++++++++++++++++----------------------
mm/migrate.c | 2 +-
6 files changed, 44 insertions(+), 31 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d45b7d8718d..f465bb2b049e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -439,6 +439,12 @@ struct vm_fault {
* page table to avoid allocation from
* atomic context.
*/
+ /*
+ * These entries are required when handling speculative page fault.
+ * This way the page handling is done using consistent field values.
+ */
+ unsigned long vma_flags;
+ pgprot_t vma_page_prot;
};

/* page entry size for vm->huge_fault() */
@@ -781,9 +787,9 @@ void free_compound_page(struct page *page);
* pte_mkwrite. But get_user_pages can cause write faults for mappings
* that do not have writing enabled, when used by access_process_vm.
*/
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
{
- if (likely(vma->vm_flags & VM_WRITE))
+ if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 823688414d27..865886a689ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,

for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t entry;
- entry = mk_pte(pages[i], vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = mk_pte(pages[i], vmf->vma_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
- entry = maybe_mkwrite(entry, vma);
+ entry = maybe_mkwrite(entry, vma->vm_flags);
if (!write)
entry = pte_wrprotect(entry);
if (!young)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 109f5de82910..13246da4bc50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
.vma = vma,
.address = haddr,
.flags = flags,
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
/*
* Hard to debug if it ends up being
* used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6a0cbca3885e..42469037240a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -888,6 +888,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
.pgoff = linear_page_index(vma, address),
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
};

/* we only decide to swapin, if there is enough young ptes */
diff --git a/mm/memory.c b/mm/memory.c
index 2cf7b6185daa..d0de58464479 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1560,7 +1560,8 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
goto out_unlock;
}
entry = pte_mkyoung(*pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry),
+ vma->vm_flags);
if (ptep_set_access_flags(vma, addr, pte, entry, 1))
update_mmu_cache(vma, addr, pte);
}
@@ -1575,7 +1576,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,

if (mkwrite) {
entry = pte_mkyoung(entry);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma->vm_flags);
}

set_pte_at(mm, addr, pte, entry);
@@ -2257,7 +2258,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)

flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
entry = pte_mkyoung(vmf->orig_pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
update_mmu_cache(vma, vmf->address, vmf->pte);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2335,8 +2336,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
inc_mm_counter_fast(mm, MM_ANONPAGES);
}
flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
- entry = mk_pte(new_page, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = mk_pte(new_page, vmf->vma_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/*
* Clear the pte entry and flush it first, before updating the
* pte with the new entry. This will avoid a race condition
@@ -2401,7 +2402,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* Don't let another task, with possibly unlocked vma,
* keep the mlocked page.
*/
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
+ if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
lock_page(old_page); /* LRU manipulation */
if (PageMlocked(old_page))
munlock_vma_page(old_page);
@@ -2438,7 +2439,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
{
- WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
+ WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
/*
@@ -2540,7 +2541,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable and/or call ops->pfn_mkwrite.
*/
- if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))
return wp_pfn_shared(vmf);

@@ -2599,7 +2600,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return VM_FAULT_WRITE;
}
unlock_page(vmf->page);
- } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
return wp_page_shared(vmf);
}
@@ -2878,9 +2879,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
- pte = mk_pte(page, vma->vm_page_prot);
+ pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
- pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+ pte = maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;
@@ -2905,7 +2906,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

swap_free(entry);
if (mem_cgroup_swap_full(page) ||
- (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+ (vmf->vma_flags & VM_LOCKED) || PageMlocked(page))
try_to_free_swap(page);
unlock_page(page);
if (page != swapcache && swapcache) {
@@ -2963,7 +2964,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_t entry;

/* File mapping without ->vm_ops ? */
- if (vma->vm_flags & VM_SHARED)
+ if (vmf->vma_flags & VM_SHARED)
return VM_FAULT_SIGBUS;

/*
@@ -2987,7 +2988,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
- vma->vm_page_prot));
+ vmf->vma_page_prot));
if (!pte_map_lock(vmf))
return VM_FAULT_RETRY;
if (!pte_none(*vmf->pte))
@@ -3021,8 +3022,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
*/
__SetPageUptodate(page);

- entry = mk_pte(page, vma->vm_page_prot);
- if (vma->vm_flags & VM_WRITE)
+ entry = mk_pte(page, vmf->vma_page_prot);
+ if (vmf->vma_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));

if (!pte_map_lock(vmf)) {
@@ -3242,7 +3243,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
for (i = 0; i < HPAGE_PMD_NR; i++)
flush_icache_page(vma, page + i);

- entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = mk_huge_pmd(page, vmf->vma_page_prot);
if (write)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

@@ -3318,11 +3319,11 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
return VM_FAULT_NOPAGE;

flush_icache_page(vma, page);
- entry = mk_pte(page, vma->vm_page_prot);
+ entry = mk_pte(page, vmf->vma_page_prot);
if (write)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
/* copy-on-write page */
- if (write && !(vma->vm_flags & VM_SHARED)) {
+ if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
@@ -3362,7 +3363,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

/* Did we COW the page? */
if ((vmf->flags & FAULT_FLAG_WRITE) &&
- !(vmf->vma->vm_flags & VM_SHARED))
+ !(vmf->vma_flags & VM_SHARED))
page = vmf->cow_page;
else
page = vmf->page;
@@ -3641,7 +3642,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
}
} else if (!(vmf->flags & FAULT_FLAG_WRITE))
ret = do_read_fault(vmf);
- else if (!(vma->vm_flags & VM_SHARED))
+ else if (!(vmf->vma_flags & VM_SHARED))
ret = do_cow_fault(vmf);
else
ret = do_shared_fault(vmf);
@@ -3698,7 +3699,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* accessible ptes, some can allow access by kernel mode.
*/
old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
- pte = pte_modify(old_pte, vma->vm_page_prot);
+ pte = pte_modify(old_pte, vmf->vma_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
@@ -3732,7 +3733,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* Flag if the page is shared between multiple address spaces. This
* is later used when determining whether to group tasks together
*/
- if (page_mapcount(page) > 1 && (vma->vm_flags & VM_SHARED))
+ if (page_mapcount(page) > 1 && (vmf->vma_flags & VM_SHARED))
flags |= TNF_SHARED;

last_cpupid = page_cpupid_last(page);
@@ -3777,7 +3778,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);

/* COW handled on pte level: split pmd */
- VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
+ VM_BUG_ON_VMA(vmf->vma_flags & VM_SHARED, vmf->vma);
__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

return VM_FAULT_FALLBACK;
@@ -3924,6 +3925,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
+ .vma_flags = vma->vm_flags,
+ .vma_page_prot = vma->vm_page_prot,
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
diff --git a/mm/migrate.c b/mm/migrate.c
index f2ecc2855a12..a9138093a8e2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -240,7 +240,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
*/
entry = pte_to_swp_entry(*pvmw.pte);
if (is_write_migration_entry(entry))
- pte = maybe_mkwrite(pte, vma);
+ pte = maybe_mkwrite(pte, vma->vm_flags);

if (unlikely(is_zone_device_page(new))) {
if (is_device_private_page(new)) {
--
2.21.0

2019-04-16 13:49:55

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 20/31] mm: introduce vma reference counter

The final goal is to be able to use a VMA structure without holding the
mmap_sem and to be sure that the structure will not be freed in our back.

The lockless use of the VMA will be done through RCU protection and thus a
dedicated freeing service is required to manage it asynchronously.

As reported in a 2010's thread [1], this may impact file handling when a
file is still referenced while the mapping is no more there. As the final
goal is to handle anonymous VMA in a speculative way and not file backed
mapping, we could close and free the file pointer in a synchronous way, as
soon as we are guaranteed to not use it without holding the mmap_sem. For
sanity reason, in a minimal effort, the vm_file file pointer is unset once
the file pointer is put.

[1] https://lore.kernel.org/linux-mm/[email protected]/

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/mm.h | 4 ++++
include/linux/mm_types.h | 3 +++
mm/internal.h | 27 +++++++++++++++++++++++++++
mm/mmap.c | 13 +++++++++----
4 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f14b2c9ddfd4..f761a9c65c74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
vma->vm_mm = mm;
vma->vm_ops = &dummy_vm_ops;
INIT_LIST_HEAD(&vma->anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ atomic_set(&vma->vm_ref_count, 1);
+#endif
}

static inline void vma_set_anonymous(struct vm_area_struct *vma)
@@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
INIT_LIST_HEAD(&vma->anon_vma_chain);
#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_init(&vma->vm_sequence);
+ atomic_set(&vma->vm_ref_count, 1);
#endif
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 24b3f8ce9e42..6a6159e11a3f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -285,6 +285,9 @@ struct vm_area_struct {
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;

+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ atomic_t vm_ref_count;
+#endif
struct rb_node vm_rb;

/*
diff --git a/mm/internal.h b/mm/internal.h
index 9eeaf2b95166..302382bed406 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,33 @@ void page_writeback_init(void);

vm_fault_t do_swap_page(struct vm_fault *vmf);

+
+extern void __free_vma(struct vm_area_struct *vma);
+
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void get_vma(struct vm_area_struct *vma)
+{
+ atomic_inc(&vma->vm_ref_count);
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+ if (atomic_dec_and_test(&vma->vm_ref_count))
+ __free_vma(vma);
+}
+
+#else
+
+static inline void get_vma(struct vm_area_struct *vma)
+{
+}
+
+static inline void put_vma(struct vm_area_struct *vma)
+{
+ __free_vma(vma);
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);

diff --git a/mm/mmap.c b/mm/mmap.c
index f7f6027a7dff..c106440dcae7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
}
#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */

+void __free_vma(struct vm_area_struct *vma)
+{
+ mpol_put(vma_policy(vma));
+ vm_area_free(vma);
+}
+
/*
* Close a vm structure and free it, returning the next.
*/
@@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
vma->vm_ops->close(vma);
if (vma->vm_file)
fput(vma->vm_file);
- mpol_put(vma_policy(vma));
- vm_area_free(vma);
+ vma->vm_file = NULL;
+ put_vma(vma);
return next;
}

@@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
if (next->anon_vma)
anon_vma_merge(vma, next);
mm->map_count--;
- mpol_put(vma_policy(next));
- vm_area_free(next);
+ put_vma(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
--
2.21.0

2019-04-16 13:50:02

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 27/31] mm: add speculative page fault vmstats

Add speculative_pgfault vmstat counter to count successful speculative page
fault handling.

Also fixing a minor typo in include/linux/vm_event_item.h.

Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/vm_event_item.h | 3 +++
mm/memory.c | 3 +++
mm/vmstat.c | 5 ++++-
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..137666e91074 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -109,6 +109,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_SWAP
SWAP_RA,
SWAP_RA_HIT,
+#endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ SPECULATIVE_PGFAULT,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/memory.c b/mm/memory.c
index 509851ad7c95..c65e8011d285 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4367,6 +4367,9 @@ vm_fault_t __handle_speculative_fault(struct mm_struct *mm,

put_vma(vma);

+ if (ret != VM_FAULT_RETRY)
+ count_vm_event(SPECULATIVE_PGFAULT);
+
/*
* The task may have entered a memcg OOM situation but
* if the allocation error was handled gracefully (no
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a7d493366a65..93f54b31e150 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1288,7 +1288,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
#endif
-#endif /* CONFIG_VM_EVENTS_COUNTERS */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+ "speculative_pgfault",
+#endif
+#endif /* CONFIG_VM_EVENT_COUNTERS */
};
#endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */

--
2.21.0

2019-04-16 13:50:18

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Acked-by: David Rientjes <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
include/linux/migrate.h | 4 ++--
mm/memory.c | 2 +-
mm/migrate.c | 4 ++--
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..0197e40325f8 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -125,14 +125,14 @@ static inline void __ClearPageMovable(struct page *page)
#ifdef CONFIG_NUMA_BALANCING
extern bool pmd_trans_migrating(pmd_t pmd);
extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
#else
static inline bool pmd_trans_migrating(pmd_t pmd)
{
return false;
}
static inline int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node)
+ struct vm_fault *vmf, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/memory.c b/mm/memory.c
index d0de58464479..56802850e72c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3747,7 +3747,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
}

/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, vma, target_nid);
+ migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index a9138093a8e2..633bd9abac54 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1938,7 +1938,7 @@ bool pmd_trans_migrating(pmd_t pmd)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
int node)
{
pg_data_t *pgdat = NODE_DATA(node);
@@ -1951,7 +1951,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
* with execute permissions as they are probably shared libraries.
*/
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
- (vma->vm_flags & VM_EXEC))
+ (vmf->vma_flags & VM_EXEC))
goto out;

/*
--
2.21.0

2019-04-16 13:50:32

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault

Vinayak Menon faced a panic because one thread was page faulting a page in
swap, while another one was mprotecting a part of the VMA leading to a VMA
split.
This raise a panic in swap_vma_readahead() because the VMA's boundaries
were not more matching the faulting address.

To avoid this, if the page is not found in the swap, the speculative page
fault is aborted to retry a regular page fault.

Reported-by: Vinayak Menon <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 6e6bf61c0e5c..1991da97e2db 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
lru_cache_add_anon(page);
swap_readpage(page, true);
}
+ } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
+ /*
+ * Don't try readahead during a speculative page fault
+ * as the VMA's boundaries may change in our back.
+ * If the page is not in the swap cache and synchronous
+ * read is disabled, fall back to the regular page
+ * fault mechanism.
+ */
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ ret = VM_FAULT_RETRY;
+ goto out;
} else {
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
vmf);
--
2.21.0

2019-04-16 13:50:35

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 12/31] mm: protect SPF handler against anon_vma changes

The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour <[email protected]>
---
mm/memory.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 423fa8ea0569..2cf7b6185daa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -377,7 +377,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
* Hide vma from rmap and truncate_pagecache before freeing
* pgtables
*/
+ vm_write_begin(vma);
unlink_anon_vmas(vma);
+ vm_write_end(vma);
unlink_file_vma(vma);

if (is_vm_hugetlb_page(vma)) {
@@ -391,7 +393,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+ vm_write_begin(vma);
unlink_anon_vmas(vma);
+ vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
--
2.21.0

2019-04-16 13:50:55

by Laurent Dufour

[permalink] [raw]
Subject: [PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
Speculative Page Fault handler when building for 64bit.

Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Laurent Dufour <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0f2ab09da060..8bd575184d0b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -30,6 +30,7 @@ config X86_64
select SWIOTLB
select X86_DEV_DMA_OPS
select ARCH_HAS_SYSCALL_WRAPPER
+ select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

#
# Arch settings
--
2.21.0

2019-04-16 14:30:03

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> From: Mahendran Ganesh <[email protected]>
>
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> enables Speculative Page Fault handler.
>
> Signed-off-by: Ganesh Mahendran <[email protected]>

This is missing your S-o-B.

The first patch noted that the ARCH_SUPPORTS_* option was there because
the arch code had to make an explicit call to try to handle the fault
speculatively, but that isn't addeed until patch 30.

Why is this separate from that code?

Thanks,
Mark.

> ---
> arch/arm64/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 870ef86a64ed..8e86934d598b 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -174,6 +174,7 @@ config ARM64
> select SWIOTLB
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> help
> ARM 64-bit (AArch64) Linux support.
>
> --
> 2.21.0
>

2019-04-16 14:34:32

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
>> From: Mahendran Ganesh <[email protected]>
>>
>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
>> enables Speculative Page Fault handler.
>>
>> Signed-off-by: Ganesh Mahendran <[email protected]>
>
> This is missing your S-o-B.

You're right, I missed that...

>
> The first patch noted that the ARCH_SUPPORTS_* option was there because
> the arch code had to make an explicit call to try to handle the fault
> speculatively, but that isn't addeed until patch 30.
>
> Why is this separate from that code?

Andrew was recommended this a long time ago for bisection purpose. This
allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the
code that trigger the spf handler is added to the per architecture's code.

Thanks,
Laurent.

> Thanks,
> Mark.
>
>> ---
>> arch/arm64/Kconfig | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index 870ef86a64ed..8e86934d598b 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -174,6 +174,7 @@ config ARM64
>> select SWIOTLB
>> select SYSCTL_EXCEPTION_TRACE
>> select THREAD_INFO_IN_TASK
>> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> help
>> ARM 64-bit (AArch64) Linux support.
>>
>> --
>> 2.21.0
>>
>

2019-04-16 14:44:27

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
> Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> > On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> > > From: Mahendran Ganesh <[email protected]>
> > >
> > > Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> > > enables Speculative Page Fault handler.
> > >
> > > Signed-off-by: Ganesh Mahendran <[email protected]>
> >
> > This is missing your S-o-B.
>
> You're right, I missed that...
>
> > The first patch noted that the ARCH_SUPPORTS_* option was there because
> > the arch code had to make an explicit call to try to handle the fault
> > speculatively, but that isn't addeed until patch 30.
> >
> > Why is this separate from that code?
>
> Andrew was recommended this a long time ago for bisection purpose. This
> allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
> that trigger the spf handler is added to the per architecture's code.

Ok. I think it would be worth noting that in the commit message, to
avoid anyone else asking the same question. :)

Thanks,
Mark.

2019-04-18 21:49:57

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 03:44:52PM +0200, Laurent Dufour wrote:
> This configuration variable will be used to build the code needed to
> handle speculative page fault.
>
> By default it is turned off, and activated depending on architecture
> support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
>
> The architecture support is needed since the speculative page fault handler
> is called from the architecture's page faulting code, and some code has to
> be added there to handle the speculative handler.
>
> The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
> does processing that is not compatible with the speculative handling in the
> case ARCH_HAS_PTE_SPECIAL is not set.
>
> Suggested-by: Thomas Gleixner <[email protected]>
> Suggested-by: David Rientjes <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

Small question below

> ---
> mm/Kconfig | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0eada3f818fa..ff278ac9978a 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -761,4 +761,26 @@ config GUP_BENCHMARK
> config ARCH_HAS_PTE_SPECIAL
> bool
>
> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> + def_bool n
> +
> +config SPECULATIVE_PAGE_FAULT
> + bool "Speculative page faults"
> + default y
> + depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> + depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
> + help
> + Try to handle user space page faults without holding the mmap_sem.
> +
> + This should allow better concurrency for massively threaded processes

Is there any case where it does not provide better concurrency ? The
should make me wonder :)

> + since the page fault handler will not wait for other thread's memory
> + layout change to be done, assuming that this change is done in
> + another part of the process's memory space. This type of page fault
> + is named speculative page fault.
> +
> + If the speculative page fault fails because a concurrent modification
> + is detected or because underlying PMD or PTE tables are not yet
> + allocated, the speculative page fault fails and a classic page fault
> + is then tried.
> +
> endmenu
> --
> 2.21.0
>

2019-04-18 21:50:39

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 02/31] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 03:44:53PM +0200, Laurent Dufour wrote:
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
> Speculative Page Fault handler when building for 64bit.
>
> Cc: Thomas Gleixner <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

I think this patch should be move as last patch in the serie so that
the feature is not enabled mid-way without all the pieces ready if
someone bisect. But i have not review everything yet so maybe it is
fine.

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> arch/x86/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0f2ab09da060..8bd575184d0b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -30,6 +30,7 @@ config X86_64
> select SWIOTLB
> select X86_DEV_DMA_OPS
> select ARCH_HAS_SYSCALL_WRAPPER
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>
> #
> # Arch settings
> --
> 2.21.0
>

2019-04-18 21:51:14

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 03/31] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 03:44:54PM +0200, Laurent Dufour wrote:
> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
> the Speculative Page Fault handler.
>
> Support is only provide for BOOK3S_64 currently because:
> - require CONFIG_PPC_STD_MMU because checks done in
> set_access_flags_filter()
> - require BOOK3S because we can't support for book3e_hugetlb_preload()
> called by update_mmu_cache()
>
> Cc: Michael Ellerman <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Same comment as for x86.

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> arch/powerpc/Kconfig | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2d0be82c3061..a29887ea5383 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -238,6 +238,7 @@ config PPC
> select PCI_SYSCALL if PCI
> select RTC_LIB
> select SPARSE_IRQ
> + select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
> select SYSCTL_EXCEPTION_TRACE
> select THREAD_INFO_IN_TASK
> select VIRT_TO_BUS if !PPC64
> --
> 2.21.0
>

2019-04-18 21:52:31

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
> On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
> > Le 16/04/2019 ? 16:27, Mark Rutland a ?crit?:
> > > On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> > > > From: Mahendran Ganesh <[email protected]>
> > > >
> > > > Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> > > > enables Speculative Page Fault handler.
> > > >
> > > > Signed-off-by: Ganesh Mahendran <[email protected]>
> > >
> > > This is missing your S-o-B.
> >
> > You're right, I missed that...
> >
> > > The first patch noted that the ARCH_SUPPORTS_* option was there because
> > > the arch code had to make an explicit call to try to handle the fault
> > > speculatively, but that isn't addeed until patch 30.
> > >
> > > Why is this separate from that code?
> >
> > Andrew was recommended this a long time ago for bisection purpose. This
> > allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
> > that trigger the spf handler is added to the per architecture's code.
>
> Ok. I think it would be worth noting that in the commit message, to
> avoid anyone else asking the same question. :)

Should have read this thread before looking at x86 and ppc :)

In any case the patch is:

Reviewed-by: J?r?me Glisse <[email protected]>

2019-04-18 22:05:38

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE

On Tue, Apr 16, 2019 at 03:44:56PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <[email protected]>
>
> When speculating faults (without holding mmap_sem) we need to validate
> that the vma against which we loaded pages is still valid when we're
> ready to install the new PTE.
>
> Therefore, replace the pte_offset_map_lock() calls that (re)take the
> PTL with pte_map_lock() which can fail in case we find the VMA changed
> since we started the fault.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> [Port to 4.12 kernel]
> [Remove the comment about the fault_env structure which has been
> implemented as the vm_fault structure in the kernel]
> [move pte_map_lock()'s definition upper in the file]
> [move the define of FAULT_FLAG_SPECULATIVE later in the series]
> [review error path in do_swap_page(), do_anonymous_page() and
> wp_page_copy()]
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> mm/memory.c | 87 +++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 58 insertions(+), 29 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c6ddadd9d2b7..fc3698d13cb5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(apply_to_page_range);
>
> +static inline bool pte_map_lock(struct vm_fault *vmf)

I am not fan of the name maybe pte_offset_map_lock_if_valid() ? But
that just a taste thing. So feel free to ignore this comment.


> +{
> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + return true;
> +}
> +
> /*
> * handle_pte_fault chooses page fault handler according to an entry which was
> * read non-atomically. Before making any commitment, on those architectures

2019-04-18 22:07:46

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 06/31] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

On Tue, Apr 16, 2019 at 03:44:57PM +0200, Laurent Dufour wrote:
> When handling page fault without holding the mmap_sem the fetch of the
> pte lock pointer and the locking will have to be done while ensuring
> that the VMA is not touched in our back.
>
> So move the fetch and locking operations in a dedicated function.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>


> ---
> mm/memory.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index fc3698d13cb5..221ccdf34991 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(apply_to_page_range);
>
> +static inline bool pte_spinlock(struct vm_fault *vmf)
> +{
> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + spin_lock(vmf->ptl);
> + return true;
> +}
> +
> static inline bool pte_map_lock(struct vm_fault *vmf)
> {
> vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> @@ -3656,8 +3663,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> * validation through pte_unmap_same(). It's of NUMA type but
> * the pfn may be screwed if the read is non atomic.
> */
> - vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
> - spin_lock(vmf->ptl);
> + if (!pte_spinlock(vmf))
> + return VM_FAULT_RETRY;
> if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> goto out;
> @@ -3850,8 +3857,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> return do_numa_page(vmf);
>
> - vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> - spin_lock(vmf->ptl);
> + if (!pte_spinlock(vmf))
> + return VM_FAULT_RETRY;
> entry = vmf->orig_pte;
> if (unlikely(!pte_same(*vmf->pte, entry)))
> goto unlock;
> --
> 2.21.0
>

2019-04-18 22:12:37

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

On Tue, Apr 16, 2019 at 03:44:58PM +0200, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
>
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
>
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
>
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
> 1. pte are the same (0)
> 2. pte are different (VM_FAULT_PTNOTSAME)
> 3. a VMA's changes has been detected (VM_FAULT_RETRY)
>
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
>
> Acked-by: David Rientjes <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>


> ---
> include/linux/mm_types.h | 6 +++++-
> mm/memory.c | 37 +++++++++++++++++++++++++++----------
> 2 files changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8ec38b11b361..fd7d38ee2e33 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -652,6 +652,8 @@ typedef __bitwise unsigned int vm_fault_t;
> * @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
> * fsync() to complete (for synchronous page faults
> * in DAX)
> + * @VM_FAULT_PTNOTSAME Page table entries have changed during a
> + * speculative page fault handling.
> * @VM_FAULT_HINDEX_MASK: mask HINDEX value
> *
> */
> @@ -669,6 +671,7 @@ enum vm_fault_reason {
> VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
> VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
> VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
> + VM_FAULT_PTNOTSAME = (__force vm_fault_t)0x004000,
> VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
> };
>
> @@ -693,7 +696,8 @@ enum vm_fault_reason {
> { VM_FAULT_RETRY, "RETRY" }, \
> { VM_FAULT_FALLBACK, "FALLBACK" }, \
> { VM_FAULT_DONE_COW, "DONE_COW" }, \
> - { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
> + { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
> + { VM_FAULT_PTNOTSAME, "PTNOTSAME" }
>
> struct vm_special_mapping {
> const char *name; /* The name, e.g. "[vdso]". */
> diff --git a/mm/memory.c b/mm/memory.c
> index 221ccdf34991..d5bebca47d98 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2094,21 +2094,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
> * parts, do_swap_page must check under lock before unmapping the pte and
> * proceeding (but do_wp_page is only called after already making such a check;
> * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + * 0 if the PTE are the same
> + * VM_FAULT_PTNOTSAME if the PTE are different
> + * VM_FAULT_RETRY if the VMA has changed in our back during
> + * a speculative page fault handling.
> */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
> {
> - int same = 1;
> + int ret = 0;
> +
> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
> if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
> }
> #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
> }
>
> static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
> @@ -2714,8 +2722,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> int exclusive = 0;
> vm_fault_t ret = 0;
>
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret) {
> + /*
> + * If pte != orig_pte, this means another thread did the
> + * swap operation in our back.
> + * So nothing else to do.
> + */
> + if (ret == VM_FAULT_PTNOTSAME)
> + ret = 0;
> goto out;
> + }
>
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> --
> 2.21.0
>

2019-04-18 22:23:30

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 08/31] mm: introduce INIT_VMA()

On Tue, Apr 16, 2019 at 03:44:59PM +0200, Laurent Dufour wrote:
> Some VMA struct fields need to be initialized once the VMA structure is
> allocated.
> Currently this only concerns anon_vma_chain field but some other will be
> added to support the speculative page fault.
>
> Instead of spreading the initialization calls all over the code, let's
> introduce a dedicated inline function.
>
> Signed-off-by: Laurent Dufour <[email protected]>
> ---
> fs/exec.c | 1 +
> include/linux/mm.h | 5 +++++
> kernel/fork.c | 2 +-
> mm/mmap.c | 3 +++
> mm/nommu.c | 1 +
> 5 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2e0033348d8e..9762e060295c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -266,6 +266,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
> vma->vm_start = vma->vm_end - PAGE_SIZE;
> vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
> vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> + INIT_VMA(vma);
>
> err = insert_vm_struct(mm, vma);
> if (err)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4ba2f53f9d60..2ceb1d2869a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1407,6 +1407,11 @@ struct zap_details {
> pgoff_t last_index; /* Highest page->index to unmap */
> };
>
> +static inline void INIT_VMA(struct vm_area_struct *vma)

Can we leave capital names for macro ? Also i prefer vma_init_struct() (the
one thing i like in C++ is namespace and thus i like namespace_action() for
function name).

Also why not doing a coccinelle patch for this:

@@
struct vm_area_struct *vma;
@@
-INIT_LIST_HEAD(&vma->anon_vma_chain);
+vma_init_struct(vma);


Untested ...

> +{
> + INIT_LIST_HEAD(&vma->anon_vma_chain);
> +}
> +
> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte, bool with_public_device);
> #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 915be4918a2b..f8dae021c2e5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -341,7 +341,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
>
> if (new) {
> *new = *orig;
> - INIT_LIST_HEAD(&new->anon_vma_chain);
> + INIT_VMA(new);
> }
> return new;
> }
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bd7b9f293b39..5ad3a3228d76 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1765,6 +1765,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> vma->vm_flags = vm_flags;
> vma->vm_page_prot = vm_get_page_prot(vm_flags);
> vma->vm_pgoff = pgoff;
> + INIT_VMA(vma);
>
> if (file) {
> if (vm_flags & VM_DENYWRITE) {
> @@ -3037,6 +3038,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
> }
>
> vma_set_anonymous(vma);
> + INIT_VMA(vma);
> vma->vm_start = addr;
> vma->vm_end = addr + len;
> vma->vm_pgoff = pgoff;
> @@ -3395,6 +3397,7 @@ static struct vm_area_struct *__install_special_mapping(
> if (unlikely(vma == NULL))
> return ERR_PTR(-ENOMEM);
>
> + INIT_VMA(vma);
> vma->vm_start = addr;
> vma->vm_end = addr + len;
>
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 749276beb109..acf7ca72ca90 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1210,6 +1210,7 @@ unsigned long do_mmap(struct file *file,
> region->vm_flags = vm_flags;
> region->vm_pgoff = pgoff;
>
> + INIT_VMA(vma);
> vma->vm_flags = vm_flags;
> vma->vm_pgoff = pgoff;
>
> --
> 2.21.0
>

2019-04-18 22:54:08

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 09/31] mm: VMA sequence count

On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <[email protected]>
>
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.
>
> The calls to vm_write_begin/end() in unmap_page_range() are
> used to detect when a VMA is being unmap and thus that new page fault
> should not be satisfied for this VMA. If the seqcount hasn't changed when
> the page table are locked, this means we are safe to satisfy the page
> fault.
>
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.
>
> The VMA's sequence counter is also used to detect change to various VMA's
> fields used during the page fault handling, such as:
> - vm_start, vm_end
> - vm_pgoff
> - vm_flags, vm_page_prot
> - vm_policy

^ All above are under mmap write lock ?

> - anon_vma

^ This is either under mmap write lock or under page table lock

So my question is do we need the complexity of seqcount_t for this ?

It seems that using regular int as counter and also relying on vm_flags
when vma is unmap should do the trick.

vma_delete(struct vm_area_struct *vma)
{
...
/*
* Make sure the vma is mark as invalid ie neither read nor write
* so that speculative fault back off. A racing speculative fault
* will either see the flags as 0 or the new seqcount.
*/
vma->vm_flags = 0;
smp_wmb();
vma->seqcount++;
...
}

Then:
speculative_fault_begin(struct vm_area_struct *vma,
struct spec_vmf *spvmf)
{
...
spvmf->seqcount = vma->seqcount;
smp_rmb();
spvmf->vm_flags = vma->vm_flags;
if (!spvmf->vm_flags) {
// Back off the vma is dying ...
...
}
}

bool speculative_fault_commit(struct vm_area_struct *vma,
struct spec_vmf *spvmf)
{
...
seqcount = vma->seqcount;
smp_rmb();
vm_flags = vma->vm_flags;

if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
// Something did change for the vma
return false;
}
return true;
}

This would also avoid the lockdep issue described below. But maybe what
i propose is stupid and i will see it after further reviewing thing.


Cheers,
J?r?me


>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> [Port to 4.12 kernel]
> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> [Introduce vm_write_* inline function depending on
> CONFIG_SPECULATIVE_PAGE_FAULT]
> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
> using vm_raw_write* functions]
> [Fix a lock dependency warning in mmap_region() when entering the error
> path]
> [move sequence initialisation INIT_VMA()]
> [Review the patch description about unmap_page_range()]
> Signed-off-by: Laurent Dufour <[email protected]>
> ---
> include/linux/mm.h | 44 ++++++++++++++++++++++++++++++++++++++++
> include/linux/mm_types.h | 3 +++
> mm/memory.c | 2 ++
> mm/mmap.c | 30 +++++++++++++++++++++++++++
> 4 files changed, 79 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2ceb1d2869a6..906b9e06f18e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1410,6 +1410,9 @@ struct zap_details {
> static inline void INIT_VMA(struct vm_area_struct *vma)
> {
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(&vma->vm_sequence);
> +#endif
> }
>
> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
> unmap_mapping_range(mapping, holebegin, holelen, 0);
> }
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> + write_seqcount_begin(&vma->vm_sequence);
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> + int subclass)
> +{
> + write_seqcount_begin_nested(&vma->vm_sequence, subclass);
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> + write_seqcount_end(&vma->vm_sequence);
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_begin(&vma->vm_sequence);
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_end(&vma->vm_sequence);
> +}
> +#else
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> + int subclass)
> +{
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
> extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
> void *buf, int len, unsigned int gup_flags);
> extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fd7d38ee2e33..e78f72eb2576 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -337,6 +337,9 @@ struct vm_area_struct {
> struct mempolicy *vm_policy; /* NUMA policy for the VMA */
> #endif
> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_t vm_sequence;
> +#endif
> } __randomize_layout;
>
> struct core_thread {
> diff --git a/mm/memory.c b/mm/memory.c
> index d5bebca47d98..423fa8ea0569 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
> unsigned long next;
>
> BUG_ON(addr >= end);
> + vm_write_begin(vma);
> tlb_start_vma(tlb, vma);
> pgd = pgd_offset(vma->vm_mm, addr);
> do {
> @@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
> next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
> } while (pgd++, addr = next, addr != end);
> tlb_end_vma(tlb, vma);
> + vm_write_end(vma);
> }
>
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5ad3a3228d76..a4e4d52a5148 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> long adjust_next = 0;
> int remove_next = 0;
>
> + /*
> + * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
> + *
> + * Locked is complaining about a theoretical lock dependency, involving
> + * 3 locks:
> + * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
> + *
> + * Here are the major path leading to this dependency :
> + * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem
> + * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
> + * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
> + * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
> + *
> + * So there is no way to solve this easily, especially because in
> + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
> + * VMAs are not yet known.
> + * However, the way the vm_seq is used is guarantying that we will
> + * never block on it since we just check for its value and never wait
> + * for it to move, see vma_has_changed() and handle_speculative_fault().
> + */
> + vm_raw_write_begin(vma);
> + if (next)
> + vm_raw_write_begin(next);
> +
> if (next && !insert) {
> struct vm_area_struct *exporter = NULL, *importer = NULL;
>
> @@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> * "vma->vm_next" gap must be updated.
> */
> next = vma->vm_next;
> + if (next)
> + vm_raw_write_begin(next);
> } else {
> /*
> * For the scope of the comment "next" and
> @@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> if (insert && file)
> uprobe_mmap(insert);
>
> + if (next && next != vma)
> + vm_raw_write_end(next);
> + vm_raw_write_end(vma);
> +
> validate_mm(mm);
>
> return 0;
> --
> 2.21.0
>

2019-04-19 18:21:51

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 09/31] mm: VMA sequence count

Hi Jerome,

Thanks a lot for reviewing this series.

Le 19/04/2019 à 00:48, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <[email protected]>
>>
>> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
>> counts such that we can easily test if a VMA is changed.
>>
>> The calls to vm_write_begin/end() in unmap_page_range() are
>> used to detect when a VMA is being unmap and thus that new page fault
>> should not be satisfied for this VMA. If the seqcount hasn't changed when
>> the page table are locked, this means we are safe to satisfy the page
>> fault.
>>
>> The flip side is that we cannot distinguish between a vma_adjust() and
>> the unmap_page_range() -- where with the former we could have
>> re-checked the vma bounds against the address.
>>
>> The VMA's sequence counter is also used to detect change to various VMA's
>> fields used during the page fault handling, such as:
>> - vm_start, vm_end
>> - vm_pgoff
>> - vm_flags, vm_page_prot
>> - vm_policy
>
> ^ All above are under mmap write lock ?

Yes, changes are still made under the protection of the mmap_sem.

>
>> - anon_vma
>
> ^ This is either under mmap write lock or under page table lock
>
> So my question is do we need the complexity of seqcount_t for this ?

The sequence counter is used to detect write operation done while
readers (SPF handler) is running.

The implementation is quite simple (here without the lockdep checks):

static inline void raw_write_seqcount_begin(seqcount_t *s)
{
s->sequence++;
smp_wmb();
}

I can't see why this is too complex here, would you elaborate on this ?

>
> It seems that using regular int as counter and also relying on vm_flags
> when vma is unmap should do the trick.

vm_flags is not enough I guess an some operation are not impacting the
vm_flags at all (resizing for instance).
Am I missing something ?

>
> vma_delete(struct vm_area_struct *vma)
> {
> ...
> /*
> * Make sure the vma is mark as invalid ie neither read nor write
> * so that speculative fault back off. A racing speculative fault
> * will either see the flags as 0 or the new seqcount.
> */
> vma->vm_flags = 0;
> smp_wmb();
> vma->seqcount++;
> ...
> }

Well I don't think we can safely clear the vm_flags this way when the
VMA is unmap, I think it is used later when cleaning is doen.

Later in this series, the VMA deletion is managed when the VMA is
unlinked from the RB Tree. That is checked using the vm_rb field's
value, and managed using RCU.

> Then:
> speculative_fault_begin(struct vm_area_struct *vma,
> struct spec_vmf *spvmf)
> {
> ...
> spvmf->seqcount = vma->seqcount;
> smp_rmb();
> spvmf->vm_flags = vma->vm_flags;
> if (!spvmf->vm_flags) {
> // Back off the vma is dying ...
> ...
> }
> }
>
> bool speculative_fault_commit(struct vm_area_struct *vma,
> struct spec_vmf *spvmf)
> {
> ...
> seqcount = vma->seqcount;
> smp_rmb();
> vm_flags = vma->vm_flags;
>
> if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
> // Something did change for the vma
> return false;
> }
> return true;
> }
>
> This would also avoid the lockdep issue described below. But maybe what
> i propose is stupid and i will see it after further reviewing thing.

That's true that the lockdep is quite annoying here. But it is still
interesting to keep in the loop to avoid 2 subsequent
write_seqcount_begin() call being made in the same context (which would
lead to an even sequence counter value while write operation is in
progress). So I think this is still a good thing to have lockdep
available here.



>
> Cheers,
> Jérôme
>
>
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>
>> [Port to 4.12 kernel]
>> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Introduce vm_write_* inline function depending on
>> CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>> using vm_raw_write* functions]
>> [Fix a lock dependency warning in mmap_region() when entering the error
>> path]
>> [move sequence initialisation INIT_VMA()]
>> [Review the patch description about unmap_page_range()]
>> Signed-off-by: Laurent Dufour <[email protected]>
>> ---
>> include/linux/mm.h | 44 ++++++++++++++++++++++++++++++++++++++++
>> include/linux/mm_types.h | 3 +++
>> mm/memory.c | 2 ++
>> mm/mmap.c | 30 +++++++++++++++++++++++++++
>> 4 files changed, 79 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 2ceb1d2869a6..906b9e06f18e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1410,6 +1410,9 @@ struct zap_details {
>> static inline void INIT_VMA(struct vm_area_struct *vma)
>> {
>> INIT_LIST_HEAD(&vma->anon_vma_chain);
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + seqcount_init(&vma->vm_sequence);
>> +#endif
>> }
>>
>> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>> @@ -1534,6 +1537,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
>> unmap_mapping_range(mapping, holebegin, holelen, 0);
>> }
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static inline void vm_write_begin(struct vm_area_struct *vma)
>> +{
>> + write_seqcount_begin(&vma->vm_sequence);
>> +}
>> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
>> + int subclass)
>> +{
>> + write_seqcount_begin_nested(&vma->vm_sequence, subclass);
>> +}
>> +static inline void vm_write_end(struct vm_area_struct *vma)
>> +{
>> + write_seqcount_end(&vma->vm_sequence);
>> +}
>> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
>> +{
>> + raw_write_seqcount_begin(&vma->vm_sequence);
>> +}
>> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
>> +{
>> + raw_write_seqcount_end(&vma->vm_sequence);
>> +}
>> +#else
>> +static inline void vm_write_begin(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
>> + int subclass)
>> +{
>> +}
>> +static inline void vm_write_end(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
>> +{
>> +}
>> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
>> +{
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
>> void *buf, int len, unsigned int gup_flags);
>> extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index fd7d38ee2e33..e78f72eb2576 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -337,6 +337,9 @@ struct vm_area_struct {
>> struct mempolicy *vm_policy; /* NUMA policy for the VMA */
>> #endif
>> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + seqcount_t vm_sequence;
>> +#endif
>> } __randomize_layout;
>>
>> struct core_thread {
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d5bebca47d98..423fa8ea0569 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1256,6 +1256,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>> unsigned long next;
>>
>> BUG_ON(addr >= end);
>> + vm_write_begin(vma);
>> tlb_start_vma(tlb, vma);
>> pgd = pgd_offset(vma->vm_mm, addr);
>> do {
>> @@ -1265,6 +1266,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>> next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>> } while (pgd++, addr = next, addr != end);
>> tlb_end_vma(tlb, vma);
>> + vm_write_end(vma);
>> }
>>
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 5ad3a3228d76..a4e4d52a5148 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -726,6 +726,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> long adjust_next = 0;
>> int remove_next = 0;
>>
>> + /*
>> + * Why using vm_raw_write*() functions here to avoid lockdep's warning ?
>> + *
>> + * Locked is complaining about a theoretical lock dependency, involving
>> + * 3 locks:
>> + * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim
>> + *
>> + * Here are the major path leading to this dependency :
>> + * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem
>> + * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim
>> + * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
>> + * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence
>> + *
>> + * So there is no way to solve this easily, especially because in
>> + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted
>> + * VMAs are not yet known.
>> + * However, the way the vm_seq is used is guarantying that we will
>> + * never block on it since we just check for its value and never wait
>> + * for it to move, see vma_has_changed() and handle_speculative_fault().
>> + */
>> + vm_raw_write_begin(vma);
>> + if (next)
>> + vm_raw_write_begin(next);
>> +
>> if (next && !insert) {
>> struct vm_area_struct *exporter = NULL, *importer = NULL;
>>
>> @@ -950,6 +974,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> * "vma->vm_next" gap must be updated.
>> */
>> next = vma->vm_next;
>> + if (next)
>> + vm_raw_write_begin(next);
>> } else {
>> /*
>> * For the scope of the comment "next" and
>> @@ -996,6 +1022,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> if (insert && file)
>> uprobe_mmap(insert);
>>
>> + if (next && next != vma)
>> + vm_raw_write_end(next);
>> + vm_raw_write_end(vma);
>> +
>> validate_mm(mm);
>>
>> return 0;
>> --
>> 2.21.0
>>
>


2019-04-22 15:54:17

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 09/31] mm: VMA sequence count

On Fri, Apr 19, 2019 at 05:45:57PM +0200, Laurent Dufour wrote:
> Hi Jerome,
>
> Thanks a lot for reviewing this series.
>
> Le 19/04/2019 ? 00:48, Jerome Glisse a ?crit?:
> > On Tue, Apr 16, 2019 at 03:45:00PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra <[email protected]>
> > >
> > > Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> > > counts such that we can easily test if a VMA is changed.
> > >
> > > The calls to vm_write_begin/end() in unmap_page_range() are
> > > used to detect when a VMA is being unmap and thus that new page fault
> > > should not be satisfied for this VMA. If the seqcount hasn't changed when
> > > the page table are locked, this means we are safe to satisfy the page
> > > fault.
> > >
> > > The flip side is that we cannot distinguish between a vma_adjust() and
> > > the unmap_page_range() -- where with the former we could have
> > > re-checked the vma bounds against the address.
> > >
> > > The VMA's sequence counter is also used to detect change to various VMA's
> > > fields used during the page fault handling, such as:
> > > - vm_start, vm_end
> > > - vm_pgoff
> > > - vm_flags, vm_page_prot
> > > - vm_policy
> >
> > ^ All above are under mmap write lock ?
>
> Yes, changes are still made under the protection of the mmap_sem.
>
> >
> > > - anon_vma
> >
> > ^ This is either under mmap write lock or under page table lock
> >
> > So my question is do we need the complexity of seqcount_t for this ?
>
> The sequence counter is used to detect write operation done while readers
> (SPF handler) is running.
>
> The implementation is quite simple (here without the lockdep checks):
>
> static inline void raw_write_seqcount_begin(seqcount_t *s)
> {
> s->sequence++;
> smp_wmb();
> }
>
> I can't see why this is too complex here, would you elaborate on this ?
>
> >
> > It seems that using regular int as counter and also relying on vm_flags
> > when vma is unmap should do the trick.
>
> vm_flags is not enough I guess an some operation are not impacting the
> vm_flags at all (resizing for instance).
> Am I missing something ?
>
> >
> > vma_delete(struct vm_area_struct *vma)
> > {
> > ...
> > /*
> > * Make sure the vma is mark as invalid ie neither read nor write
> > * so that speculative fault back off. A racing speculative fault
> > * will either see the flags as 0 or the new seqcount.
> > */
> > vma->vm_flags = 0;
> > smp_wmb();
> > vma->seqcount++;
> > ...
> > }
>
> Well I don't think we can safely clear the vm_flags this way when the VMA is
> unmap, I think it is used later when cleaning is doen.
>
> Later in this series, the VMA deletion is managed when the VMA is unlinked
> from the RB Tree. That is checked using the vm_rb field's value, and managed
> using RCU.
>
> > Then:
> > speculative_fault_begin(struct vm_area_struct *vma,
> > struct spec_vmf *spvmf)
> > {
> > ...
> > spvmf->seqcount = vma->seqcount;
> > smp_rmb();
> > spvmf->vm_flags = vma->vm_flags;
> > if (!spvmf->vm_flags) {
> > // Back off the vma is dying ...
> > ...
> > }
> > }
> >
> > bool speculative_fault_commit(struct vm_area_struct *vma,
> > struct spec_vmf *spvmf)
> > {
> > ...
> > seqcount = vma->seqcount;
> > smp_rmb();
> > vm_flags = vma->vm_flags;
> >
> > if (spvmf->vm_flags != vm_flags || seqcount != spvmf->seqcount) {
> > // Something did change for the vma
> > return false;
> > }
> > return true;
> > }
> >
> > This would also avoid the lockdep issue described below. But maybe what
> > i propose is stupid and i will see it after further reviewing thing.
>
> That's true that the lockdep is quite annoying here. But it is still
> interesting to keep in the loop to avoid 2 subsequent write_seqcount_begin()
> call being made in the same context (which would lead to an even sequence
> counter value while write operation is in progress). So I think this is
> still a good thing to have lockdep available here.

Ok so i had to read everything and i should have read everything before
asking all of the above. It does look good in fact, what worried my in
this patch is all the lockdep avoidance as it is usualy a red flags.

But after thinking long and hard i do not see how to easily solve that
one as unmap_page_range() is in so many different path... So what is done
in this patch is the most sane thing. Sorry for the noise.

So for this patch:

Reviewed-by: J?r?me Glisse <[email protected]>

2019-04-22 20:11:10

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 14/31] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

On Tue, Apr 16, 2019 at 03:45:05PM +0200, Laurent Dufour wrote:
> migrate_misplaced_page() is only called during the page fault handling so
> it's better to pass the pointer to the struct vm_fault instead of the vma.
>
> This way during the speculative page fault path the saved vma->vm_flags
> could be used.
>
> Acked-by: David Rientjes <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/migrate.h | 4 ++--
> mm/memory.c | 2 +-
> mm/migrate.c | 4 ++--
> 3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e13d9bf2f9a5..0197e40325f8 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -125,14 +125,14 @@ static inline void __ClearPageMovable(struct page *page)
> #ifdef CONFIG_NUMA_BALANCING
> extern bool pmd_trans_migrating(pmd_t pmd);
> extern int migrate_misplaced_page(struct page *page,
> - struct vm_area_struct *vma, int node);
> + struct vm_fault *vmf, int node);
> #else
> static inline bool pmd_trans_migrating(pmd_t pmd)
> {
> return false;
> }
> static inline int migrate_misplaced_page(struct page *page,
> - struct vm_area_struct *vma, int node)
> + struct vm_fault *vmf, int node)
> {
> return -EAGAIN; /* can't migrate now */
> }
> diff --git a/mm/memory.c b/mm/memory.c
> index d0de58464479..56802850e72c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3747,7 +3747,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> }
>
> /* Migrate to the requested node */
> - migrated = migrate_misplaced_page(page, vma, target_nid);
> + migrated = migrate_misplaced_page(page, vmf, target_nid);
> if (migrated) {
> page_nid = target_nid;
> flags |= TNF_MIGRATED;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a9138093a8e2..633bd9abac54 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1938,7 +1938,7 @@ bool pmd_trans_migrating(pmd_t pmd)
> * node. Caller is expected to have an elevated reference count on
> * the page that will be dropped by this function before returning.
> */
> -int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> +int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
> int node)
> {
> pg_data_t *pgdat = NODE_DATA(node);
> @@ -1951,7 +1951,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> * with execute permissions as they are probably shared libraries.
> */
> if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
> - (vma->vm_flags & VM_EXEC))
> + (vmf->vma_flags & VM_EXEC))
> goto out;
>
> /*
> --
> 2.21.0
>

2019-04-22 20:18:16

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 16/31] mm: introduce __vm_normal_page()

On Tue, Apr 16, 2019 at 03:45:07PM +0200, Laurent Dufour wrote:
> When dealing with the speculative fault path we should use the VMA's field
> cached value stored in the vm_fault structure.
>
> Currently vm_normal_page() is using the pointer to the VMA to fetch the
> vm_flags value. This patch provides a new __vm_normal_page() which is
> receiving the vm_flags flags value as parameter.
>
> Note: The speculative path is turned on for architecture providing support
> for special PTE flag. So only the first block of vm_normal_page is used
> during the speculative path.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm.h | 18 +++++++++++++++---
> mm/memory.c | 21 ++++++++++++---------
> 2 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f465bb2b049e..f14b2c9ddfd4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1421,9 +1421,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
> #endif
> }
>
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte, bool with_public_device);
> -#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> + pte_t pte, bool with_public_device,
> + unsigned long vma_flags);
> +static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
> + unsigned long addr, pte_t pte,
> + bool with_public_device)
> +{
> + return __vm_normal_page(vma, addr, pte, with_public_device,
> + vma->vm_flags);
> +}
> +static inline struct page *vm_normal_page(struct vm_area_struct *vma,
> + unsigned long addr, pte_t pte)
> +{
> + return _vm_normal_page(vma, addr, pte, false);
> +}
>
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t pmd);
> diff --git a/mm/memory.c b/mm/memory.c
> index 85ec5ce5c0a8..be93f2c8ebe0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -533,7 +533,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> }
>
> /*
> - * vm_normal_page -- This function gets the "struct page" associated with a pte.
> + * __vm_normal_page -- This function gets the "struct page" associated with
> + * a pte.
> *
> * "Special" mappings do not wish to be associated with a "struct page" (either
> * it doesn't exist, or it exists but they don't want to touch it). In this
> @@ -574,8 +575,9 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> * PFNMAP mappings in order to support COWable mappings.
> *
> */
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte, bool with_public_device)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> + pte_t pte, bool with_public_device,
> + unsigned long vma_flags)
> {
> unsigned long pfn = pte_pfn(pte);
>
> @@ -584,7 +586,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> goto check_pfn;
> if (vma->vm_ops && vma->vm_ops->find_special_page)
> return vma->vm_ops->find_special_page(vma, addr);
> - if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
> return NULL;
> if (is_zero_pfn(pfn))
> return NULL;
> @@ -620,8 +622,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>
> /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
>
> - if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> - if (vma->vm_flags & VM_MIXEDMAP) {
> + if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> + if (vma_flags & VM_MIXEDMAP) {
> if (!pfn_valid(pfn))
> return NULL;
> goto out;
> @@ -630,7 +632,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> off = (addr - vma->vm_start) >> PAGE_SHIFT;
> if (pfn == vma->vm_pgoff + off)
> return NULL;
> - if (!is_cow_mapping(vma->vm_flags))
> + if (!is_cow_mapping(vma_flags))
> return NULL;
> }
> }
> @@ -2532,7 +2534,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
>
> - vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
> + vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
> + vmf->vma_flags);
> if (!vmf->page) {
> /*
> * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
> @@ -3706,7 +3709,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte);
> update_mmu_cache(vma, vmf->address, vmf->pte);
>
> - page = vm_normal_page(vma, vmf->address, pte);
> + page = __vm_normal_page(vma, vmf->address, pte, false, vmf->vma_flags);
> if (!page) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return 0;
> --
> 2.21.0
>

2019-04-22 23:06:17

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 11/31] mm: protect mremap() against SPF hanlder

On Tue, Apr 16, 2019 at 03:45:02PM +0200, Laurent Dufour wrote:
> If a thread is remapping an area while another one is faulting on the
> destination area, the SPF handler may fetch the vma from the RB tree before
> the pte has been moved by the other thread. This means that the moved ptes
> will overwrite those create by the page fault handler leading to page
> leaked.
>
> CPU 1 CPU2
> enter mremap()
> unmap the dest area
> copy_vma() Enter speculative page fault handler
> >> at this time the dest area is present in the RB tree
> fetch the vma matching dest area
> create a pte as the VMA matched
> Exit the SPF handler
> <data written in the new page>
> move_ptes()
> > it is assumed that the dest area is empty,
> > the move ptes overwrite the page mapped by the CPU2.
>
> To prevent that, when the VMA matching the dest area is extended or created
> by copy_vma(), it should be marked as non available to the SPF handler.
> The usual way to so is to rely on vm_write_begin()/end().
> This is already in __vma_adjust() called by copy_vma() (through
> vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
> which create a window for another thread.
> This patch adds a new parameter to vma_merge() which is passed down to
> vma_adjust().
> The assumption is that copy_vma() is returning a vma which should be
> released by calling vm_raw_write_end() by the callee once the ptes have
> been moved.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

Small comment about a comment below but can be fix as a fixup
patch nothing earth shattering.

> ---
> include/linux/mm.h | 24 ++++++++++++++++-----
> mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++-----------
> mm/mremap.c | 13 ++++++++++++
> 3 files changed, 73 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 906b9e06f18e..5d45b7d8718d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>
> /* mmap.c */
> extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> +
> extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand);
> + struct vm_area_struct *expand, bool keep_locked);
> +
> static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
> {
> - return __vma_adjust(vma, start, end, pgoff, insert, NULL);
> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
> }
> -extern struct vm_area_struct *vma_merge(struct mm_struct *,
> +
> +extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
> + struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
> + pgoff_t pgoff, struct mempolicy *mpol,
> + struct vm_userfaultfd_ctx uff, bool keep_locked);
> +
> +static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
> struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> - unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> - struct mempolicy *, struct vm_userfaultfd_ctx);
> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
> +{
> + return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
> + pol, uff, false);
> +}
> +
> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> unsigned long addr, int new_below);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b77ec0149249..13460b38b0fb 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
> */
> int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
> - struct vm_area_struct *expand)
> + struct vm_area_struct *expand, bool keep_locked)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
> @@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>
> importer->anon_vma = exporter->anon_vma;
> error = anon_vma_clone(importer, exporter);
> - if (error)
> + if (error) {
> + if (next && next != vma)
> + vm_raw_write_end(next);
> + vm_raw_write_end(vma);
> return error;
> + }
> }
> }
> again:
> @@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>
> if (next && next != vma)
> vm_raw_write_end(next);
> - vm_raw_write_end(vma);
> + if (!keep_locked)
> + vm_raw_write_end(vma);
>
> validate_mm(mm);
>
> @@ -1161,12 +1166,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
> * parameter) may establish ptes with the wrong permissions of NNNN
> * instead of the right permissions of XXXX.
> */
> -struct vm_area_struct *vma_merge(struct mm_struct *mm,
> +struct vm_area_struct *__vma_merge(struct mm_struct *mm,
> struct vm_area_struct *prev, unsigned long addr,
> unsigned long end, unsigned long vm_flags,
> struct anon_vma *anon_vma, struct file *file,
> pgoff_t pgoff, struct mempolicy *policy,
> - struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
> + struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
> + bool keep_locked)
> {
> pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
> struct vm_area_struct *area, *next;
> @@ -1214,10 +1220,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> /* cases 1, 6 */
> err = __vma_adjust(prev, prev->vm_start,
> next->vm_end, prev->vm_pgoff, NULL,
> - prev);
> + prev, keep_locked);
> } else /* cases 2, 5, 7 */
> err = __vma_adjust(prev, prev->vm_start,
> - end, prev->vm_pgoff, NULL, prev);
> + end, prev->vm_pgoff, NULL, prev,
> + keep_locked);
> if (err)
> return NULL;
> khugepaged_enter_vma_merge(prev, vm_flags);
> @@ -1234,10 +1241,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
> vm_userfaultfd_ctx)) {
> if (prev && addr < prev->vm_end) /* case 4 */
> err = __vma_adjust(prev, prev->vm_start,
> - addr, prev->vm_pgoff, NULL, next);
> + addr, prev->vm_pgoff, NULL, next,
> + keep_locked);
> else { /* cases 3, 8 */
> err = __vma_adjust(area, addr, next->vm_end,
> - next->vm_pgoff - pglen, NULL, next);
> + next->vm_pgoff - pglen, NULL, next,
> + keep_locked);
> /*
> * In case 3 area is already equal to next and
> * this is a noop, but in case 8 "area" has
> @@ -3259,9 +3268,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>
> if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
> return NULL; /* should never get here */
> - new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> - vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
> - vma->vm_userfaultfd_ctx);
> +
> + /* There is 3 cases to manage here in
> + * AAAA AAAA AAAA AAAA
> + * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN
> + * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL
> + * PPPPPPPPNNNN(2)
> + * PPPPNNNNNNNN(3)
> + *
> + * new_vma == prev in case A,1,2
> + * new_vma == next in case B,3
> + */
> + new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
> + vma->anon_vma, vma->vm_file, pgoff,
> + vma_policy(vma), vma->vm_userfaultfd_ctx, true);
> if (new_vma) {
> /*
> * Source vma may have been merged into new_vma
> @@ -3299,6 +3319,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> get_file(new_vma->vm_file);
> if (new_vma->vm_ops && new_vma->vm_ops->open)
> new_vma->vm_ops->open(new_vma);
> + /*
> + * As the VMA is linked right now, it may be hit by the
> + * speculative page fault handler. But we don't want it to
> + * to start mapping page in this area until the caller has
> + * potentially move the pte from the moved VMA. To prevent
> + * that we protect it right now, and let the caller unprotect
> + * it once the move is done.
> + */

It would be better to say:
/*
* Block speculative page fault on the new VMA before "linking" it as
* as once it is linked then it may be hit by speculative page fault.
* But we don't want it to start mapping page in this area until the
* caller has potentially move the pte from the moved VMA. To prevent
* that we protect it before linking and let the caller unprotect it
* once the move is done.
*/


> + vm_raw_write_begin(new_vma);
> vma_link(mm, new_vma, prev, rb_link, rb_parent);
> *need_rmap_locks = false;
> }
> diff --git a/mm/mremap.c b/mm/mremap.c
> index fc241d23cd97..ae5c3379586e 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -357,6 +357,14 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> if (!new_vma)
> return -ENOMEM;
>
> + /* new_vma is returned protected by copy_vma, to prevent speculative
> + * page fault to be done in the destination area before we move the pte.
> + * Now, we must also protect the source VMA since we don't want pages
> + * to be mapped in our back while we are copying the PTEs.
> + */
> + if (vma != new_vma)
> + vm_raw_write_begin(vma);
> +
> moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
> need_rmap_locks);
> if (moved_len < old_len) {
> @@ -373,6 +381,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> */
> move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
> true);
> + if (vma != new_vma)
> + vm_raw_write_end(vma);
> vma = new_vma;
> old_len = new_len;
> old_addr = new_addr;
> @@ -381,7 +391,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> mremap_userfaultfd_prep(new_vma, uf);
> arch_remap(mm, old_addr, old_addr + old_len,
> new_addr, new_addr + new_len);
> + if (vma != new_vma)
> + vm_raw_write_end(vma);
> }
> + vm_raw_write_end(new_vma);
>
> /* Conceal VM_ACCOUNT so old reservation is not undone */
> if (vm_flags & VM_ACCOUNT) {
> --
> 2.21.0
>

2019-04-22 23:11:36

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 12/31] mm: protect SPF handler against anon_vma changes

On Tue, Apr 16, 2019 at 03:45:03PM +0200, Laurent Dufour wrote:
> The speculative page fault handler must be protected against anon_vma
> changes. This is because page_add_new_anon_rmap() is called during the
> speculative path.
>
> In addition, don't try speculative page fault if the VMA don't have an
> anon_vma structure allocated because its allocation should be
> protected by the mmap_sem.
>
> In __vma_adjust() when importer->anon_vma is set, there is no need to
> protect against speculative page faults since speculative page fault
> is aborted if the vma->anon_vma is not set.
>
> When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
> valid since we checked for it when locking the pte and the anon_vma is
> removed once the pte is unlocked. So even if the speculative page
> fault handler is running concurrently with do_unmap(), as the pte is
> locked in unmap_region() - through unmap_vmas() - and the anon_vma
> unlinked later, because we check for the vma sequence counter which is
> updated in unmap_page_range() before locking the pte, and then in
> free_pgtables() so when locking the pte the change will be detected.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> mm/memory.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 423fa8ea0569..2cf7b6185daa 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -377,7 +377,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> * Hide vma from rmap and truncate_pagecache before freeing
> * pgtables
> */
> + vm_write_begin(vma);
> unlink_anon_vmas(vma);
> + vm_write_end(vma);
> unlink_file_vma(vma);
>
> if (is_vm_hugetlb_page(vma)) {
> @@ -391,7 +393,9 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
> && !is_vm_hugetlb_page(next)) {
> vma = next;
> next = vma->vm_next;
> + vm_write_begin(vma);
> unlink_anon_vmas(vma);
> + vm_write_end(vma);
> unlink_file_vma(vma);
> }
> free_pgd_range(tlb, addr, vma->vm_end,
> --
> 2.21.0
>

2019-04-23 00:22:24

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure

On Tue, Apr 16, 2019 at 03:45:04PM +0200, Laurent Dufour wrote:
> When handling speculative page fault, the vma->vm_flags and
> vma->vm_page_prot fields are read once the page table lock is released. So
> there is no more guarantee that these fields would not change in our back.
> They will be saved in the vm_fault structure before the VMA is checked for
> changes.
>
> In the detail, when we deal with a speculative page fault, the mmap_sem is
> not taken, so parallel VMA's changes can occurred. When a VMA change is
> done which will impact the page fault processing, we assumed that the VMA
> sequence counter will be changed. In the page fault processing, at the
> time the PTE is locked, we checked the VMA sequence counter to detect
> changes done in our back. If no change is detected we can continue further.
> But this doesn't prevent the VMA to not be changed in our back while the
> PTE is locked. So VMA's fields which are used while the PTE is locked must
> be saved to ensure that we are using *static* values. This is important
> since the PTE changes will be made on regards to these VMA fields and they
> need to be consistent. This concerns the vma->vm_flags and
> vma->vm_page_prot VMA fields.
>
> This patch also set the fields in hugetlb_no_page() and
> __collapse_huge_page_swapin even if it is not need for the callee.
>
> Signed-off-by: Laurent Dufour <[email protected]>

I am unsure about something see below, so you might need to update
that one but it would not change the structure of the patch thus:

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm.h | 10 +++++++--
> mm/huge_memory.c | 6 +++---
> mm/hugetlb.c | 2 ++
> mm/khugepaged.c | 2 ++
> mm/memory.c | 53 ++++++++++++++++++++++++----------------------
> mm/migrate.c | 2 +-
> 6 files changed, 44 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5d45b7d8718d..f465bb2b049e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -439,6 +439,12 @@ struct vm_fault {
> * page table to avoid allocation from
> * atomic context.
> */
> + /*
> + * These entries are required when handling speculative page fault.
> + * This way the page handling is done using consistent field values.
> + */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
> };
>
> /* page entry size for vm->huge_fault() */
> @@ -781,9 +787,9 @@ void free_compound_page(struct page *page);
> * pte_mkwrite. But get_user_pages can cause write faults for mappings
> * that do not have writing enabled, when used by access_process_vm.
> */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
> {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
> pte = pte_mkwrite(pte);
> return pte;
> }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 823688414d27..865886a689ee 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
>
> for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> pte_t entry;
> - entry = mk_pte(pages[i], vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(pages[i], vmf->vma_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
> memcg = (void *)page_private(pages[i]);
> set_page_private(pages[i], 0);
> page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> @@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> entry = pte_swp_mksoft_dirty(entry);
> } else {
> entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
> - entry = maybe_mkwrite(entry, vma);
> + entry = maybe_mkwrite(entry, vma->vm_flags);
> if (!write)
> entry = pte_wrprotect(entry);
> if (!young)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 109f5de82910..13246da4bc50 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> .vma = vma,
> .address = haddr,
> .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,

Shouldn't you use READ_ONCE ? I doubt compiler will do something creative
with struct initialization but as you are using WRITE_ONCE to update those
fields maybe pairing read with READ_ONCE where the mmap_sem is not always
taken might make sense.

> /*
> * Hard to debug if it ends up being
> * used by a callee that assumes
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6a0cbca3885e..42469037240a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -888,6 +888,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
> .flags = FAULT_FLAG_ALLOW_RETRY,
> .pmd = pmd,
> .pgoff = linear_page_index(vma, address),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,

Same as above.

[...]

> return VM_FAULT_FALLBACK;
> @@ -3924,6 +3925,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> .flags = flags,
> .pgoff = linear_page_index(vma, address),
> .gfp_mask = __get_fault_gfp_mask(vma),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,

Same as above

> };
> unsigned int dirty = flags & FAULT_FLAG_WRITE;
> struct mm_struct *mm = vma->vm_mm;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f2ecc2855a12..a9138093a8e2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -240,7 +240,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> */
> entry = pte_to_swp_entry(*pvmw.pte);
> if (is_write_migration_entry(entry))
> - pte = maybe_mkwrite(pte, vma);
> + pte = maybe_mkwrite(pte, vma->vm_flags);
>
> if (unlikely(is_zone_device_page(new))) {
> if (is_device_private_page(new)) {
> --
> 2.21.0
>

2019-04-23 00:38:21

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 15/31] mm: introduce __lru_cache_add_active_or_unevictable

On Tue, Apr 16, 2019 at 03:45:06PM +0200, Laurent Dufour wrote:
> The speculative page fault handler which is run without holding the
> mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
> is not guaranteed to remain constant.
> Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
> value parameter instead of the vma pointer.
>
> Acked-by: David Rientjes <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/swap.h | 10 ++++++++--
> mm/memory.c | 8 ++++----
> mm/swap.c | 6 +++---
> 3 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 4bfb5c4ac108..d33b94eb3c69 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -343,8 +343,14 @@ extern void deactivate_file_page(struct page *page);
> extern void mark_page_lazyfree(struct page *page);
> extern void swap_setup(void);
>
> -extern void lru_cache_add_active_or_unevictable(struct page *page,
> - struct vm_area_struct *vma);
> +extern void __lru_cache_add_active_or_unevictable(struct page *page,
> + unsigned long vma_flags);
> +
> +static inline void lru_cache_add_active_or_unevictable(struct page *page,
> + struct vm_area_struct *vma)
> +{
> + return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
> +}
>
> /* linux/mm/vmscan.c */
> extern unsigned long zone_reclaimable_pages(struct zone *zone);
> diff --git a/mm/memory.c b/mm/memory.c
> index 56802850e72c..85ec5ce5c0a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
> page_add_new_anon_rmap(new_page, vma, vmf->address, false);
> mem_cgroup_commit_charge(new_page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(new_page, vma);
> + __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
> /*
> * We call the notify macro here because, when using secondary
> * mmu page tables (such as kvm shadow page tables), we want the
> @@ -2896,7 +2896,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (unlikely(page != swapcache && swapcache)) {
> page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> } else {
> do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> mem_cgroup_commit_charge(page, memcg, true, false);
> @@ -3048,7 +3048,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> setpte:
> set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>
> @@ -3327,7 +3327,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> - lru_cache_add_active_or_unevictable(page, vma);
> + __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> } else {
> inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> page_add_file_rmap(page, false);
> diff --git a/mm/swap.c b/mm/swap.c
> index 3a75722e68a9..a55f0505b563 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -450,12 +450,12 @@ void lru_cache_add(struct page *page)
> * directly back onto it's zone's unevictable list, it does NOT use a
> * per cpu pagevec.
> */
> -void lru_cache_add_active_or_unevictable(struct page *page,
> - struct vm_area_struct *vma)
> +void __lru_cache_add_active_or_unevictable(struct page *page,
> + unsigned long vma_flags)
> {
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> + if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> SetPageActive(page);
> else if (!TestSetPageMlocked(page)) {
> /*
> --
> 2.21.0
>

2019-04-23 00:39:50

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 17/31] mm: introduce __page_add_new_anon_rmap()

On Tue, Apr 16, 2019 at 03:45:08PM +0200, Laurent Dufour wrote:
> When dealing with speculative page fault handler, we may race with VMA
> being split or merged. In this case the vma->vm_start and vm->vm_end
> fields may not match the address the page fault is occurring.
>
> This can only happens when the VMA is split but in that case, the
> anon_vma pointer of the new VMA will be the same as the original one,
> because in __split_vma the new->anon_vma is set to src->anon_vma when
> *new = *vma.
>
> So even if the VMA boundaries are not correct, the anon_vma pointer is
> still valid.
>
> If the VMA has been merged, then the VMA in which it has been merged
> must have the same anon_vma pointer otherwise the merge can't be done.
>
> So in all the case we know that the anon_vma is valid, since we have
> checked before starting the speculative page fault that the anon_vma
> pointer is valid for this VMA and since there is an anon_vma this
> means that at one time a page has been backed and that before the VMA
> is cleaned, the page table lock would have to be grab to clean the
> PTE, and the anon_vma field is checked once the PTE is locked.
>
> This patch introduce a new __page_add_new_anon_rmap() service which
> doesn't check for the VMA boundaries, and create a new inline one
> which do the check.
>
> When called from a page fault handler, if this is not a speculative one,
> there is a guarantee that vm_start and vm_end match the faulting address,
> so this check is useless. In the context of the speculative page fault
> handler, this check may be wrong but anon_vma is still valid as explained
> above.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/rmap.h | 12 ++++++++++--
> mm/memory.c | 8 ++++----
> mm/rmap.c | 5 ++---
> 3 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 988d176472df..a5d282573093 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long, bool);
> void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
> unsigned long, int);
> -void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> - unsigned long, bool);
> +void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> + unsigned long, bool);
> +static inline void page_add_new_anon_rmap(struct page *page,
> + struct vm_area_struct *vma,
> + unsigned long address, bool compound)
> +{
> + VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> + __page_add_new_anon_rmap(page, vma, address, compound);
> +}
> +
> void page_add_file_rmap(struct page *, bool);
> void page_remove_rmap(struct page *, bool);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index be93f2c8ebe0..46f877b6abea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2347,7 +2347,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> * thread doing COW.
> */
> ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
> - page_add_new_anon_rmap(new_page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
> mem_cgroup_commit_charge(new_page, memcg, false, false);
> __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
> /*
> @@ -2897,7 +2897,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
> /* ksm created a completely new copy */
> if (unlikely(page != swapcache && swapcache)) {
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> } else {
> @@ -3049,7 +3049,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> }
>
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> setpte:
> @@ -3328,7 +3328,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> /* copy-on-write page */
> if (write && !(vmf->vma_flags & VM_SHARED)) {
> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, vmf->address, false);
> + __page_add_new_anon_rmap(page, vma, vmf->address, false);
> mem_cgroup_commit_charge(page, memcg, false, false);
> __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
> } else {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e5dfe2ae6b0d..2148e8ce6e34 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1140,7 +1140,7 @@ void do_page_add_anon_rmap(struct page *page,
> }
>
> /**
> - * page_add_new_anon_rmap - add pte mapping to a new anonymous page
> + * __page_add_new_anon_rmap - add pte mapping to a new anonymous page
> * @page: the page to add the mapping to
> * @vma: the vm area in which the mapping is added
> * @address: the user virtual address mapped
> @@ -1150,12 +1150,11 @@ void do_page_add_anon_rmap(struct page *page,
> * This means the inc-and-test can be bypassed.
> * Page does not have to be locked.
> */
> -void page_add_new_anon_rmap(struct page *page,
> +void __page_add_new_anon_rmap(struct page *page,
> struct vm_area_struct *vma, unsigned long address, bool compound)
> {
> int nr = compound ? hpage_nr_pages(page) : 1;
>
> - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> __SetPageSwapBacked(page);
> if (compound) {
> VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> --
> 2.21.0
>

2019-04-23 00:39:50

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
> lead to thread being blocked due to data corruption:
>
> CPU 1 CPU 2 CPU 3
> Process 1, Process 1, Process 1,
> Thread A Thread B Thread C
>
> while (1) { while (1) { while(1) {
> pthread_mutex_lock(l) pthread_mutex_lock(l) fork
> pthread_mutex_unlock(l) pthread_mutex_unlock(l) }
> } }
>
> In the details this happens because :
>
> CPU 1 CPU 2 CPU 3
> fork()
> copy_pte_range()
> set PTE rdonly
> got to next VMA...
> . PTE is seen rdonly PTE still writable
> . thread is writing to page
> . -> page fault
> . copy the page Thread writes to page
> . . -> no page fault
> . update the PTE
> . flush TLB for that PTE
> flush TLB PTE are now rdonly

Should the fork be on CPU3 to be consistant with the top thing (just to
make it easier to read and go from one to the other as thread can move
from one CPU to another).

>
> So the write done by the CPU 3 is interfering with the page copy operation
> done by CPU 2, leading to the data corruption.
>
> To avoid this we mark all the VMA involved in the COW mechanism as changing
> by calling vm_write_begin(). This ensures that the speculative page fault
> handler will not try to handle a fault on these pages.
> The marker is set until the TLB is flushed, ensuring that all the CPUs will
> now see the PTE as not writable.
> Once the TLB is flush, the marker is removed by calling vm_write_end().
>
> The variable last is used to keep tracked of the latest VMA marked to
> handle the error path where part of the VMA may have been marked.
>
> Since multiple VMA from the same mm may have the sequence count increased
> during this process, the use of the vm_raw_write_begin/end() is required to
> avoid lockdep false warning messages.
>
> Reported-by: Ganesh Mahendran <[email protected]>
> Reported-by: Vinayak Menon <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

A minor comment (see below)

Reviewed-by: J?rome Glisse <[email protected]>

> ---
> kernel/fork.c | 30 ++++++++++++++++++++++++++++--
> 1 file changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f8dae021c2e5..2992d2c95256 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
> static __latent_entropy int dup_mmap(struct mm_struct *mm,
> struct mm_struct *oldmm)
> {
> - struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> + struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
> struct rb_node **rb_link, *rb_parent;
> int retval;
> unsigned long charge;
> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> rb_parent = &tmp->vm_rb;
>
> mm->map_count++;
> - if (!(tmp->vm_flags & VM_WIPEONFORK))
> + if (!(tmp->vm_flags & VM_WIPEONFORK)) {
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> + /*
> + * Mark this VMA as changing to prevent the
> + * speculative page fault hanlder to process
> + * it until the TLB are flushed below.
> + */
> + last = mpnt;
> + vm_raw_write_begin(mpnt);
> + }
> retval = copy_page_range(mm, oldmm, mpnt);
> + }
>
> if (tmp->vm_ops && tmp->vm_ops->open)
> tmp->vm_ops->open(tmp);
> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> out:
> up_write(&mm->mmap_sem);
> flush_tlb_mm(oldmm);
> +
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {

You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
will always be NULL if it is not enabled but maybe the compiler will
miss the optimization opportunity if you only have the for() loop
below.

> + /*
> + * Since the TLB has been flush, we can safely unmark the
> + * copied VMAs and allows the speculative page fault handler to
> + * process them again.
> + * Walk back the VMA list from the last marked VMA.
> + */
> + for (; last; last = last->vm_prev) {
> + if (last->vm_flags & VM_DONTCOPY)
> + continue;
> + if (!(last->vm_flags & VM_WIPEONFORK))
> + vm_raw_write_end(last);
> + }
> + }
> +
> up_write(&oldmm->mmap_sem);
> dup_userfaultfd_complete(&uf);
> fail_uprobe_end:
> --
> 2.21.0
>

2019-04-23 00:41:07

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 19/31] mm: protect the RB tree with a sequence lock

On Tue, Apr 16, 2019 at 03:45:10PM +0200, Laurent Dufour wrote:
> Introducing a per mm_struct seqlock, mm_seq field, to protect the changes
> made in the MM RB tree. This allows to walk the RB tree without grabbing
> the mmap_sem, and on the walk is done to double check that sequence counter
> was stable during the walk.
>
> The mm seqlock is held while inserting and removing entries into the MM RB
> tree. Later in this series, it will be check when looking for a VMA
> without holding the mmap_sem.
>
> This is based on the initial work from Peter Zijlstra:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm_types.h | 3 +++
> kernel/fork.c | 3 +++
> mm/init-mm.c | 3 +++
> mm/mmap.c | 48 +++++++++++++++++++++++++++++++---------
> 4 files changed, 46 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index e78f72eb2576..24b3f8ce9e42 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -358,6 +358,9 @@ struct mm_struct {
> struct {
> struct vm_area_struct *mmap; /* list of VMAs */
> struct rb_root mm_rb;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqlock_t mm_seq;
> +#endif
> u64 vmacache_seqnum; /* per-thread vmacache */
> #ifdef CONFIG_MMU
> unsigned long (*get_unmapped_area) (struct file *filp,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2992d2c95256..3a1739197ebc 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1008,6 +1008,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> mm->mmap = NULL;
> mm->mm_rb = RB_ROOT;
> mm->vmacache_seqnum = 0;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqlock_init(&mm->mm_seq);
> +#endif
> atomic_set(&mm->mm_users, 1);
> atomic_set(&mm->mm_count, 1);
> init_rwsem(&mm->mmap_sem);
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index a787a319211e..69346b883a4e 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -27,6 +27,9 @@
> */
> struct mm_struct init_mm = {
> .mm_rb = RB_ROOT,
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + .mm_seq = __SEQLOCK_UNLOCKED(init_mm.mm_seq),
> +#endif
> .pgd = swapper_pg_dir,
> .mm_users = ATOMIC_INIT(2),
> .mm_count = ATOMIC_INIT(1),
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 13460b38b0fb..f7f6027a7dff 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -170,6 +170,24 @@ void unlink_file_vma(struct vm_area_struct *vma)
> }
> }
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void mm_write_seqlock(struct mm_struct *mm)
> +{
> + write_seqlock(&mm->mm_seq);
> +}
> +static inline void mm_write_sequnlock(struct mm_struct *mm)
> +{
> + write_sequnlock(&mm->mm_seq);
> +}
> +#else
> +static inline void mm_write_seqlock(struct mm_struct *mm)
> +{
> +}
> +static inline void mm_write_sequnlock(struct mm_struct *mm)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
> /*
> * Close a vm structure and free it, returning the next.
> */
> @@ -445,26 +463,32 @@ static void vma_gap_update(struct vm_area_struct *vma)
> }
>
> static inline void vma_rb_insert(struct vm_area_struct *vma,
> - struct rb_root *root)
> + struct mm_struct *mm)
> {
> + struct rb_root *root = &mm->mm_rb;
> +
> /* All rb_subtree_gap values must be consistent prior to insertion */
> validate_mm_rb(root, NULL);
>
> rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> }
>
> -static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> +static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
> {
> + struct rb_root *root = &mm->mm_rb;
> +
> /*
> * Note rb_erase_augmented is a fairly large inline function,
> * so make sure we instantiate it only once with our desired
> * augmented rbtree callbacks.
> */
> + mm_write_seqlock(mm);
> rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> + mm_write_sequnlock(mm); /* wmb */
> }
>
> static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> - struct rb_root *root,
> + struct mm_struct *mm,
> struct vm_area_struct *ignore)
> {
> /*
> @@ -472,21 +496,21 @@ static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> * with the possible exception of the "next" vma being erased if
> * next->vm_start was reduced.
> */
> - validate_mm_rb(root, ignore);
> + validate_mm_rb(&mm->mm_rb, ignore);
>
> - __vma_rb_erase(vma, root);
> + __vma_rb_erase(vma, mm);
> }
>
> static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
> - struct rb_root *root)
> + struct mm_struct *mm)
> {
> /*
> * All rb_subtree_gap values must be consistent prior to erase,
> * with the possible exception of the vma being erased.
> */
> - validate_mm_rb(root, vma);
> + validate_mm_rb(&mm->mm_rb, vma);
>
> - __vma_rb_erase(vma, root);
> + __vma_rb_erase(vma, mm);
> }
>
> /*
> @@ -601,10 +625,12 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
> * immediately update the gap to the correct value. Finally we
> * rebalance the rbtree after all augmented values have been set.
> */
> + mm_write_seqlock(mm);
> rb_link_node(&vma->vm_rb, rb_parent, rb_link);
> vma->rb_subtree_gap = 0;
> vma_gap_update(vma);
> - vma_rb_insert(vma, &mm->mm_rb);
> + vma_rb_insert(vma, mm);
> + mm_write_sequnlock(mm);
> }
>
> static void __vma_link_file(struct vm_area_struct *vma)
> @@ -680,7 +706,7 @@ static __always_inline void __vma_unlink_common(struct mm_struct *mm,
> {
> struct vm_area_struct *next;
>
> - vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
> + vma_rb_erase_ignore(vma, mm, ignore);
> next = vma->vm_next;
> if (has_prev)
> prev->vm_next = next;
> @@ -2674,7 +2700,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> insertion_point = (prev ? &prev->vm_next : &mm->mmap);
> vma->vm_prev = NULL;
> do {
> - vma_rb_erase(vma, &mm->mm_rb);
> + vma_rb_erase(vma, mm);
> mm->map_count--;
> tail_vma = vma;
> vma = vma->vm_next;
> --
> 2.21.0
>

2019-04-23 00:42:32

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 20/31] mm: introduce vma reference counter

On Tue, Apr 16, 2019 at 03:45:11PM +0200, Laurent Dufour wrote:
> The final goal is to be able to use a VMA structure without holding the
> mmap_sem and to be sure that the structure will not be freed in our back.
>
> The lockless use of the VMA will be done through RCU protection and thus a
> dedicated freeing service is required to manage it asynchronously.
>
> As reported in a 2010's thread [1], this may impact file handling when a
> file is still referenced while the mapping is no more there. As the final
> goal is to handle anonymous VMA in a speculative way and not file backed
> mapping, we could close and free the file pointer in a synchronous way, as
> soon as we are guaranteed to not use it without holding the mmap_sem. For
> sanity reason, in a minimal effort, the vm_file file pointer is unset once
> the file pointer is put.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Signed-off-by: Laurent Dufour <[email protected]>

Using kref would have been better from my POV even with RCU freeing
but anyway:

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm.h | 4 ++++
> include/linux/mm_types.h | 3 +++
> mm/internal.h | 27 +++++++++++++++++++++++++++
> mm/mmap.c | 13 +++++++++----
> 4 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f14b2c9ddfd4..f761a9c65c74 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
> vma->vm_mm = mm;
> vma->vm_ops = &dummy_vm_ops;
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + atomic_set(&vma->vm_ref_count, 1);
> +#endif
> }
>
> static inline void vma_set_anonymous(struct vm_area_struct *vma)
> @@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
> INIT_LIST_HEAD(&vma->anon_vma_chain);
> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> seqcount_init(&vma->vm_sequence);
> + atomic_set(&vma->vm_ref_count, 1);
> #endif
> }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 24b3f8ce9e42..6a6159e11a3f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -285,6 +285,9 @@ struct vm_area_struct {
> /* linked list of VM areas per task, sorted by address */
> struct vm_area_struct *vm_next, *vm_prev;
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + atomic_t vm_ref_count;
> +#endif
> struct rb_node vm_rb;
>
> /*
> diff --git a/mm/internal.h b/mm/internal.h
> index 9eeaf2b95166..302382bed406 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -40,6 +40,33 @@ void page_writeback_init(void);
>
> vm_fault_t do_swap_page(struct vm_fault *vmf);
>
> +
> +extern void __free_vma(struct vm_area_struct *vma);
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void get_vma(struct vm_area_struct *vma)
> +{
> + atomic_inc(&vma->vm_ref_count);
> +}
> +
> +static inline void put_vma(struct vm_area_struct *vma)
> +{
> + if (atomic_dec_and_test(&vma->vm_ref_count))
> + __free_vma(vma);
> +}
> +
> +#else
> +
> +static inline void get_vma(struct vm_area_struct *vma)
> +{
> +}
> +
> +static inline void put_vma(struct vm_area_struct *vma)
> +{
> + __free_vma(vma);
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> unsigned long floor, unsigned long ceiling);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f7f6027a7dff..c106440dcae7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
> }
> #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>
> +void __free_vma(struct vm_area_struct *vma)
> +{
> + mpol_put(vma_policy(vma));
> + vm_area_free(vma);
> +}
> +
> /*
> * Close a vm structure and free it, returning the next.
> */
> @@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> vma->vm_ops->close(vma);
> if (vma->vm_file)
> fput(vma->vm_file);
> - mpol_put(vma_policy(vma));
> - vm_area_free(vma);
> + vma->vm_file = NULL;
> + put_vma(vma);
> return next;
> }
>
> @@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> if (next->anon_vma)
> anon_vma_merge(vma, next);
> mm->map_count--;
> - mpol_put(vma_policy(next));
> - vm_area_free(next);
> + put_vma(next);
> /*
> * In mprotect's case 6 (see comments on vma_merge),
> * we must remove another next too. It would clutter
> --
> 2.21.0
>

2019-04-23 00:53:31

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
> This allows to search for a VMA structure without holding the mmap_sem.
>
> The search is repeated while the mm seqlock is changing and until we found
> a valid VMA.
>
> While under the RCU protection, a reference is taken on the VMA, so the
> caller must call put_vma() once it not more need the VMA structure.
>
> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
> reference is taken to the VMA by calling get_vma().
>
> When removing a VMA from the MM RB tree, the VMA is not release immediately
> but at the end of the RCU grace period through vm_rcu_put(). This ensures
> that the VMA remains allocated until the end the RCU grace period.
>
> Since the vm_file pointer, if valid, is released in put_vma(), there is no
> guarantee that the file pointer will be valid on the returned VMA.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Minor comments about comment (i love recursion :)) see below.

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> include/linux/mm_types.h | 1 +
> mm/internal.h | 5 ++-
> mm/mmap.c | 76 ++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 78 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a6159e11a3f..9af6694cb95d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -287,6 +287,7 @@ struct vm_area_struct {
>
> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> atomic_t vm_ref_count;
> + struct rcu_head vm_rcu;
> #endif
> struct rb_node vm_rb;
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 302382bed406..1e368e4afe3c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
> __free_vma(vma);
> }
>
> -#else
> +extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> + unsigned long addr);
> +
> +#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>
> static inline void get_vma(struct vm_area_struct *vma)
> {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c106440dcae7..34bf261dc2c8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
> {
> write_sequnlock(&mm->mm_seq);
> }
> +
> +static void __vm_rcu_put(struct rcu_head *head)
> +{
> + struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
> + vm_rcu);
> + put_vma(vma);
> +}
> +static void vm_rcu_put(struct vm_area_struct *vma)
> +{
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
> + call_rcu(&vma->vm_rcu, __vm_rcu_put);
> +}
> #else
> static inline void mm_write_seqlock(struct mm_struct *mm)
> {
> @@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>
> void __free_vma(struct vm_area_struct *vma)
> {
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
> mpol_put(vma_policy(vma));
> vm_area_free(vma);
> }
> @@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
> /*
> * Close a vm structure and free it, returning the next.
> */
> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
> {
> struct vm_area_struct *next = vma->vm_next;
>
> might_sleep();
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
> + !RB_EMPTY_NODE(&vma->vm_rb)) {
> + /*
> + * If the VMA is still linked in the RB tree, we must release
> + * that reference by calling put_vma().
> + * This should only happen when called from exit_mmap().
> + * We forcely clear the node to satisfy the chec in
^
Typo: chec -> check

> + * __free_vma(). This is safe since the RB tree is not walked
> + * anymore.
> + */
> + RB_CLEAR_NODE(&vma->vm_rb);
> + put_vma(vma);
> + }
> if (vma->vm_ops && vma->vm_ops->close)
> vma->vm_ops->close(vma);
> if (vma->vm_file)
> @@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> return next;
> }
>
> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
> +{
> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);

Adding a comment here explaining the BUG_ON so people can understand
what is wrong if that happens. For instance:

/*
* remove_vma() should be call only once a vma have been remove from the rbtree
* at which point the vma->vm_rb is an empty node. The exception is when vmas
* are destroy through exit_mmap() in which case we do not bother updating the
* rbtree (see comment in __remove_vma()).
*/

> + return __remove_vma(vma);
> +}
> +
> static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
> struct list_head *uf);
> SYSCALL_DEFINE1(brk, unsigned long, brk)
> @@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
>
> /* All rb_subtree_gap values must be consistent prior to insertion */
> validate_mm_rb(root, NULL);
> -
> + get_vma(vma);
> rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> }
>
> @@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
> mm_write_seqlock(mm);
> rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
> mm_write_sequnlock(mm); /* wmb */
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + /*
> + * Ensure the removal is complete before clearing the node.
> + * Matched by vma_has_changed()/handle_speculative_fault().
> + */
> + RB_CLEAR_NODE(&vma->vm_rb);
> + vm_rcu_put(vma);
> +#endif
> }
>
> static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
> @@ -2331,6 +2373,34 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>
> EXPORT_SYMBOL(find_vma);
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +/*
> + * Like find_vma() but under the protection of RCU and the mm sequence counter.
> + * The vma returned has to be relaesed by the caller through the call to
> + * put_vma()
> + */
> +struct vm_area_struct *find_vma_rcu(struct mm_struct *mm, unsigned long addr)
> +{
> + struct vm_area_struct *vma = NULL;
> + unsigned int seq;
> +
> + do {
> + if (vma)
> + put_vma(vma);
> +
> + seq = read_seqbegin(&mm->mm_seq);
> +
> + rcu_read_lock();
> + vma = find_vma(mm, addr);
> + if (vma)
> + get_vma(vma);
> + rcu_read_unlock();
> + } while (read_seqretry(&mm->mm_seq, seq));
> +
> + return vma;
> +}
> +#endif
> +
> /*
> * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
> */
> @@ -3231,7 +3301,7 @@ void exit_mmap(struct mm_struct *mm)
> while (vma) {
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> - vma = remove_vma(vma);
> + vma = __remove_vma(vma);
> }
> vm_unacct_memory(nr_accounted);
> }
> --
> 2.21.0
>

2019-04-23 02:48:45

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 10/31] mm: protect VMA modifications using VMA sequence count

On Tue, Apr 16, 2019 at 03:45:01PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
>
> This patch provides protection against the VMA modification done in :
> - madvise()
> - mpol_rebind_policy()
> - vma_replace_policy()
> - change_prot_numa()
> - mlock(), munlock()
> - mprotect()
> - mmap_region()
> - collapse_huge_page()
> - userfaultd registering services
>
> In addition, VMA fields which will be read during the speculative fault
> path needs to be written using WRITE_ONCE to prevent write to be split
> and intermediate values to be pushed to other CPUs.
>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

> ---
> fs/proc/task_mmu.c | 5 ++++-
> fs/userfaultfd.c | 17 ++++++++++++----
> mm/khugepaged.c | 3 +++
> mm/madvise.c | 6 +++++-
> mm/mempolicy.c | 51 ++++++++++++++++++++++++++++++----------------
> mm/mlock.c | 13 +++++++-----
> mm/mmap.c | 28 ++++++++++++++++---------
> mm/mprotect.c | 4 +++-
> mm/swap_state.c | 10 ++++++---
> 9 files changed, 95 insertions(+), 42 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 01d4eb0e6bd1..0864c050b2de 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1162,8 +1162,11 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> goto out_mm;
> }
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - vma->vm_flags &= ~VM_SOFTDIRTY;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags,
> + vma->vm_flags & ~VM_SOFTDIRTY);
> vma_set_page_prot(vma);
> + vm_write_end(vma);
> }
> downgrade_write(&mm->mmap_sem);
> break;
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 3b30301c90ec..2e0f98cadd81 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -667,8 +667,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
>
> octx = vma->vm_userfaultfd_ctx.ctx;
> if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> + vm_write_begin(vma);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + WRITE_ONCE(vma->vm_flags,
> + vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
> + vm_write_end(vma);
> return 0;
> }
>
> @@ -908,8 +911,10 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> vma = prev;
> else
> prev = vma;
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
> }
> skip_mm:
> up_write(&mm->mmap_sem);
> @@ -1474,8 +1479,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
> * the next vma was merged into the current one and
> * the current one has not been updated yet.
> */
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
> vma->vm_userfaultfd_ctx.ctx = ctx;
> + vm_write_end(vma);
>
> skip:
> prev = vma;
> @@ -1636,8 +1643,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
> * the next vma was merged into the current one and
> * the current one has not been updated yet.
> */
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> + vm_write_end(vma);
>
> skip:
> prev = vma;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a335f7c1fac4..6a0cbca3885e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> if (mm_find_pmd(mm, address) != pmd)
> goto out;
>
> + vm_write_begin(vma);
> anon_vma_lock_write(vma->anon_vma);
>
> pte = pte_offset_map(pmd, address);
> @@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> spin_unlock(pmd_ptl);
> anon_vma_unlock_write(vma->anon_vma);
> + vm_write_end(vma);
> result = SCAN_FAIL;
> goto out;
> }
> @@ -1081,6 +1083,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> set_pmd_at(mm, address, pmd, _pmd);
> update_mmu_cache_pmd(vma, address, pmd);
> spin_unlock(pmd_ptl);
> + vm_write_end(vma);
>
> *hpage = NULL;
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a692d2a893b5..6cf07dc546fc 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -184,7 +184,9 @@ static long madvise_behavior(struct vm_area_struct *vma,
> /*
> * vm_flags is protected by the mmap_sem held in write mode.
> */
> - vma->vm_flags = new_flags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, new_flags);
> + vm_write_end(vma);
> out:
> return error;
> }
> @@ -450,9 +452,11 @@ static void madvise_free_page_range(struct mmu_gather *tlb,
> .private = tlb,
> };
>
> + vm_write_begin(vma);
> tlb_start_vma(tlb, vma);
> walk_page_range(addr, end, &free_walk);
> tlb_end_vma(tlb, vma);
> + vm_write_end(vma);
> }
>
> static int madvise_free_single_vma(struct vm_area_struct *vma,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2219e747df49..94c103c5034a 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -380,8 +380,11 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
> struct vm_area_struct *vma;
>
> down_write(&mm->mmap_sem);
> - for (vma = mm->mmap; vma; vma = vma->vm_next)
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + vm_write_begin(vma);
> mpol_rebind_policy(vma->vm_policy, new);
> + vm_write_end(vma);
> + }
> up_write(&mm->mmap_sem);
> }
>
> @@ -575,9 +578,11 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
> {
> int nr_updated;
>
> + vm_write_begin(vma);
> nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1);
> if (nr_updated)
> count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
> + vm_write_end(vma);
>
> return nr_updated;
> }
> @@ -683,6 +688,7 @@ static int vma_replace_policy(struct vm_area_struct *vma,
> if (IS_ERR(new))
> return PTR_ERR(new);
>
> + vm_write_begin(vma);
> if (vma->vm_ops && vma->vm_ops->set_policy) {
> err = vma->vm_ops->set_policy(vma, new);
> if (err)
> @@ -690,11 +696,17 @@ static int vma_replace_policy(struct vm_area_struct *vma,
> }
>
> old = vma->vm_policy;
> - vma->vm_policy = new; /* protected by mmap_sem */
> + /*
> + * The speculative page fault handler accesses this field without
> + * hodling the mmap_sem.
> + */
> + WRITE_ONCE(vma->vm_policy, new);
> + vm_write_end(vma);
> mpol_put(old);
>
> return 0;
> err_out:
> + vm_write_end(vma);
> mpol_put(new);
> return err;
> }
> @@ -1654,23 +1666,28 @@ COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid,
> struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
> unsigned long addr)
> {
> - struct mempolicy *pol = NULL;
> + struct mempolicy *pol;
>
> - if (vma) {
> - if (vma->vm_ops && vma->vm_ops->get_policy) {
> - pol = vma->vm_ops->get_policy(vma, addr);
> - } else if (vma->vm_policy) {
> - pol = vma->vm_policy;
> + if (!vma)
> + return NULL;
>
> - /*
> - * shmem_alloc_page() passes MPOL_F_SHARED policy with
> - * a pseudo vma whose vma->vm_ops=NULL. Take a reference
> - * count on these policies which will be dropped by
> - * mpol_cond_put() later
> - */
> - if (mpol_needs_cond_ref(pol))
> - mpol_get(pol);
> - }
> + if (vma->vm_ops && vma->vm_ops->get_policy)
> + return vma->vm_ops->get_policy(vma, addr);
> +
> + /*
> + * This could be called without holding the mmap_sem in the
> + * speculative page fault handler's path.
> + */
> + pol = READ_ONCE(vma->vm_policy);
> + if (pol) {
> + /*
> + * shmem_alloc_page() passes MPOL_F_SHARED policy with
> + * a pseudo vma whose vma->vm_ops=NULL. Take a reference
> + * count on these policies which will be dropped by
> + * mpol_cond_put() later
> + */
> + if (mpol_needs_cond_ref(pol))
> + mpol_get(pol);
> }
>
> return pol;
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 080f3b36415b..f390903d9bbb 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -445,7 +445,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
> void munlock_vma_pages_range(struct vm_area_struct *vma,
> unsigned long start, unsigned long end)
> {
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, vma->vm_flags & VM_LOCKED_CLEAR_MASK);
> + vm_write_end(vma);
>
> while (start < end) {
> struct page *page;
> @@ -569,10 +571,11 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> * It's okay if try_to_unmap_one unmaps a page just after we
> * set VM_LOCKED, populate_vma_page_range will bring it back.
> */
> -
> - if (lock)
> - vma->vm_flags = newflags;
> - else
> + if (lock) {
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, newflags);
> + vm_write_end(vma);
> + } else
> munlock_vma_pages_range(vma, start, end);
>
> out:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a4e4d52a5148..b77ec0149249 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -877,17 +877,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
> }
>
> if (start != vma->vm_start) {
> - vma->vm_start = start;
> + WRITE_ONCE(vma->vm_start, start);
> start_changed = true;
> }
> if (end != vma->vm_end) {
> - vma->vm_end = end;
> + WRITE_ONCE(vma->vm_end, end);
> end_changed = true;
> }
> - vma->vm_pgoff = pgoff;
> + WRITE_ONCE(vma->vm_pgoff, pgoff);
> if (adjust_next) {
> - next->vm_start += adjust_next << PAGE_SHIFT;
> - next->vm_pgoff += adjust_next;
> + WRITE_ONCE(next->vm_start,
> + next->vm_start + (adjust_next << PAGE_SHIFT));
> + WRITE_ONCE(next->vm_pgoff, next->vm_pgoff + adjust_next);
> }
>
> if (root) {
> @@ -1850,12 +1851,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> out:
> perf_event_mmap(vma);
>
> + vm_write_begin(vma);
> vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> if (vm_flags & VM_LOCKED) {
> if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> is_vm_hugetlb_page(vma) ||
> vma == get_gate_vma(current->mm))
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + WRITE_ONCE(vma->vm_flags,
> + vma->vm_flags &= VM_LOCKED_CLEAR_MASK);
> else
> mm->locked_vm += (len >> PAGE_SHIFT);
> }
> @@ -1870,9 +1873,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> * then new mapped in-place (which must be aimed as
> * a completely new data area).
> */
> - vma->vm_flags |= VM_SOFTDIRTY;
> + WRITE_ONCE(vma->vm_flags, vma->vm_flags | VM_SOFTDIRTY);
>
> vma_set_page_prot(vma);
> + vm_write_end(vma);
>
> return addr;
>
> @@ -2430,7 +2434,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> mm->locked_vm += grow;
> vm_stat_account(mm, vma->vm_flags, grow);
> anon_vma_interval_tree_pre_update_vma(vma);
> - vma->vm_end = address;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_end, address);
> + vm_write_end(vma);
> anon_vma_interval_tree_post_update_vma(vma);
> if (vma->vm_next)
> vma_gap_update(vma->vm_next);
> @@ -2510,8 +2516,10 @@ int expand_downwards(struct vm_area_struct *vma,
> mm->locked_vm += grow;
> vm_stat_account(mm, vma->vm_flags, grow);
> anon_vma_interval_tree_pre_update_vma(vma);
> - vma->vm_start = address;
> - vma->vm_pgoff -= grow;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_start, address);
> + WRITE_ONCE(vma->vm_pgoff, vma->vm_pgoff - grow);
> + vm_write_end(vma);
> anon_vma_interval_tree_post_update_vma(vma);
> vma_gap_update(vma);
> spin_unlock(&mm->page_table_lock);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 65242f1e4457..78fce873ca3a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -427,12 +427,14 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> * vm_flags and vm_page_prot are protected by the mmap_sem
> * held in write mode.
> */
> - vma->vm_flags = newflags;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags, newflags);
> dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
> vma_set_page_prot(vma);
>
> change_protection(vma, start, end, vma->vm_page_prot,
> dirty_accountable, 0);
> + vm_write_end(vma);
>
> /*
> * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index eb714165afd2..c45f9122b457 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -523,7 +523,11 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> * This has been extended to use the NUMA policies from the mm triggering
> * the readahead.
> *
> - * Caller must hold read mmap_sem if vmf->vma is not NULL.
> + * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
> + * This is needed to ensure the VMA will not be freed in our back. In the case
> + * of the speculative page fault handler, this cannot happen, even if we don't
> + * hold the mmap_sem. Callees are assumed to take care of reading VMA's fields
> + * using READ_ONCE() to read consistent values.
> */
> struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_fault *vmf)
> @@ -624,9 +628,9 @@ static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma,
> unsigned long *start,
> unsigned long *end)
> {
> - *start = max3(lpfn, PFN_DOWN(vma->vm_start),
> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
> PFN_DOWN(faddr & PMD_MASK));
> - *end = min3(rpfn, PFN_DOWN(vma->vm_end),
> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
> PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
> }
>
> --
> 2.21.0
>

2019-04-23 03:25:08

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure

On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra <[email protected]>
>
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
>
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
>
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> [Manage the newly introduced pte_spinlock() for speculative page
> fault to fail if the VMA is touched in our back]
> [Rename vma_is_dead() to vma_has_changed() and declare it here]
> [Fetch p4d and pud]
> [Set vmd.sequence in __handle_mm_fault()]
> [Abort speculative path when handle_userfault() has to be called]
> [Add additional VMA's flags checks in handle_speculative_fault()]
> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> [Remove warning comment about waiting for !seq&1 since we don't want
> to wait]
> [Remove warning about no huge page support, mention it explictly]
> [Don't call do_fault() in the speculative path as __do_fault() calls
> vma->vm_ops->fault() which may want to release mmap_sem]
> [Only vm_fault pointer argument for vma_has_changed()]
> [Fix check against huge page, calling pmd_trans_huge()]
> [Use READ_ONCE() when reading VMA's fields in the speculative path]
> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
> processing done in vm_normal_page()]
> [Check that vma->anon_vma is already set when starting the speculative
> path]
> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
> the processing done in mpol_misplaced()]
> [Don't support VMA growing up or down]
> [Move check on vm_sequence just before calling handle_pte_fault()]
> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Add mem cgroup oom check]
> [Use READ_ONCE to access p*d entries]
> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> [Don't fetch pte again in handle_pte_fault() when running the speculative
> path]
> [Check PMD against concurrent collapsing operation]
> [Try spin lock the pte during the speculative path to avoid deadlock with
> other CPU's invalidating the TLB and requiring this CPU to catch the
> inter processor's interrupt]
> [Move define of FAULT_FLAG_SPECULATIVE here]
> [Introduce __handle_speculative_fault() and add a check against
> mm->mm_users in handle_speculative_fault() defined in mm.h]
> [Abort if vm_ops->fault is set instead of checking only vm_ops]
> [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
> Signed-off-by: Laurent Dufour <[email protected]>


Few comments and questions for this one see below.


> ---
> include/linux/hugetlb_inline.h | 2 +-
> include/linux/mm.h | 30 +++
> include/linux/pagemap.h | 4 +-
> mm/internal.h | 15 ++
> mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
> 5 files changed, 389 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..9e25283d6fc9 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>
> static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
> {
> - return !!(vma->vm_flags & VM_HUGETLB);
> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
> }
>
> #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f761a9c65c74..ec609cbad25a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
> #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
>
> #define FAULT_FLAG_TRACE \
> { FAULT_FLAG_WRITE, "WRITE" }, \
> @@ -409,6 +410,10 @@ struct vm_fault {
> gfp_t gfp_mask; /* gfp mask to be used for allocations */
> pgoff_t pgoff; /* Logical page offset based on vma */
> unsigned long address; /* Faulting virtual address */
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + unsigned int sequence;
> + pmd_t orig_pmd; /* value of PMD at the time of fault */
> +#endif
> pmd_t *pmd; /* Pointer to pmd entry matching
> * the 'address' */
> pud_t *pud; /* Pointer to pud entry matching
> @@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
> #ifdef CONFIG_MMU
> extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
> unsigned long address, unsigned int flags);
> +
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
> + unsigned long address,
> + unsigned int flags);
> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> + unsigned long address,
> + unsigned int flags)
> +{
> + /*
> + * Try speculative page fault for multithreaded user space task only.
> + */
> + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> + return VM_FAULT_RETRY;
> + return __handle_speculative_fault(mm, address, flags);
> +}
> +#else
> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> + unsigned long address,
> + unsigned int flags)
> +{
> + return VM_FAULT_RETRY;
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
> extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long address, unsigned int fault_flags,
> bool *unlocked);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2e8438a1216a..2fcfaa910007 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
> pgoff_t pgoff;
> if (unlikely(is_vm_hugetlb_page(vma)))
> return linear_hugepage_index(vma, address);
> - pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> - pgoff += vma->vm_pgoff;
> + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
> + pgoff += READ_ONCE(vma->vm_pgoff);
> return pgoff;
> }
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 1e368e4afe3c..ed91b199cb8c 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
> extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> unsigned long addr);
>
> +
> +static inline bool vma_has_changed(struct vm_fault *vmf)
> +{
> + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
> + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
> +
> + /*
> + * Matches both the wmb in write_seqlock_{begin,end}() and
> + * the wmb in vma_rb_erase().
> + */
> + smp_rmb();
> +
> + return ret || seq != vmf->sequence;
> +}
> +
> #else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>
> static inline void get_vma(struct vm_area_struct *vma)
> diff --git a/mm/memory.c b/mm/memory.c
> index 46f877b6abea..6e6bf61c0e5c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> if (page)
> dump_page(page, "bad pte");
> pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
> - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
> + mapping, index);
> pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
> vma->vm_file,
> vma->vm_ops ? vma->vm_ops->fault : NULL,
> @@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(apply_to_page_range);
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static bool pte_spinlock(struct vm_fault *vmf)
> +{
> + bool ret = false;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + pmd_t pmdval;
> +#endif
> +
> + /* Check if vma is still valid */
> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + spin_lock(vmf->ptl);
> + return true;
> + }
> +
> +again:
> + local_irq_disable();
> + if (vma_has_changed(vmf))
> + goto out;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + /*
> + * We check if the pmd value is still the same to ensure that there
> + * is not a huge collapse operation in progress in our back.
> + */
> + pmdval = READ_ONCE(*vmf->pmd);
> + if (!pmd_same(pmdval, vmf->orig_pmd))
> + goto out;
> +#endif
> +
> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + if (unlikely(!spin_trylock(vmf->ptl))) {
> + local_irq_enable();
> + goto again;
> + }

Do we want to constantly retry taking the spinlock ? Shouldn't it
be limited ? If we fail few times it is probably better to give
up on that speculative page fault.

So maybe putting everything within a for(i; i < MAX_TRY; ++i) loop
would be cleaner.


> +
> + if (vma_has_changed(vmf)) {
> + spin_unlock(vmf->ptl);
> + goto out;
> + }
> +
> + ret = true;
> +out:
> + local_irq_enable();
> + return ret;
> +}
> +
> +static bool pte_map_lock(struct vm_fault *vmf)
> +{
> + bool ret = false;
> + pte_t *pte;
> + spinlock_t *ptl;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + pmd_t pmdval;
> +#endif
> +
> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + return true;
> + }
> +
> + /*
> + * The first vma_has_changed() guarantees the page-tables are still
> + * valid, having IRQs disabled ensures they stay around, hence the
> + * second vma_has_changed() to make sure they are still valid once
> + * we've got the lock. After that a concurrent zap_pte_range() will
> + * block on the PTL and thus we're safe.
> + */
> +again:
> + local_irq_disable();
> + if (vma_has_changed(vmf))
> + goto out;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + /*
> + * We check if the pmd value is still the same to ensure that there
> + * is not a huge collapse operation in progress in our back.
> + */
> + pmdval = READ_ONCE(*vmf->pmd);
> + if (!pmd_same(pmdval, vmf->orig_pmd))
> + goto out;
> +#endif
> +
> + /*
> + * Same as pte_offset_map_lock() except that we call
> + * spin_trylock() in place of spin_lock() to avoid race with
> + * unmap path which may have the lock and wait for this CPU
> + * to invalidate TLB but this CPU has irq disabled.
> + * Since we are in a speculative patch, accept it could fail
> + */
> + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> + pte = pte_offset_map(vmf->pmd, vmf->address);
> + if (unlikely(!spin_trylock(ptl))) {
> + pte_unmap(pte);
> + local_irq_enable();
> + goto again;
> + }

Same comment as above shouldn't be limited to a maximum number of retry ?

> +
> + if (vma_has_changed(vmf)) {
> + pte_unmap_unlock(pte, ptl);
> + goto out;
> + }
> +
> + vmf->pte = pte;
> + vmf->ptl = ptl;
> + ret = true;
> +out:
> + local_irq_enable();
> + return ret;
> +}
> +#else
> static inline bool pte_spinlock(struct vm_fault *vmf)
> {
> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> @@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
> vmf->address, &vmf->ptl);
> return true;
> }
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>
> /*
> * handle_pte_fault chooses page fault handler according to an entry which was
> @@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> ret = check_stable_address_space(vma->vm_mm);
> if (ret)
> goto unlock;
> + /*
> + * Don't call the userfaultfd during the speculative path.
> + * We already checked for the VMA to not be managed through
> + * userfaultfd, but it may be set in our back once we have lock
> + * the pte. In such a case we can ignore it this time.
> + */
> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> + goto setpte;

Bit confuse by the comment above, if userfaultfd is set in the back
then shouldn't the speculative fault abort ? So wouldn't the following
be correct:

if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
if (vmf->flags & FAULT_FLAG_SPECULATIVE)
return VM_FAULT_RETRY;
...


> /* Deliver the page fault to userland, check inside PT lock */
> if (userfaultfd_missing(vma)) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> goto unlock_and_release;
>
> /* Deliver the page fault to userland, check inside PT lock */
> - if (userfaultfd_missing(vma)) {
> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
> + userfaultfd_missing(vma)) {

Same comment as above but this also seems more wrong then above. What
i propose above would look more correct in both cases ie we still want
to check for userfaultfd but if we are in speculative fault then we
just want to abort the speculative fault.


> pte_unmap_unlock(vmf->pte, vmf->ptl);
> mem_cgroup_cancel_charge(page, memcg, false);
> put_page(page);
> @@ -3836,6 +3959,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> pte_t entry;
>
> if (unlikely(pmd_none(*vmf->pmd))) {
> + /*
> + * In the case of the speculative page fault handler we abort
> + * the speculative path immediately as the pmd is probably
> + * in the way to be converted in a huge one. We will try
> + * again holding the mmap_sem (which implies that the collapse
> + * operation is done).
> + */
> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> + return VM_FAULT_RETRY;
> /*
> * Leave __pte_alloc() until later: because vm_ops->fault may
> * want to allocate huge page, and if we expose page table
> @@ -3843,7 +3975,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * concurrent faults and from rmap lookups.
> */
> vmf->pte = NULL;
> - } else {
> + } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> /* See comment in pte_alloc_one_map() */
> if (pmd_devmap_trans_unstable(vmf->pmd))
> return 0;
> @@ -3852,6 +3984,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * pmd from under us anymore at this point because we hold the
> * mmap_sem read mode and khugepaged takes it in write mode.
> * So now it's safe to run pte_offset_map().
> + * This is not applicable to the speculative page fault handler
> + * but in that case, the pte is fetched earlier in
> + * handle_speculative_fault().
> */
> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> vmf->orig_pte = *vmf->pte;
> @@ -3874,6 +4009,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> if (!vmf->pte) {
> if (vma_is_anonymous(vmf->vma))
> return do_anonymous_page(vmf);
> + else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> + return VM_FAULT_RETRY;

Maybe a small comment about speculative page fault not applying to
file back vma.

> else
> return do_fault(vmf);
> }
> @@ -3971,6 +4108,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> vmf.pmd = pmd_alloc(mm, vmf.pud, address);
> if (!vmf.pmd)
> return VM_FAULT_OOM;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
> +#endif
> if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
> ret = create_huge_pmd(&vmf);
> if (!(ret & VM_FAULT_FALLBACK))
> @@ -4004,6 +4144,204 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> return handle_pte_fault(&vmf);
> }
>
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +/*
> + * Tries to handle the page fault in a speculative way, without grabbing the
> + * mmap_sem.
> + */
> +vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
> + unsigned long address,
> + unsigned int flags)
> +{
> + struct vm_fault vmf = {
> + .address = address,
> + };
> + pgd_t *pgd, pgdval;
> + p4d_t *p4d, p4dval;
> + pud_t pudval;
> + int seq;
> + vm_fault_t ret = VM_FAULT_RETRY;
> + struct vm_area_struct *vma;
> +#ifdef CONFIG_NUMA
> + struct mempolicy *pol;
> +#endif
> +
> + /* Clear flags that may lead to release the mmap_sem to retry */
> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
> + flags |= FAULT_FLAG_SPECULATIVE;
> +
> + vma = find_vma_rcu(mm, address);
> + if (!vma)
> + return ret;
> +
> + /* rmb <-> seqlock,vma_rb_erase() */
> + seq = raw_read_seqcount(&vma->vm_sequence);
> + if (seq & 1)
> + goto out_put;

A comment explaining that odd sequence number means that we are racing
with a write_begin and write_end would be welcome above.

> +
> + /*
> + * Can't call vm_ops service has we don't know what they would do
> + * with the VMA.
> + * This include huge page from hugetlbfs.
> + */
> + if (vma->vm_ops && vma->vm_ops->fault)
> + goto out_put;
> +
> + /*
> + * __anon_vma_prepare() requires the mmap_sem to be held
> + * because vm_next and vm_prev must be safe. This can't be guaranteed
> + * in the speculative path.
> + */
> + if (unlikely(!vma->anon_vma))
> + goto out_put;

Maybe also remind people that once the vma->anon_vma is set then its
value will not change and thus we do not need to protect against such
thing (unlike vm_flags or other vma field below and above).

> +
> + vmf.vma_flags = READ_ONCE(vma->vm_flags);
> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
> +
> + /* Can't call userland page fault handler in the speculative path */
> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
> + goto out_put;
> +
> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
> + /*
> + * This could be detected by the check address against VMA's
> + * boundaries but we want to trace it as not supported instead
> + * of changed.
> + */
> + goto out_put;
> +
> + if (address < READ_ONCE(vma->vm_start)
> + || READ_ONCE(vma->vm_end) <= address)
> + goto out_put;
> +
> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
> + flags & FAULT_FLAG_INSTRUCTION,
> + flags & FAULT_FLAG_REMOTE)) {
> + ret = VM_FAULT_SIGSEGV;
> + goto out_put;
> + }
> +
> + /* This is one is required to check that the VMA has write access set */
> + if (flags & FAULT_FLAG_WRITE) {
> + if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
> + ret = VM_FAULT_SIGSEGV;
> + goto out_put;
> + }
> + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
> + ret = VM_FAULT_SIGSEGV;
> + goto out_put;
> + }
> +
> +#ifdef CONFIG_NUMA
> + /*
> + * MPOL_INTERLEAVE implies additional checks in
> + * mpol_misplaced() which are not compatible with the
> + *speculative page fault processing.
> + */
> + pol = __get_vma_policy(vma, address);
> + if (!pol)
> + pol = get_task_policy(current);
> + if (pol && pol->mode == MPOL_INTERLEAVE)
> + goto out_put;
> +#endif
> +
> + /*
> + * Do a speculative lookup of the PTE entry.
> + */
> + local_irq_disable();
> + pgd = pgd_offset(mm, address);
> + pgdval = READ_ONCE(*pgd);
> + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
> + goto out_walk;
> +
> + p4d = p4d_offset(pgd, address);
> + p4dval = READ_ONCE(*p4d);
> + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
> + goto out_walk;
> +
> + vmf.pud = pud_offset(p4d, address);
> + pudval = READ_ONCE(*vmf.pud);
> + if (pud_none(pudval) || unlikely(pud_bad(pudval)))
> + goto out_walk;
> +
> + /* Huge pages at PUD level are not supported. */
> + if (unlikely(pud_trans_huge(pudval)))
> + goto out_walk;
> +
> + vmf.pmd = pmd_offset(vmf.pud, address);
> + vmf.orig_pmd = READ_ONCE(*vmf.pmd);
> + /*
> + * pmd_none could mean that a hugepage collapse is in progress
> + * in our back as collapse_huge_page() mark it before
> + * invalidating the pte (which is done once the IPI is catched
> + * by all CPU and we have interrupt disabled).
> + * For this reason we cannot handle THP in a speculative way since we
> + * can't safely identify an in progress collapse operation done in our
> + * back on that PMD.
> + * Regarding the order of the following checks, see comment in
> + * pmd_devmap_trans_unstable()
> + */
> + if (unlikely(pmd_devmap(vmf.orig_pmd) ||
> + pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
> + is_swap_pmd(vmf.orig_pmd)))
> + goto out_walk;
> +
> + /*
> + * The above does not allocate/instantiate page-tables because doing so
> + * would lead to the possibility of instantiating page-tables after
> + * free_pgtables() -- and consequently leaking them.
> + *
> + * The result is that we take at least one !speculative fault per PMD
> + * in order to instantiate it.
> + */
> +
> + vmf.pte = pte_offset_map(vmf.pmd, address);
> + vmf.orig_pte = READ_ONCE(*vmf.pte);
> + barrier(); /* See comment in handle_pte_fault() */
> + if (pte_none(vmf.orig_pte)) {
> + pte_unmap(vmf.pte);
> + vmf.pte = NULL;
> + }
> +
> + vmf.vma = vma;
> + vmf.pgoff = linear_page_index(vma, address);
> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
> + vmf.sequence = seq;
> + vmf.flags = flags;
> +
> + local_irq_enable();
> +
> + /*
> + * We need to re-validate the VMA after checking the bounds, otherwise
> + * we might have a false positive on the bounds.
> + */
> + if (read_seqcount_retry(&vma->vm_sequence, seq))
> + goto out_put;
> +
> + mem_cgroup_enter_user_fault();
> + ret = handle_pte_fault(&vmf);
> + mem_cgroup_exit_user_fault();
> +
> + put_vma(vma);
> +
> + /*
> + * The task may have entered a memcg OOM situation but
> + * if the allocation error was handled gracefully (no
> + * VM_FAULT_OOM), there is no need to kill anything.
> + * Just clean up the OOM state peacefully.
> + */
> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
> + mem_cgroup_oom_synchronize(false);
> + return ret;
> +
> +out_walk:
> + local_irq_enable();
> +out_put:
> + put_vma(vma);
> + return ret;
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
> /*
> * By the time we get here, we already hold the mm semaphore
> *
> --
> 2.21.0
>

2019-04-23 03:25:08

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault

On Tue, Apr 16, 2019 at 03:45:14PM +0200, Laurent Dufour wrote:
> Vinayak Menon faced a panic because one thread was page faulting a page in
> swap, while another one was mprotecting a part of the VMA leading to a VMA
> split.
> This raise a panic in swap_vma_readahead() because the VMA's boundaries
> were not more matching the faulting address.
>
> To avoid this, if the page is not found in the swap, the speculative page
> fault is aborted to retry a regular page fault.
>
> Reported-by: Vinayak Menon <[email protected]>
> Signed-off-by: Laurent Dufour <[email protected]>

Reviewed-by: J?r?me Glisse <[email protected]>

Note that you should also skip non swap entry in do_swap_page() when doing
speculative page fault at very least you need to is_device_private_entry()
case.

But this should either be part of patch 22 or another patch to fix swap
case.

> ---
> mm/memory.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6e6bf61c0e5c..1991da97e2db 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> lru_cache_add_anon(page);
> swap_readpage(page, true);
> }
> + } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
> + /*
> + * Don't try readahead during a speculative page fault
> + * as the VMA's boundaries may change in our back.
> + * If the page is not in the swap cache and synchronous
> + * read is disabled, fall back to the regular page
> + * fault mechanism.
> + */
> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> + ret = VM_FAULT_RETRY;
> + goto out;
> } else {
> page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> vmf);
> --
> 2.21.0
>

2019-04-23 03:25:08

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Hi Laurent,

Thanks a lot for copying me on this patchset. It took me a few days to
go through it - I had not been following the previous iterations of
this series so I had to catch up. I will be sending comments for
individual commits, but before tat I would like to discuss the series
as a whole.

I think these changes are a big step in the right direction. My main
reservation about them is that they are additive - adding some complexity
for speculative page faults - and I wonder if it'd be possible, over the
long term, to replace the existing complexity we have in mmap_sem retry
mechanisms instead of adding to it. This is not something that should
block your progress, but I think it would be good, as we introduce spf,
to evaluate whether we could eventually get all the way to removing the
mmap_sem retry mechanism, or if we will actually have to keep both.


The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?

The proposed spf mechanism only works at the pte level after the page
tables have already been created. The non-spf page fault path takes the
mm->page_table_lock to protect against concurrent page table allocation
by multiple page faults; I think unmapping/freeing page tables could
be done under mm->page_table_lock too so that spf could implement
allocating new page tables by verifying the vma after taking the
mm->page_table_lock ?

The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
I am not sure what is the issue there - is this due to the vma->vm_start
and vma->vm_pgoff reads in *__vm_normal_page() ?


My last potential concern is about performance. The numbers you have
look great, but I worry about potential regressions in PF performance
for threaded processes that don't currently encounter contention
(i.e. there may be just one thread actually doing all the work while
the others are blocked). I think one good proxy for measuring that
would be to measure a single threaded workload - kernbench would be
fine - without the special-case optimization in patch 22 where
handle_speculative_fault() immediately aborts in the single-threaded case.

Reviewed-by: Michel Lespinasse <[email protected]>
This is for the series as a whole; I expect to do another review pass on
individual commits in the series when we have agreement on the toplevel
stuff (I noticed a few things like out-of-date commit messages but that's
really minor stuff).


I want to add a note about mmap_sem. In the past there has been
discussions about replacing it with an interval lock, but these never
went anywhere because, mostly, of the fact that such mechanisms were
too expensive to use in the page fault path. I think adding the spf
mechanism would invite us to revisit this issue - interval locks may
be a great way to avoid blocking between unrelated mmap_sem writers
(for example, do not delay stack creation for new threads while a
large mmap or munmap may be going on), and probably also to handle
mmap_sem readers that can't easily use the spf mechanism (for example,
gup callers which make use of the returned vmas). But again that is a
separate topic to explore which doesn't have to get resolved before
spf goes in.

2019-04-23 09:30:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
> This allows to search for a VMA structure without holding the mmap_sem.
>
> The search is repeated while the mm seqlock is changing and until we found
> a valid VMA.
>
> While under the RCU protection, a reference is taken on the VMA, so the
> caller must call put_vma() once it not more need the VMA structure.
>
> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
> reference is taken to the VMA by calling get_vma().
>
> When removing a VMA from the MM RB tree, the VMA is not release immediately
> but at the end of the RCU grace period through vm_rcu_put(). This ensures
> that the VMA remains allocated until the end the RCU grace period.
>
> Since the vm_file pointer, if valid, is released in put_vma(), there is no
> guarantee that the file pointer will be valid on the returned VMA.

What I'm missing here, and in the previous patch introducing the
refcount (also see refcount_t), is _why_ we need the refcount thing at
all.

My original plan was to use SRCU, which at the time was not complete
enough so I abused/hacked preemptible RCU, but that is no longer the
case, SRCU has all the required bits and pieces.

Also; the initial motivation was prefaulting large VMAs and the
contention on mmap was killing things; but similarly, the contention on
the refcount (I did try that) killed things just the same.

So I'm really sad to see the refcount return; and without any apparent
justification.

2019-04-23 09:40:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
> The proposed spf mechanism only handles anon vmas. Is there a
> fundamental reason why it couldn't handle mapped files too ?
> My understanding is that the mechanism of verifying the vma after
> taking back the ptl at the end of the fault would work there too ?
> The file has to stay referenced during the fault, but holding the vma's
> refcount could be made to cover that ? the vm_file refcount would have
> to be released in __free_vma() instead of remove_vma; I'm not quite sure
> if that has more implications than I realize ?

IIRC (and I really don't remember all that much) the trickiest bit was
vs unmount. Since files can stay open past the 'expected' duration,
umount could be delayed.

But yes, I think I had a version that did all that just 'fine'. Like
mentioned, I didn't keep the refcount because it sucked just as hard as
the mmap_sem contention, but the SRCU callback did the fput() just fine
(esp. now that we have delayed_fput).

2019-04-23 10:48:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
[...]
> I want to add a note about mmap_sem. In the past there has been
> discussions about replacing it with an interval lock, but these never
> went anywhere because, mostly, of the fact that such mechanisms were
> too expensive to use in the page fault path. I think adding the spf
> mechanism would invite us to revisit this issue - interval locks may
> be a great way to avoid blocking between unrelated mmap_sem writers
> (for example, do not delay stack creation for new threads while a
> large mmap or munmap may be going on), and probably also to handle
> mmap_sem readers that can't easily use the spf mechanism (for example,
> gup callers which make use of the returned vmas). But again that is a
> separate topic to explore which doesn't have to get resolved before
> spf goes in.

Well, I believe we should _really_ re-evaluate the range locking sooner
rather than later. Why? Because it looks like the most straightforward
approach to the mmap_sem contention for most usecases I have heard of
(mostly a mm{unm}ap, mremap standing in the way of page faults).
On a plus side it also makes us think about the current mmap (ab)users
which should lead to an overall code improvements and maintainability.

SPF sounds like a good idea but it is a really big and intrusive surgery
to the #PF path. And more importantly without any real world usecase
numbers which would justify this. That being said I am not opposed to
this change I just think it is a large hammer while we haven't seen
attempts to tackle problems in a simpler way.

--
Michal Hocko
SUSE Labs

2019-04-23 11:36:42

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On 04/16/2019 07:14 PM, Laurent Dufour wrote:
> In pseudo code, this could be seen as:
> speculative_page_fault()
> {
> vma = find_vma_rcu()
> check vma sequence count
> check vma's support
> disable interrupt
> check pgd,p4d,...,pte
> save pmd and pte in vmf
> save vma sequence counter in vmf
> enable interrupt
> check vma sequence count
> handle_pte_fault(vma)
> ..
> page = alloc_page()
> pte_map_lock()
> disable interrupt
> abort if sequence counter has changed
> abort if pmd or pte has changed
> pte map and lock
> enable interrupt
> if abort
> free page
> abort

Would not it be better if the 'page' allocated here can be passed on to handle_pte_fault()
below so that in the fallback path it does not have to enter the buddy again ? Of course
it will require changes to handle_pte_fault() to accommodate a pre-allocated non-NULL
struct page to operate on or free back into the buddy if fallback path fails for some
other reason. This will probably make SPF path less overhead for cases where it has to
fallback on handle_pte_fault() after pte_map_lock() in speculative_page_fault().

> ...
> put_vma(vma)
> }
>
> arch_fault_handler()
> {
> if (speculative_page_fault(&vma))
> goto done
> again:
> lock(mmap_sem)
> vma = find_vma();
> handle_pte_fault(vma);
> if retry
> unlock(mmap_sem)
> goto again;
> done:
> handle fault error
> }

- Anshuman

2019-04-23 12:43:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> [...]
> > I want to add a note about mmap_sem. In the past there has been
> > discussions about replacing it with an interval lock, but these never
> > went anywhere because, mostly, of the fact that such mechanisms were
> > too expensive to use in the page fault path. I think adding the spf
> > mechanism would invite us to revisit this issue - interval locks may
> > be a great way to avoid blocking between unrelated mmap_sem writers
> > (for example, do not delay stack creation for new threads while a
> > large mmap or munmap may be going on), and probably also to handle
> > mmap_sem readers that can't easily use the spf mechanism (for example,
> > gup callers which make use of the returned vmas). But again that is a
> > separate topic to explore which doesn't have to get resolved before
> > spf goes in.
>
> Well, I believe we should _really_ re-evaluate the range locking sooner
> rather than later. Why? Because it looks like the most straightforward
> approach to the mmap_sem contention for most usecases I have heard of
> (mostly a mm{unm}ap, mremap standing in the way of page faults).
> On a plus side it also makes us think about the current mmap (ab)users
> which should lead to an overall code improvements and maintainability.

Dave Chinner recently did evaluate the range lock for solving a problem
in XFS and didn't like what he saw:

https://lore.kernel.org/linux-fsdevel/[email protected]/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

I think scaling the lock needs to be tied to the actual data structure
and not have a second tree on-the-side to fake-scale the locking. Anyway,
we're going to have a session on this at LSFMM, right?

> SPF sounds like a good idea but it is a really big and intrusive surgery
> to the #PF path. And more importantly without any real world usecase
> numbers which would justify this. That being said I am not opposed to
> this change I just think it is a large hammer while we haven't seen
> attempts to tackle problems in a simpler way.

I don't think the "no real world usecase numbers" is fair. Laurent quoted:

> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
>
> BASE SPF delta
> 24 CPUs x86 5492.69 9383.07 70.83%
> 1024 CPUS P8 VM 8476.74 17144.38 102%

and cited 30% improvement for you-know-what product from an earlier
version of the patch.

2019-04-23 12:51:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Tue, Apr 23, 2019 at 05:41:48AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
>
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c
>
> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking.

Right, which is how I ended up using the split PT locks. They already
provide fine(r) grained locking.

2019-04-23 13:43:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Tue 23-04-19 05:41:48, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> > [...]
> > > I want to add a note about mmap_sem. In the past there has been
> > > discussions about replacing it with an interval lock, but these never
> > > went anywhere because, mostly, of the fact that such mechanisms were
> > > too expensive to use in the page fault path. I think adding the spf
> > > mechanism would invite us to revisit this issue - interval locks may
> > > be a great way to avoid blocking between unrelated mmap_sem writers
> > > (for example, do not delay stack creation for new threads while a
> > > large mmap or munmap may be going on), and probably also to handle
> > > mmap_sem readers that can't easily use the spf mechanism (for example,
> > > gup callers which make use of the returned vmas). But again that is a
> > > separate topic to explore which doesn't have to get resolved before
> > > spf goes in.
> >
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
>
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

Thank you, will have a look.

> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking. Anyway,
> we're going to have a session on this at LSFMM, right?

I thought we had something for the mmap_sem scaling but I do not see
this in a list of proposed topics. But we can certainly add it there.

> > SPF sounds like a good idea but it is a really big and intrusive surgery
> > to the #PF path. And more importantly without any real world usecase
> > numbers which would justify this. That being said I am not opposed to
> > this change I just think it is a large hammer while we haven't seen
> > attempts to tackle problems in a simpler way.
>
> I don't think the "no real world usecase numbers" is fair. Laurent quoted:
>
> > Ebizzy:
> > -------
> > The test is counting the number of records per second it can manage, the
> > higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> > consistent result I repeated the test 100 times and measure the average
> > result. The number is the record processes per second, the higher is the best.
> >
> > BASE SPF delta
> > 24 CPUs x86 5492.69 9383.07 70.83%
> > 1024 CPUS P8 VM 8476.74 17144.38 102%
>
> and cited 30% improvement for you-know-what product from an earlier
> version of the patch.

Well, we are talking about
45 files changed, 1277 insertions(+), 196 deletions(-)

which is a _major_ surgery in my book. Having a real life workloads numbers
is nothing unfair to ask for IMHO.

And let me remind you that I am not really opposing SPF in general. I
would just like to see a simpler approach before we go such a large
change. If the range locking is not really a scalable approach then all
right but from why I've see it should help a lot of most bottle-necks I
have seen.
--
Michal Hocko
SUSE Labs

2019-04-23 15:23:38

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 01/31] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

Le 18/04/2019 à 23:47, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:44:52PM +0200, Laurent Dufour wrote:
>> This configuration variable will be used to build the code needed to
>> handle speculative page fault.
>>
>> By default it is turned off, and activated depending on architecture
>> support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
>>
>> The architecture support is needed since the speculative page fault handler
>> is called from the architecture's page faulting code, and some code has to
>> be added there to handle the speculative handler.
>>
>> The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
>> does processing that is not compatible with the speculative handling in the
>> case ARCH_HAS_PTE_SPECIAL is not set.
>>
>> Suggested-by: Thomas Gleixner <[email protected]>
>> Suggested-by: David Rientjes <[email protected]>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Reviewed-by: Jérôme Glisse <[email protected]>

Thanks Jérôme.

> Small question below
>
>> ---
>> mm/Kconfig | 22 ++++++++++++++++++++++
>> 1 file changed, 22 insertions(+)
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 0eada3f818fa..ff278ac9978a 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -761,4 +761,26 @@ config GUP_BENCHMARK
>> config ARCH_HAS_PTE_SPECIAL
>> bool
>>
>> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> + def_bool n
>> +
>> +config SPECULATIVE_PAGE_FAULT
>> + bool "Speculative page faults"
>> + default y
>> + depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> + depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
>> + help
>> + Try to handle user space page faults without holding the mmap_sem.
>> +
>> + This should allow better concurrency for massively threaded processes
>
> Is there any case where it does not provide better concurrency ? The
> should make me wonder :)

Depending on the VMA's type, it may not provide better concurrency.
Indeed only anonymous mapping are managed currently. Perhaps this should
be mentioned here, is it ?

>> + since the page fault handler will not wait for other thread's memory
>> + layout change to be done, assuming that this change is done in
>> + another part of the process's memory space. This type of page fault
>> + is named speculative page fault.
>> +
>> + If the speculative page fault fails because a concurrent modification
>> + is detected or because underlying PMD or PTE tables are not yet
>> + allocated, the speculative page fault fails and a classic page fault
>> + is then tried.
>> +
>> endmenu
>> --
>> 2.21.0
>>
>

2019-04-23 15:38:21

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Le 18/04/2019 à 23:51, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
>> On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
>>> Le 16/04/2019 à 16:27, Mark Rutland a écrit :
>>>> On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
>>>>> From: Mahendran Ganesh <[email protected]>
>>>>>
>>>>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
>>>>> enables Speculative Page Fault handler.
>>>>>
>>>>> Signed-off-by: Ganesh Mahendran <[email protected]>
>>>>
>>>> This is missing your S-o-B.
>>>
>>> You're right, I missed that...
>>>
>>>> The first patch noted that the ARCH_SUPPORTS_* option was there because
>>>> the arch code had to make an explicit call to try to handle the fault
>>>> speculatively, but that isn't addeed until patch 30.
>>>>
>>>> Why is this separate from that code?
>>>
>>> Andrew was recommended this a long time ago for bisection purpose. This
>>> allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
>>> that trigger the spf handler is added to the per architecture's code.
>>
>> Ok. I think it would be worth noting that in the commit message, to
>> avoid anyone else asking the same question. :)
>
> Should have read this thread before looking at x86 and ppc :)
>
> In any case the patch is:
>
> Reviewed-by: Jérôme Glisse <[email protected]>

Thanks Mark and Jérôme for reviewing this.

Regarding the change in the commit message, I'm wondering if this would
be better to place it in the Series's letter head.

But I'm fine to put it in each architecture's commit.


2019-04-23 15:46:02

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

On Tue, Apr 16, 2019 at 03:44:58PM +0200, Laurent Dufour wrote:
> +static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
> {
> - int same = 1;
> + int ret = 0;

Surely 'ret' should be of type vm_fault_t?

> + ret = VM_FAULT_RETRY;

... this should have thrown a sparse warning?

2019-04-23 15:47:52

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 05/31] mm: prepare for FAULT_FLAG_SPECULATIVE

Le 19/04/2019 à 00:04, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:44:56PM +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <[email protected]>
>>
>> When speculating faults (without holding mmap_sem) we need to validate
>> that the vma against which we loaded pages is still valid when we're
>> ready to install the new PTE.
>>
>> Therefore, replace the pte_offset_map_lock() calls that (re)take the
>> PTL with pte_map_lock() which can fail in case we find the VMA changed
>> since we started the fault.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>
>> [Port to 4.12 kernel]
>> [Remove the comment about the fault_env structure which has been
>> implemented as the vm_fault structure in the kernel]
>> [move pte_map_lock()'s definition upper in the file]
>> [move the define of FAULT_FLAG_SPECULATIVE later in the series]
>> [review error path in do_swap_page(), do_anonymous_page() and
>> wp_page_copy()]
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Reviewed-by: Jérôme Glisse <[email protected]>
>
>> ---
>> mm/memory.c | 87 +++++++++++++++++++++++++++++++++++------------------
>> 1 file changed, 58 insertions(+), 29 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c6ddadd9d2b7..fc3698d13cb5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2073,6 +2073,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(apply_to_page_range);
>>
>> +static inline bool pte_map_lock(struct vm_fault *vmf)
>
> I am not fan of the name maybe pte_offset_map_lock_if_valid() ? But
> that just a taste thing. So feel free to ignore this comment.

I agree with you that adding _if_valid or something equivalent to
highlight the conditional of this function would be a good idea.

I'll think further about that name but yours looks good ;)


>> +{
>> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> + vmf->address, &vmf->ptl);
>> + return true;
>> +}
>> +
>> /*
>> * handle_pte_fault chooses page fault handler according to an entry which was
>> * read non-atomically. Before making any commitment, on those architectures
>

2019-04-23 15:49:16

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 07/31] mm: make pte_unmap_same compatible with SPF

Le 23/04/2019 à 17:43, Matthew Wilcox a écrit :
> On Tue, Apr 16, 2019 at 03:44:58PM +0200, Laurent Dufour wrote:
>> +static inline vm_fault_t pte_unmap_same(struct vm_fault *vmf)
>> {
>> - int same = 1;
>> + int ret = 0;
>
> Surely 'ret' should be of type vm_fault_t?

Nice catch !

>
>> + ret = VM_FAULT_RETRY;
>
> ... this should have thrown a sparse warning?

It should have, but I can't remember having see it, weird...

2019-04-23 15:54:13

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 11/31] mm: protect mremap() against SPF hanlder

Le 22/04/2019 à 21:51, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:02PM +0200, Laurent Dufour wrote:
>> If a thread is remapping an area while another one is faulting on the
>> destination area, the SPF handler may fetch the vma from the RB tree before
>> the pte has been moved by the other thread. This means that the moved ptes
>> will overwrite those create by the page fault handler leading to page
>> leaked.
>>
>> CPU 1 CPU2
>> enter mremap()
>> unmap the dest area
>> copy_vma() Enter speculative page fault handler
>> >> at this time the dest area is present in the RB tree
>> fetch the vma matching dest area
>> create a pte as the VMA matched
>> Exit the SPF handler
>> <data written in the new page>
>> move_ptes()
>> > it is assumed that the dest area is empty,
>> > the move ptes overwrite the page mapped by the CPU2.
>>
>> To prevent that, when the VMA matching the dest area is extended or created
>> by copy_vma(), it should be marked as non available to the SPF handler.
>> The usual way to so is to rely on vm_write_begin()/end().
>> This is already in __vma_adjust() called by copy_vma() (through
>> vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
>> which create a window for another thread.
>> This patch adds a new parameter to vma_merge() which is passed down to
>> vma_adjust().
>> The assumption is that copy_vma() is returning a vma which should be
>> released by calling vm_raw_write_end() by the callee once the ptes have
>> been moved.
>>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Reviewed-by: Jérôme Glisse <[email protected]>
>
> Small comment about a comment below but can be fix as a fixup
> patch nothing earth shattering.
>
>> ---
>> include/linux/mm.h | 24 ++++++++++++++++-----
>> mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++-----------
>> mm/mremap.c | 13 ++++++++++++
>> 3 files changed, 73 insertions(+), 17 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 906b9e06f18e..5d45b7d8718d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
>>
>> /* mmap.c */
>> extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
>> +
>> extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
>> - struct vm_area_struct *expand);
>> + struct vm_area_struct *expand, bool keep_locked);
>> +
>> static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
>> {
>> - return __vma_adjust(vma, start, end, pgoff, insert, NULL);
>> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
>> }
>> -extern struct vm_area_struct *vma_merge(struct mm_struct *,
>> +
>> +extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
>> + struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
>> + pgoff_t pgoff, struct mempolicy *mpol,
>> + struct vm_userfaultfd_ctx uff, bool keep_locked);
>> +
>> +static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> struct vm_area_struct *prev, unsigned long addr, unsigned long end,
>> - unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
>> - struct mempolicy *, struct vm_userfaultfd_ctx);
>> + unsigned long vm_flags, struct anon_vma *anon, struct file *file,
>> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
>> +{
>> + return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
>> + pol, uff, false);
>> +}
>> +
>> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
>> extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
>> unsigned long addr, int new_below);
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b77ec0149249..13460b38b0fb 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
>> */
>> int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
>> - struct vm_area_struct *expand)
>> + struct vm_area_struct *expand, bool keep_locked)
>> {
>> struct mm_struct *mm = vma->vm_mm;
>> struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
>> @@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>
>> importer->anon_vma = exporter->anon_vma;
>> error = anon_vma_clone(importer, exporter);
>> - if (error)
>> + if (error) {
>> + if (next && next != vma)
>> + vm_raw_write_end(next);
>> + vm_raw_write_end(vma);
>> return error;
>> + }
>> }
>> }
>> again:
>> @@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>>
>> if (next && next != vma)
>> vm_raw_write_end(next);
>> - vm_raw_write_end(vma);
>> + if (!keep_locked)
>> + vm_raw_write_end(vma);
>>
>> validate_mm(mm);
>>
>> @@ -1161,12 +1166,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
>> * parameter) may establish ptes with the wrong permissions of NNNN
>> * instead of the right permissions of XXXX.
>> */
>> -struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> +struct vm_area_struct *__vma_merge(struct mm_struct *mm,
>> struct vm_area_struct *prev, unsigned long addr,
>> unsigned long end, unsigned long vm_flags,
>> struct anon_vma *anon_vma, struct file *file,
>> pgoff_t pgoff, struct mempolicy *policy,
>> - struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
>> + struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
>> + bool keep_locked)
>> {
>> pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
>> struct vm_area_struct *area, *next;
>> @@ -1214,10 +1220,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> /* cases 1, 6 */
>> err = __vma_adjust(prev, prev->vm_start,
>> next->vm_end, prev->vm_pgoff, NULL,
>> - prev);
>> + prev, keep_locked);
>> } else /* cases 2, 5, 7 */
>> err = __vma_adjust(prev, prev->vm_start,
>> - end, prev->vm_pgoff, NULL, prev);
>> + end, prev->vm_pgoff, NULL, prev,
>> + keep_locked);
>> if (err)
>> return NULL;
>> khugepaged_enter_vma_merge(prev, vm_flags);
>> @@ -1234,10 +1241,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
>> vm_userfaultfd_ctx)) {
>> if (prev && addr < prev->vm_end) /* case 4 */
>> err = __vma_adjust(prev, prev->vm_start,
>> - addr, prev->vm_pgoff, NULL, next);
>> + addr, prev->vm_pgoff, NULL, next,
>> + keep_locked);
>> else { /* cases 3, 8 */
>> err = __vma_adjust(area, addr, next->vm_end,
>> - next->vm_pgoff - pglen, NULL, next);
>> + next->vm_pgoff - pglen, NULL, next,
>> + keep_locked);
>> /*
>> * In case 3 area is already equal to next and
>> * this is a noop, but in case 8 "area" has
>> @@ -3259,9 +3268,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>>
>> if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
>> return NULL; /* should never get here */
>> - new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
>> - vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
>> - vma->vm_userfaultfd_ctx);
>> +
>> + /* There is 3 cases to manage here in
>> + * AAAA AAAA AAAA AAAA
>> + * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN
>> + * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL
>> + * PPPPPPPPNNNN(2)
>> + * PPPPNNNNNNNN(3)
>> + *
>> + * new_vma == prev in case A,1,2
>> + * new_vma == next in case B,3
>> + */
>> + new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
>> + vma->anon_vma, vma->vm_file, pgoff,
>> + vma_policy(vma), vma->vm_userfaultfd_ctx, true);
>> if (new_vma) {
>> /*
>> * Source vma may have been merged into new_vma
>> @@ -3299,6 +3319,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>> get_file(new_vma->vm_file);
>> if (new_vma->vm_ops && new_vma->vm_ops->open)
>> new_vma->vm_ops->open(new_vma);
>> + /*
>> + * As the VMA is linked right now, it may be hit by the
>> + * speculative page fault handler. But we don't want it to
>> + * to start mapping page in this area until the caller has
>> + * potentially move the pte from the moved VMA. To prevent
>> + * that we protect it right now, and let the caller unprotect
>> + * it once the move is done.
>> + */
>
> It would be better to say:
> /*
> * Block speculative page fault on the new VMA before "linking" it as
> * as once it is linked then it may be hit by speculative page fault.
> * But we don't want it to start mapping page in this area until the
> * caller has potentially move the pte from the moved VMA. To prevent
> * that we protect it before linking and let the caller unprotect it
> * once the move is done.
> */
>

I'm fine with your proposal.

Thanks for reviewing this.


>> + vm_raw_write_begin(new_vma);
>> vma_link(mm, new_vma, prev, rb_link, rb_parent);
>> *need_rmap_locks = false;
>> }
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index fc241d23cd97..ae5c3379586e 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -357,6 +357,14 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>> if (!new_vma)
>> return -ENOMEM;
>>
>> + /* new_vma is returned protected by copy_vma, to prevent speculative
>> + * page fault to be done in the destination area before we move the pte.
>> + * Now, we must also protect the source VMA since we don't want pages
>> + * to be mapped in our back while we are copying the PTEs.
>> + */
>> + if (vma != new_vma)
>> + vm_raw_write_begin(vma);
>> +
>> moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
>> need_rmap_locks);
>> if (moved_len < old_len) {
>> @@ -373,6 +381,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>> */
>> move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
>> true);
>> + if (vma != new_vma)
>> + vm_raw_write_end(vma);
>> vma = new_vma;
>> old_len = new_len;
>> old_addr = new_addr;
>> @@ -381,7 +391,10 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>> mremap_userfaultfd_prep(new_vma, uf);
>> arch_remap(mm, old_addr, old_addr + old_len,
>> new_addr, new_addr + new_len);
>> + if (vma != new_vma)
>> + vm_raw_write_end(vma);
>> }
>> + vm_raw_write_end(new_vma);
>>
>> /* Conceal VM_ACCOUNT so old reservation is not undone */
>> if (vm_flags & VM_ACCOUNT) {
>> --
>> 2.21.0
>>
>

2019-04-23 16:21:24

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

On Tue, Apr 23, 2019 at 05:36:31PM +0200, Laurent Dufour wrote:
> Le 18/04/2019 à 23:51, Jerome Glisse a écrit :
> > On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
> > > On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
> > > > Le 16/04/2019 à 16:27, Mark Rutland a écrit :
> > > > > On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
> > > > > > From: Mahendran Ganesh <[email protected]>
> > > > > >
> > > > > > Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
> > > > > > enables Speculative Page Fault handler.
> > > > > >
> > > > > > Signed-off-by: Ganesh Mahendran <[email protected]>
> > > > >
> > > > > This is missing your S-o-B.
> > > >
> > > > You're right, I missed that...
> > > >
> > > > > The first patch noted that the ARCH_SUPPORTS_* option was there because
> > > > > the arch code had to make an explicit call to try to handle the fault
> > > > > speculatively, but that isn't addeed until patch 30.
> > > > >
> > > > > Why is this separate from that code?
> > > >
> > > > Andrew was recommended this a long time ago for bisection purpose. This
> > > > allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
> > > > that trigger the spf handler is added to the per architecture's code.
> > >
> > > Ok. I think it would be worth noting that in the commit message, to
> > > avoid anyone else asking the same question. :)
> >
> > Should have read this thread before looking at x86 and ppc :)
> >
> > In any case the patch is:
> >
> > Reviewed-by: Jérôme Glisse <[email protected]>
>
> Thanks Mark and Jérôme for reviewing this.
>
> Regarding the change in the commit message, I'm wondering if this would be
> better to place it in the Series's letter head.
>
> But I'm fine to put it in each architecture's commit.

I think noting it in both the cover letter and specific patches is best.

Having something in the commit message means that the intent will be
clear when the patch is viewed in isolation (e.g. as they will be once
merged).

All that's necessary is something like:

Note that this patch only enables building the common speculative page
fault code such that this can be bisected, and has no functional
impact. The architecture-specific code to make use of this and enable
the feature will be addded in a subsequent patch.

Thanks,
Mark.

2019-04-23 18:15:08

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

On Tue, 23 Apr 2019, Peter Zijlstra wrote:

>Also; the initial motivation was prefaulting large VMAs and the
>contention on mmap was killing things; but similarly, the contention on
>the refcount (I did try that) killed things just the same.

Right, this is just like what can happen with per-vma locking.

Thanks,
Davidlohr

2019-04-24 07:39:39

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 23/04/2019 à 11:38, Peter Zijlstra a écrit :
> On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
>> The proposed spf mechanism only handles anon vmas. Is there a
>> fundamental reason why it couldn't handle mapped files too ?
>> My understanding is that the mechanism of verifying the vma after
>> taking back the ptl at the end of the fault would work there too ?
>> The file has to stay referenced during the fault, but holding the vma's
>> refcount could be made to cover that ? the vm_file refcount would have
>> to be released in __free_vma() instead of remove_vma; I'm not quite sure
>> if that has more implications than I realize ?
>
> IIRC (and I really don't remember all that much) the trickiest bit was
> vs unmount. Since files can stay open past the 'expected' duration,
> umount could be delayed.
>
> But yes, I think I had a version that did all that just 'fine'. Like
> mentioned, I didn't keep the refcount because it sucked just as hard as
> the mmap_sem contention, but the SRCU callback did the fput() just fine
> (esp. now that we have delayed_fput).

I had to use a refcount for the VMA because I'm using RCU in place of
SRCU and only protecting the RB tree using RCU.

Regarding the file pointer, I decided to release it synchronously to
avoid the latency of RCU during the file closing. As you mentioned this
could delayed the umount but not only, as Linus Torvald demonstrated by
the past [1]. Anyway, since the file support is not yet here there is no
need for that currently.

Regarding the file mapping support, the concern is to ensure that
vm_ops->fault() will not try to release the mmap_sem. This is true for
most of the file system operation using the generic one, but there is
currently no clever way to identify that except by checking the
vm_ops->fault pointer. Adding a flag to the vm_operations_struct
structure is another option.

that's doable as far as the underlying fault() function is not dealing
with the mmap_sem, and I made a try by the past but was thinking that
first the anonymous case should be accepted before moving forward this way.

[1]
https://lore.kernel.org/linux-mm/[email protected]/

2019-04-24 10:15:33

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

Le 23/04/2019 à 11:27, Peter Zijlstra a écrit :
> On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
>> This allows to search for a VMA structure without holding the mmap_sem.
>>
>> The search is repeated while the mm seqlock is changing and until we found
>> a valid VMA.
>>
>> While under the RCU protection, a reference is taken on the VMA, so the
>> caller must call put_vma() once it not more need the VMA structure.
>>
>> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
>> reference is taken to the VMA by calling get_vma().
>>
>> When removing a VMA from the MM RB tree, the VMA is not release immediately
>> but at the end of the RCU grace period through vm_rcu_put(). This ensures
>> that the VMA remains allocated until the end the RCU grace period.
>>
>> Since the vm_file pointer, if valid, is released in put_vma(), there is no
>> guarantee that the file pointer will be valid on the returned VMA.
>
> What I'm missing here, and in the previous patch introducing the
> refcount (also see refcount_t), is _why_ we need the refcount thing at
> all.

The need for the VMA's refcount is to ensure that the VMA will remain
until the end of the SPF handler. This is a consequence of the use of
RCU instead of SRCU to protect the RB tree.

I was not aware of the refcount_t type, it would be better here to avoid
wrapping.

> My original plan was to use SRCU, which at the time was not complete
> enough so I abused/hacked preemptible RCU, but that is no longer the
> case, SRCU has all the required bits and pieces.

When I did test using SRCU it was involving a lot a scheduling to run
the SRCU callback mechanism. In some workload the impact on the
perfomance was significant [1].

I can't see this overhead using RCU.

>
> Also; the initial motivation was prefaulting large VMAs and the
> contention on mmap was killing things; but similarly, the contention on
> the refcount (I did try that) killed things just the same.

Doing prefaulting should be doable, I'll try to think further about that.

Regarding the refcount, I should I missed something, this is an atomic
counter, so there should not be contention on it but cache exclusivity,
not ideal I agree but I can't see what else to use here.

> So I'm really sad to see the refcount return; and without any apparent
> justification.

I'm not opposed to use another mechanism here, but SRCU didn't show good
performance with some workload, and I can't see how to use RCU without a
reference counter here. So please, advise.

Thanks,
Laurent.

[1]
https://lore.kernel.org/linux-mm/[email protected]/

2019-04-24 13:38:44

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap()

Le 22/04/2019 à 22:32, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
>> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
>> lead to thread being blocked due to data corruption:
>>
>> CPU 1 CPU 2 CPU 3
>> Process 1, Process 1, Process 1,
>> Thread A Thread B Thread C
>>
>> while (1) { while (1) { while(1) {
>> pthread_mutex_lock(l) pthread_mutex_lock(l) fork
>> pthread_mutex_unlock(l) pthread_mutex_unlock(l) }
>> } }
>>
>> In the details this happens because :
>>
>> CPU 1 CPU 2 CPU 3
>> fork()
>> copy_pte_range()
>> set PTE rdonly
>> got to next VMA...
>> . PTE is seen rdonly PTE still writable
>> . thread is writing to page
>> . -> page fault
>> . copy the page Thread writes to page
>> . . -> no page fault
>> . update the PTE
>> . flush TLB for that PTE
>> flush TLB PTE are now rdonly
>
> Should the fork be on CPU3 to be consistant with the top thing (just to
> make it easier to read and go from one to the other as thread can move
> from one CPU to another).

Sure, this is quite confusing this way ;)

>>
>> So the write done by the CPU 3 is interfering with the page copy operation
>> done by CPU 2, leading to the data corruption.
>>
>> To avoid this we mark all the VMA involved in the COW mechanism as changing
>> by calling vm_write_begin(). This ensures that the speculative page fault
>> handler will not try to handle a fault on these pages.
>> The marker is set until the TLB is flushed, ensuring that all the CPUs will
>> now see the PTE as not writable.
>> Once the TLB is flush, the marker is removed by calling vm_write_end().
>>
>> The variable last is used to keep tracked of the latest VMA marked to
>> handle the error path where part of the VMA may have been marked.
>>
>> Since multiple VMA from the same mm may have the sequence count increased
>> during this process, the use of the vm_raw_write_begin/end() is required to
>> avoid lockdep false warning messages.
>>
>> Reported-by: Ganesh Mahendran <[email protected]>
>> Reported-by: Vinayak Menon <[email protected]>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> A minor comment (see below)
>
> Reviewed-by: Jérome Glisse <[email protected]>

Thanks for the review Jérôme.

>> ---
>> kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>> 1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f8dae021c2e5..2992d2c95256 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>> static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> struct mm_struct *oldmm)
>> {
>> - struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>> + struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>> struct rb_node **rb_link, *rb_parent;
>> int retval;
>> unsigned long charge;
>> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> rb_parent = &tmp->vm_rb;
>>
>> mm->map_count++;
>> - if (!(tmp->vm_flags & VM_WIPEONFORK))
>> + if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>> + /*
>> + * Mark this VMA as changing to prevent the
>> + * speculative page fault hanlder to process
>> + * it until the TLB are flushed below.
>> + */
>> + last = mpnt;
>> + vm_raw_write_begin(mpnt);
>> + }
>> retval = copy_page_range(mm, oldmm, mpnt);
>> + }
>>
>> if (tmp->vm_ops && tmp->vm_ops->open)
>> tmp->vm_ops->open(tmp);
>> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> out:
>> up_write(&mm->mmap_sem);
>> flush_tlb_mm(oldmm);
>> +
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>
> You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
> will always be NULL if it is not enabled but maybe the compiler will
> miss the optimization opportunity if you only have the for() loop
> below.

I didn't check for the generated code, perhaps the compiler will be
optimize that correctly.
This being said, I think the if block is better for the code
readability, highlighting that this block is only needed in the case of SPF.

>> + /*
>> + * Since the TLB has been flush, we can safely unmark the
>> + * copied VMAs and allows the speculative page fault handler to
>> + * process them again.
>> + * Walk back the VMA list from the last marked VMA.
>> + */
>> + for (; last; last = last->vm_prev) {
>> + if (last->vm_flags & VM_DONTCOPY)
>> + continue;
>> + if (!(last->vm_flags & VM_WIPEONFORK))
>> + vm_raw_write_end(last);
>> + }
>> + }
>> +
>> up_write(&oldmm->mmap_sem);
>> dup_userfaultfd_complete(&uf);
>> fail_uprobe_end:
>> --
>> 2.21.0
>>
>

2019-04-24 13:38:59

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 04/31] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

Le 23/04/2019 à 18:19, Mark Rutland a écrit :
> On Tue, Apr 23, 2019 at 05:36:31PM +0200, Laurent Dufour wrote:
>> Le 18/04/2019 à 23:51, Jerome Glisse a écrit :
>>> On Tue, Apr 16, 2019 at 03:41:56PM +0100, Mark Rutland wrote:
>>>> On Tue, Apr 16, 2019 at 04:31:27PM +0200, Laurent Dufour wrote:
>>>>> Le 16/04/2019 à 16:27, Mark Rutland a écrit :
>>>>>> On Tue, Apr 16, 2019 at 03:44:55PM +0200, Laurent Dufour wrote:
>>>>>>> From: Mahendran Ganesh <[email protected]>
>>>>>>>
>>>>>>> Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
>>>>>>> enables Speculative Page Fault handler.
>>>>>>>
>>>>>>> Signed-off-by: Ganesh Mahendran <[email protected]>
>>>>>>
>>>>>> This is missing your S-o-B.
>>>>>
>>>>> You're right, I missed that...
>>>>>
>>>>>> The first patch noted that the ARCH_SUPPORTS_* option was there because
>>>>>> the arch code had to make an explicit call to try to handle the fault
>>>>>> speculatively, but that isn't addeed until patch 30.
>>>>>>
>>>>>> Why is this separate from that code?
>>>>>
>>>>> Andrew was recommended this a long time ago for bisection purpose. This
>>>>> allows to build the code with CONFIG_SPECULATIVE_PAGE_FAULT before the code
>>>>> that trigger the spf handler is added to the per architecture's code.
>>>>
>>>> Ok. I think it would be worth noting that in the commit message, to
>>>> avoid anyone else asking the same question. :)
>>>
>>> Should have read this thread before looking at x86 and ppc :)
>>>
>>> In any case the patch is:
>>>
>>> Reviewed-by: Jérôme Glisse <[email protected]>
>>
>> Thanks Mark and Jérôme for reviewing this.
>>
>> Regarding the change in the commit message, I'm wondering if this would be
>> better to place it in the Series's letter head.
>>
>> But I'm fine to put it in each architecture's commit.
>
> I think noting it in both the cover letter and specific patches is best.
>
> Having something in the commit message means that the intent will be
> clear when the patch is viewed in isolation (e.g. as they will be once
> merged).
>
> All that's necessary is something like:
>
> Note that this patch only enables building the common speculative page
> fault code such that this can be bisected, and has no functional
> impact. The architecture-specific code to make use of this and enable
> the feature will be addded in a subsequent patch.

Thanks Mark, will do it this way.


> Thanks,
> Mark.

2019-04-24 14:29:24

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 20/31] mm: introduce vma reference counter

Le 22/04/2019 à 22:36, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:11PM +0200, Laurent Dufour wrote:
>> The final goal is to be able to use a VMA structure without holding the
>> mmap_sem and to be sure that the structure will not be freed in our back.
>>
>> The lockless use of the VMA will be done through RCU protection and thus a
>> dedicated freeing service is required to manage it asynchronously.
>>
>> As reported in a 2010's thread [1], this may impact file handling when a
>> file is still referenced while the mapping is no more there. As the final
>> goal is to handle anonymous VMA in a speculative way and not file backed
>> mapping, we could close and free the file pointer in a synchronous way, as
>> soon as we are guaranteed to not use it without holding the mmap_sem. For
>> sanity reason, in a minimal effort, the vm_file file pointer is unset once
>> the file pointer is put.
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Using kref would have been better from my POV even with RCU freeing
> but anyway:
>
> Reviewed-by: Jérôme Glisse <[email protected]>

Thanks Jérôme,

I think kref is a good option here, I'll give it a try.


>> ---
>> include/linux/mm.h | 4 ++++
>> include/linux/mm_types.h | 3 +++
>> mm/internal.h | 27 +++++++++++++++++++++++++++
>> mm/mmap.c | 13 +++++++++----
>> 4 files changed, 43 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f14b2c9ddfd4..f761a9c65c74 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -529,6 +529,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
>> vma->vm_mm = mm;
>> vma->vm_ops = &dummy_vm_ops;
>> INIT_LIST_HEAD(&vma->anon_vma_chain);
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + atomic_set(&vma->vm_ref_count, 1);
>> +#endif
>> }
>>
>> static inline void vma_set_anonymous(struct vm_area_struct *vma)
>> @@ -1418,6 +1421,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
>> INIT_LIST_HEAD(&vma->anon_vma_chain);
>> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> seqcount_init(&vma->vm_sequence);
>> + atomic_set(&vma->vm_ref_count, 1);
>> #endif
>> }
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 24b3f8ce9e42..6a6159e11a3f 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -285,6 +285,9 @@ struct vm_area_struct {
>> /* linked list of VM areas per task, sorted by address */
>> struct vm_area_struct *vm_next, *vm_prev;
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + atomic_t vm_ref_count;
>> +#endif
>> struct rb_node vm_rb;
>>
>> /*
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9eeaf2b95166..302382bed406 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -40,6 +40,33 @@ void page_writeback_init(void);
>>
>> vm_fault_t do_swap_page(struct vm_fault *vmf);
>>
>> +
>> +extern void __free_vma(struct vm_area_struct *vma);
>> +
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static inline void get_vma(struct vm_area_struct *vma)
>> +{
>> + atomic_inc(&vma->vm_ref_count);
>> +}
>> +
>> +static inline void put_vma(struct vm_area_struct *vma)
>> +{
>> + if (atomic_dec_and_test(&vma->vm_ref_count))
>> + __free_vma(vma);
>> +}
>> +
>> +#else
>> +
>> +static inline void get_vma(struct vm_area_struct *vma)
>> +{
>> +}
>> +
>> +static inline void put_vma(struct vm_area_struct *vma)
>> +{
>> + __free_vma(vma);
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>> unsigned long floor, unsigned long ceiling);
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index f7f6027a7dff..c106440dcae7 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -188,6 +188,12 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>> }
>> #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> +void __free_vma(struct vm_area_struct *vma)
>> +{
>> + mpol_put(vma_policy(vma));
>> + vm_area_free(vma);
>> +}
>> +
>> /*
>> * Close a vm structure and free it, returning the next.
>> */
>> @@ -200,8 +206,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> vma->vm_ops->close(vma);
>> if (vma->vm_file)
>> fput(vma->vm_file);
>> - mpol_put(vma_policy(vma));
>> - vm_area_free(vma);
>> + vma->vm_file = NULL;
>> + put_vma(vma);
>> return next;
>> }
>>
>> @@ -990,8 +996,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
>> if (next->anon_vma)
>> anon_vma_merge(vma, next);
>> mm->map_count--;
>> - mpol_put(vma_policy(next));
>> - vm_area_free(next);
>> + put_vma(next);
>> /*
>> * In mprotect's case 6 (see comments on vma_merge),
>> * we must remove another next too. It would clutter
>> --
>> 2.21.0
>>
>

2019-04-24 14:41:29

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 21/31] mm: Introduce find_vma_rcu()

Le 22/04/2019 à 22:57, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:12PM +0200, Laurent Dufour wrote:
>> This allows to search for a VMA structure without holding the mmap_sem.
>>
>> The search is repeated while the mm seqlock is changing and until we found
>> a valid VMA.
>>
>> While under the RCU protection, a reference is taken on the VMA, so the
>> caller must call put_vma() once it not more need the VMA structure.
>>
>> At the time a VMA is inserted in the MM RB tree, in vma_rb_insert(), a
>> reference is taken to the VMA by calling get_vma().
>>
>> When removing a VMA from the MM RB tree, the VMA is not release immediately
>> but at the end of the RCU grace period through vm_rcu_put(). This ensures
>> that the VMA remains allocated until the end the RCU grace period.
>>
>> Since the vm_file pointer, if valid, is released in put_vma(), there is no
>> guarantee that the file pointer will be valid on the returned VMA.
>>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Minor comments about comment (i love recursion :)) see below.
>
> Reviewed-by: Jérôme Glisse <[email protected]>

Thanks Jérôme, see my comments to your comments on my comments below ;)

>> ---
>> include/linux/mm_types.h | 1 +
>> mm/internal.h | 5 ++-
>> mm/mmap.c | 76 ++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 78 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 6a6159e11a3f..9af6694cb95d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -287,6 +287,7 @@ struct vm_area_struct {
>>
>> #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> atomic_t vm_ref_count;
>> + struct rcu_head vm_rcu;
>> #endif
>> struct rb_node vm_rb;
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 302382bed406..1e368e4afe3c 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -55,7 +55,10 @@ static inline void put_vma(struct vm_area_struct *vma)
>> __free_vma(vma);
>> }
>>
>> -#else
>> +extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
>> + unsigned long addr);
>> +
>> +#else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> static inline void get_vma(struct vm_area_struct *vma)
>> {
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index c106440dcae7..34bf261dc2c8 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -179,6 +179,18 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>> {
>> write_sequnlock(&mm->mm_seq);
>> }
>> +
>> +static void __vm_rcu_put(struct rcu_head *head)
>> +{
>> + struct vm_area_struct *vma = container_of(head, struct vm_area_struct,
>> + vm_rcu);
>> + put_vma(vma);
>> +}
>> +static void vm_rcu_put(struct vm_area_struct *vma)
>> +{
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>> + call_rcu(&vma->vm_rcu, __vm_rcu_put);
>> +}
>> #else
>> static inline void mm_write_seqlock(struct mm_struct *mm)
>> {
>> @@ -190,6 +202,8 @@ static inline void mm_write_sequnlock(struct mm_struct *mm)
>>
>> void __free_vma(struct vm_area_struct *vma)
>> {
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>> mpol_put(vma_policy(vma));
>> vm_area_free(vma);
>> }
>> @@ -197,11 +211,24 @@ void __free_vma(struct vm_area_struct *vma)
>> /*
>> * Close a vm structure and free it, returning the next.
>> */
>> -static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> +static struct vm_area_struct *__remove_vma(struct vm_area_struct *vma)
>> {
>> struct vm_area_struct *next = vma->vm_next;
>>
>> might_sleep();
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT) &&
>> + !RB_EMPTY_NODE(&vma->vm_rb)) {
>> + /*
>> + * If the VMA is still linked in the RB tree, we must release
>> + * that reference by calling put_vma().
>> + * This should only happen when called from exit_mmap().
>> + * We forcely clear the node to satisfy the chec in
> ^
> Typo: chec -> check

Yep

>
>> + * __free_vma(). This is safe since the RB tree is not walked
>> + * anymore.
>> + */
>> + RB_CLEAR_NODE(&vma->vm_rb);
>> + put_vma(vma);
>> + }
>> if (vma->vm_ops && vma->vm_ops->close)
>> vma->vm_ops->close(vma);
>> if (vma->vm_file)
>> @@ -211,6 +238,13 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> return next;
>> }
>>
>> +static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>> +{
>> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT))
>> + VM_BUG_ON_VMA(!RB_EMPTY_NODE(&vma->vm_rb), vma);
>
> Adding a comment here explaining the BUG_ON so people can understand
> what is wrong if that happens. For instance:
>
> /*
> * remove_vma() should be call only once a vma have been remove from the rbtree
> * at which point the vma->vm_rb is an empty node. The exception is when vmas
> * are destroy through exit_mmap() in which case we do not bother updating the
> * rbtree (see comment in __remove_vma()).
> */

I agree !


>> + return __remove_vma(vma);
>> +}
>> +
>> static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
>> struct list_head *uf);
>> SYSCALL_DEFINE1(brk, unsigned long, brk)
>> @@ -475,7 +509,7 @@ static inline void vma_rb_insert(struct vm_area_struct *vma,
>>
>> /* All rb_subtree_gap values must be consistent prior to insertion */
>> validate_mm_rb(root, NULL);
>> -
>> + get_vma(vma);
>> rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>> }
>>
>> @@ -491,6 +525,14 @@ static void __vma_rb_erase(struct vm_area_struct *vma, struct mm_struct *mm)
>> mm_write_seqlock(mm);
>> rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
>> mm_write_sequnlock(mm); /* wmb */
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + /*
>> + * Ensure the removal is complete before clearing the node.
>> + * Matched by vma_has_changed()/handle_speculative_fault().
>> + */
>> + RB_CLEAR_NODE(&vma->vm_rb);
>> + vm_rcu_put(vma);
>> +#endif
>> }
>>
>> static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
>> @@ -2331,6 +2373,34 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>
>> EXPORT_SYMBOL(find_vma);
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +/*
>> + * Like find_vma() but under the protection of RCU and the mm sequence counter.
>> + * The vma returned has to be relaesed by the caller through the call to
>> + * put_vma()
>> + */
>> +struct vm_area_struct *find_vma_rcu(struct mm_struct *mm, unsigned long addr)
>> +{
>> + struct vm_area_struct *vma = NULL;
>> + unsigned int seq;
>> +
>> + do {
>> + if (vma)
>> + put_vma(vma);
>> +
>> + seq = read_seqbegin(&mm->mm_seq);
>> +
>> + rcu_read_lock();
>> + vma = find_vma(mm, addr);
>> + if (vma)
>> + get_vma(vma);
>> + rcu_read_unlock();
>> + } while (read_seqretry(&mm->mm_seq, seq));
>> +
>> + return vma;
>> +}
>> +#endif
>> +
>> /*
>> * Same as find_vma, but also return a pointer to the previous VMA in *pprev.
>> */
>> @@ -3231,7 +3301,7 @@ void exit_mmap(struct mm_struct *mm)
>> while (vma) {
>> if (vma->vm_flags & VM_ACCOUNT)
>> nr_accounted += vma_pages(vma);
>> - vma = remove_vma(vma);
>> + vma = __remove_vma(vma);
>> }
>> vm_unacct_memory(nr_accounted);
>> }
>> --
>> 2.21.0
>>
>

2019-04-24 14:58:35

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure

Le 22/04/2019 à 23:26, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
>> From: Peter Zijlstra <[email protected]>
>>
>> Provide infrastructure to do a speculative fault (not holding
>> mmap_sem).
>>
>> The not holding of mmap_sem means we can race against VMA
>> change/removal and page-table destruction. We use the SRCU VMA freeing
>> to keep the VMA around. We use the VMA seqcount to detect change
>> (including umapping / page-table deletion) and we use gup_fast() style
>> page-table walking to deal with page-table races.
>>
>> Once we've obtained the page and are ready to update the PTE, we
>> validate if the state we started the fault with is still valid, if
>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>> PTE and we're done.
>>
>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>
>> [Manage the newly introduced pte_spinlock() for speculative page
>> fault to fail if the VMA is touched in our back]
>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>> [Fetch p4d and pud]
>> [Set vmd.sequence in __handle_mm_fault()]
>> [Abort speculative path when handle_userfault() has to be called]
>> [Add additional VMA's flags checks in handle_speculative_fault()]
>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>> [Remove warning comment about waiting for !seq&1 since we don't want
>> to wait]
>> [Remove warning about no huge page support, mention it explictly]
>> [Don't call do_fault() in the speculative path as __do_fault() calls
>> vma->vm_ops->fault() which may want to release mmap_sem]
>> [Only vm_fault pointer argument for vma_has_changed()]
>> [Fix check against huge page, calling pmd_trans_huge()]
>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>> processing done in vm_normal_page()]
>> [Check that vma->anon_vma is already set when starting the speculative
>> path]
>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>> the processing done in mpol_misplaced()]
>> [Don't support VMA growing up or down]
>> [Move check on vm_sequence just before calling handle_pte_fault()]
>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Add mem cgroup oom check]
>> [Use READ_ONCE to access p*d entries]
>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>> path]
>> [Check PMD against concurrent collapsing operation]
>> [Try spin lock the pte during the speculative path to avoid deadlock with
>> other CPU's invalidating the TLB and requiring this CPU to catch the
>> inter processor's interrupt]
>> [Move define of FAULT_FLAG_SPECULATIVE here]
>> [Introduce __handle_speculative_fault() and add a check against
>> mm->mm_users in handle_speculative_fault() defined in mm.h]
>> [Abort if vm_ops->fault is set instead of checking only vm_ops]
>> [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
>> Signed-off-by: Laurent Dufour <[email protected]>
>
>
> Few comments and questions for this one see below.
>
>
>> ---
>> include/linux/hugetlb_inline.h | 2 +-
>> include/linux/mm.h | 30 +++
>> include/linux/pagemap.h | 4 +-
>> mm/internal.h | 15 ++
>> mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
>> 5 files changed, 389 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>> index 0660a03d37d9..9e25283d6fc9 100644
>> --- a/include/linux/hugetlb_inline.h
>> +++ b/include/linux/hugetlb_inline.h
>> @@ -8,7 +8,7 @@
>>
>> static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>> {
>> - return !!(vma->vm_flags & VM_HUGETLB);
>> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>> }
>>
>> #else
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f761a9c65c74..ec609cbad25a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
>> #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
>> #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
>> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
>> +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
>>
>> #define FAULT_FLAG_TRACE \
>> { FAULT_FLAG_WRITE, "WRITE" }, \
>> @@ -409,6 +410,10 @@ struct vm_fault {
>> gfp_t gfp_mask; /* gfp mask to be used for allocations */
>> pgoff_t pgoff; /* Logical page offset based on vma */
>> unsigned long address; /* Faulting virtual address */
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + unsigned int sequence;
>> + pmd_t orig_pmd; /* value of PMD at the time of fault */
>> +#endif
>> pmd_t *pmd; /* Pointer to pmd entry matching
>> * the 'address' */
>> pud_t *pud; /* Pointer to pud entry matching
>> @@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
>> #ifdef CONFIG_MMU
>> extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
>> unsigned long address, unsigned int flags);
>> +
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags);
>> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + /*
>> + * Try speculative page fault for multithreaded user space task only.
>> + */
>> + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
>> + return VM_FAULT_RETRY;
>> + return __handle_speculative_fault(mm, address, flags);
>> +}
>> +#else
>> +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + return VM_FAULT_RETRY;
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>> unsigned long address, unsigned int fault_flags,
>> bool *unlocked);
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index 2e8438a1216a..2fcfaa910007 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>> pgoff_t pgoff;
>> if (unlikely(is_vm_hugetlb_page(vma)))
>> return linear_hugepage_index(vma, address);
>> - pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>> - pgoff += vma->vm_pgoff;
>> + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
>> + pgoff += READ_ONCE(vma->vm_pgoff);
>> return pgoff;
>> }
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 1e368e4afe3c..ed91b199cb8c 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
>> extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
>> unsigned long addr);
>>
>> +
>> +static inline bool vma_has_changed(struct vm_fault *vmf)
>> +{
>> + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
>> + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
>> +
>> + /*
>> + * Matches both the wmb in write_seqlock_{begin,end}() and
>> + * the wmb in vma_rb_erase().
>> + */
>> + smp_rmb();
>> +
>> + return ret || seq != vmf->sequence;
>> +}
>> +
>> #else /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> static inline void get_vma(struct vm_area_struct *vma)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 46f877b6abea..6e6bf61c0e5c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>> if (page)
>> dump_page(page, "bad pte");
>> pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
>> - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>> + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
>> + mapping, index);
>> pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>> vma->vm_file,
>> vma->vm_ops ? vma->vm_ops->fault : NULL,
>> @@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>> }
>> EXPORT_SYMBOL_GPL(apply_to_page_range);
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +static bool pte_spinlock(struct vm_fault *vmf)
>> +{
>> + bool ret = false;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + pmd_t pmdval;
>> +#endif
>> +
>> + /* Check if vma is still valid */
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + spin_lock(vmf->ptl);
>> + return true;
>> + }
>> +
>> +again:
>> + local_irq_disable();
>> + if (vma_has_changed(vmf))
>> + goto out;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + /*
>> + * We check if the pmd value is still the same to ensure that there
>> + * is not a huge collapse operation in progress in our back.
>> + */
>> + pmdval = READ_ONCE(*vmf->pmd);
>> + if (!pmd_same(pmdval, vmf->orig_pmd))
>> + goto out;
>> +#endif
>> +
>> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + if (unlikely(!spin_trylock(vmf->ptl))) {
>> + local_irq_enable();
>> + goto again;
>> + }
>
> Do we want to constantly retry taking the spinlock ? Shouldn't it
> be limited ? If we fail few times it is probably better to give
> up on that speculative page fault.
>
> So maybe putting everything within a for(i; i < MAX_TRY; ++i) loop
> would be cleaner.

I did tried that by the past when I added this loop but I never reach
the limit I set. By the way what should be the MAX_TRY value? ;)

The loop was introduced to fix a race between CPU, this is explained in
the patch description, but a comment is clearly missing here:

/*
* A spin_trylock() of the ptl is done to avoid a deadlock with other
* CPU invalidating the TLB and requiring this CPU to catch the IPI.
* As the interrupt are disabled during this operation we need to relax
* them and try again locking the PTL.
*/

I don't think that retrying the page fault would help, since the regular
page fault handler will also spin here if there is a massive contention
on the PTL.

>
>
>> +
>> + if (vma_has_changed(vmf)) {
>> + spin_unlock(vmf->ptl);
>> + goto out;
>> + }
>> +
>> + ret = true;
>> +out:
>> + local_irq_enable();
>> + return ret;
>> +}
>> +
>> +static bool pte_map_lock(struct vm_fault *vmf)
>> +{
>> + bool ret = false;
>> + pte_t *pte;
>> + spinlock_t *ptl;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + pmd_t pmdval;
>> +#endif
>> +
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>> + vmf->address, &vmf->ptl);
>> + return true;
>> + }
>> +
>> + /*
>> + * The first vma_has_changed() guarantees the page-tables are still
>> + * valid, having IRQs disabled ensures they stay around, hence the
>> + * second vma_has_changed() to make sure they are still valid once
>> + * we've got the lock. After that a concurrent zap_pte_range() will
>> + * block on the PTL and thus we're safe.
>> + */
>> +again:
>> + local_irq_disable();
>> + if (vma_has_changed(vmf))
>> + goto out;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> + /*
>> + * We check if the pmd value is still the same to ensure that there
>> + * is not a huge collapse operation in progress in our back.
>> + */
>> + pmdval = READ_ONCE(*vmf->pmd);
>> + if (!pmd_same(pmdval, vmf->orig_pmd))
>> + goto out;
>> +#endif
>> +
>> + /*
>> + * Same as pte_offset_map_lock() except that we call
>> + * spin_trylock() in place of spin_lock() to avoid race with
>> + * unmap path which may have the lock and wait for this CPU
>> + * to invalidate TLB but this CPU has irq disabled.
>> + * Since we are in a speculative patch, accept it could fail
>> + */
>> + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> + pte = pte_offset_map(vmf->pmd, vmf->address);
>> + if (unlikely(!spin_trylock(ptl))) {
>> + pte_unmap(pte);
>> + local_irq_enable();
>> + goto again;
>> + }
>
> Same comment as above shouldn't be limited to a maximum number of retry ?

Same answer ;)

>
>> +
>> + if (vma_has_changed(vmf)) {
>> + pte_unmap_unlock(pte, ptl);
>> + goto out;
>> + }
>> +
>> + vmf->pte = pte;
>> + vmf->ptl = ptl;
>> + ret = true;
>> +out:
>> + local_irq_enable();
>> + return ret;
>> +}
>> +#else
>> static inline bool pte_spinlock(struct vm_fault *vmf)
>> {
>> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>> @@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>> vmf->address, &vmf->ptl);
>> return true;
>> }
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>
>> /*
>> * handle_pte_fault chooses page fault handler according to an entry which was
>> @@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> ret = check_stable_address_space(vma->vm_mm);
>> if (ret)
>> goto unlock;
>> + /*
>> + * Don't call the userfaultfd during the speculative path.
>> + * We already checked for the VMA to not be managed through
>> + * userfaultfd, but it may be set in our back once we have lock
>> + * the pte. In such a case we can ignore it this time.
>> + */
>> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + goto setpte;
>
> Bit confuse by the comment above, if userfaultfd is set in the back
> then shouldn't the speculative fault abort ? So wouldn't the following
> be correct:
>
> if (userfaultfd_missing(vma)) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> return VM_FAULT_RETRY;
> ...

Well here we are racing with the user space action setting the
userfaultfd, we may have go through this page fault seeing the
userfaultfd or not. But I can't imagine that the user process will rely
on that to happen. If there is such a race, it would be up to the user
space process to ensure that no page fault are triggered while it is
setting up the userfaultfd.
Since a check on the userfaultfd is done at the beginning of the SPF
handler, I made the choice to ignore this later and not trigger the
userfault this time.

Obviously we may abort the SPF handling but what is the benefit ?

>
>> /* Deliver the page fault to userland, check inside PT lock */
>> if (userfaultfd_missing(vma)) {
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> goto unlock_and_release;
>>
>> /* Deliver the page fault to userland, check inside PT lock */
>> - if (userfaultfd_missing(vma)) {
>> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
>> + userfaultfd_missing(vma)) {
>
> Same comment as above but this also seems more wrong then above. What
> i propose above would look more correct in both cases ie we still want
> to check for userfaultfd but if we are in speculative fault then we
> just want to abort the speculative fault.

Why is more wrong here ? Indeed this is consistent with the previous
action, ignore the userfault event if it has been set while the SPF
handler is in progress. IMHO this is up to the user space to serialize
the userfaultfd setting against the ongoing page fault in that case.

>
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> mem_cgroup_cancel_charge(page, memcg, false);
>> put_page(page);
>> @@ -3836,6 +3959,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> pte_t entry;
>>
>> if (unlikely(pmd_none(*vmf->pmd))) {
>> + /*
>> + * In the case of the speculative page fault handler we abort
>> + * the speculative path immediately as the pmd is probably
>> + * in the way to be converted in a huge one. We will try
>> + * again holding the mmap_sem (which implies that the collapse
>> + * operation is done).
>> + */
>> + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + return VM_FAULT_RETRY;
>> /*
>> * Leave __pte_alloc() until later: because vm_ops->fault may
>> * want to allocate huge page, and if we expose page table
>> @@ -3843,7 +3975,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> * concurrent faults and from rmap lookups.
>> */
>> vmf->pte = NULL;
>> - } else {
>> + } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>> /* See comment in pte_alloc_one_map() */
>> if (pmd_devmap_trans_unstable(vmf->pmd))
>> return 0;
>> @@ -3852,6 +3984,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> * pmd from under us anymore at this point because we hold the
>> * mmap_sem read mode and khugepaged takes it in write mode.
>> * So now it's safe to run pte_offset_map().
>> + * This is not applicable to the speculative page fault handler
>> + * but in that case, the pte is fetched earlier in
>> + * handle_speculative_fault().
>> */
>> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> vmf->orig_pte = *vmf->pte;
>> @@ -3874,6 +4009,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> if (!vmf->pte) {
>> if (vma_is_anonymous(vmf->vma))
>> return do_anonymous_page(vmf);
>> + else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>> + return VM_FAULT_RETRY;
>
> Maybe a small comment about speculative page fault not applying to
> file back vma.

Sure.

>
>> else
>> return do_fault(vmf);
>> }
>> @@ -3971,6 +4108,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>> if (!vmf.pmd)
>> return VM_FAULT_OOM;
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> + vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
>> +#endif
>> if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
>> ret = create_huge_pmd(&vmf);
>> if (!(ret & VM_FAULT_FALLBACK))
>> @@ -4004,6 +4144,204 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> return handle_pte_fault(&vmf);
>> }
>>
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +/*
>> + * Tries to handle the page fault in a speculative way, without grabbing the
>> + * mmap_sem.
>> + */
>> +vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
>> + unsigned long address,
>> + unsigned int flags)
>> +{
>> + struct vm_fault vmf = {
>> + .address = address,
>> + };
>> + pgd_t *pgd, pgdval;
>> + p4d_t *p4d, p4dval;
>> + pud_t pudval;
>> + int seq;
>> + vm_fault_t ret = VM_FAULT_RETRY;
>> + struct vm_area_struct *vma;
>> +#ifdef CONFIG_NUMA
>> + struct mempolicy *pol;
>> +#endif
>> +
>> + /* Clear flags that may lead to release the mmap_sem to retry */
>> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> + flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> + vma = find_vma_rcu(mm, address);
>> + if (!vma)
>> + return ret;
>> +
>> + /* rmb <-> seqlock,vma_rb_erase() */
>> + seq = raw_read_seqcount(&vma->vm_sequence);
>> + if (seq & 1)
>> + goto out_put;
>
> A comment explaining that odd sequence number means that we are racing
> with a write_begin and write_end would be welcome above.

Yes that would be welcome.

>> +
>> + /*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> + if (vma->vm_ops && vma->vm_ops->fault)
>> + goto out_put;
>> +
>> + /*
>> + * __anon_vma_prepare() requires the mmap_sem to be held
>> + * because vm_next and vm_prev must be safe. This can't be guaranteed
>> + * in the speculative path.
>> + */
>> + if (unlikely(!vma->anon_vma))
>> + goto out_put;
>
> Maybe also remind people that once the vma->anon_vma is set then its
> value will not change and thus we do not need to protect against such
> thing (unlike vm_flags or other vma field below and above).

Will do, thanks.


>> +
>> + vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> + /* Can't call userland page fault handler in the speculative path */
>> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> + goto out_put;
>> +
>> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> + /*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> + goto out_put;
>> +
>> + if (address < READ_ONCE(vma->vm_start)
>> + || READ_ONCE(vma->vm_end) <= address)
>> + goto out_put;
>> +
>> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> + flags & FAULT_FLAG_INSTRUCTION,
>> + flags & FAULT_FLAG_REMOTE)) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> +
>> + /* This is one is required to check that the VMA has write access set */
>> + if (flags & FAULT_FLAG_WRITE) {
>> + if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
>> + ret = VM_FAULT_SIGSEGV;
>> + goto out_put;
>> + }
>> +
>> +#ifdef CONFIG_NUMA
>> + /*
>> + * MPOL_INTERLEAVE implies additional checks in
>> + * mpol_misplaced() which are not compatible with the
>> + *speculative page fault processing.
>> + */
>> + pol = __get_vma_policy(vma, address);
>> + if (!pol)
>> + pol = get_task_policy(current);
>> + if (pol && pol->mode == MPOL_INTERLEAVE)
>> + goto out_put;
>> +#endif
>> +
>> + /*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> + local_irq_disable();
>> + pgd = pgd_offset(mm, address);
>> + pgdval = READ_ONCE(*pgd);
>> + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
>> + goto out_walk;
>> +
>> + p4d = p4d_offset(pgd, address);
>> + p4dval = READ_ONCE(*p4d);
>> + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
>> + goto out_walk;
>> +
>> + vmf.pud = pud_offset(p4d, address);
>> + pudval = READ_ONCE(*vmf.pud);
>> + if (pud_none(pudval) || unlikely(pud_bad(pudval)))
>> + goto out_walk;
>> +
>> + /* Huge pages at PUD level are not supported. */
>> + if (unlikely(pud_trans_huge(pudval)))
>> + goto out_walk;
>> +
>> + vmf.pmd = pmd_offset(vmf.pud, address);
>> + vmf.orig_pmd = READ_ONCE(*vmf.pmd);
>> + /*
>> + * pmd_none could mean that a hugepage collapse is in progress
>> + * in our back as collapse_huge_page() mark it before
>> + * invalidating the pte (which is done once the IPI is catched
>> + * by all CPU and we have interrupt disabled).
>> + * For this reason we cannot handle THP in a speculative way since we
>> + * can't safely identify an in progress collapse operation done in our
>> + * back on that PMD.
>> + * Regarding the order of the following checks, see comment in
>> + * pmd_devmap_trans_unstable()
>> + */
>> + if (unlikely(pmd_devmap(vmf.orig_pmd) ||
>> + pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
>> + is_swap_pmd(vmf.orig_pmd)))
>> + goto out_walk;
>> +
>> + /*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
>> +
>> + vmf.pte = pte_offset_map(vmf.pmd, address);
>> + vmf.orig_pte = READ_ONCE(*vmf.pte);
>> + barrier(); /* See comment in handle_pte_fault() */
>> + if (pte_none(vmf.orig_pte)) {
>> + pte_unmap(vmf.pte);
>> + vmf.pte = NULL;
>> + }
>> +
>> + vmf.vma = vma;
>> + vmf.pgoff = linear_page_index(vma, address);
>> + vmf.gfp_mask = __get_fault_gfp_mask(vma);
>> + vmf.sequence = seq;
>> + vmf.flags = flags;
>> +
>> + local_irq_enable();
>> +
>> + /*
>> + * We need to re-validate the VMA after checking the bounds, otherwise
>> + * we might have a false positive on the bounds.
>> + */
>> + if (read_seqcount_retry(&vma->vm_sequence, seq))
>> + goto out_put;
>> +
>> + mem_cgroup_enter_user_fault();
>> + ret = handle_pte_fault(&vmf);
>> + mem_cgroup_exit_user_fault();
>> +
>> + put_vma(vma);
>> +
>> + /*
>> + * The task may have entered a memcg OOM situation but
>> + * if the allocation error was handled gracefully (no
>> + * VM_FAULT_OOM), there is no need to kill anything.
>> + * Just clean up the OOM state peacefully.
>> + */
>> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>> + mem_cgroup_oom_synchronize(false);
>> + return ret;
>> +
>> +out_walk:
>> + local_irq_enable();
>> +out_put:
>> + put_vma(vma);
>> + return ret;
>> +}
>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>> +
>> /*
>> * By the time we get here, we already hold the mm semaphore
>> *
>> --
>> 2.21.0
>>
>

2019-04-24 15:00:06

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 23/31] mm: don't do swap readahead during speculative page fault

Le 22/04/2019 à 23:36, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:14PM +0200, Laurent Dufour wrote:
>> Vinayak Menon faced a panic because one thread was page faulting a page in
>> swap, while another one was mprotecting a part of the VMA leading to a VMA
>> split.
>> This raise a panic in swap_vma_readahead() because the VMA's boundaries
>> were not more matching the faulting address.
>>
>> To avoid this, if the page is not found in the swap, the speculative page
>> fault is aborted to retry a regular page fault.
>>
>> Reported-by: Vinayak Menon <[email protected]>
>> Signed-off-by: Laurent Dufour <[email protected]>
>
> Reviewed-by: Jérôme Glisse <[email protected]>
>
> Note that you should also skip non swap entry in do_swap_page() when doing
> speculative page fault at very least you need to is_device_private_entry()
> case.
>
> But this should either be part of patch 22 or another patch to fix swap
> case.

Thanks Jérôme,

Yes I missed that, I guess the best option would be to abort on non swap
entry. I'll add that in the patch 22.

>> ---
>> mm/memory.c | 11 +++++++++++
>> 1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6e6bf61c0e5c..1991da97e2db 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2900,6 +2900,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> lru_cache_add_anon(page);
>> swap_readpage(page, true);
>> }
>> + } else if (vmf->flags & FAULT_FLAG_SPECULATIVE) {
>> + /*
>> + * Don't try readahead during a speculative page fault
>> + * as the VMA's boundaries may change in our back.
>> + * If the page is not in the swap cache and synchronous
>> + * read is disabled, fall back to the regular page
>> + * fault mechanism.
>> + */
>> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>> + ret = VM_FAULT_RETRY;
>> + goto out;
>> } else {
>> page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
>> vmf);
>> --
>> 2.21.0
>>
>

2019-04-24 15:16:48

by Jerome Glisse

[permalink] [raw]
Subject: Re: [PATCH v12 22/31] mm: provide speculative fault infrastructure

On Wed, Apr 24, 2019 at 04:56:14PM +0200, Laurent Dufour wrote:
> Le 22/04/2019 ? 23:26, Jerome Glisse a ?crit?:
> > On Tue, Apr 16, 2019 at 03:45:13PM +0200, Laurent Dufour wrote:
> > > From: Peter Zijlstra <[email protected]>
> > >
> > > Provide infrastructure to do a speculative fault (not holding
> > > mmap_sem).
> > >
> > > The not holding of mmap_sem means we can race against VMA
> > > change/removal and page-table destruction. We use the SRCU VMA freeing
> > > to keep the VMA around. We use the VMA seqcount to detect change
> > > (including umapping / page-table deletion) and we use gup_fast() style
> > > page-table walking to deal with page-table races.
> > >
> > > Once we've obtained the page and are ready to update the PTE, we
> > > validate if the state we started the fault with is still valid, if
> > > not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> > > PTE and we're done.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > >
> > > [Manage the newly introduced pte_spinlock() for speculative page
> > > fault to fail if the VMA is touched in our back]
> > > [Rename vma_is_dead() to vma_has_changed() and declare it here]
> > > [Fetch p4d and pud]
> > > [Set vmd.sequence in __handle_mm_fault()]
> > > [Abort speculative path when handle_userfault() has to be called]
> > > [Add additional VMA's flags checks in handle_speculative_fault()]
> > > [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> > > [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> > > [Remove warning comment about waiting for !seq&1 since we don't want
> > > to wait]
> > > [Remove warning about no huge page support, mention it explictly]
> > > [Don't call do_fault() in the speculative path as __do_fault() calls
> > > vma->vm_ops->fault() which may want to release mmap_sem]
> > > [Only vm_fault pointer argument for vma_has_changed()]
> > > [Fix check against huge page, calling pmd_trans_huge()]
> > > [Use READ_ONCE() when reading VMA's fields in the speculative path]
> > > [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
> > > processing done in vm_normal_page()]
> > > [Check that vma->anon_vma is already set when starting the speculative
> > > path]
> > > [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
> > > the processing done in mpol_misplaced()]
> > > [Don't support VMA growing up or down]
> > > [Move check on vm_sequence just before calling handle_pte_fault()]
> > > [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> > > [Add mem cgroup oom check]
> > > [Use READ_ONCE to access p*d entries]
> > > [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> > > [Don't fetch pte again in handle_pte_fault() when running the speculative
> > > path]
> > > [Check PMD against concurrent collapsing operation]
> > > [Try spin lock the pte during the speculative path to avoid deadlock with
> > > other CPU's invalidating the TLB and requiring this CPU to catch the
> > > inter processor's interrupt]
> > > [Move define of FAULT_FLAG_SPECULATIVE here]
> > > [Introduce __handle_speculative_fault() and add a check against
> > > mm->mm_users in handle_speculative_fault() defined in mm.h]
> > > [Abort if vm_ops->fault is set instead of checking only vm_ops]
> > > [Use find_vma_rcu() and call put_vma() when we are done with the VMA]
> > > Signed-off-by: Laurent Dufour <[email protected]>
> >
> >
> > Few comments and questions for this one see below.
> >
> >
> > > ---
> > > include/linux/hugetlb_inline.h | 2 +-
> > > include/linux/mm.h | 30 +++
> > > include/linux/pagemap.h | 4 +-
> > > mm/internal.h | 15 ++
> > > mm/memory.c | 344 ++++++++++++++++++++++++++++++++-
> > > 5 files changed, 389 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> > > index 0660a03d37d9..9e25283d6fc9 100644
> > > --- a/include/linux/hugetlb_inline.h
> > > +++ b/include/linux/hugetlb_inline.h
> > > @@ -8,7 +8,7 @@
> > > static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
> > > {
> > > - return !!(vma->vm_flags & VM_HUGETLB);
> > > + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
> > > }
> > > #else
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index f761a9c65c74..ec609cbad25a 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -381,6 +381,7 @@ extern pgprot_t protection_map[16];
> > > #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
> > > #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
> > > #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
> > > +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */
> > > #define FAULT_FLAG_TRACE \
> > > { FAULT_FLAG_WRITE, "WRITE" }, \
> > > @@ -409,6 +410,10 @@ struct vm_fault {
> > > gfp_t gfp_mask; /* gfp mask to be used for allocations */
> > > pgoff_t pgoff; /* Logical page offset based on vma */
> > > unsigned long address; /* Faulting virtual address */
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > + unsigned int sequence;
> > > + pmd_t orig_pmd; /* value of PMD at the time of fault */
> > > +#endif
> > > pmd_t *pmd; /* Pointer to pmd entry matching
> > > * the 'address' */
> > > pud_t *pud; /* Pointer to pud entry matching
> > > @@ -1524,6 +1529,31 @@ int invalidate_inode_page(struct page *page);
> > > #ifdef CONFIG_MMU
> > > extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
> > > unsigned long address, unsigned int flags);
> > > +
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > +extern vm_fault_t __handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags);
> > > +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags)
> > > +{
> > > + /*
> > > + * Try speculative page fault for multithreaded user space task only.
> > > + */
> > > + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
> > > + return VM_FAULT_RETRY;
> > > + return __handle_speculative_fault(mm, address, flags);
> > > +}
> > > +#else
> > > +static inline vm_fault_t handle_speculative_fault(struct mm_struct *mm,
> > > + unsigned long address,
> > > + unsigned int flags)
> > > +{
> > > + return VM_FAULT_RETRY;
> > > +}
> > > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > +
> > > extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> > > unsigned long address, unsigned int fault_flags,
> > > bool *unlocked);
> > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > > index 2e8438a1216a..2fcfaa910007 100644
> > > --- a/include/linux/pagemap.h
> > > +++ b/include/linux/pagemap.h
> > > @@ -457,8 +457,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
> > > pgoff_t pgoff;
> > > if (unlikely(is_vm_hugetlb_page(vma)))
> > > return linear_hugepage_index(vma, address);
> > > - pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
> > > - pgoff += vma->vm_pgoff;
> > > + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
> > > + pgoff += READ_ONCE(vma->vm_pgoff);
> > > return pgoff;
> > > }
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 1e368e4afe3c..ed91b199cb8c 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -58,6 +58,21 @@ static inline void put_vma(struct vm_area_struct *vma)
> > > extern struct vm_area_struct *find_vma_rcu(struct mm_struct *mm,
> > > unsigned long addr);
> > > +
> > > +static inline bool vma_has_changed(struct vm_fault *vmf)
> > > +{
> > > + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
> > > + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
> > > +
> > > + /*
> > > + * Matches both the wmb in write_seqlock_{begin,end}() and
> > > + * the wmb in vma_rb_erase().
> > > + */
> > > + smp_rmb();
> > > +
> > > + return ret || seq != vmf->sequence;
> > > +}
> > > +
> > > #else /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > static inline void get_vma(struct vm_area_struct *vma)
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 46f877b6abea..6e6bf61c0e5c 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -522,7 +522,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > if (page)
> > > dump_page(page, "bad pte");
> > > pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
> > > - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > > + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
> > > + mapping, index);
> > > pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
> > > vma->vm_file,
> > > vma->vm_ops ? vma->vm_ops->fault : NULL,
> > > @@ -2082,6 +2083,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
> > > }
> > > EXPORT_SYMBOL_GPL(apply_to_page_range);
> > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> > > +static bool pte_spinlock(struct vm_fault *vmf)
> > > +{
> > > + bool ret = false;
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + pmd_t pmdval;
> > > +#endif
> > > +
> > > + /* Check if vma is still valid */
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> > > + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + spin_lock(vmf->ptl);
> > > + return true;
> > > + }
> > > +
> > > +again:
> > > + local_irq_disable();
> > > + if (vma_has_changed(vmf))
> > > + goto out;
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + /*
> > > + * We check if the pmd value is still the same to ensure that there
> > > + * is not a huge collapse operation in progress in our back.
> > > + */
> > > + pmdval = READ_ONCE(*vmf->pmd);
> > > + if (!pmd_same(pmdval, vmf->orig_pmd))
> > > + goto out;
> > > +#endif
> > > +
> > > + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + if (unlikely(!spin_trylock(vmf->ptl))) {
> > > + local_irq_enable();
> > > + goto again;
> > > + }
> >
> > Do we want to constantly retry taking the spinlock ? Shouldn't it
> > be limited ? If we fail few times it is probably better to give
> > up on that speculative page fault.
> >
> > So maybe putting everything within a for(i; i < MAX_TRY; ++i) loop
> > would be cleaner.
>
> I did tried that by the past when I added this loop but I never reach the
> limit I set. By the way what should be the MAX_TRY value? ;)

A power of 2 :) Like 4, something small.

>
> The loop was introduced to fix a race between CPU, this is explained in the
> patch description, but a comment is clearly missing here:
>
> /*
> * A spin_trylock() of the ptl is done to avoid a deadlock with other
> * CPU invalidating the TLB and requiring this CPU to catch the IPI.
> * As the interrupt are disabled during this operation we need to relax
> * them and try again locking the PTL.
> */
>
> I don't think that retrying the page fault would help, since the regular
> page fault handler will also spin here if there is a massive contention on
> the PTL.

My main fear is the loop will hammer a CPU if another CPU is holding
the same spinlock. In most places the page table lock should be held
only for short period of time so it should never last long. So while i
can not think of any reasons it would loop forever i fear i might have
a lack of imagination here.

>
> >
> >
> > > +
> > > + if (vma_has_changed(vmf)) {
> > > + spin_unlock(vmf->ptl);
> > > + goto out;
> > > + }
> > > +
> > > + ret = true;
> > > +out:
> > > + local_irq_enable();
> > > + return ret;
> > > +}
> > > +
> > > +static bool pte_map_lock(struct vm_fault *vmf)
> > > +{
> > > + bool ret = false;
> > > + pte_t *pte;
> > > + spinlock_t *ptl;
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + pmd_t pmdval;
> > > +#endif
> > > +
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
> > > + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
> > > + vmf->address, &vmf->ptl);
> > > + return true;
> > > + }
> > > +
> > > + /*
> > > + * The first vma_has_changed() guarantees the page-tables are still
> > > + * valid, having IRQs disabled ensures they stay around, hence the
> > > + * second vma_has_changed() to make sure they are still valid once
> > > + * we've got the lock. After that a concurrent zap_pte_range() will
> > > + * block on the PTL and thus we're safe.
> > > + */
> > > +again:
> > > + local_irq_disable();
> > > + if (vma_has_changed(vmf))
> > > + goto out;
> > > +
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > + /*
> > > + * We check if the pmd value is still the same to ensure that there
> > > + * is not a huge collapse operation in progress in our back.
> > > + */
> > > + pmdval = READ_ONCE(*vmf->pmd);
> > > + if (!pmd_same(pmdval, vmf->orig_pmd))
> > > + goto out;
> > > +#endif
> > > +
> > > + /*
> > > + * Same as pte_offset_map_lock() except that we call
> > > + * spin_trylock() in place of spin_lock() to avoid race with
> > > + * unmap path which may have the lock and wait for this CPU
> > > + * to invalidate TLB but this CPU has irq disabled.
> > > + * Since we are in a speculative patch, accept it could fail
> > > + */
> > > + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > + pte = pte_offset_map(vmf->pmd, vmf->address);
> > > + if (unlikely(!spin_trylock(ptl))) {
> > > + pte_unmap(pte);
> > > + local_irq_enable();
> > > + goto again;
> > > + }
> >
> > Same comment as above shouldn't be limited to a maximum number of retry ?
>
> Same answer ;)
>
> >
> > > +
> > > + if (vma_has_changed(vmf)) {
> > > + pte_unmap_unlock(pte, ptl);
> > > + goto out;
> > > + }
> > > +
> > > + vmf->pte = pte;
> > > + vmf->ptl = ptl;
> > > + ret = true;
> > > +out:
> > > + local_irq_enable();
> > > + return ret;
> > > +}
> > > +#else
> > > static inline bool pte_spinlock(struct vm_fault *vmf)
> > > {
> > > vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> > > @@ -2095,6 +2208,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
> > > vmf->address, &vmf->ptl);
> > > return true;
> > > }
> > > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> > > /*
> > > * handle_pte_fault chooses page fault handler according to an entry which was
> > > @@ -2999,6 +3113,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > > ret = check_stable_address_space(vma->vm_mm);
> > > if (ret)
> > > goto unlock;
> > > + /*
> > > + * Don't call the userfaultfd during the speculative path.
> > > + * We already checked for the VMA to not be managed through
> > > + * userfaultfd, but it may be set in our back once we have lock
> > > + * the pte. In such a case we can ignore it this time.
> > > + */
> > > + if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> > > + goto setpte;
> >
> > Bit confuse by the comment above, if userfaultfd is set in the back
> > then shouldn't the speculative fault abort ? So wouldn't the following
> > be correct:
> >
> > if (userfaultfd_missing(vma)) {
> > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> > return VM_FAULT_RETRY;
> > ...
>
> Well here we are racing with the user space action setting the userfaultfd,
> we may have go through this page fault seeing the userfaultfd or not. But I
> can't imagine that the user process will rely on that to happen. If there is
> such a race, it would be up to the user space process to ensure that no page
> fault are triggered while it is setting up the userfaultfd.
> Since a check on the userfaultfd is done at the beginning of the SPF
> handler, I made the choice to ignore this later and not trigger the
> userfault this time.
>
> Obviously we may abort the SPF handling but what is the benefit ?

Yeah probably no benefit one way or the other, backing of when a vma
change in anyway seems to be more consistent to me but i am fine either
way.

>
> >
> > > /* Deliver the page fault to userland, check inside PT lock */
> > > if (userfaultfd_missing(vma)) {
> > > pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > @@ -3041,7 +3163,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > > goto unlock_and_release;
> > > /* Deliver the page fault to userland, check inside PT lock */
> > > - if (userfaultfd_missing(vma)) {
> > > + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) &&
> > > + userfaultfd_missing(vma)) {
> >
> > Same comment as above but this also seems more wrong then above. What
> > i propose above would look more correct in both cases ie we still want
> > to check for userfaultfd but if we are in speculative fault then we
> > just want to abort the speculative fault.
>
> Why is more wrong here ? Indeed this is consistent with the previous action,
> ignore the userfault event if it has been set while the SPF handler is in
> progress. IMHO this is up to the user space to serialize the userfaultfd
> setting against the ongoing page fault in that case.

Adding a comment saying that SPF would have back off if userfaulfd
was arm at begining of SPF and that we want to ignore race with
userfaultfd enablement.

2019-04-24 18:05:04

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 22/04/2019 à 23:29, Michel Lespinasse a écrit :
> Hi Laurent,
>
> Thanks a lot for copying me on this patchset. It took me a few days to
> go through it - I had not been following the previous iterations of
> this series so I had to catch up. I will be sending comments for
> individual commits, but before tat I would like to discuss the series
> as a whole.

Hi Michel,

Thanks for reviewing this series.

> I think these changes are a big step in the right direction. My main
> reservation about them is that they are additive - adding some complexity
> for speculative page faults - and I wonder if it'd be possible, over the
> long term, to replace the existing complexity we have in mmap_sem retry
> mechanisms instead of adding to it. This is not something that should
> block your progress, but I think it would be good, as we introduce spf,
> to evaluate whether we could eventually get all the way to removing the
> mmap_sem retry mechanism, or if we will actually have to keep both.

Until we get rid of the mmap_sem which seems to be a very long story, I
can't see how we could get rid of the retry mechanism.

> The proposed spf mechanism only handles anon vmas. Is there a
> fundamental reason why it couldn't handle mapped files too ?
> My understanding is that the mechanism of verifying the vma after
> taking back the ptl at the end of the fault would work there too ?
> The file has to stay referenced during the fault, but holding the vma's
> refcount could be made to cover that ? the vm_file refcount would have
> to be released in __free_vma() instead of remove_vma; I'm not quite sure
> if that has more implications than I realize ?

The only concern is the flow of operation done in the vm_ops->fault()
processing. Most of the file system relie on the generic filemap_fault()
which should be safe to use. But we need a clever way to identify fault
processing which are compatible with the SPF handler. This could be done
using a tag/flag in the vm_ops structure or in the vma's flags.

This would be the next step.


> The proposed spf mechanism only works at the pte level after the page
> tables have already been created. The non-spf page fault path takes the
> mm->page_table_lock to protect against concurrent page table allocation
> by multiple page faults; I think unmapping/freeing page tables could
> be done under mm->page_table_lock too so that spf could implement
> allocating new page tables by verifying the vma after taking the
> mm->page_table_lock ?

I've to admit that I didn't dig further here.
Do you have a patch? ;)

>
> The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
> I am not sure what is the issue there - is this due to the vma->vm_start
> and vma->vm_pgoff reads in *__vm_normal_page() ?

Yes that's the reason, no way to guarantee the value of these fields in
the SPF path.

>
> My last potential concern is about performance. The numbers you have
> look great, but I worry about potential regressions in PF performance
> for threaded processes that don't currently encounter contention
> (i.e. there may be just one thread actually doing all the work while
> the others are blocked). I think one good proxy for measuring that
> would be to measure a single threaded workload - kernbench would be
> fine - without the special-case optimization in patch 22 where
> handle_speculative_fault() immediately aborts in the single-threaded case.

I'll have to give it a try.

> Reviewed-by: Michel Lespinasse <[email protected]>
> This is for the series as a whole; I expect to do another review pass on
> individual commits in the series when we have agreement on the toplevel
> stuff (I noticed a few things like out-of-date commit messages but that's
> really minor stuff).

Thanks a lot for reviewing this long series.

>
> I want to add a note about mmap_sem. In the past there has been
> discussions about replacing it with an interval lock, but these never
> went anywhere because, mostly, of the fact that such mechanisms were
> too expensive to use in the page fault path. I think adding the spf
> mechanism would invite us to revisit this issue - interval locks may
> be a great way to avoid blocking between unrelated mmap_sem writers
> (for example, do not delay stack creation for new threads while a
> large mmap or munmap may be going on), and probably also to handle
> mmap_sem readers that can't easily use the spf mechanism (for example,
> gup callers which make use of the returned vmas). But again that is a
> separate topic to explore which doesn't have to get resolved before
> spf goes in.
>

2019-04-27 01:54:57

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Wed, Apr 24, 2019 at 09:33:44AM +0200, Laurent Dufour wrote:
> Le 23/04/2019 ? 11:38, Peter Zijlstra a ?crit?:
> > On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
> > > The proposed spf mechanism only handles anon vmas. Is there a
> > > fundamental reason why it couldn't handle mapped files too ?
> > > My understanding is that the mechanism of verifying the vma after
> > > taking back the ptl at the end of the fault would work there too ?
> > > The file has to stay referenced during the fault, but holding the vma's
> > > refcount could be made to cover that ? the vm_file refcount would have
> > > to be released in __free_vma() instead of remove_vma; I'm not quite sure
> > > if that has more implications than I realize ?
> >
> > IIRC (and I really don't remember all that much) the trickiest bit was
> > vs unmount. Since files can stay open past the 'expected' duration,
> > umount could be delayed.
> >
> > But yes, I think I had a version that did all that just 'fine'. Like
> > mentioned, I didn't keep the refcount because it sucked just as hard as
> > the mmap_sem contention, but the SRCU callback did the fput() just fine
> > (esp. now that we have delayed_fput).
>
> I had to use a refcount for the VMA because I'm using RCU in place of SRCU
> and only protecting the RB tree using RCU.
>
> Regarding the file pointer, I decided to release it synchronously to avoid
> the latency of RCU during the file closing. As you mentioned this could
> delayed the umount but not only, as Linus Torvald demonstrated by the past
> [1]. Anyway, since the file support is not yet here there is no need for
> that currently.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Just to make sure I understand this correctly. If a program tries to
munmap a region while page faults are occuring (which means that the
program has a race condition in the first place), before spf the
mmap_sem would delay the munmap until the page fault completes. With
spf the munmap will happen immediately, while the vm_ops->fault()
is running, with spf holding a ref to the file. vm_ops->fault is
expected to execute a read from the file to the page cache, and the
page cache page will never be mapped into the process because after
taking the ptl, spf will notice the vma changed. So, the side effects
that may be observed after munmap completes would be:

- side effects from reading a file into the page cache - I'm not sure
what they are, the main one I can think of is that userspace may observe
the file's atime changing ?

- side effects from holding a reference to the file - which userspace
may observe by trying to unmount().

Is that the extent of the side effects, or are there more that I have
not thought of ?

> Regarding the file mapping support, the concern is to ensure that
> vm_ops->fault() will not try to release the mmap_sem. This is true for most
> of the file system operation using the generic one, but there is currently
> no clever way to identify that except by checking the vm_ops->fault pointer.
> Adding a flag to the vm_operations_struct structure is another option.
>
> that's doable as far as the underlying fault() function is not dealing with
> the mmap_sem, and I made a try by the past but was thinking that first the
> anonymous case should be accepted before moving forward this way.

Yes, that makes sense. Updating all of the fault handlers would be a
lot of work - but there doesn't seem to be anything fundamental that
wouldn't work there (except for the side effects of reordering spf
against munmap, as discussed above, which doesn't look easy to fully hide.).

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2019-04-27 06:02:43

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Wed, Apr 24, 2019 at 08:01:20PM +0200, Laurent Dufour wrote:
> Le 22/04/2019 ? 23:29, Michel Lespinasse a ?crit?:
> > Hi Laurent,
> >
> > Thanks a lot for copying me on this patchset. It took me a few days to
> > go through it - I had not been following the previous iterations of
> > this series so I had to catch up. I will be sending comments for
> > individual commits, but before tat I would like to discuss the series
> > as a whole.
>
> Hi Michel,
>
> Thanks for reviewing this series.
>
> > I think these changes are a big step in the right direction. My main
> > reservation about them is that they are additive - adding some complexity
> > for speculative page faults - and I wonder if it'd be possible, over the
> > long term, to replace the existing complexity we have in mmap_sem retry
> > mechanisms instead of adding to it. This is not something that should
> > block your progress, but I think it would be good, as we introduce spf,
> > to evaluate whether we could eventually get all the way to removing the
> > mmap_sem retry mechanism, or if we will actually have to keep both.
>
> Until we get rid of the mmap_sem which seems to be a very long story, I
> can't see how we could get rid of the retry mechanism.

Short answer: I'd like spf to be extended to handle file vmas,
populating page tables, and the vm_normal_page thing, so that we
wouldn't have to fall back to the path that grabs (and possibly
has to drop) the read side mmap_sem.

Even doing the above, there are still cases spf can't solve - for
example, gup, or the occasional spf abort, or even the case of a large
mmap/munmap delaying a smaller one. I think replacing mmap_sem with a
reader/writer interval lock would be a very nice generic solution to
this problem, allowing false conflicts to proceed in parallel, while
synchronizing true conflicts which is exactly what we want. But I
don't think such a lock can be implemented efficiently enough to be
put on the page fault fast-path, so I think spf could be the solution
there - it would allow us to skip taking that interval lock on most
page faults. The other places where we use mmap_sem are not as critical
for performance (they normally operate on a larger region at a time)
so I think we could afford the interval lock in those places.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2019-06-06 06:45:00

by Haiyan Song

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Hi Laurent,

Regression test for v12 patch serials have been run on Intel 2s skylake platform,
some regressions were found by LKP-tools (linux kernel performance). Only tested the
cases that have been run and found regressions on v11 patch serials.

Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
Kernel commit:
base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51)
head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)

Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
THP: enable / disable
nr_task: 100%

The following is benchmark results, tested 4 times for every case.

a). Enable THP
base %stddev change head %stddev
will-it-scale.page_fault3.per_thread_ops 63216 ?3% -16.9% 52537 ?4%
will-it-scale.page_fault2.per_thread_ops 36862 -9.8% 33256

b). Disable THP
base %stddev change head %stddev
will-it-scale.page_fault3.per_thread_ops 65111 -18.6% 53023 ?2%
will-it-scale.page_fault2.per_thread_ops 38164 -12.0% 33565

Best regards,
Haiyan Song

On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
> page fault without holding the mm semaphore [1].
>
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type of page fault is named speculative
> page fault. If the speculative page fault fails because a concurrency has
> been detected or because underlying PMD or PTE tables are not yet
> allocating, it is failing its processing and a regular page fault is then
> tried.
>
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by protecting the MM RB
> tree with RCU and by using a reference counter on each VMA. When fetching a
> VMA under the RCU protection, the VMA's reference counter is incremented to
> ensure that the VMA will not freed in our back during the SPF
> processing. Once that processing is done the VMA's reference counter is
> decremented. To ensure that a VMA is still present when walking the RB tree
> locklessly, the VMA's reference counter is incremented when that VMA is
> linked in the RB tree. When the VMA is unlinked from the RB tree, its
> reference counter will be decremented at the end of the RCU grace period,
> ensuring it will be available during this time. This means that the VMA
> freeing could be delayed and could delay the file closing for file
> mapping. Since the SPF handler is not able to manage file mapping, file is
> closed synchronously and not during the RCU cleaning. This is safe since
> the page fault handler is aborting if a file pointer is associated to the
> VMA.
>
> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
> benchmark [2].
>
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
>
> Once the VMA has been found, the speculative page fault handler would check
> for the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus, the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried. VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
>
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
>
> The locking of the PTE is done with interrupts disabled, this allows
> checking for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing operation will have to wait on the PTE lock to move
> forward. This allows the SPF handler to map the PTE safely. If the PMD
> value is different from the one recorded at the beginning of the SPF
> operation, the classic page fault handler will be called to handle the
> operation while holding the mmap_sem. As the PTE lock is done with the
> interrupts disabled, the lock is done using spin_trylock() to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by
> another CPU holding the PTE.
>
> In pseudo code, this could be seen as:
> speculative_page_fault()
> {
> vma = find_vma_rcu()
> check vma sequence count
> check vma's support
> disable interrupt
> check pgd,p4d,...,pte
> save pmd and pte in vmf
> save vma sequence counter in vmf
> enable interrupt
> check vma sequence count
> handle_pte_fault(vma)
> ..
> page = alloc_page()
> pte_map_lock()
> disable interrupt
> abort if sequence counter has changed
> abort if pmd or pte has changed
> pte map and lock
> enable interrupt
> if abort
> free page
> abort
> ...
> put_vma(vma)
> }
>
> arch_fault_handler()
> {
> if (speculative_page_fault(&vma))
> goto done
> again:
> lock(mmap_sem)
> vma = find_vma();
> handle_pte_fault(vma);
> if retry
> unlock(mmap_sem)
> goto again;
> done:
> handle fault error
> }
>
> Support for THP is not done because when checking for the PMD, we can be
> confused by an in progress collapsing operation done by khugepaged. The
> issue is that pmd_none() could be true either if the PMD is not already
> populated or if the underlying PTE are in the way to be collapsed. So we
> cannot safely allocate a PMD if pmd_none() is true.
>
> This series add a new software performance event named 'speculative-faults'
> or 'spf'. It counts the number of successful page fault event handled
> speculatively. When recording 'faults,spf' events, the faults one is
> counting the total number of page fault events while 'spf' is only counting
> the part of the faults processed speculatively.
>
> There are some trace events introduced by this series. They allow
> identifying why the page faults were not processed speculatively. This
> doesn't take in account the faults generated by a monothreaded process
> which directly processed while holding the mmap_sem. This trace events are
> grouped in a system named 'pagefault', they are:
>
> - pagefault:spf_vma_changed : if the VMA has been changed in our back
> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
> - pagefault:spf_vma_notsup : the VMA's type is not supported
> - pagefault:spf_vma_access : the VMA's access right are not respected
> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
> back.
>
> To record all the related events, the easier is to run perf with the
> following arguments :
> $ perf stat -e 'faults,spf,pagefault:*' <command>
>
> There is also a dedicated vmstat counter showing the number of successful
> page fault handled speculatively. I can be seen this way:
> $ grep speculative_pgfault /proc/vmstat
>
> It is possible to deactivate the speculative page fault handler by echoing
> 0 in /proc/sys/vm/speculative_page_fault.
>
> This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
> functional on x86, PowerPC. I cross built it on arm64 but I was not able to
> test it.
>
> This series is also available on github [4].
>
> ---------------------
> Real Workload results
>
> Test using a "popular in memory multithreaded database product" on 128cores
> SMT8 Power system are in progress and I will come back with performance
> mesurement as soon as possible. With the previous series we seen up to 30%
> improvements in the number of transaction processed per second, and we hope
> this will be the case with this series too.
>
> ------------------
> Benchmarks results
>
> Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
> SPF is BASE + this series
>
> Kernbench:
> ----------
> Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
> kernel (kernel is build 5 times):
>
> Average Half load -j 24
> Run (std deviation)
> BASE SPF
> Elapsed Time 56.52 (1.39185) 56.256 (1.15106) 0.47%
> User Time 980.018 (2.94734) 984.958 (1.98518) -0.50%
> System Time 130.744 (1.19148) 133.616 (0.873573) -2.20%
> Percent CPU 1965.6 (49.682) 1988.4 (40.035) -1.16%
> Context Switches 29926.6 (272.789) 30472.4 (109.569) -1.82%
> Sleeps 124793 (415.87) 125003 (591.008) -0.17%
>
> Average Optimal load -j 48
> Run (std deviation)
> BASE SPF
> Elapsed Time 46.354 (0.917949) 45.968 (1.42786) 0.83%
> User Time 1193.42 (224.96) 1196.78 (223.28) -0.28%
> System Time 143.306 (13.2726) 146.177 (13.2659) -2.00%
> Percent CPU 2668.6 (743.157) 2699.9 (753.767) -1.17%
> Context Switches 62268.3 (34097.1) 62721.7 (33999.1) -0.73%
> Sleeps 132556 (8222.99) 132607 (8077.6) -0.04%
>
> During a run on the SPF, perf events were captured:
> Performance counter stats for '../kernbench -M':
> 525,873,132 faults
> 242 spf
> 0 pagefault:spf_vma_changed
> 0 pagefault:spf_vma_noanon
> 441 pagefault:spf_vma_notsup
> 0 pagefault:spf_vma_access
> 0 pagefault:spf_pmd_changed
>
> Very few speculative page faults were recorded as most of the processes
> involved are monothreaded (sounds that on this architecture some threads
> were created during the kernel build processing).
>
> Here are the kerbench results on a 1024 CPUs Power8 VM:
>
> 5.1.0-rc4-mm1+ 5.1.0-rc4-mm1-spf-rcu+
> Average Half load -j 512 Run (std deviation):
> Elapsed Time 52.52 (0.906697) 52.778 (0.510069) -0.49%
> User Time 3855.43 (76.378) 3890.44 (73.0466) -0.91%
> System Time 1977.24 (182.316) 1974.56 (166.097) 0.14%
> Percent CPU 11111.6 (540.461) 11115.2 (458.907) -0.03%
> Context Switches 83245.6 (3061.44) 83651.8 (1202.31) -0.49%
> Sleeps 613459 (23091.8) 628378 (27485.2) -2.43%
>
> Average Optimal load -j 1024 Run (std deviation):
> Elapsed Time 52.964 (0.572346) 53.132 (0.825694) -0.32%
> User Time 4058.22 (222.034) 4070.2 (201.646) -0.30%
> System Time 2672.81 (759.207) 2712.13 (797.292) -1.47%
> Percent CPU 12756.7 (1786.35) 12806.5 (1858.89) -0.39%
> Context Switches 88818.5 (6772) 87890.6 (5567.72) 1.04%
> Sleeps 618658 (20842.2) 636297 (25044) -2.85%
>
> During a run on the SPF, perf events were captured:
> Performance counter stats for '../kernbench -M':
> 149 375 832 faults
> 1 spf
> 0 pagefault:spf_vma_changed
> 0 pagefault:spf_vma_noanon
> 561 pagefault:spf_vma_notsup
> 0 pagefault:spf_vma_access
> 0 pagefault:spf_pmd_changed
>
> Most of the processes involved are monothreaded so SPF is not activated but
> there is no impact on the performance.
>
> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
>
> BASE SPF delta
> 24 CPUs x86 5492.69 9383.07 70.83%
> 1024 CPUS P8 VM 8476.74 17144.38 102%
>
> Here are the performance counter read during a run on a 48 CPUs x86 node:
> Performance counter stats for './ebizzy -mTt 48':
> 11,846,569 faults
> 10,886,706 spf
> 957,702 pagefault:spf_vma_changed
> 0 pagefault:spf_vma_noanon
> 815 pagefault:spf_vma_notsup
> 0 pagefault:spf_vma_access
> 0 pagefault:spf_pmd_changed
>
> And the ones captured during a run on a 1024 CPUs Power VM:
> Performance counter stats for './ebizzy -mTt 1024':
> 1 359 789 faults
> 1 284 910 spf
> 72 085 pagefault:spf_vma_changed
> 0 pagefault:spf_vma_noanon
> 2 669 pagefault:spf_vma_notsup
> 0 pagefault:spf_vma_access
> 0 pagefault:spf_pmd_changed
>
> In ebizzy's case most of the page fault were handled in a speculative way,
> leading the ebizzy performance boost.
>
> ------------------
> Changes since v11 [3]
> - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
> - Abort speculative page fault when doing swap readhead because VMA's
> boundaries are not protected at this time. Doing this the first swap in
> is doing a readhead, the next fault should be handled in a speculative
> way as the page is present in the swap read page.
> - Handle a race between copy_pte_range() and the wp_page_copy called by
> the speculative page fault handler.
> - Ported to Kernel v5.0
> - Moved VM_FAULT_PTNOTSAME define in mm_types.h
> - Use RCU to protect the MM RB tree instead of a rwlock.
> - Add a toggle interface: /proc/sys/vm/speculative_page_fault
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
> [4] https://github.com/ldu4/linux/tree/spf-v12
>
> Laurent Dufour (25):
> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
> mm: make pte_unmap_same compatible with SPF
> mm: introduce INIT_VMA()
> mm: protect VMA modifications using VMA sequence count
> mm: protect mremap() against SPF hanlder
> mm: protect SPF handler against anon_vma changes
> mm: cache some VMA fields in the vm_fault structure
> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
> mm: introduce __lru_cache_add_active_or_unevictable
> mm: introduce __vm_normal_page()
> mm: introduce __page_add_new_anon_rmap()
> mm: protect against PTE changes done by dup_mmap()
> mm: protect the RB tree with a sequence lock
> mm: introduce vma reference counter
> mm: Introduce find_vma_rcu()
> mm: don't do swap readahead during speculative page fault
> mm: adding speculative page fault failure trace events
> perf: add a speculative page fault sw event
> perf tools: add support for the SPF perf event
> mm: add speculative page fault vmstats
> powerpc/mm: add speculative page fault
> mm: Add a speculative page fault switch in sysctl
>
> Mahendran Ganesh (2):
> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> arm64/mm: add speculative page fault
>
> Peter Zijlstra (4):
> mm: prepare for FAULT_FLAG_SPECULATIVE
> mm: VMA sequence count
> mm: provide speculative fault infrastructure
> x86/mm: add speculative pagefault handling
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/mm/fault.c | 12 +
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/mm/fault.c | 16 +
> arch/x86/Kconfig | 1 +
> arch/x86/mm/fault.c | 14 +
> fs/exec.c | 1 +
> fs/proc/task_mmu.c | 5 +-
> fs/userfaultfd.c | 17 +-
> include/linux/hugetlb_inline.h | 2 +-
> include/linux/migrate.h | 4 +-
> include/linux/mm.h | 138 +++++-
> include/linux/mm_types.h | 16 +-
> include/linux/pagemap.h | 4 +-
> include/linux/rmap.h | 12 +-
> include/linux/swap.h | 10 +-
> include/linux/vm_event_item.h | 3 +
> include/trace/events/pagefault.h | 80 ++++
> include/uapi/linux/perf_event.h | 1 +
> kernel/fork.c | 35 +-
> kernel/sysctl.c | 9 +
> mm/Kconfig | 22 +
> mm/huge_memory.c | 6 +-
> mm/hugetlb.c | 2 +
> mm/init-mm.c | 3 +
> mm/internal.h | 45 ++
> mm/khugepaged.c | 5 +
> mm/madvise.c | 6 +-
> mm/memory.c | 631 ++++++++++++++++++++++----
> mm/mempolicy.c | 51 ++-
> mm/migrate.c | 6 +-
> mm/mlock.c | 13 +-
> mm/mmap.c | 249 ++++++++--
> mm/mprotect.c | 4 +-
> mm/mremap.c | 13 +
> mm/nommu.c | 1 +
> mm/rmap.c | 5 +-
> mm/swap.c | 6 +-
> mm/swap_state.c | 10 +-
> mm/vmstat.c | 5 +-
> tools/include/uapi/linux/perf_event.h | 1 +
> tools/perf/util/evsel.c | 1 +
> tools/perf/util/parse-events.c | 4 +
> tools/perf/util/parse-events.l | 1 +
> tools/perf/util/python.c | 1 +
> 45 files changed, 1277 insertions(+), 196 deletions(-)
> create mode 100644 include/trace/events/pagefault.h
>
> --
> 2.21.0
>

2019-06-14 08:39:30

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 06/06/2019 à 08:51, Haiyan Song a écrit :
> Hi Laurent,
>
> Regression test for v12 patch serials have been run on Intel 2s skylake platform,
> some regressions were found by LKP-tools (linux kernel performance). Only tested the
> cases that have been run and found regressions on v11 patch serials.
>
> Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
> Kernel commit:
> base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51)
> head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)
>
> Benchmark: will-it-scale
> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
> Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
> test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
> THP: enable / disable
> nr_task: 100%
>
> The following is benchmark results, tested 4 times for every case.
>
> a). Enable THP
> base %stddev change head %stddev
> will-it-scale.page_fault3.per_thread_ops 63216 ±3% -16.9% 52537 ±4%
> will-it-scale.page_fault2.per_thread_ops 36862 -9.8% 33256
>
> b). Disable THP
> base %stddev change head %stddev
> will-it-scale.page_fault3.per_thread_ops 65111 -18.6% 53023 ±2%
> will-it-scale.page_fault2.per_thread_ops 38164 -12.0% 33565

Hi Haiyan,

Thanks for running this tests on your systems.

I did the same tests on my systems (x86 and PowerPc) and I didn't get the same numbers.
My x86 system has lower CPUs but larger memory amount but I don't think this impacts
a lot since my numbers are far from yours.

x86_64 48CPUs 755G
5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf
page_fault2_threads SPF OFF SPF ON
THP always 2200902.3 [5%] 2152618.8 -2% [4%] 2136316 -3% [7%]
THP never 2185616.5 [6%] 2099274.2 -4% [3%] 2123275.1 -3% [7%]

5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf
page_fault3_threads SPF OFF SPF ON
THP always 2700078.7 [5%] 2789437.1 +3% [4%] 2944806.8 +12% [3%]
THP never 2625756.7 [4%] 2944806.8 +12% [8%] 2876525.5 +10% [4%]

PowerPC P8 80CPUs 31G
5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf
page_fault2_threads SPF OFF SPF ON
THP always 171732 [0%] 170762.8 -1% [0%] 170450.9 -1% [0%]
THP never 171808.4 [0%] 170600.3 -1% [0%] 170231.6 -1% [0%]

5.1.0-rc4-mm1 5.1.0-rc4-mm1-spf
page_fault3_threads SPF OFF SPF ON
THP always 2499.6 [13%] 2624.5 +5% [11%] 2734.5 +9% [3%]
THP never 2732.5 [2%] 2791.1 +2% [1%] 2695 -3% [4%]

Numbers in bracket are the standard deviation percent.

I run each test 10 times and then compute the average and deviation.

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the result.

Thanks,
Laurent.

> Best regards,
> Haiyan Song
>
> On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
>> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
>> page fault without holding the mm semaphore [1].
>>
>> The idea is to try to handle user space page faults without holding the
>> mmap_sem. This should allow better concurrency for massively threaded
>> process since the page fault handler will not wait for other threads memory
>> layout change to be done, assuming that this change is done in another part
>> of the process's memory space. This type of page fault is named speculative
>> page fault. If the speculative page fault fails because a concurrency has
>> been detected or because underlying PMD or PTE tables are not yet
>> allocating, it is failing its processing and a regular page fault is then
>> tried.
>>
>> The speculative page fault (SPF) has to look for the VMA matching the fault
>> address without holding the mmap_sem, this is done by protecting the MM RB
>> tree with RCU and by using a reference counter on each VMA. When fetching a
>> VMA under the RCU protection, the VMA's reference counter is incremented to
>> ensure that the VMA will not freed in our back during the SPF
>> processing. Once that processing is done the VMA's reference counter is
>> decremented. To ensure that a VMA is still present when walking the RB tree
>> locklessly, the VMA's reference counter is incremented when that VMA is
>> linked in the RB tree. When the VMA is unlinked from the RB tree, its
>> reference counter will be decremented at the end of the RCU grace period,
>> ensuring it will be available during this time. This means that the VMA
>> freeing could be delayed and could delay the file closing for file
>> mapping. Since the SPF handler is not able to manage file mapping, file is
>> closed synchronously and not during the RCU cleaning. This is safe since
>> the page fault handler is aborting if a file pointer is associated to the
>> VMA.
>>
>> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
>> benchmark [2].
>>
>> The VMA's attributes checked during the speculative page fault processing
>> have to be protected against parallel changes. This is done by using a per
>> VMA sequence lock. This sequence lock allows the speculative page fault
>> handler to fast check for parallel changes in progress and to abort the
>> speculative page fault in that case.
>>
>> Once the VMA has been found, the speculative page fault handler would check
>> for the VMA's attributes to verify that the page fault has to be handled
>> correctly or not. Thus, the VMA is protected through a sequence lock which
>> allows fast detection of concurrent VMA changes. If such a change is
>> detected, the speculative page fault is aborted and a *classic* page fault
>> is tried. VMA sequence lockings are added when VMA attributes which are
>> checked during the page fault are modified.
>>
>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>> so once the page table is locked, the VMA is valid, so any other changes
>> leading to touching this PTE will need to lock the page table, so no
>> parallel change is possible at this time.
>>
>> The locking of the PTE is done with interrupts disabled, this allows
>> checking for the PMD to ensure that there is not an ongoing collapsing
>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>> valid at the time the PTE is locked, we have the guarantee that the
>> collapsing operation will have to wait on the PTE lock to move
>> forward. This allows the SPF handler to map the PTE safely. If the PMD
>> value is different from the one recorded at the beginning of the SPF
>> operation, the classic page fault handler will be called to handle the
>> operation while holding the mmap_sem. As the PTE lock is done with the
>> interrupts disabled, the lock is done using spin_trylock() to avoid dead
>> lock when handling a page fault while a TLB invalidate is requested by
>> another CPU holding the PTE.
>>
>> In pseudo code, this could be seen as:
>> speculative_page_fault()
>> {
>> vma = find_vma_rcu()
>> check vma sequence count
>> check vma's support
>> disable interrupt
>> check pgd,p4d,...,pte
>> save pmd and pte in vmf
>> save vma sequence counter in vmf
>> enable interrupt
>> check vma sequence count
>> handle_pte_fault(vma)
>> ..
>> page = alloc_page()
>> pte_map_lock()
>> disable interrupt
>> abort if sequence counter has changed
>> abort if pmd or pte has changed
>> pte map and lock
>> enable interrupt
>> if abort
>> free page
>> abort
>> ...
>> put_vma(vma)
>> }
>>
>> arch_fault_handler()
>> {
>> if (speculative_page_fault(&vma))
>> goto done
>> again:
>> lock(mmap_sem)
>> vma = find_vma();
>> handle_pte_fault(vma);
>> if retry
>> unlock(mmap_sem)
>> goto again;
>> done:
>> handle fault error
>> }
>>
>> Support for THP is not done because when checking for the PMD, we can be
>> confused by an in progress collapsing operation done by khugepaged. The
>> issue is that pmd_none() could be true either if the PMD is not already
>> populated or if the underlying PTE are in the way to be collapsed. So we
>> cannot safely allocate a PMD if pmd_none() is true.
>>
>> This series add a new software performance event named 'speculative-faults'
>> or 'spf'. It counts the number of successful page fault event handled
>> speculatively. When recording 'faults,spf' events, the faults one is
>> counting the total number of page fault events while 'spf' is only counting
>> the part of the faults processed speculatively.
>>
>> There are some trace events introduced by this series. They allow
>> identifying why the page faults were not processed speculatively. This
>> doesn't take in account the faults generated by a monothreaded process
>> which directly processed while holding the mmap_sem. This trace events are
>> grouped in a system named 'pagefault', they are:
>>
>> - pagefault:spf_vma_changed : if the VMA has been changed in our back
>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>> - pagefault:spf_vma_notsup : the VMA's type is not supported
>> - pagefault:spf_vma_access : the VMA's access right are not respected
>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>> back.
>>
>> To record all the related events, the easier is to run perf with the
>> following arguments :
>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>
>> There is also a dedicated vmstat counter showing the number of successful
>> page fault handled speculatively. I can be seen this way:
>> $ grep speculative_pgfault /proc/vmstat
>>
>> It is possible to deactivate the speculative page fault handler by echoing
>> 0 in /proc/sys/vm/speculative_page_fault.
>>
>> This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
>> functional on x86, PowerPC. I cross built it on arm64 but I was not able to
>> test it.
>>
>> This series is also available on github [4].
>>
>> ---------------------
>> Real Workload results
>>
>> Test using a "popular in memory multithreaded database product" on 128cores
>> SMT8 Power system are in progress and I will come back with performance
>> mesurement as soon as possible. With the previous series we seen up to 30%
>> improvements in the number of transaction processed per second, and we hope
>> this will be the case with this series too.
>>
>> ------------------
>> Benchmarks results
>>
>> Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
>> SPF is BASE + this series
>>
>> Kernbench:
>> ----------
>> Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
>> kernel (kernel is build 5 times):
>>
>> Average Half load -j 24
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 56.52 (1.39185) 56.256 (1.15106) 0.47%
>> User Time 980.018 (2.94734) 984.958 (1.98518) -0.50%
>> System Time 130.744 (1.19148) 133.616 (0.873573) -2.20%
>> Percent CPU 1965.6 (49.682) 1988.4 (40.035) -1.16%
>> Context Switches 29926.6 (272.789) 30472.4 (109.569) -1.82%
>> Sleeps 124793 (415.87) 125003 (591.008) -0.17%
>>
>> Average Optimal load -j 48
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 46.354 (0.917949) 45.968 (1.42786) 0.83%
>> User Time 1193.42 (224.96) 1196.78 (223.28) -0.28%
>> System Time 143.306 (13.2726) 146.177 (13.2659) -2.00%
>> Percent CPU 2668.6 (743.157) 2699.9 (753.767) -1.17%
>> Context Switches 62268.3 (34097.1) 62721.7 (33999.1) -0.73%
>> Sleeps 132556 (8222.99) 132607 (8077.6) -0.04%
>>
>> During a run on the SPF, perf events were captured:
>> Performance counter stats for '../kernbench -M':
>> 525,873,132 faults
>> 242 spf
>> 0 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 441 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> Very few speculative page faults were recorded as most of the processes
>> involved are monothreaded (sounds that on this architecture some threads
>> were created during the kernel build processing).
>>
>> Here are the kerbench results on a 1024 CPUs Power8 VM:
>>
>> 5.1.0-rc4-mm1+ 5.1.0-rc4-mm1-spf-rcu+
>> Average Half load -j 512 Run (std deviation):
>> Elapsed Time 52.52 (0.906697) 52.778 (0.510069) -0.49%
>> User Time 3855.43 (76.378) 3890.44 (73.0466) -0.91%
>> System Time 1977.24 (182.316) 1974.56 (166.097) 0.14%
>> Percent CPU 11111.6 (540.461) 11115.2 (458.907) -0.03%
>> Context Switches 83245.6 (3061.44) 83651.8 (1202.31) -0.49%
>> Sleeps 613459 (23091.8) 628378 (27485.2) -2.43%
>>
>> Average Optimal load -j 1024 Run (std deviation):
>> Elapsed Time 52.964 (0.572346) 53.132 (0.825694) -0.32%
>> User Time 4058.22 (222.034) 4070.2 (201.646) -0.30%
>> System Time 2672.81 (759.207) 2712.13 (797.292) -1.47%
>> Percent CPU 12756.7 (1786.35) 12806.5 (1858.89) -0.39%
>> Context Switches 88818.5 (6772) 87890.6 (5567.72) 1.04%
>> Sleeps 618658 (20842.2) 636297 (25044) -2.85%
>>
>> During a run on the SPF, perf events were captured:
>> Performance counter stats for '../kernbench -M':
>> 149 375 832 faults
>> 1 spf
>> 0 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 561 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> Most of the processes involved are monothreaded so SPF is not activated but
>> there is no impact on the performance.
>>
>> Ebizzy:
>> -------
>> The test is counting the number of records per second it can manage, the
>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>> consistent result I repeated the test 100 times and measure the average
>> result. The number is the record processes per second, the higher is the best.
>>
>> BASE SPF delta
>> 24 CPUs x86 5492.69 9383.07 70.83%
>> 1024 CPUS P8 VM 8476.74 17144.38 102%
>>
>> Here are the performance counter read during a run on a 48 CPUs x86 node:
>> Performance counter stats for './ebizzy -mTt 48':
>> 11,846,569 faults
>> 10,886,706 spf
>> 957,702 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 815 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> And the ones captured during a run on a 1024 CPUs Power VM:
>> Performance counter stats for './ebizzy -mTt 1024':
>> 1 359 789 faults
>> 1 284 910 spf
>> 72 085 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 2 669 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> In ebizzy's case most of the page fault were handled in a speculative way,
>> leading the ebizzy performance boost.
>>
>> ------------------
>> Changes since v11 [3]
>> - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
>> - Abort speculative page fault when doing swap readhead because VMA's
>> boundaries are not protected at this time. Doing this the first swap in
>> is doing a readhead, the next fault should be handled in a speculative
>> way as the page is present in the swap read page.
>> - Handle a race between copy_pte_range() and the wp_page_copy called by
>> the speculative page fault handler.
>> - Ported to Kernel v5.0
>> - Moved VM_FAULT_PTNOTSAME define in mm_types.h
>> - Use RCU to protect the MM RB tree instead of a rwlock.
>> - Add a toggle interface: /proc/sys/vm/speculative_page_fault
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>> [4] https://github.com/ldu4/linux/tree/spf-v12
>>
>> Laurent Dufour (25):
>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>> mm: make pte_unmap_same compatible with SPF
>> mm: introduce INIT_VMA()
>> mm: protect VMA modifications using VMA sequence count
>> mm: protect mremap() against SPF hanlder
>> mm: protect SPF handler against anon_vma changes
>> mm: cache some VMA fields in the vm_fault structure
>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>> mm: introduce __lru_cache_add_active_or_unevictable
>> mm: introduce __vm_normal_page()
>> mm: introduce __page_add_new_anon_rmap()
>> mm: protect against PTE changes done by dup_mmap()
>> mm: protect the RB tree with a sequence lock
>> mm: introduce vma reference counter
>> mm: Introduce find_vma_rcu()
>> mm: don't do swap readahead during speculative page fault
>> mm: adding speculative page fault failure trace events
>> perf: add a speculative page fault sw event
>> perf tools: add support for the SPF perf event
>> mm: add speculative page fault vmstats
>> powerpc/mm: add speculative page fault
>> mm: Add a speculative page fault switch in sysctl
>>
>> Mahendran Ganesh (2):
>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>> arm64/mm: add speculative page fault
>>
>> Peter Zijlstra (4):
>> mm: prepare for FAULT_FLAG_SPECULATIVE
>> mm: VMA sequence count
>> mm: provide speculative fault infrastructure
>> x86/mm: add speculative pagefault handling
>>
>> arch/arm64/Kconfig | 1 +
>> arch/arm64/mm/fault.c | 12 +
>> arch/powerpc/Kconfig | 1 +
>> arch/powerpc/mm/fault.c | 16 +
>> arch/x86/Kconfig | 1 +
>> arch/x86/mm/fault.c | 14 +
>> fs/exec.c | 1 +
>> fs/proc/task_mmu.c | 5 +-
>> fs/userfaultfd.c | 17 +-
>> include/linux/hugetlb_inline.h | 2 +-
>> include/linux/migrate.h | 4 +-
>> include/linux/mm.h | 138 +++++-
>> include/linux/mm_types.h | 16 +-
>> include/linux/pagemap.h | 4 +-
>> include/linux/rmap.h | 12 +-
>> include/linux/swap.h | 10 +-
>> include/linux/vm_event_item.h | 3 +
>> include/trace/events/pagefault.h | 80 ++++
>> include/uapi/linux/perf_event.h | 1 +
>> kernel/fork.c | 35 +-
>> kernel/sysctl.c | 9 +
>> mm/Kconfig | 22 +
>> mm/huge_memory.c | 6 +-
>> mm/hugetlb.c | 2 +
>> mm/init-mm.c | 3 +
>> mm/internal.h | 45 ++
>> mm/khugepaged.c | 5 +
>> mm/madvise.c | 6 +-
>> mm/memory.c | 631 ++++++++++++++++++++++----
>> mm/mempolicy.c | 51 ++-
>> mm/migrate.c | 6 +-
>> mm/mlock.c | 13 +-
>> mm/mmap.c | 249 ++++++++--
>> mm/mprotect.c | 4 +-
>> mm/mremap.c | 13 +
>> mm/nommu.c | 1 +
>> mm/rmap.c | 5 +-
>> mm/swap.c | 6 +-
>> mm/swap_state.c | 10 +-
>> mm/vmstat.c | 5 +-
>> tools/include/uapi/linux/perf_event.h | 1 +
>> tools/perf/util/evsel.c | 1 +
>> tools/perf/util/parse-events.c | 4 +
>> tools/perf/util/parse-events.l | 1 +
>> tools/perf/util/python.c | 1 +
>> 45 files changed, 1277 insertions(+), 196 deletions(-)
>> create mode 100644 include/trace/events/pagefault.h
>>
>> --
>> 2.21.0
>>


Attachments:
runit.sh (1.00 kB)

2019-06-14 08:46:10

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> Please find attached the script I run to get these numbers.
> This would be nice if you could give it a try on your victim node and share the result.

Sounds that the Intel mail fitering system doesn't like the attached shell script.
Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44

Thanks,
Laurent.

2019-06-20 08:12:44

by Haiyan Song

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Hi Laurent,

I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
serials.

Here attached the output results of this script.

The following comparison result is statistics from the script outputs.

a). Enable THP
SPF_0 change SPF_1
will-it-scale.page_fault2.per_thread_ops 2664190.8 -11.7% 2353637.6
will-it-scale.page_fault3.per_thread_ops 4480027.2 -14.7% 3819331.9


b). Disable THP
SPF_0 change SPF_1
will-it-scale.page_fault2.per_thread_ops 2653260.7 -10% 2385165.8
will-it-scale.page_fault3.per_thread_ops 4436330.1 -12.4% 3886734.2


Thanks,
Haiyan Song


On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> Le 14/06/2019 ? 10:37, Laurent Dufour a ?crit?:
> > Please?find?attached?the?script?I?run?to?get?these?numbers.
> > This?would?be?nice?if?you?could?give?it?a?try?on?your?victim?node?and?share?the?result.
>
> Sounds that the Intel mail fitering system doesn't like the attached shell script.
> Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
>
> Thanks,
> Laurent.
>


Attachments:
(No filename) (1.26 kB)
page_fault2_threads.5.1.0-rc4-mm1-00300-g02c5a1f.out (761.00 B)
page_fault3_threads.5.1.0-rc4-mm1-00300-g02c5a1f.out (761.00 B)
Download all attachments

2020-07-06 09:25:39

by Chinwen Chang

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> Hi Laurent,
>
> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
> serials.
>
> Here attached the output results of this script.
>
> The following comparison result is statistics from the script outputs.
>
> a). Enable THP
> SPF_0 change SPF_1
> will-it-scale.page_fault2.per_thread_ops 2664190.8 -11.7% 2353637.6
> will-it-scale.page_fault3.per_thread_ops 4480027.2 -14.7% 3819331.9
>
>
> b). Disable THP
> SPF_0 change SPF_1
> will-it-scale.page_fault2.per_thread_ops 2653260.7 -10% 2385165.8
> will-it-scale.page_fault3.per_thread_ops 4436330.1 -12.4% 3886734.2
>
>
> Thanks,
> Haiyan Song
>
>
> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> > Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> > > Please find attached the script I run to get these numbers.
> > > This would be nice if you could give it a try on your victim node and share the result.
> >
> > Sounds that the Intel mail fitering system doesn't like the attached shell script.
> > Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
> >
> > Thanks,
> > Laurent.
> >

Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package version w/ SPF w/o SPF improve(%)
------------------------------------------------------------------
Baidu maps 10.13.3 0.887 0.98 9.49
Taobao 8.4.0.35 1.227 1.293 5.10
Meituan 9.12.401 1.107 1.543 28.26
WeChat 7.0.3 2.353 2.68 12.20
Honor of Kings 1.43.1.6 6.63 6.713 1.24


By the way, we have verified our platforms with those patches and
achieved the goal of mass production.

Thanks.
Chinwen Chang

2020-07-06 12:31:55

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 06/07/2020 à 11:25, Chinwen Chang a écrit :
> On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
>> Hi Laurent,
>>
>> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
>> serials.
>>
>> Here attached the output results of this script.
>>
>> The following comparison result is statistics from the script outputs.
>>
>> a). Enable THP
>> SPF_0 change SPF_1
>> will-it-scale.page_fault2.per_thread_ops 2664190.8 -11.7% 2353637.6
>> will-it-scale.page_fault3.per_thread_ops 4480027.2 -14.7% 3819331.9
>>
>>
>> b). Disable THP
>> SPF_0 change SPF_1
>> will-it-scale.page_fault2.per_thread_ops 2653260.7 -10% 2385165.8
>> will-it-scale.page_fault3.per_thread_ops 4436330.1 -12.4% 3886734.2
>>
>>
>> Thanks,
>> Haiyan Song
>>
>>
>> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
>>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
>>>> Please find attached the script I run to get these numbers.
>>>> This would be nice if you could give it a try on your victim node and share the result.
>>>
>>> Sounds that the Intel mail fitering system doesn't like the attached shell script.
>>> Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
>>>
>>> Thanks,
>>> Laurent.
>>>
>
> Hi Laurent,
>
> We merged SPF v11 and some patches from v12 into our platforms. After
> several experiments, we observed SPF has obvious improvements on the
> launch time of applications, especially for those high-TLP ones,
>
> # launch time of applications(s):
>
> package version w/ SPF w/o SPF improve(%)
> ------------------------------------------------------------------
> Baidu maps 10.13.3 0.887 0.98 9.49
> Taobao 8.4.0.35 1.227 1.293 5.10
> Meituan 9.12.401 1.107 1.543 28.26
> WeChat 7.0.3 2.353 2.68 12.20
> Honor of Kings 1.43.1.6 6.63 6.713 1.24

That's great news, thanks for reporting this!

>
> By the way, we have verified our platforms with those patches and
> achieved the goal of mass production.

Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.

2020-07-07 05:33:22

by Chinwen Chang

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Mon, 2020-07-06 at 14:27 +0200, Laurent Dufour wrote:
> Le 06/07/2020 à 11:25, Chinwen Chang a écrit :
> > On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> >> Hi Laurent,
> >>
> >> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
> >> serials.
> >>
> >> Here attached the output results of this script.
> >>
> >> The following comparison result is statistics from the script outputs.
> >>
> >> a). Enable THP
> >> SPF_0 change SPF_1
> >> will-it-scale.page_fault2.per_thread_ops 2664190.8 -11.7% 2353637.6
> >> will-it-scale.page_fault3.per_thread_ops 4480027.2 -14.7% 3819331.9
> >>
> >>
> >> b). Disable THP
> >> SPF_0 change SPF_1
> >> will-it-scale.page_fault2.per_thread_ops 2653260.7 -10% 2385165.8
> >> will-it-scale.page_fault3.per_thread_ops 4436330.1 -12.4% 3886734.2
> >>
> >>
> >> Thanks,
> >> Haiyan Song
> >>
> >>
> >> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> >>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> >>>> Please find attached the script I run to get these numbers.
> >>>> This would be nice if you could give it a try on your victim node and share the result.
> >>>
> >>> Sounds that the Intel mail fitering system doesn't like the attached shell script.
> >>> Please find it there: https://urldefense.com/v3/__https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44__;!!CTRNKA9wMg0ARbw!0lux2FMCbIFxFEl824CdSuSQqT0IVWsvyUqfDVJNEVb9gTWyRltm7cpPZg70N_XhXmMZ$
> >>>
> >>> Thanks,
> >>> Laurent.
> >>>
> >
> > Hi Laurent,
> >
> > We merged SPF v11 and some patches from v12 into our platforms. After
> > several experiments, we observed SPF has obvious improvements on the
> > launch time of applications, especially for those high-TLP ones,
> >
> > # launch time of applications(s):
> >
> > package version w/ SPF w/o SPF improve(%)
> > ------------------------------------------------------------------
> > Baidu maps 10.13.3 0.887 0.98 9.49
> > Taobao 8.4.0.35 1.227 1.293 5.10
> > Meituan 9.12.401 1.107 1.543 28.26
> > WeChat 7.0.3 2.353 2.68 12.20
> > Honor of Kings 1.43.1.6 6.63 6.713 1.24
>
> That's great news, thanks for reporting this!
>
> >
> > By the way, we have verified our platforms with those patches and
> > achieved the goal of mass production.
>
> Another good news!
> For my information, what is your targeted hardware?
>
> Cheers,
> Laurent.

Hi Laurent,

Our targeted hardware belongs to ARM64 multi-core series.

Thanks.
Chinwen
>

2020-12-14 10:49:44

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

Le 14/12/2020 à 03:03, Joel Fernandes a écrit :
> On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> [..]
>>>> Hi Laurent,
>>>>
>>>> We merged SPF v11 and some patches from v12 into our platforms. After
>>>> several experiments, we observed SPF has obvious improvements on the
>>>> launch time of applications, especially for those high-TLP ones,
>>>>
>>>> # launch time of applications(s):
>>>>
>>>> package version w/ SPF w/o SPF improve(%)
>>>> ------------------------------------------------------------------
>>>> Baidu maps 10.13.3 0.887 0.98 9.49
>>>> Taobao 8.4.0.35 1.227 1.293 5.10
>>>> Meituan 9.12.401 1.107 1.543 28.26
>>>> WeChat 7.0.3 2.353 2.68 12.20
>>>> Honor of Kings 1.43.1.6 6.63 6.713 1.24
>>>
>>> That's great news, thanks for reporting this!
>>>
>>>>
>>>> By the way, we have verified our platforms with those patches and
>>>> achieved the goal of mass production.
>>>
>>> Another good news!
>>> For my information, what is your targeted hardware?
>>>
>>> Cheers,
>>> Laurent.
>>
>> Hi Laurent,
>>
>> Our targeted hardware belongs to ARM64 multi-core series.
>
> Hello!
>
> I was trying to develop an intuition about why does SPF give improvement for
> you on small CPU systems. This is just a high-level theory but:
>
> 1. Assume the improvement is because of elimination of "blocking" on
> mmap_sem.
> Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
> places, thus causing blocking on mmap_sem in other paths? If so, is it
> feasible to convert such usages to acquiring them in read-mode?

That's correct, and the goal of this series is to try not holding the mmap_sem
in read mode during page fault processing.

Converting mmap_sem holder from write to read mode is not so easy and that work
as already been done in some places. If you think there are areas where this
could be done, you're welcome to send patches fixing that.

> 2. Assume the improvement is because of lesser read-side contention on
> mmap_sem.
> On small CPU systems, I would not expect reducing cache-line bouncing to give
> such a dramatic improvement in performance as you are seeing.

I don't think cache line bouncing reduction is the main sourcec of performance
improvement, I would rather think this is the lower part here.
I guess this is mainly because during loading time a lot of page fault is
occuring and thus SPF is reducing the contention on the mmap_sem.

> Thanks for any insight on this!
>
> - Joel
>

2020-12-14 13:45:12

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]
> > > Hi Laurent,
> > >
> > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > several experiments, we observed SPF has obvious improvements on the
> > > launch time of applications, especially for those high-TLP ones,
> > >
> > > # launch time of applications(s):
> > >
> > > package version w/ SPF w/o SPF improve(%)
> > > ------------------------------------------------------------------
> > > Baidu maps 10.13.3 0.887 0.98 9.49
> > > Taobao 8.4.0.35 1.227 1.293 5.10
> > > Meituan 9.12.401 1.107 1.543 28.26
> > > WeChat 7.0.3 2.353 2.68 12.20
> > > Honor of Kings 1.43.1.6 6.63 6.713 1.24
> >
> > That's great news, thanks for reporting this!
> >
> > >
> > > By the way, we have verified our platforms with those patches and
> > > achieved the goal of mass production.
> >
> > Another good news!
> > For my information, what is your targeted hardware?
> >
> > Cheers,
> > Laurent.
>
> Hi Laurent,
>
> Our targeted hardware belongs to ARM64 multi-core series.

Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?

2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.

Thanks for any insight on this!

- Joel

2020-12-14 18:53:04

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v12 00/31] Speculative page faults

On Mon, Dec 14, 2020 at 10:36:29AM +0100, Laurent Dufour wrote:
> Le 14/12/2020 ? 03:03, Joel Fernandes a ?crit?:
> > On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> > [..]
> > > > > Hi Laurent,
> > > > >
> > > > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > > > several experiments, we observed SPF has obvious improvements on the
> > > > > launch time of applications, especially for those high-TLP ones,
> > > > >
> > > > > # launch time of applications(s):
> > > > >
> > > > > package version w/ SPF w/o SPF improve(%)
> > > > > ------------------------------------------------------------------
> > > > > Baidu maps 10.13.3 0.887 0.98 9.49
> > > > > Taobao 8.4.0.35 1.227 1.293 5.10
> > > > > Meituan 9.12.401 1.107 1.543 28.26
> > > > > WeChat 7.0.3 2.353 2.68 12.20
> > > > > Honor of Kings 1.43.1.6 6.63 6.713 1.24
> > > >
> > > > That's great news, thanks for reporting this!
> > > >
> > > > >
> > > > > By the way, we have verified our platforms with those patches and
> > > > > achieved the goal of mass production.
> > > >
> > > > Another good news!
> > > > For my information, what is your targeted hardware?
> > > >
> > > > Cheers,
> > > > Laurent.
> > >
> > > Hi Laurent,
> > >
> > > Our targeted hardware belongs to ARM64 multi-core series.
> >
> > Hello!
> >
> > I was trying to develop an intuition about why does SPF give improvement for
> > you on small CPU systems. This is just a high-level theory but:
> >
> > 1. Assume the improvement is because of elimination of "blocking" on
> > mmap_sem.
> > Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
> > places, thus causing blocking on mmap_sem in other paths? If so, is it
> > feasible to convert such usages to acquiring them in read-mode?
>
> That's correct, and the goal of this series is to try not holding the
> mmap_sem in read mode during page fault processing.
>
> Converting mmap_sem holder from write to read mode is not so easy and that
> work as already been done in some places. If you think there are areas where
> this could be done, you're welcome to send patches fixing that.
>
> > 2. Assume the improvement is because of lesser read-side contention on
> > mmap_sem.
> > On small CPU systems, I would not expect reducing cache-line bouncing to give
> > such a dramatic improvement in performance as you are seeing.
>
> I don't think cache line bouncing reduction is the main sourcec of
> performance improvement, I would rather think this is the lower part here.
> I guess this is mainly because during loading time a lot of page fault is
> occuring and thus SPF is reducing the contention on the mmap_sem.

Thanks for the reply. I think I also wrongly assumed that acquiring mmap
rwsem in write mode in a syscall makes SPF moot. Peter explained to me on IRC
that tere's still perf improvement in write mode if an unrelated VMA is
modified while another VMA is faulting. CMIIW - not an mm expert by any
stretch.

Thanks!

- Joel