2023-04-14 14:27:37

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 00/17] Introduce Copy-On-Write to Page Table

NOTE
====
This patch is primarily aimed at optimizing the memory usage of page
table in processes with large address space, which can potentailly lead
to improved the fork system calll latency under certain conditions.
However, we're planning to improve the fork latency in the future but
not in this patch.

---

v4 -> v5
- Split the present and non-present parts of zap_pte_range.
- Remove the incorrect assertion of mmap lock rwitability in handle_cow_pte_fault.
- In break COW PTe fault handler, to avoid the situation where someone may
allocate the new PTE table due to clearing the pmd entry before duplicating
COW-ed PTE, we update the pmd entry with new PTE table after we finish the
duplication.
- Add a second chance to break COW PTE after the allocation fails at first time,
if second time stil fails, kill the failed process by OOM killer.
- Extract the zap part of COW-ed PTE from break COW PTE fault commit.
- In zap part, clear the pmd entry which assigned to COW-ed PTE instead of
clearing it in free page table part. Before this change, it was possible
to access the COW-ed PTe after it had been zapped.
- In zap part, we flush TLB and free the batched memory before we handle
the COW-ed PTE. And, during zapping COW-ed PTE, we defer flushing TLB
and freeing the batched memory until after we have cleared the pmd entry.
- Add the COW-ed PTE table sanity check to page table check.

v4: https://lore.kernel.org/linux-mm/[email protected]/

v3 -> v4
- Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
s390 and powerpc32, don't support the PMD entry and PTE table
operations.
- Fix unmatch type of break_cow_pte_range() in
migrate_vma_collect_pmd().
- Don’t break COW PTE in folio_referenced_one().
- Fix the wrong VMA range checking in break_cow_pte_range().
- Only break COW when we modify the soft-dirty bit in
clear_refs_pte_range().
- Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
- Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
tlb_flush_pmd_range().
- Handle VM_DONTCOPY with COW PTE fork.
- Fix the wrong address and invalid vma in recover_pte_range().
- Fix the infinite page fault loop in GUP routine.
In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
handler, we return -EMLINK to let the GUP handles the page fault
(call faultin_page() in __get_user_pages()).
- return not_found(pvmw) if the break COW PTE failed in
page_vma_mapped_walk().
- Since COW PTE has the same result as the normal COW selftest, it
probably passed the COW selftest.

# [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
not ok 33 No leak from parent into child
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
not ok 44 No leak from parent into child
# [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
not ok 55 No leak from child into parent
# [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
not ok 66 No leak from child into parent

Bail out! 4 out of 147 tests failed
# Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
See the more information about anon cow hugetlb tests:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/


v3: https://lore.kernel.org/linux-mm/[email protected]/T/

RFC v2 -> v3
- Change the sysctl with PID to prctl(PR_SET_COW_PTE).
- Account all the COW PTE mapped pages in fork() instead of defer it to
page fault (break COW PTE).
- If there is an unshareable mapped page (maybe pinned or private
device), recover all the entries that are already handled by COW PTE
fork, then copy to the new one.
- Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
follow_pfn_pte().
- Remove the PTE ownership since we don't need it.
- Use pte lock to protect the break COW PTE and free COW-ed PTE.
- Do TLB flushing in break COW PTE handler.
- Handle THP, KSM, madvise, mprotect, uffd and migrate device.
- Handle the replacement page of uprobe.
- Handle the clear_refs_write() of fs/proc.
- All of the benchmarks dropped since the accounting and pte lock.
The benchmarks of v3 is worse than RFC v2, most of the cases are
similar to the normal fork, but there still have an use case
(TriforceAFL) is better than the normal fork version.

RFC v2: https://lore.kernel.org/linux-mm/[email protected]/T/

RFC v1 -> RFC v2
- Change the clone flag method to sysctl with PID.
- Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
MMF_COW_PTE_READY, for the sysctl.
- Change the owner pointer to use the folio padding.
- Handle all the VMAs that cover the PTE table when doing the break COW PTE.
- Remove the self-defined refcount to use the _refcount for the page
table page.
- Add the exclusive flag to let the page table only own by one task in
some situations.
- Invalidate address range MMU notifier and start the write_seqcount
when doing the break COW PTE.
- Handle the swap cache and swapoff.

RFC v1: https://lore.kernel.org/all/[email protected]/

---

Currently, copy-on-write is only used for the mapped memory; the child
process still needs to copy the entire page table from the parent
process during forking. The parent process might take a lot of time and
memory to copy the page table when the parent has a big page table
allocated. For example, the memory usage of a process after forking with
1 GB mapped memory is as follows:

DEFAULT FORK
parent child
VmRSS: 1049688 kB 1048688 kB
VmPTE: 2096 kB 2096 kB

This patch introduces copy-on-write (COW) for the PTE level page tables.
COW PTE conditionally improves performance in the situation where the
user needs copies of the program to run on isolated environments.
Feedback-based fuzzers (e.g., AFL) and serverless/microservice frameworks
are two major examples. For instance, COW PTE achieves a 1.03x throughput
increase when running TriforceAFL.

After applying COW to PTE, the memory usage after forking is as follows:

COW PTE
parent child
VmRSS: 1049968 kB 2576 kB
VmPTE: 2096 kB 44 kB

The results show that this patch significantly decreases memory usage.
The other number of latencies are discussed later.

Real-world application benchmarks
=================================

We run benchmarks of fuzzing and VM cloning. The experiments were
done with the normal fork or the fork with COW PTE.

With AFL (LLVM mode) and SQLite, COW PTE (52.15 execs/sec) is a
little bit worse than the normal fork version (53.50 execs/sec).

fork
execs_per_sec unix_time time
count 28.000000 2.800000e+01 28.000000
mean 53.496786 1.671270e+09 96.107143
std 3.625060 7.194717e+01 71.947172
min 35.350000 1.671270e+09 0.000000
25% 53.967500 1.671270e+09 33.750000
50% 54.235000 1.671270e+09 92.000000
75% 54.525000 1.671270e+09 149.250000
max 55.100000 1.671270e+09 275.000000

COW PTE
execs_per_sec unix_time time
count 34.000000 3.400000e+01 34.000000
mean 52.150000 1.671268e+09 103.323529
std 3.218271 7.507682e+01 75.076817
min 34.250000 1.671268e+09 0.000000
25% 52.500000 1.671268e+09 42.250000
50% 52.750000 1.671268e+09 94.500000
75% 52.952500 1.671268e+09 150.750000
max 53.680000 1.671268e+09 285.000000


With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE
(105.54 execs/sec) achieves a 1.05x throughput increase over the
normal fork version (102.30 execs/sec).

fork
execs_per_sec unix_time time
count 38.000000 3.800000e+01 38.000000
mean 102.299737 1.671269e+09 156.289474
std 20.139268 8.717113e+01 87.171130
min 6.600000 1.671269e+09 0.000000
25% 95.657500 1.671269e+09 82.250000
50% 109.950000 1.671269e+09 176.500000
75% 113.972500 1.671269e+09 223.750000
max 118.790000 1.671269e+09 281.000000

COW PTE
execs_per_sec unix_time time
count 42.000000 4.200000e+01 42.000000
mean 105.540714 1.671269e+09 163.476190
std 19.443517 8.858845e+01 88.588453
min 6.200000 1.671269e+09 0.000000
25% 96.585000 1.671269e+09 123.500000
50% 113.925000 1.671269e+09 180.500000
75% 116.940000 1.671269e+09 233.500000
max 121.090000 1.671269e+09 286.000000

Microbenchmark - syscall latency
================================

We run microbenchmarks to measure the latency of a fork syscall with
sizes of mapped memory ranging from 0 to 512 MB. The results show that
the latency of a normal fork reaches 10 ms. The latency of a fork with
COW PTE is also around 10 ms.

Microbenchmark - page fault latency
====================================

We conducted some microbenchmarks to measure page fault latency with
different patterns of accesses to a 512 MB memory buffer after forking.

In the first experiment, the program accesses the entire 512 MB memory
by writing to all the pages consecutively. The experiment is done with
normal fork, fork with COW PTE and calculates the single access average
latency. COW PTE page fault latency (0.000795 ms) and the normal fork
fault latency (0.000770 ms). Here are the raw numbers:

Page fault - Access to the entire 512 MB memory

fork mean: 0.000770 ms
fork median: 0.000769 ms
fork std: 0.000010 ms

COW PTE mean: 0.000795 ms
COW PTE median: 0.000795 ms
COW PTE std: 0.000009 ms

The second experiment simulates real-world applications with sparse
accesses. The program randomly accesses the memory by writing to one
random page 1 million times and calculates the average access time,
after that, we run both 100 times to get the averages. The result shows
that COW PTE (0.000029 ms) is similar to the normal fork (0.000026 ms).

Page fault - Random access

fork mean: 0.000026 ms
fork median: 0.000025 ms
fork std: 0.000002 ms

COW PTE mean: 0.000029 ms
COW PTE median: 0.000026 ms
COW PTE std: 0.000004 ms

All the tests were run with QEMU and the kernel was built with
the x86_64 default config (v3 patch set).

Summary
=======

In summary, COW PTE reduces the memory footprint of processes and
conditionally improve the latency of fork syscall.

This patch is based on the paper "On-demand-fork: a microsecond fork
for memory-intensive and latency-sensitive applications" [1] from
Purdue University.

Any comments and suggestions are welcome.

Thanks,
Chih-En Lin

---

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v6.3-rc6.

---

Chih-En Lin (17):
mm: Split out the present cases from zap_pte_range()
mm: Allow user to control COW PTE via prctl
mm: Add Copy-On-Write PTE to fork()
mm: Add break COW PTE fault and helper functions
mm: Handle COW-ed PTE during zapping
mm/rmap: Break COW PTE in rmap walking
mm/khugepaged: Break COW PTE before scanning pte
mm/ksm: Break COW PTE before modify shared PTE
mm/madvise: Handle COW-ed PTE with madvise()
mm/gup: Trigger break COW PTE before calling follow_pfn_pte()
mm/mprotect: Break COW PTE before changing protection
mm/userfaultfd: Support COW PTE
mm/migrate_device: Support COW PTE
fs/proc: Support COW PTE with clear_refs_write
events/uprobes: Break COW PTE before replacing page
mm: fork: Enable COW PTE to fork system call
mm: Check the unexpected modification of COW-ed PTE

arch/x86/include/asm/pgtable.h | 1 +
fs/proc/task_mmu.c | 5 +
include/linux/mm.h | 37 ++
include/linux/page_table_check.h | 62 ++
include/linux/pgtable.h | 6 +
include/linux/rmap.h | 2 +
include/linux/sched/coredump.h | 13 +-
include/trace/events/huge_memory.h | 1 +
include/uapi/linux/prctl.h | 6 +
kernel/events/uprobes.c | 2 +-
kernel/fork.c | 7 +
kernel/sys.c | 11 +
mm/Kconfig | 9 +
mm/gup.c | 8 +-
mm/khugepaged.c | 35 +-
mm/ksm.c | 4 +-
mm/madvise.c | 13 +
mm/memory.c | 926 ++++++++++++++++++++++++++---
mm/migrate.c | 3 +-
mm/migrate_device.c | 2 +
mm/mmap.c | 4 +
mm/mprotect.c | 9 +
mm/mremap.c | 2 +
mm/page_table_check.c | 58 ++
mm/page_vma_mapped.c | 4 +
mm/rmap.c | 9 +-
mm/swapfile.c | 2 +
mm/userfaultfd.c | 6 +
mm/vmscan.c | 3 +-
29 files changed, 1149 insertions(+), 101 deletions(-)

--
2.34.1


2023-04-14 14:27:43

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 07/17] mm/khugepaged: Break COW PTE before scanning pte

We should not allow THP to collapse COW-ed PTE. So, break COW PTE
before collapse_pte_mapped_thp() collapse to THP. Also, break COW
PTE before khugepaged_scan_pmd() scan PTE.

Signed-off-by: Chih-En Lin <[email protected]>
---
include/trace/events/huge_memory.h | 1 +
mm/khugepaged.c | 35 +++++++++++++++++++++++++++++-
2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 3e6fb05852f9..5f2c39f61521 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
EM( SCAN_PMD_NULL, "pmd_null") \
EM( SCAN_PMD_NONE, "pmd_none") \
EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \
+ EM( SCAN_COW_PTE, "cowed_pte") \
EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \
EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \
EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 92e6f56a932d..3020fcb53691 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -31,6 +31,7 @@ enum scan_result {
SCAN_PMD_NULL,
SCAN_PMD_NONE,
SCAN_PMD_MAPPED,
+ SCAN_COW_PTE,
SCAN_EXCEED_NONE_PTE,
SCAN_EXCEED_SWAP_PTE,
SCAN_EXCEED_SHARED_PTE,
@@ -886,7 +887,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
return SCAN_PMD_MAPPED;
if (pmd_devmap(pmde))
return SCAN_PMD_NULL;
- if (pmd_bad(pmde))
+ if (pmd_write(pmde) && pmd_bad(pmde))
return SCAN_PMD_NULL;
return SCAN_SUCCEED;
}
@@ -937,6 +938,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
pte_unmap(vmf.pte);
continue;
}
+ if (break_cow_pte(vma, pmd, address))
+ return SCAN_COW_PTE;
ret = do_swap_page(&vmf);

/*
@@ -1049,6 +1052,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
if (result != SCAN_SUCCEED)
goto out_up_write;

+ /* We should already handled COW-ed PTE. */
+ VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd));
+
anon_vma_lock_write(vma->anon_vma);

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
@@ -1159,6 +1165,13 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,

memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
+
+ /* Break COW PTE before we collapse the pages. */
+ if (break_cow_pte(vma, pmd, address)) {
+ result = SCAN_COW_PTE;
+ goto out;
+ }
+
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
@@ -1217,6 +1230,10 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
goto out_unmap;
}

+ /*
+ * If we only trigger the break COW PTE, the page usually
+ * still in COW mapping, which it still be shared.
+ */
if (page_mapcount(page) > 1) {
++shared;
if (cc->is_khugepaged &&
@@ -1512,6 +1529,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
goto drop_hpage;
}

+ /* We shouldn't let COW-ed PTE collapse. */
+ if (break_cow_pte(vma, pmd, haddr))
+ goto drop_hpage;
+ VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd));
+
/*
* We need to lock the mapping so that from here on, only GUP-fast and
* hardware page walks can access the parts of the page tables that
@@ -1717,6 +1739,11 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
result = SCAN_PTE_UFFD_WP;
goto unlock_next;
}
+ if (test_bit(MMF_COW_PTE, &mm->flags) &&
+ !pmd_write(*pmd)) {
+ result = SCAN_COW_PTE;
+ goto unlock_next;
+ }
collapse_and_free_pmd(mm, vma, addr, pmd);
if (!cc->is_khugepaged && is_target)
result = set_huge_pmd(vma, addr, pmd, hpage);
@@ -2154,6 +2181,11 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
swap = 0;
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
+ if (break_cow_pte(find_vma(mm, addr), NULL, addr)) {
+ result = SCAN_COW_PTE;
+ goto out;
+ }
+
rcu_read_lock();
xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
if (xas_retry(&xas, page))
@@ -2224,6 +2256,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
}
rcu_read_unlock();

+out:
if (result == SCAN_SUCCEED) {
if (cc->is_khugepaged &&
present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
--
2.34.1

2023-04-14 14:27:47

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 08/17] mm/ksm: Break COW PTE before modify shared PTE

Break COW PTE before merge the page that reside in COW-ed PTE.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/ksm.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 2b8d30068cbb..963ef4d0085d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1052,7 +1052,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
pte_t *orig_pte)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, 0);
+ DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, PVMW_BREAK_COW_PTE);
int swapped;
int err = -EFAULT;
struct mmu_notifier_range range;
@@ -1169,6 +1169,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
barrier();
if (!pmd_present(pmde) || pmd_trans_huge(pmde))
goto out;
+ if (break_cow_pte(vma, pmd, addr))
+ goto out;

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
addr + PAGE_SIZE);
--
2.34.1

2023-04-14 14:27:50

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 05/17] mm: Handle COW-ed PTE during zapping

To support the zap functionally for COW-ed PTE, we need to zap the
entire PTE table each time instead of partially zapping pages.
Therefore, if the zap range covers the entire PTE table, we can
handle de-account, remove the rmap, etc. However we shouldn't modify
the entries when there are still someone references to the COW-ed
PTE. Otherwise, if only the zapped process references to this COW-ed
PTE, we just reuse it and do the normal zapping.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/memory.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 87 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f8a87a0fc382..7908e20f802a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -192,6 +192,12 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
+#ifdef CONFIG_COW_PTE
+ if (test_bit(MMF_COW_PTE, &tlb->mm->flags)) {
+ if (!pmd_none(*pmd) && !pmd_write(*pmd))
+ VM_WARN_ON(cow_pte_count(pmd) != 1);
+ }
+#endif
if (pmd_none_or_clear_bad(pmd))
continue;
free_pte_range(tlb, pmd, addr);
@@ -1656,6 +1662,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,

#define ZAP_PTE_INIT 0x0000
#define ZAP_PTE_FORCE_FLUSH 0x0001
+#define ZAP_PTE_IS_SHARED 0x0002

struct zap_pte_details {
pte_t **pte;
@@ -1681,9 +1688,13 @@ zap_present_pte(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (unlikely(!should_zap_page(details, page)))
return 0;

- ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+ if (pte_details->flags & ZAP_PTE_IS_SHARED)
+ ptent = ptep_get(pte);
+ else
+ ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ if (!(pte_details->flags & ZAP_PTE_IS_SHARED))
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
if (unlikely(!page))
return 0;

@@ -1767,8 +1778,10 @@ zap_nopresent_pte(struct mmu_gather *tlb, struct vm_area_struct *vma,
/* We should have covered all the swap entry types */
WARN_ON_ONCE(1);
}
- pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
- zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ if (!(pte_details->flags & ZAP_PTE_IS_SHARED)) {
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+ }
}

static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -1785,6 +1798,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
.flags = ZAP_PTE_INIT,
.pte = &pte,
};
+#ifdef CONFIG_COW_PTE
+ unsigned long orig_addr = addr;
+
+ if (test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)) {
+ if (!range_in_vma(vma, addr & PMD_MASK,
+ (addr + PMD_SIZE) & PMD_MASK)) {
+ /*
+ * We cannot promise this COW-ed PTE will also be zap
+ * with the rest of VMAs. So, break COW PTE here.
+ */
+ break_cow_pte(vma, pmd, addr);
+ } else {
+ /*
+ * We free the batched memory before we handle
+ * COW-ed PTE.
+ */
+ tlb_flush_mmu(tlb);
+ end = (addr + PMD_SIZE) & PMD_MASK;
+ addr = addr & PMD_MASK;
+ start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (cow_pte_count(pmd) == 1) {
+ /* Reuse COW-ed PTE */
+ pmd_t new = pmd_mkwrite(*pmd);
+ set_pmd_at(tlb->mm, addr, pmd, new);
+ } else
+ pte_details.flags |= ZAP_PTE_IS_SHARED;
+ pte_unmap_unlock(start_pte, ptl);
+ }
+ }
+#endif

tlb_change_page_size(tlb, PAGE_SIZE);
again:
@@ -1828,7 +1871,16 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
*/
if (pte_details.flags & ZAP_PTE_FORCE_FLUSH) {
pte_details.flags &= ~ZAP_PTE_FORCE_FLUSH;
- tlb_flush_mmu(tlb);
+ /*
+ * With COW-ed PTE, we defer freeing the batched memory until
+ * after we have actually cleared the COW-ed PTE's pmd entry.
+ * Since, if we are the only ones still referencing the COW-ed
+ * PTe table after we have freed the batched memory, the page
+ * table check will report a bug with anon_map_count != 0 in
+ * page_table_check_zero().
+ */
+ if (!(pte_details.flags & ZAP_PTE_IS_SHARED))
+ tlb_flush_mmu(tlb);
}

if (addr != end) {
@@ -1836,6 +1888,36 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
goto again;
}

+#ifdef CONFIG_COW_PTE
+ if (pte_details.flags & ZAP_PTE_IS_SHARED) {
+ start_pte = pte_offset_map_lock(mm, pmd, orig_addr, &ptl);
+ if (!pmd_put_pte(pmd)) {
+ pmd_t new = pmd_mkwrite(*pmd);
+ set_pmd_at(tlb->mm, addr, pmd, new);
+ /*
+ * We are the only ones who still referencing this.
+ * Clear the page table check before we free the
+ * batched memory.
+ */
+ page_table_check_pte_clear_range(mm, orig_addr, *pmd);
+ pte_unmap_unlock(start_pte, ptl);
+ /* free the batched memory and flush the TLB. */
+ tlb_flush_mmu(tlb);
+ free_pte_range(tlb, pmd, addr);
+ } else {
+ pmd_clear(pmd);
+ pte_unmap_unlock(start_pte, ptl);
+ mm_dec_nr_ptes(tlb->mm);
+ /*
+ * Someone still referencing to the table,
+ * we just flush TLB here.
+ */
+ flush_tlb_range(vma, addr & PMD_MASK,
+ (addr + PMD_SIZE) & PMD_MASK);
+ }
+ }
+#endif
+
return addr;
}

--
2.34.1

2023-04-14 14:28:02

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 10/17] mm/gup: Trigger break COW PTE before calling follow_pfn_pte()

In most of cases, GUP will not modify the page table, excluding
follow_pfn_pte(). To deal with COW PTE, Trigger the break COW
PTE fault before calling follow_pfn_pte().

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/gup.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index eab18ba045db..325424c02ca6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -544,7 +544,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
- if (unlikely(pmd_bad(*pmd)))
+ /* COW-ed PTE has write protection which can trigger pmd_bad(). */
+ if (unlikely(pmd_write(*pmd) && pmd_bad(*pmd)))
return no_page_table(vma, flags);

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -587,6 +588,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (is_zero_pfn(pte_pfn(pte))) {
page = pte_page(pte);
} else {
+ if (test_bit(MMF_COW_PTE, &mm->flags) &&
+ !pmd_write(*pmd)) {
+ page = ERR_PTR(-EMLINK);
+ goto out;
+ }
ret = follow_pfn_pte(vma, address, ptep, flags);
page = ERR_PTR(ret);
goto out;
--
2.34.1

2023-04-14 14:28:20

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 06/17] mm/rmap: Break COW PTE in rmap walking

Some of the features (unmap, migrate, device exclusive, mkclean, etc)
might modify the pte entry via rmap. Add a new page vma mapped walk
flag, PVMW_BREAK_COW_PTE, to indicate the rmap walking to break COW PTE.

Signed-off-by: Chih-En Lin <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/migrate.c | 3 ++-
mm/page_vma_mapped.c | 4 ++++
mm/rmap.c | 9 +++++----
mm/vmscan.c | 3 ++-
5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..57e9b72dc63a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -377,6 +377,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
#define PVMW_SYNC (1 << 0)
/* Look for migration entries rather than present PTEs */
#define PVMW_MIGRATION (1 << 1)
+/* Break COW-ed PTE during walking */
+#define PVMW_BREAK_COW_PTE (1 << 2)

struct page_vma_mapped_walk {
unsigned long pfn;
diff --git a/mm/migrate.c b/mm/migrate.c
index db3f154446af..38933993af14 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,7 +184,8 @@ void putback_movable_pages(struct list_head *l)
static bool remove_migration_pte(struct folio *folio,
struct vm_area_struct *vma, unsigned long addr, void *old)
{
- DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+ DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr,
+ PVMW_SYNC | PVMW_MIGRATION | PVMW_BREAK_COW_PTE);

while (page_vma_mapped_walk(&pvmw)) {
rmap_t rmap_flags = RMAP_NONE;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..1750b3460828 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -254,6 +254,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
step_forward(pvmw, PMD_SIZE);
continue;
}
+ if (pvmw->flags & PVMW_BREAK_COW_PTE) {
+ if (break_cow_pte(vma, pvmw->pmd, pvmw->address))
+ return not_found(pvmw);
+ }
if (!map_pte(pvmw))
goto next_pte;
this_pte:
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..5582da6d72fc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1006,7 +1006,8 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+ PVMW_SYNC | PVMW_BREAK_COW_PTE);
int *cleaned = arg;

*cleaned += page_vma_mkclean_one(&pvmw);
@@ -1450,7 +1451,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
pte_t pteval;
struct page *subpage;
bool anon_exclusive, ret = true;
@@ -1810,7 +1811,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
unsigned long address, void *arg)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
pte_t pteval;
struct page *subpage;
bool anon_exclusive, ret = true;
@@ -2177,7 +2178,7 @@ static bool page_make_device_exclusive_one(struct folio *folio,
struct vm_area_struct *vma, unsigned long address, void *priv)
{
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
struct make_exclusive_args *args = priv;
pte_t pteval;
struct page *subpage;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..4abbd036f927 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1892,7 +1892,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,

/*
* The folio is mapped into the page tables of one or more
- * processes. Try to unmap it here.
+ * processes. Try to unmap it here. Also, since it will write
+ * to the page tables, break COW PTE if they are.
*/
if (folio_mapped(folio)) {
enum ttu_flags flags = TTU_BATCH_FLUSH;
--
2.34.1

2023-04-14 14:28:24

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 11/17] mm/mprotect: Break COW PTE before changing protection

If the PTE table is COW-ed, break it before changing the protection.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/mprotect.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 13e84d8c0797..a33f23a73fa5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,6 +103,9 @@ static long change_pte_range(struct mmu_gather *tlb,
if (pmd_trans_unstable(pmd))
return 0;

+ if (break_cow_pte(vma, pmd, addr))
+ return 0;
+
/*
* The pmd points to a regular pte so the pmd can't change
* from under us even if the mmap_lock is only hold for
@@ -312,6 +315,12 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 1;
if (pmd_trans_huge(pmdval))
return 0;
+ /*
+ * If the entry point to COW-ed PTE, it's write protection bit
+ * will cause pmd_bad().
+ */
+ if (!pmd_write(pmdval))
+ return 0;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
return 1;
--
2.34.1

2023-04-14 14:28:31

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 13/17] mm/migrate_device: Support COW PTE

Break COW PTE before collecting the pages in COW-ed PTE.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/migrate_device.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..340a8c39ee3b 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -106,6 +106,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
}

+ if (break_cow_pte_range(vma, start, end))
+ return migrate_vma_collect_skip(start, end, walk);
if (unlikely(pmd_bad(*pmdp)))
return migrate_vma_collect_skip(start, end, walk);

--
2.34.1

2023-04-14 14:28:33

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 14/17] fs/proc: Support COW PTE with clear_refs_write

Before clearing the entry in COW-ed PTE, break COW PTE first.

Signed-off-by: Chih-En Lin <[email protected]>
---
fs/proc/task_mmu.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6a96e1713fd5..c76b74029dfd 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1195,6 +1195,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
if (pmd_trans_unstable(pmd))
return 0;

+ /* Only break COW when we modify the soft-dirty bit. */
+ if (cp->type == CLEAR_REFS_SOFT_DIRTY &&
+ break_cow_pte(vma, pmd, addr))
+ return 0;
+
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;
--
2.34.1

2023-04-14 14:28:55

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 12/17] mm/userfaultfd: Support COW PTE

If uffd fills the zeropage or installs to COW-ed PTE, break it first.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/userfaultfd.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 53c3d916ff66..f5e0a97d6a3d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,6 +70,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
struct inode *inode;
pgoff_t offset, max_off;

+ if (break_cow_pte(dst_vma, dst_pmd, dst_addr))
+ return -ENOMEM;
+
_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
_dst_pte = pte_mkdirty(_dst_pte);
if (page_in_cache && !vm_shared)
@@ -215,6 +218,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
pgoff_t offset, max_off;
struct inode *inode;

+ if (break_cow_pte(dst_vma, dst_pmd, dst_addr))
+ return -ENOMEM;
+
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
--
2.34.1

2023-04-14 14:28:59

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 09/17] mm/madvise: Handle COW-ed PTE with madvise()

Break COW PTE if madvise() modify the pte entry of COW-ed PTE.
Following are the list of flags which need to break COW PTE. However,
like MADV_HUGEPAGE and MADV_MERGEABLE, we should handle it respectively.

- MADV_DONTNEED: It calls to zap_page_range() which already be handled.
- MADV_FREE: It uses walk_page_range() with madvise_free_pte_range() to
free the page by itself, so add break_cow_pte().
- MADV_REMOVE: Same as MADV_FREE, it remove the page by itself, so add
break_cow_pte_range().
- MADV_COLD: Similar to MAD_FREE, break COW PTE before pageout.
- MADV_POPULATE: Let GUP deal with it.

Signed-off-by: Chih-En Lin <[email protected]>
---
mm/madvise.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 340125d08c03..71176edb751e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -425,6 +425,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (pmd_trans_unstable(pmd))
return 0;
#endif
+ if (break_cow_pte(vma, pmd, addr))
+ return 0;
+
tlb_change_page_size(tlb, PAGE_SIZE);
orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
flush_tlb_batched_pending(mm);
@@ -625,6 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (pmd_trans_unstable(pmd))
return 0;

+ /* We should only allocate PTE. */
+ if (break_cow_pte(vma, pmd, addr))
+ goto next;
+
tlb_change_page_size(tlb, PAGE_SIZE);
orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
flush_tlb_batched_pending(mm);
@@ -984,6 +991,12 @@ static long madvise_remove(struct vm_area_struct *vma,
if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
return -EACCES;

+ error = break_cow_pte_range(vma, start, end);
+ if (error < 0)
+ return error;
+ else if (error > 0)
+ return -ENOMEM;
+
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

--
2.34.1

2023-04-14 14:29:22

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 15/17] events/uprobes: Break COW PTE before replacing page

Break COW PTE if we want to replace the page which
resides in COW-ed PTE.

Signed-off-by: Chih-En Lin <[email protected]>
---
kernel/events/uprobes.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 59887c69d54c..db6bfaab928d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -156,7 +156,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
struct folio *old_folio = page_folio(old_page);
struct folio *new_folio;
struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0);
+ DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, PVMW_BREAK_COW_PTE);
int err;
struct mmu_notifier_range range;

--
2.34.1

2023-04-14 14:30:15

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 16/17] mm: fork: Enable COW PTE to fork system call

This patch enables the Copy-On-Write (COW) mechanism to the PTE table
in fork system call. To let the process do COW PTE fork, use
prctl(PR_SET_COW_PTE), it will set the MMF_COW_PTE_READY flag to the
process for enabling COW PTE during the next time of fork.

It uses the MMF_COW_PTE flag to distinguish the normal page table
and the COW one. Moreover, it is difficult to distinguish whether all
the page tables is out of COW state. So the MMF_COW_PTE flag won't be
disabled after setup.

Signed-off-by: Chih-En Lin <[email protected]>
---
kernel/fork.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0c92f224c68c..8452d5c4eb5e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2679,6 +2679,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
trace = 0;
}

+#ifdef CONFIG_COW_PTE
+ if (current->mm && test_bit(MMF_COW_PTE_READY, &current->mm->flags)) {
+ clear_bit(MMF_COW_PTE_READY, &current->mm->flags);
+ set_bit(MMF_COW_PTE, &current->mm->flags);
+ }
+#endif
+
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
add_latent_entropy();

--
2.34.1

2023-04-14 14:30:39

by Chih-En Lin

[permalink] [raw]
Subject: [PATCH v5 17/17] mm: Check the unexpected modification of COW-ed PTE

In the most of the cases, we don't expect any write access to COW-ed PTE
table. To prevent this, add the new modification check to the page table
check.

But, there are still some of valid reasons where we might want to modify
COW-ed PTE tables. Therefore, add the enable/disable function to the
check.

Signed-off-by: Chih-En Lin <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
include/linux/page_table_check.h | 62 ++++++++++++++++++++++++++++++++
mm/memory.c | 4 +++
mm/page_table_check.c | 58 ++++++++++++++++++++++++++++++
4 files changed, 125 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7425f32e5293..6b323c672e36 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1022,6 +1022,7 @@ static inline pud_t native_local_pudp_get_and_clear(pud_t *pudp)
static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte)
{
+ cowed_pte_table_check_modify(mm, addr, ptep, pte);
page_table_check_pte_set(mm, addr, ptep, pte);
set_pte(ptep, pte);
}
diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
index 01e16c7696ec..4a54dc454281 100644
--- a/include/linux/page_table_check.h
+++ b/include/linux/page_table_check.h
@@ -113,6 +113,54 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
__page_table_check_pte_clear_range(mm, addr, pmd);
}

+#ifdef CONFIG_COW_PTE
+void __check_cowed_pte_table_enable(pte_t *ptep);
+void __check_cowed_pte_table_disable(pte_t *ptep);
+void __cowed_pte_table_check_modify(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte);
+
+static inline void check_cowed_pte_table_enable(pte_t *ptep)
+{
+ if (static_branch_likely(&page_table_check_disabled))
+ return;
+
+ __check_cowed_pte_table_enable(ptep);
+}
+
+static inline void check_cowed_pte_table_disable(pte_t *ptep)
+{
+ if (static_branch_likely(&page_table_check_disabled))
+ return;
+
+ __check_cowed_pte_table_disable(ptep);
+}
+
+static inline void cowed_pte_table_check_modify(struct mm_struct *mm,
+ unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ if (static_branch_likely(&page_table_check_disabled))
+ return;
+
+ __cowed_pte_table_check_modify(mm, addr, ptep, pte);
+}
+#else
+static inline void check_cowed_pte_table_enable(pte_t *ptep)
+{
+}
+
+static inline void check_cowed_pte_table_disable(pte_t *ptep)
+{
+}
+
+static inline void cowed_pte_table_check_modify(struct mm_struct *mm,
+ unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+}
+#endif /* CONFIG_COW_PTE */
+
+
#else

static inline void page_table_check_alloc(struct page *page, unsigned int order)
@@ -162,5 +210,19 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
{
}

+static inline void check_cowed_pte_table_enable(pte_t *ptep)
+{
+}
+
+static inline void check_cowed_pte_table_disable(pte_t *ptep)
+{
+}
+
+static inline void cowed_pte_table_check_modify(struct mm_struct *mm,
+ unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+}
+
#endif /* CONFIG_PAGE_TABLE_CHECK */
#endif /* __LINUX_PAGE_TABLE_CHECK_H */
diff --git a/mm/memory.c b/mm/memory.c
index 7908e20f802a..e62487413038 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1202,10 +1202,12 @@ copy_cow_pte_range(struct vm_area_struct *dst_vma,
* Although, parent's PTE is COW-ed, we should
* still need to handle all the swap stuffs.
*/
+ check_cowed_pte_table_disable(src_pte);
ret = copy_nonpresent_pte(dst_mm, src_mm,
src_pte, src_pte,
curr, curr,
addr, rss);
+ check_cowed_pte_table_enable(src_pte);
if (ret == -EIO) {
entry = pte_to_swp_entry(*src_pte);
break;
@@ -1223,8 +1225,10 @@ copy_cow_pte_range(struct vm_area_struct *dst_vma,
* copy_present_pte() will determine the mapped page
* should be COW mapping or not.
*/
+ check_cowed_pte_table_disable(src_pte);
ret = copy_present_pte(curr, curr, src_pte, src_pte,
addr, rss, NULL);
+ check_cowed_pte_table_enable(src_pte);
/*
* If we need a pre-allocated page for this pte,
* drop the lock, recover all the entries, fall
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 25d8610c0042..5175c7476508 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -14,6 +14,9 @@
struct page_table_check {
atomic_t anon_map_count;
atomic_t file_map_count;
+#ifdef CONFIG_COW_PTE
+ atomic_t check_cowed_pte;
+#endif
};

static bool __page_table_check_enabled __initdata =
@@ -248,3 +251,58 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm,
pte_unmap(ptep - PTRS_PER_PTE);
}
}
+
+#ifdef CONFIG_COW_PTE
+void __check_cowed_pte_table_enable(pte_t *ptep)
+{
+ struct page *page = pte_page(*ptep);
+ struct page_ext *page_ext = page_ext_get(page);
+ struct page_table_check *ptc = get_page_table_check(page_ext);
+
+ atomic_set(&ptc->check_cowed_pte, 1);
+ page_ext_put(page_ext);
+}
+
+void __check_cowed_pte_table_disable(pte_t *ptep)
+{
+ struct page *page = pte_page(*ptep);
+ struct page_ext *page_ext = page_ext_get(page);
+ struct page_table_check *ptc = get_page_table_check(page_ext);
+
+ atomic_set(&ptc->check_cowed_pte, 0);
+ page_ext_put(page_ext);
+}
+
+static int check_cowed_pte_table(pte_t *ptep)
+{
+ struct page *page = pte_page(*ptep);
+ struct page_ext *page_ext = page_ext_get(page);
+ struct page_table_check *ptc = get_page_table_check(page_ext);
+ int check = 0;
+
+ check = atomic_read(&ptc->check_cowed_pte);
+ page_ext_put(page_ext);
+
+ return check;
+}
+
+void __cowed_pte_table_check_modify(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ if (!test_bit(MMF_COW_PTE, &mm->flags) || !check_cowed_pte_table(ptep))
+ return;
+
+ pgd = pgd_offset(mm, addr);
+ p4d = p4d_offset(pgd, addr);
+ pud = pud_offset(p4d, addr);
+ pmd = pmd_offset(pud, addr);
+
+ if (!pmd_none(*pmd) && !pmd_write(*pmd) && cow_pte_count(pmd) > 1)
+ BUG_ON(!pte_same(*ptep, pte));
+}
+#endif
--
2.34.1