When per-vma locks were introduced in [1] several types of page faults
would still fall back to mmap_lock to keep the patchset simple. Among them
are swap and userfault pages. The main reason for skipping those cases was
the fact that mmap_lock could be dropped while handling these faults and
that required additional logic to be implemented.
Implement the mechanism to allow per-vma locks to be dropped for these
cases. When that happens handle_mm_fault returns new VM_FAULT_VMA_UNLOCKED
vm_fault_reason bit along with VM_FAULT_RETRY to indicate that VMA lock
was dropped. Naturally, once VMA lock is dropped that VMA should be
assumed unstable and can't be used.
Changes since v1 posted at [2]
- New patch 1/6 to remove do_poll parameter from read_swap_cache_async(),
per Huang Ying
- New patch 3/6 to separate VM_FAULT_COMPLETED addition,
per Alistair Popple
- Added comment for VM_FAULT_VMA_UNLOCKED in 4/6, per Alistair Popple
- New patch 6/6 to handle userfaults under VMA lock
Note: I tried implementing Matthew's suggestion in [3] to add vmf_end_read
but that gets quite messy since it would require changing code for every
architecture when we change handle_mm_fault interface.
Note: patch 4/6 will cause a trivial merge conflict in arch/arm64/mm/fault.c
when applied over mm-unstable branch due to a patch from ARM64 tree [4]
which is missing in mm-unstable.
[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
Suren Baghdasaryan (6):
swap: remove remnants of polling from read_swap_cache_async
mm: handle swap page faults under VMA lock if page is uncontended
mm: add missing VM_FAULT_RESULT_TRACE name for VM_FAULT_COMPLETED
mm: drop VMA lock before waiting for migration
mm: implement folio wait under VMA lock
mm: handle userfaults under VMA lock
arch/arm64/mm/fault.c | 3 ++-
arch/powerpc/mm/fault.c | 3 ++-
arch/s390/mm/fault.c | 3 ++-
arch/x86/mm/fault.c | 3 ++-
fs/userfaultfd.c | 42 ++++++++++++++++++-----------------
include/linux/mm_types.h | 7 +++++-
include/linux/pagemap.h | 14 ++++++++----
mm/filemap.c | 37 +++++++++++++++++++------------
mm/madvise.c | 4 ++--
mm/memory.c | 48 ++++++++++++++++++++++------------------
mm/swap.h | 1 -
mm/swap_state.c | 12 +++++-----
12 files changed, 103 insertions(+), 74 deletions(-)
--
2.41.0.162.gfafddb0af9-goog
VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason
to be used as flag_array inside trace_print_flags_seq(). The element
for VM_FAULT_COMPLETED is missing, add it.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/mm_types.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..79765e3dd8f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1070,7 +1070,8 @@ enum vm_fault_reason {
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
- { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
+ { VM_FAULT_COMPLETED, "COMPLETED" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
--
2.41.0.162.gfafddb0af9-goog
Commit [1] introduced IO polling support during swapin to reduce
swap read latency for block devices that can be polled. However later
commit [2] removed polling support. Therefore it seems safe to remove
do_poll parameter in read_swap_cache_async and always call swap_readpage
with synchronous=false waiting for IO completion in folio_lock_or_retry.
[1] commit 23955622ff8d ("swap: add block io poll in swapin path")
[2] commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/madvise.c | 4 ++--
mm/swap.h | 1 -
mm/swap_state.c | 12 +++++-------
3 files changed, 7 insertions(+), 10 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..b1e8adf1234e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -215,7 +215,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
continue;
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false, &splug);
+ vma, index, &splug);
if (page)
put_page(page);
}
@@ -252,7 +252,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
rcu_read_unlock();
page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false, &splug);
+ NULL, 0, &splug);
if (page)
put_page(page);
diff --git a/mm/swap.h b/mm/swap.h
index 7c033d793f15..8a3c7a0ace4f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -46,7 +46,6 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
unsigned long addr,
- bool do_poll,
struct swap_iocb **plug);
struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b76a65ac28b3..a3839de71f3f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -517,15 +517,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma,
- unsigned long addr, bool do_poll,
- struct swap_iocb **plug)
+ unsigned long addr, struct swap_iocb **plug)
{
bool page_was_allocated;
struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
vma, addr, &page_was_allocated);
if (page_was_allocated)
- swap_readpage(retpage, do_poll, plug);
+ swap_readpage(retpage, false, plug);
return retpage;
}
@@ -620,7 +619,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
struct swap_iocb *splug = NULL;
- bool do_poll = true, page_allocated;
+ bool page_allocated;
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address;
@@ -628,7 +627,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
if (!mask)
goto skip;
- do_poll = false;
/* Read a page_cluster sized and aligned cluster around offset. */
start_offset = offset & ~mask;
end_offset = offset | mask;
@@ -660,7 +658,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
/* The page was likely read above, so no need for plugging here */
- return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll, NULL);
+ return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL);
}
int init_swap_address_space(unsigned int type, unsigned long nr_pages)
@@ -825,7 +823,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
skip:
/* The page was likely read above, so no need for plugging here */
return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
- ra_info.win == 1, NULL);
+ NULL);
}
/**
--
2.41.0.162.gfafddb0af9-goog
When page fault is handled under VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry
implementation has to drop and reacquire mmap_lock if folio could
not be immediately locked.
Instead of retrying all swapped page faults, retry only when folio
locking fails.
Note that the only time do_swap_page calls synchronous swap_readpage
is when SWP_SYNCHRONOUS_IO is set, which is only set for
QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
pmem). Therefore we don't sleep in this path, and there's no need to
drop the mmap or per-vma lock.
Drivers implementing ops->migrate_to_ram might still rely on mmap_lock,
therefore fall back to mmap_lock in this case.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
mm/filemap.c | 6 ++++++
mm/memory.c | 14 +++++++++-----
2 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..7cb0a3776a07 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
* mmap_lock has been released (mmap_read_unlock(), unless flags had both
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
* which case mmap_lock is still held.
+ * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
+ * with VMA lock only, the VMA lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return true
* with the folio locked and the mmap_lock unperturbed.
@@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags)
{
+ /* Can't do this if not holding mmap_lock */
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ return false;
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..41f45819a923 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
- if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
- ret = VM_FAULT_RETRY;
- goto out;
- }
-
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
@@ -3725,6 +3720,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ /*
+ * migrate_to_ram is not yet ready to operate
+ * under VMA lock.
+ */
+ ret |= VM_FAULT_RETRY;
+ goto out;
+ }
+
vmf->page = pfn_swap_entry_to_page(entry);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
--
2.41.0.162.gfafddb0af9-goog
Follow the same pattern as mmap_lock when waiting for folio by dropping
VMA lock before the wait and retrying once folio is available.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
include/linux/pagemap.h | 14 ++++++++++----
mm/filemap.c | 43 ++++++++++++++++++++++-------------------
mm/memory.c | 13 ++++++++-----
3 files changed, 41 insertions(+), 29 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..6c9493314c21 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -896,8 +896,8 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
void __folio_lock(struct folio *folio);
int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
+ unsigned int flags, bool *lock_dropped);
void unlock_page(struct page *page);
void folio_unlock(struct folio *folio);
@@ -1002,10 +1002,16 @@ static inline int folio_lock_killable(struct folio *folio)
* __folio_lock_or_retry().
*/
static inline bool folio_lock_or_retry(struct folio *folio,
- struct mm_struct *mm, unsigned int flags)
+ struct vm_area_struct *vma, unsigned int flags,
+ bool *lock_dropped)
{
might_sleep();
- return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+ if (folio_trylock(folio)) {
+ *lock_dropped = false;
+ return true;
+ }
+
+ return __folio_lock_or_retry(folio, vma, flags, lock_dropped);
}
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 7cb0a3776a07..838955635fbc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1701,37 +1701,35 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
/*
* Return values:
- * true - folio is locked; mmap_lock is still held.
+ * true - folio is locked.
* false - folio is not locked.
- * mmap_lock has been released (mmap_read_unlock(), unless flags had both
- * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
- * which case mmap_lock is still held.
- * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
- * with VMA lock only, the VMA lock is still held.
+ *
+ * lock_dropped indicates whether mmap_lock/VMA lock got dropped.
+ * mmap_lock/VMA lock is dropped when function fails to lock the folio,
+ * unless flags had both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT
+ * set, in which case mmap_lock/VMA lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return true
- * with the folio locked and the mmap_lock unperturbed.
+ * with the folio locked and the mmap_lock/VMA lock unperturbed.
*/
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
- unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
+ unsigned int flags, bool *lock_dropped)
{
- /* Can't do this if not holding mmap_lock */
- if (flags & FAULT_FLAG_VMA_LOCK)
- return false;
-
if (fault_flag_allow_retry_first(flags)) {
- /*
- * CAUTION! In this case, mmap_lock is not released
- * even though return 0.
- */
- if (flags & FAULT_FLAG_RETRY_NOWAIT)
+ if (flags & FAULT_FLAG_RETRY_NOWAIT) {
+ *lock_dropped = false;
return false;
+ }
- mmap_read_unlock(mm);
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ else
+ mmap_read_unlock(vma->vm_mm);
if (flags & FAULT_FLAG_KILLABLE)
folio_wait_locked_killable(folio);
else
folio_wait_locked(folio);
+ *lock_dropped = true;
return false;
}
if (flags & FAULT_FLAG_KILLABLE) {
@@ -1739,13 +1737,18 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
ret = __folio_lock_killable(folio);
if (ret) {
- mmap_read_unlock(mm);
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ else
+ mmap_read_unlock(vma->vm_mm);
+ *lock_dropped = true;
return false;
}
} else {
__folio_lock(folio);
}
+ *lock_dropped = false;
return true;
}
diff --git a/mm/memory.c b/mm/memory.c
index c234f8085f1e..acb09a3aad53 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3568,6 +3568,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
struct folio *folio = page_folio(vmf->page);
struct vm_area_struct *vma = vmf->vma;
struct mmu_notifier_range range;
+ bool lock_dropped;
/*
* We need a reference to lock the folio because we don't hold
@@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
if (!folio_try_get(folio))
return 0;
- if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
+ if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
folio_put(folio);
+ if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
+ return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
return VM_FAULT_RETRY;
}
mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
@@ -3704,7 +3707,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
- int locked;
+ bool lock_dropped;
vm_fault_t ret = 0;
void *shadow = NULL;
@@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_release;
}
- locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
-
- if (!locked) {
+ if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
+ if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
+ ret |= VM_FAULT_VMA_UNLOCKED;
ret |= VM_FAULT_RETRY;
goto out_release;
}
--
2.41.0.162.gfafddb0af9-goog
Enable handle_userfault to operate under VMA lock by releasing VMA lock
instead of mmap_lock and retrying with VM_FAULT_VMA_UNLOCKED set.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
fs/userfaultfd.c | 42 ++++++++++++++++++++++--------------------
mm/memory.c | 9 ---------
2 files changed, 22 insertions(+), 29 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0fd96d6e39ce..23c3a4ce45d9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -277,17 +277,17 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
* hugepmd ranges.
*/
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long address,
- unsigned long flags,
- unsigned long reason)
+ struct vm_fault *vmf,
+ unsigned long reason)
{
+ struct vm_area_struct *vma = vmf->vma;
pte_t *ptep, pte;
bool ret = true;
- mmap_assert_locked(ctx->mm);
+ if (!(vmf->flags & FAULT_FLAG_VMA_LOCK))
+ mmap_assert_locked(ctx->mm);
- ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
+ ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
if (!ptep)
goto out;
@@ -308,10 +308,8 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
}
#else
static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
- struct vm_area_struct *vma,
- unsigned long address,
- unsigned long flags,
- unsigned long reason)
+ struct vm_fault *vmf,
+ unsigned long reason)
{
return false; /* should never get here */
}
@@ -325,11 +323,11 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
* threads.
*/
static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
- unsigned long address,
- unsigned long flags,
+ struct vm_fault *vmf,
unsigned long reason)
{
struct mm_struct *mm = ctx->mm;
+ unsigned long address = vmf->address;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -337,7 +335,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
pte_t *pte;
bool ret = true;
- mmap_assert_locked(mm);
+ if (!(vmf->flags & FAULT_FLAG_VMA_LOCK))
+ mmap_assert_locked(mm);
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -445,7 +444,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
* Coredumping runs without mmap_lock so we can only check that
* the mmap_lock is held, if PF_DUMPCORE was not set.
*/
- mmap_assert_locked(mm);
+ if (!(vmf->flags & FAULT_FLAG_VMA_LOCK))
+ mmap_assert_locked(mm);
ctx = vma->vm_userfaultfd_ctx.ctx;
if (!ctx)
@@ -561,15 +561,17 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
spin_unlock_irq(&ctx->fault_pending_wqh.lock);
if (!is_vm_hugetlb_page(vma))
- must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
- reason);
+ must_wait = userfaultfd_must_wait(ctx, vmf, reason);
else
- must_wait = userfaultfd_huge_must_wait(ctx, vma,
- vmf->address,
- vmf->flags, reason);
+ must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
if (is_vm_hugetlb_page(vma))
hugetlb_vma_unlock_read(vma);
- mmap_read_unlock(mm);
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ vma_end_read(vma);
+ /* WARNING: VMA can't be used after this */
+ ret |= VM_FAULT_VMA_UNLOCKED;
+ } else
+ mmap_read_unlock(mm);
if (likely(must_wait && !READ_ONCE(ctx->released))) {
wake_up_poll(&ctx->fd_wqh, EPOLLIN);
diff --git a/mm/memory.c b/mm/memory.c
index acb09a3aad53..b2ea015dcb87 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5306,15 +5306,6 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
if (!vma_start_read(vma))
goto inval;
- /*
- * Due to the possibility of userfault handler dropping mmap_lock, avoid
- * it for now and fall back to page fault handling under mmap_lock.
- */
- if (userfaultfd_armed(vma)) {
- vma_end_read(vma);
- goto inval;
- }
-
/* Check since vm_start/vm_end might change before we lock the VMA */
if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
vma_end_read(vma);
--
2.41.0.162.gfafddb0af9-goog
migration_entry_wait does not need VMA lock, therefore it can be dropped
before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
lock was dropped while in handle_mm_fault().
Note that once VMA lock is dropped, the VMA reference can't be used as
there are no guarantees it was not freed.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
arch/arm64/mm/fault.c | 3 ++-
arch/powerpc/mm/fault.c | 3 ++-
arch/s390/mm/fault.c | 3 ++-
arch/x86/mm/fault.c | 3 ++-
include/linux/mm_types.h | 6 +++++-
mm/memory.c | 12 ++++++++++--
6 files changed, 23 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 6045a5117ac1..8f59badbffb5 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -601,7 +601,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
goto lock_mmap;
}
fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & VM_FAULT_VMA_UNLOCKED))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 531177a4ee08..b27730f07141 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -494,7 +494,8 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address,
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & VM_FAULT_VMA_UNLOCKED))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index b65144c392b0..cc923dbb0821 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -418,7 +418,8 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & VM_FAULT_VMA_UNLOCKED))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
goto out;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e4399983c50c..ef62ab2fd211 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1347,7 +1347,8 @@ void do_user_addr_fault(struct pt_regs *regs,
goto lock_mmap;
}
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & VM_FAULT_VMA_UNLOCKED))
+ vma_end_read(vma);
if (!(fault & VM_FAULT_RETRY)) {
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 79765e3dd8f3..bd6b95c82f7a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1030,6 +1030,8 @@ typedef __bitwise unsigned int vm_fault_t;
* fsync() to complete (for synchronous page faults
* in DAX)
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
+ * @VM_FAULT_VMA_UNLOCKED: VMA lock was released, vmf->vma should no longer
+ * be accessed
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
*
*/
@@ -1047,6 +1049,7 @@ enum vm_fault_reason {
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
+ VM_FAULT_VMA_UNLOCKED = (__force vm_fault_t)0x008000,
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
};
@@ -1071,7 +1074,8 @@ enum vm_fault_reason {
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }, \
- { VM_FAULT_COMPLETED, "COMPLETED" }
+ { VM_FAULT_COMPLETED, "COMPLETED" }, \
+ { VM_FAULT_VMA_UNLOCKED, "VMA_UNLOCKED" }
struct vm_special_mapping {
const char *name; /* The name, e.g. "[vdso]". */
diff --git a/mm/memory.c b/mm/memory.c
index 41f45819a923..c234f8085f1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3714,8 +3714,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
- migration_entry_wait(vma->vm_mm, vmf->pmd,
- vmf->address);
+ /* Save mm in case VMA lock is dropped */
+ struct mm_struct *mm = vma->vm_mm;
+
+ if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
+ /* No need to hold VMA lock for migration */
+ vma_end_read(vma);
+ /* WARNING: VMA can't be used after this */
+ ret |= VM_FAULT_VMA_UNLOCKED;
+ }
+ migration_entry_wait(mm, vmf->pmd, vmf->address);
} else if (is_device_exclusive_entry(entry)) {
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
--
2.41.0.162.gfafddb0af9-goog
+ Ming Lei for confirmation.
Suren Baghdasaryan <[email protected]> writes:
> Commit [1] introduced IO polling support during swapin to reduce
> swap read latency for block devices that can be polled. However later
> commit [2] removed polling support. Therefore it seems safe to remove
> do_poll parameter in read_swap_cache_async and always call swap_readpage
> with synchronous=false waiting for IO completion in folio_lock_or_retry.
>
> [1] commit 23955622ff8d ("swap: add block io poll in swapin path")
> [2] commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio")
>
> Suggested-by: Huang Ying <[email protected]>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Looks good to me! Thanks!
Reviewed-by: "Huang, Ying" <[email protected]>
> ---
> mm/madvise.c | 4 ++--
> mm/swap.h | 1 -
> mm/swap_state.c | 12 +++++-------
> 3 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index b5ffbaf616f5..b1e8adf1234e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -215,7 +215,7 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> continue;
>
> page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> - vma, index, false, &splug);
> + vma, index, &splug);
> if (page)
> put_page(page);
> }
> @@ -252,7 +252,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
> rcu_read_unlock();
>
> page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
> - NULL, 0, false, &splug);
> + NULL, 0, &splug);
> if (page)
> put_page(page);
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 7c033d793f15..8a3c7a0ace4f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -46,7 +46,6 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
> struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma,
> unsigned long addr,
> - bool do_poll,
> struct swap_iocb **plug);
> struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b76a65ac28b3..a3839de71f3f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -517,15 +517,14 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma,
> - unsigned long addr, bool do_poll,
> - struct swap_iocb **plug)
> + unsigned long addr, struct swap_iocb **plug)
> {
> bool page_was_allocated;
> struct page *retpage = __read_swap_cache_async(entry, gfp_mask,
> vma, addr, &page_was_allocated);
>
> if (page_was_allocated)
> - swap_readpage(retpage, do_poll, plug);
> + swap_readpage(retpage, false, plug);
>
> return retpage;
> }
> @@ -620,7 +619,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> struct swap_info_struct *si = swp_swap_info(entry);
> struct blk_plug plug;
> struct swap_iocb *splug = NULL;
> - bool do_poll = true, page_allocated;
> + bool page_allocated;
> struct vm_area_struct *vma = vmf->vma;
> unsigned long addr = vmf->address;
>
> @@ -628,7 +627,6 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> if (!mask)
> goto skip;
>
> - do_poll = false;
> /* Read a page_cluster sized and aligned cluster around offset. */
> start_offset = offset & ~mask;
> end_offset = offset | mask;
> @@ -660,7 +658,7 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> lru_add_drain(); /* Push any new pages onto the LRU now */
> skip:
> /* The page was likely read above, so no need for plugging here */
> - return read_swap_cache_async(entry, gfp_mask, vma, addr, do_poll, NULL);
> + return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL);
> }
>
> int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> @@ -825,7 +823,7 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
> skip:
> /* The page was likely read above, so no need for plugging here */
> return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address,
> - ra_info.win == 1, NULL);
> + NULL);
> }
>
> /**
On Fri, Jun 9, 2023 at 9:58 AM Huang, Ying <[email protected]> wrote:
>
> + Ming Lei for confirmation.
Good catch, it isn't necessary to pass the polling parameter now.
Thanks,
On Thu, Jun 08, 2023 at 05:51:57PM -0700, Suren Baghdasaryan wrote:
> static inline bool folio_lock_or_retry(struct folio *folio,
> - struct mm_struct *mm, unsigned int flags)
> + struct vm_area_struct *vma, unsigned int flags,
> + bool *lock_dropped)
I hate these double-return-value functions.
How about this for an API:
vm_fault_t folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
{
might_sleep();
if (folio_trylock(folio))
return 0;
return __folio_lock_fault(folio, vmf);
}
Then the users look like ...
> @@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> if (!folio_try_get(folio))
> return 0;
>
> - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> folio_put(folio);
> + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> + return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
> return VM_FAULT_RETRY;
> }
ret = folio_lock_fault(folio, vmf);
if (ret)
return ret;
> @@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_release;
> }
>
> - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
> -
> - if (!locked) {
> + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> + ret |= VM_FAULT_VMA_UNLOCKED;
> ret |= VM_FAULT_RETRY;
> goto out_release;
> }
ret |= folio_lock_fault(folio, vmf);
if (ret & VM_FAULT_RETRY)
goto out_release;
ie instead of trying to reconstruct what __folio_lock_fault() did from
its outputs, we just let folio_lock_fault() tell us what it did.
On Fri, Jun 9, 2023 at 11:49 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Jun 9, 2023 at 8:03 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, Jun 08, 2023 at 05:51:57PM -0700, Suren Baghdasaryan wrote:
> > > static inline bool folio_lock_or_retry(struct folio *folio,
> > > - struct mm_struct *mm, unsigned int flags)
> > > + struct vm_area_struct *vma, unsigned int flags,
> > > + bool *lock_dropped)
> >
> > I hate these double-return-value functions.
> >
> > How about this for an API:
> >
> > vm_fault_t folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> > {
> > might_sleep();
> > if (folio_trylock(folio))
> > return 0;
> > return __folio_lock_fault(folio, vmf);
> > }
> >
> > Then the users look like ...
> >
> > > @@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > if (!folio_try_get(folio))
> > > return 0;
> > >
> > > - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> > > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > > folio_put(folio);
> > > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > + return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
> > > return VM_FAULT_RETRY;
> > > }
> >
> > ret = folio_lock_fault(folio, vmf);
> > if (ret)
> > return ret;
> >
> > > @@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > goto out_release;
> > > }
> > >
> > > - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
> > > -
> > > - if (!locked) {
> > > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > + ret |= VM_FAULT_VMA_UNLOCKED;
> > > ret |= VM_FAULT_RETRY;
> > > goto out_release;
> > > }
> >
> > ret |= folio_lock_fault(folio, vmf);
> > if (ret & VM_FAULT_RETRY)
> > goto out_release;
> >
> > ie instead of trying to reconstruct what __folio_lock_fault() did from
> > its outputs, we just let folio_lock_fault() tell us what it did.
>
> Thanks for taking a look!
> Ok, I think what you are suggesting is to have a new set of
> folio_lock_fault()/__folio_lock_fault() functions which return
> vm_fault_t directly, __folio_lock_fault() will use
> __folio_lock_or_retry() internally and will adjust its return value
> based on __folio_lock_or_retry()'s return and the lock releasing rules
> described in the comments for __folio_lock_or_retry(). Is my
> understanding correct?
Oh, after rereading I think you are suggesting to replace
folio_lock_or_retry()/__folio_lock_or_retry() with
folio_lock_fault()/__folio_lock_fault(), not to add them. Is that
right?
On Fri, Jun 9, 2023 at 8:03 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Jun 08, 2023 at 05:51:57PM -0700, Suren Baghdasaryan wrote:
> > static inline bool folio_lock_or_retry(struct folio *folio,
> > - struct mm_struct *mm, unsigned int flags)
> > + struct vm_area_struct *vma, unsigned int flags,
> > + bool *lock_dropped)
>
> I hate these double-return-value functions.
>
> How about this for an API:
>
> vm_fault_t folio_lock_fault(struct folio *folio, struct vm_fault *vmf)
> {
> might_sleep();
> if (folio_trylock(folio))
> return 0;
> return __folio_lock_fault(folio, vmf);
> }
>
> Then the users look like ...
>
> > @@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > if (!folio_try_get(folio))
> > return 0;
> >
> > - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > folio_put(folio);
> > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
> > return VM_FAULT_RETRY;
> > }
>
> ret = folio_lock_fault(folio, vmf);
> if (ret)
> return ret;
>
> > @@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > goto out_release;
> > }
> >
> > - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
> > -
> > - if (!locked) {
> > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + ret |= VM_FAULT_VMA_UNLOCKED;
> > ret |= VM_FAULT_RETRY;
> > goto out_release;
> > }
>
> ret |= folio_lock_fault(folio, vmf);
> if (ret & VM_FAULT_RETRY)
> goto out_release;
>
> ie instead of trying to reconstruct what __folio_lock_fault() did from
> its outputs, we just let folio_lock_fault() tell us what it did.
Thanks for taking a look!
Ok, I think what you are suggesting is to have a new set of
folio_lock_fault()/__folio_lock_fault() functions which return
vm_fault_t directly, __folio_lock_fault() will use
__folio_lock_or_retry() internally and will adjust its return value
based on __folio_lock_or_retry()'s return and the lock releasing rules
described in the comments for __folio_lock_or_retry(). Is my
understanding correct?
On Thu, Jun 8, 2023 at 8:14 PM Ming Lei <[email protected]> wrote:
>
> On Fri, Jun 9, 2023 at 9:58 AM Huang, Ying <[email protected]> wrote:
> >
> > + Ming Lei for confirmation.
>
> Good catch, it isn't necessary to pass the polling parameter now.
Thanks folks for reviewing and confirming!
>
> Thanks,
>
On Fri, Jun 09, 2023 at 11:55:29AM -0700, Suren Baghdasaryan wrote:
> Oh, after rereading I think you are suggesting to replace
> folio_lock_or_retry()/__folio_lock_or_retry() with
> folio_lock_fault()/__folio_lock_fault(), not to add them. Is that
> right?
Right. It only has two callers. And I'd do that before adding the
FAULT_VMA_LOCK handling to it.
On Thu, Jun 08, 2023 at 05:51:54PM -0700, Suren Baghdasaryan wrote:
> When page fault is handled under VMA lock protection, all swap page
> faults are retried with mmap_lock because folio_lock_or_retry
> implementation has to drop and reacquire mmap_lock if folio could
> not be immediately locked.
> Instead of retrying all swapped page faults, retry only when folio
> locking fails.
> Note that the only time do_swap_page calls synchronous swap_readpage
> is when SWP_SYNCHRONOUS_IO is set, which is only set for
> QUEUE_FLAG_SYNCHRONOUS devices: brd, zram and nvdimms (both btt and
> pmem). Therefore we don't sleep in this path, and there's no need to
> drop the mmap or per-vma lock.
> Drivers implementing ops->migrate_to_ram might still rely on mmap_lock,
> therefore fall back to mmap_lock in this case.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> mm/filemap.c | 6 ++++++
> mm/memory.c | 14 +++++++++-----
> 2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b4c9bd368b7e..7cb0a3776a07 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> * which case mmap_lock is still held.
> + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
> + * with VMA lock only, the VMA lock is still held.
> *
> * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
> * with the folio locked and the mmap_lock unperturbed.
> @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> unsigned int flags)
> {
> + /* Can't do this if not holding mmap_lock */
> + if (flags & FAULT_FLAG_VMA_LOCK)
> + return false;
If here what we need is the page lock, can we just conditionally release
either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
--
Peter Xu
On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> migration_entry_wait does not need VMA lock, therefore it can be dropped
> before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> lock was dropped while in handle_mm_fault().
> Note that once VMA lock is dropped, the VMA reference can't be used as
> there are no guarantees it was not freed.
Then vma lock behaves differently from mmap read lock, am I right? Can we
still make them match on behaviors, or there's reason not to do so?
One reason is if they match they can reuse existing flags and there'll be
less confusing, e.g. this:
(fault->flags & FAULT_FLAG_VMA_LOCK) &&
(vm_fault_ret && (VM_FAULT_RETRY || VM_FAULT_COMPLETE))
can replace the new flag, iiuc.
Thanks,
--
Peter Xu
On Thu, Jun 08, 2023 at 05:51:55PM -0700, Suren Baghdasaryan wrote:
> VM_FAULT_RESULT_TRACE should contain an element for every vm_fault_reason
> to be used as flag_array inside trace_print_flags_seq(). The element
> for VM_FAULT_COMPLETED is missing, add it.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu
On Fri, Jun 09, 2023 at 04:25:42PM -0400, Peter Xu wrote:
> > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > unsigned int flags)
> > {
> > + /* Can't do this if not holding mmap_lock */
> > + if (flags & FAULT_FLAG_VMA_LOCK)
> > + return false;
>
> If here what we need is the page lock, can we just conditionally release
> either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
See patch 5 ...
On Fri, Jun 09, 2023 at 09:35:49PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 09, 2023 at 04:25:42PM -0400, Peter Xu wrote:
> > > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > > unsigned int flags)
> > > {
> > > + /* Can't do this if not holding mmap_lock */
> > > + if (flags & FAULT_FLAG_VMA_LOCK)
> > > + return false;
> >
> > If here what we need is the page lock, can we just conditionally release
> > either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
>
> See patch 5 ...
Just reaching.. :)
Why not in one shot, then?
--
Peter Xu
On Thu, Jun 08, 2023 at 05:51:57PM -0700, Suren Baghdasaryan wrote:
> Follow the same pattern as mmap_lock when waiting for folio by dropping
> VMA lock before the wait and retrying once folio is available.
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> include/linux/pagemap.h | 14 ++++++++++----
> mm/filemap.c | 43 ++++++++++++++++++++++-------------------
> mm/memory.c | 13 ++++++++-----
> 3 files changed, 41 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index a56308a9d1a4..6c9493314c21 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -896,8 +896,8 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
>
> void __folio_lock(struct folio *folio);
> int __folio_lock_killable(struct folio *folio);
> -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> - unsigned int flags);
> +bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
> + unsigned int flags, bool *lock_dropped);
> void unlock_page(struct page *page);
> void folio_unlock(struct folio *folio);
>
> @@ -1002,10 +1002,16 @@ static inline int folio_lock_killable(struct folio *folio)
> * __folio_lock_or_retry().
> */
> static inline bool folio_lock_or_retry(struct folio *folio,
> - struct mm_struct *mm, unsigned int flags)
> + struct vm_area_struct *vma, unsigned int flags,
> + bool *lock_dropped)
> {
> might_sleep();
> - return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
> + if (folio_trylock(folio)) {
> + *lock_dropped = false;
> + return true;
> + }
> +
> + return __folio_lock_or_retry(folio, vma, flags, lock_dropped);
> }
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7cb0a3776a07..838955635fbc 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1701,37 +1701,35 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
>
> /*
> * Return values:
> - * true - folio is locked; mmap_lock is still held.
> + * true - folio is locked.
> * false - folio is not locked.
> - * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> - * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> - * which case mmap_lock is still held.
> - * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
> - * with VMA lock only, the VMA lock is still held.
> + *
> + * lock_dropped indicates whether mmap_lock/VMA lock got dropped.
> + * mmap_lock/VMA lock is dropped when function fails to lock the folio,
> + * unless flags had both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT
> + * set, in which case mmap_lock/VMA lock is still held.
This seems to be a separate change to have "lock_dropped", would it worth a
separate patch for it if needed?
I do agree it's confusing and it might be the reason of this change, but I
think it may or may not help much.. as long as VM_FAULT_RETRY semantics
kept unchanged iiuc (it doesn't always imply mmap lock released, only if
!NOWAIT, which can be confusing too).
Especially that doesn't seem like a must for the vma change. IIUC to
support vma lock here we can simply keep everything as before, but only
release proper lock based on the fault flag should work. But maybe I just
missed something, so that relies on the answer to previous patch...
> *
> * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
> - * with the folio locked and the mmap_lock unperturbed.
> + * with the folio locked and the mmap_lock/VMA lock unperturbed.
> */
> -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> - unsigned int flags)
> +bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
> + unsigned int flags, bool *lock_dropped)
> {
> - /* Can't do this if not holding mmap_lock */
> - if (flags & FAULT_FLAG_VMA_LOCK)
> - return false;
> -
> if (fault_flag_allow_retry_first(flags)) {
> - /*
> - * CAUTION! In this case, mmap_lock is not released
> - * even though return 0.
> - */
> - if (flags & FAULT_FLAG_RETRY_NOWAIT)
> + if (flags & FAULT_FLAG_RETRY_NOWAIT) {
> + *lock_dropped = false;
> return false;
> + }
>
> - mmap_read_unlock(mm);
> + if (flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vma);
> + else
> + mmap_read_unlock(vma->vm_mm);
> if (flags & FAULT_FLAG_KILLABLE)
> folio_wait_locked_killable(folio);
> else
> folio_wait_locked(folio);
> + *lock_dropped = true;
> return false;
> }
> if (flags & FAULT_FLAG_KILLABLE) {
> @@ -1739,13 +1737,18 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
>
> ret = __folio_lock_killable(folio);
> if (ret) {
> - mmap_read_unlock(mm);
> + if (flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vma);
> + else
> + mmap_read_unlock(vma->vm_mm);
> + *lock_dropped = true;
> return false;
> }
> } else {
> __folio_lock(folio);
> }
>
> + *lock_dropped = false;
> return true;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c234f8085f1e..acb09a3aad53 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3568,6 +3568,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> struct folio *folio = page_folio(vmf->page);
> struct vm_area_struct *vma = vmf->vma;
> struct mmu_notifier_range range;
> + bool lock_dropped;
>
> /*
> * We need a reference to lock the folio because we don't hold
> @@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> if (!folio_try_get(folio))
> return 0;
>
> - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> folio_put(folio);
> + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> + return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
> return VM_FAULT_RETRY;
> }
> mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> @@ -3704,7 +3707,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> bool exclusive = false;
> swp_entry_t entry;
> pte_t pte;
> - int locked;
> + bool lock_dropped;
> vm_fault_t ret = 0;
> void *shadow = NULL;
>
> @@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_release;
> }
>
> - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
> -
> - if (!locked) {
> + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> + ret |= VM_FAULT_VMA_UNLOCKED;
> ret |= VM_FAULT_RETRY;
> goto out_release;
> }
> --
> 2.41.0.162.gfafddb0af9-goog
>
--
Peter Xu
On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > lock was dropped while in handle_mm_fault().
> > Note that once VMA lock is dropped, the VMA reference can't be used as
> > there are no guarantees it was not freed.
>
> Then vma lock behaves differently from mmap read lock, am I right? Can we
> still make them match on behaviors, or there's reason not to do so?
I think we could match their behavior by also dropping mmap_lock here
when fault is handled under mmap_lock (!(fault->flags &
FAULT_FLAG_VMA_LOCK)).
I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
mmap_lock in do_page_fault(), so indeed, I might be able to use
VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
of reusing existing flags?
>
> One reason is if they match they can reuse existing flags and there'll be
> less confusing, e.g. this:
>
> (fault->flags & FAULT_FLAG_VMA_LOCK) &&
> (vm_fault_ret && (VM_FAULT_RETRY || VM_FAULT_COMPLETE))
>
> can replace the new flag, iiuc.
>
> Thanks,
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Fri, Jun 9, 2023 at 1:45 PM Peter Xu <[email protected]> wrote:
>
> On Fri, Jun 09, 2023 at 09:35:49PM +0100, Matthew Wilcox wrote:
> > On Fri, Jun 09, 2023 at 04:25:42PM -0400, Peter Xu wrote:
> > > > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > > > unsigned int flags)
> > > > {
> > > > + /* Can't do this if not holding mmap_lock */
> > > > + if (flags & FAULT_FLAG_VMA_LOCK)
> > > > + return false;
> > >
> > > If here what we need is the page lock, can we just conditionally release
> > > either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
> >
> > See patch 5 ...
>
> Just reaching.. :)
>
> Why not in one shot, then?
I like small incremental changes, but I can squash them if that helps
in having a complete picture.
>
> --
> Peter Xu
>
On Fri, Jun 9, 2023 at 1:54 PM Peter Xu <[email protected]> wrote:
>
> On Thu, Jun 08, 2023 at 05:51:57PM -0700, Suren Baghdasaryan wrote:
> > Follow the same pattern as mmap_lock when waiting for folio by dropping
> > VMA lock before the wait and retrying once folio is available.
> >
> > Signed-off-by: Suren Baghdasaryan <[email protected]>
> > ---
> > include/linux/pagemap.h | 14 ++++++++++----
> > mm/filemap.c | 43 ++++++++++++++++++++++-------------------
> > mm/memory.c | 13 ++++++++-----
> > 3 files changed, 41 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index a56308a9d1a4..6c9493314c21 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -896,8 +896,8 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
> >
> > void __folio_lock(struct folio *folio);
> > int __folio_lock_killable(struct folio *folio);
> > -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > - unsigned int flags);
> > +bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
> > + unsigned int flags, bool *lock_dropped);
> > void unlock_page(struct page *page);
> > void folio_unlock(struct folio *folio);
> >
> > @@ -1002,10 +1002,16 @@ static inline int folio_lock_killable(struct folio *folio)
> > * __folio_lock_or_retry().
> > */
> > static inline bool folio_lock_or_retry(struct folio *folio,
> > - struct mm_struct *mm, unsigned int flags)
> > + struct vm_area_struct *vma, unsigned int flags,
> > + bool *lock_dropped)
> > {
> > might_sleep();
> > - return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
> > + if (folio_trylock(folio)) {
> > + *lock_dropped = false;
> > + return true;
> > + }
> > +
> > + return __folio_lock_or_retry(folio, vma, flags, lock_dropped);
> > }
> >
> > /*
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7cb0a3776a07..838955635fbc 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1701,37 +1701,35 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> >
> > /*
> > * Return values:
> > - * true - folio is locked; mmap_lock is still held.
> > + * true - folio is locked.
> > * false - folio is not locked.
> > - * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> > - * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> > - * which case mmap_lock is still held.
> > - * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
> > - * with VMA lock only, the VMA lock is still held.
> > + *
> > + * lock_dropped indicates whether mmap_lock/VMA lock got dropped.
> > + * mmap_lock/VMA lock is dropped when function fails to lock the folio,
> > + * unless flags had both FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT
> > + * set, in which case mmap_lock/VMA lock is still held.
>
> This seems to be a separate change to have "lock_dropped", would it worth a
> separate patch for it if needed?
Yes, Matthew asked for the same and also to change the function to
return the flags directly which should make it cleaner. IOW when this
function drops the lock it will include VM_FAULT_VMA_UNLOCKED flag in
its return value. And that will be done in a separate patch.
>
> I do agree it's confusing and it might be the reason of this change, but I
> think it may or may not help much.. as long as VM_FAULT_RETRY semantics
> kept unchanged iiuc (it doesn't always imply mmap lock released, only if
> !NOWAIT, which can be confusing too).
>
> Especially that doesn't seem like a must for the vma change. IIUC to
> support vma lock here we can simply keep everything as before, but only
> release proper lock based on the fault flag should work. But maybe I just
> missed something, so that relies on the answer to previous patch...
That was my intention here, IOW I'm making the following replacement:
- mmap_read_unlock(mm);
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ else
+ mmap_read_unlock(vma->vm_mm);
Did I miss something which makes the function work differently between
mmap_lock vs per-vma one?
>
> > *
> > * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
> > - * with the folio locked and the mmap_lock unperturbed.
> > + * with the folio locked and the mmap_lock/VMA lock unperturbed.
> > */
> > -bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > - unsigned int flags)
> > +bool __folio_lock_or_retry(struct folio *folio, struct vm_area_struct *vma,
> > + unsigned int flags, bool *lock_dropped)
> > {
> > - /* Can't do this if not holding mmap_lock */
> > - if (flags & FAULT_FLAG_VMA_LOCK)
> > - return false;
> > -
> > if (fault_flag_allow_retry_first(flags)) {
> > - /*
> > - * CAUTION! In this case, mmap_lock is not released
> > - * even though return 0.
> > - */
> > - if (flags & FAULT_FLAG_RETRY_NOWAIT)
> > + if (flags & FAULT_FLAG_RETRY_NOWAIT) {
> > + *lock_dropped = false;
> > return false;
> > + }
> >
> > - mmap_read_unlock(mm);
> > + if (flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vma);
> > + else
> > + mmap_read_unlock(vma->vm_mm);
> > if (flags & FAULT_FLAG_KILLABLE)
> > folio_wait_locked_killable(folio);
> > else
> > folio_wait_locked(folio);
> > + *lock_dropped = true;
> > return false;
> > }
> > if (flags & FAULT_FLAG_KILLABLE) {
> > @@ -1739,13 +1737,18 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> >
> > ret = __folio_lock_killable(folio);
> > if (ret) {
> > - mmap_read_unlock(mm);
> > + if (flags & FAULT_FLAG_VMA_LOCK)
> > + vma_end_read(vma);
> > + else
> > + mmap_read_unlock(vma->vm_mm);
> > + *lock_dropped = true;
> > return false;
> > }
> > } else {
> > __folio_lock(folio);
> > }
> >
> > + *lock_dropped = false;
> > return true;
> > }
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c234f8085f1e..acb09a3aad53 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3568,6 +3568,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > struct folio *folio = page_folio(vmf->page);
> > struct vm_area_struct *vma = vmf->vma;
> > struct mmu_notifier_range range;
> > + bool lock_dropped;
> >
> > /*
> > * We need a reference to lock the folio because we don't hold
> > @@ -3580,8 +3581,10 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > if (!folio_try_get(folio))
> > return 0;
> >
> > - if (!folio_lock_or_retry(folio, vma->vm_mm, vmf->flags)) {
> > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > folio_put(folio);
> > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + return VM_FAULT_VMA_UNLOCKED | VM_FAULT_RETRY;
> > return VM_FAULT_RETRY;
> > }
> > mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> > @@ -3704,7 +3707,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > bool exclusive = false;
> > swp_entry_t entry;
> > pte_t pte;
> > - int locked;
> > + bool lock_dropped;
> > vm_fault_t ret = 0;
> > void *shadow = NULL;
> >
> > @@ -3837,9 +3840,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > goto out_release;
> > }
> >
> > - locked = folio_lock_or_retry(folio, vma->vm_mm, vmf->flags);
> > -
> > - if (!locked) {
> > + if (!folio_lock_or_retry(folio, vma, vmf->flags, &lock_dropped)) {
> > + if (lock_dropped && vmf->flags & FAULT_FLAG_VMA_LOCK)
> > + ret |= VM_FAULT_VMA_UNLOCKED;
> > ret |= VM_FAULT_RETRY;
> > goto out_release;
> > }
> > --
> > 2.41.0.162.gfafddb0af9-goog
> >
>
> --
> Peter Xu
>
On Fri, Jun 9, 2023 at 12:37 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Jun 09, 2023 at 11:55:29AM -0700, Suren Baghdasaryan wrote:
> > Oh, after rereading I think you are suggesting to replace
> > folio_lock_or_retry()/__folio_lock_or_retry() with
> > folio_lock_fault()/__folio_lock_fault(), not to add them. Is that
> > right?
>
> Right. It only has two callers. And I'd do that before adding the
> FAULT_VMA_LOCK handling to it.
Got it. Will make the changes.
Thanks folks for the review!
On Fri, Jun 9, 2023 at 3:30 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > lock was dropped while in handle_mm_fault().
> > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > there are no guarantees it was not freed.
> >
> > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > still make them match on behaviors, or there's reason not to do so?
>
> I think we could match their behavior by also dropping mmap_lock here
> when fault is handled under mmap_lock (!(fault->flags &
> FAULT_FLAG_VMA_LOCK)).
> I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> mmap_lock in do_page_fault(), so indeed, I might be able to use
> VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> of reusing existing flags?
Sorry, I meant VM_FAULT_VMA_UNLOCKED, not FAULT_FLAG_VMA_LOCK in the
above reply.
I took a closer look into using VM_FAULT_COMPLETED instead of
VM_FAULT_VMA_UNLOCKED but when we fall back from per-vma lock to
mmap_lock we need to retry with an indication that the per-vma lock
was dropped. Returning (VM_FAULT_RETRY | VM_FAULT_COMPLETE) to
indicate such state seems strange to me ("retry" and "complete" seem
like contradicting concepts to be used in a single result). I could
use VM_FAULT_COMPLETE when releasing mmap_lock since we don't use it
in combination with VM_FAULT_RETRY and (VM_FAULT_RETRY |
VM_FAULT_VMA_UNLOCKED) when dropping per-vma lock and falling back to
mmap_lock. It still requires the new VM_FAULT_VMA_UNLOCKED flag but I
think logically that makes more sense. WDYT?
>
> >
> > One reason is if they match they can reuse existing flags and there'll be
> > less confusing, e.g. this:
> >
> > (fault->flags & FAULT_FLAG_VMA_LOCK) &&
> > (vm_fault_ret && (VM_FAULT_RETRY || VM_FAULT_COMPLETE))
> >
> > can replace the new flag, iiuc.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> >
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Fri, Jun 09, 2023 at 06:29:43PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 9, 2023 at 3:30 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > > lock was dropped while in handle_mm_fault().
> > > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > > there are no guarantees it was not freed.
> > >
> > > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > > still make them match on behaviors, or there's reason not to do so?
> >
> > I think we could match their behavior by also dropping mmap_lock here
> > when fault is handled under mmap_lock (!(fault->flags &
> > FAULT_FLAG_VMA_LOCK)).
> > I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> > mmap_lock in do_page_fault(), so indeed, I might be able to use
> > VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> > instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> > of reusing existing flags?
> Sorry, I meant VM_FAULT_VMA_UNLOCKED, not FAULT_FLAG_VMA_LOCK in the
> above reply.
>
> I took a closer look into using VM_FAULT_COMPLETED instead of
> VM_FAULT_VMA_UNLOCKED but when we fall back from per-vma lock to
> mmap_lock we need to retry with an indication that the per-vma lock
> was dropped. Returning (VM_FAULT_RETRY | VM_FAULT_COMPLETE) to
> indicate such state seems strange to me ("retry" and "complete" seem
Not relevant to this migration patch, but for the whole idea I was thinking
whether it should just work if we simply:
fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
- vma_end_read(vma);
+ if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
+ vma_end_read(vma);
?
GUP may need more caution on NOWAIT, but vma lock is only in fault paths so
IIUC it's fine?
--
Peter Xu
On Fri, Jun 09, 2023 at 03:30:10PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> >
> > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > lock was dropped while in handle_mm_fault().
> > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > there are no guarantees it was not freed.
> >
> > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > still make them match on behaviors, or there's reason not to do so?
>
> I think we could match their behavior by also dropping mmap_lock here
> when fault is handled under mmap_lock (!(fault->flags &
> FAULT_FLAG_VMA_LOCK)).
> I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> mmap_lock in do_page_fault(), so indeed, I might be able to use
> VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> of reusing existing flags?
Yes.
I'd suggest we move this patch out of the series as it's not really part of
it on enabling swap + uffd. It can be a separate patch and hopefully it'll
always change both vma+mmap lock cases, and with proper reasonings.
Thanks,
--
Peter Xu
On Fri, Jun 09, 2023 at 03:34:34PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 9, 2023 at 1:45 PM Peter Xu <[email protected]> wrote:
> >
> > On Fri, Jun 09, 2023 at 09:35:49PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jun 09, 2023 at 04:25:42PM -0400, Peter Xu wrote:
> > > > > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > > > > unsigned int flags)
> > > > > {
> > > > > + /* Can't do this if not holding mmap_lock */
> > > > > + if (flags & FAULT_FLAG_VMA_LOCK)
> > > > > + return false;
> > > >
> > > > If here what we need is the page lock, can we just conditionally release
> > > > either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
> > >
> > > See patch 5 ...
> >
> > Just reaching.. :)
> >
> > Why not in one shot, then?
>
> I like small incremental changes, but I can squash them if that helps
> in having a complete picture.
Yes that'll be appreciated. IMHO keeping changing semantics of
FAULT_FLAG_VMA_LOCK for the folio lock function in the same small series is
confusing.
--
Peter Xu
On Fri, Jun 09, 2023 at 03:48:04PM -0700, Suren Baghdasaryan wrote:
> That was my intention here, IOW I'm making the following replacement:
>
> - mmap_read_unlock(mm);
> + if (flags & FAULT_FLAG_VMA_LOCK)
> + vma_end_read(vma);
> + else
> + mmap_read_unlock(vma->vm_mm);
>
> Did I miss something which makes the function work differently between
> mmap_lock vs per-vma one?
Nothing wrong I can see. Thanks.
--
Peter Xu
On Mon, Jun 12, 2023 at 6:28 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jun 09, 2023 at 03:30:10PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > > lock was dropped while in handle_mm_fault().
> > > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > > there are no guarantees it was not freed.
> > >
> > > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > > still make them match on behaviors, or there's reason not to do so?
> >
> > I think we could match their behavior by also dropping mmap_lock here
> > when fault is handled under mmap_lock (!(fault->flags &
> > FAULT_FLAG_VMA_LOCK)).
> > I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> > mmap_lock in do_page_fault(), so indeed, I might be able to use
> > VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> > instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> > of reusing existing flags?
>
> Yes.
>
> I'd suggest we move this patch out of the series as it's not really part of
> it on enabling swap + uffd. It can be a separate patch and hopefully it'll
> always change both vma+mmap lock cases, and with proper reasonings.
Ok, I can move it out with mmap_lock support only and then add per-vma
lock support in my patchset (because this path is still part of
do_swap_page and my patchset enables swap support for per-vma locks).
>
> Thanks,
>
> --
> Peter Xu
>
On Mon, Jun 12, 2023 at 6:59 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jun 09, 2023 at 03:34:34PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Jun 9, 2023 at 1:45 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Fri, Jun 09, 2023 at 09:35:49PM +0100, Matthew Wilcox wrote:
> > > > On Fri, Jun 09, 2023 at 04:25:42PM -0400, Peter Xu wrote:
> > > > > > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> > > > > > unsigned int flags)
> > > > > > {
> > > > > > + /* Can't do this if not holding mmap_lock */
> > > > > > + if (flags & FAULT_FLAG_VMA_LOCK)
> > > > > > + return false;
> > > > >
> > > > > If here what we need is the page lock, can we just conditionally release
> > > > > either mmap lock or vma lock depending on FAULT_FLAG_VMA_LOCK?
> > > >
> > > > See patch 5 ...
> > >
> > > Just reaching.. :)
> > >
> > > Why not in one shot, then?
> >
> > I like small incremental changes, but I can squash them if that helps
> > in having a complete picture.
>
> Yes that'll be appreciated. IMHO keeping changing semantics of
> FAULT_FLAG_VMA_LOCK for the folio lock function in the same small series is
> confusing.
Ack. Thanks for the feedback!
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jun 12, 2023 at 6:36 AM Peter Xu <[email protected]> wrote:
>
> On Fri, Jun 09, 2023 at 06:29:43PM -0700, Suren Baghdasaryan wrote:
> > On Fri, Jun 9, 2023 at 3:30 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> > > >
> > > > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > > > lock was dropped while in handle_mm_fault().
> > > > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > > > there are no guarantees it was not freed.
> > > >
> > > > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > > > still make them match on behaviors, or there's reason not to do so?
> > >
> > > I think we could match their behavior by also dropping mmap_lock here
> > > when fault is handled under mmap_lock (!(fault->flags &
> > > FAULT_FLAG_VMA_LOCK)).
> > > I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> > > mmap_lock in do_page_fault(), so indeed, I might be able to use
> > > VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> > > instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> > > of reusing existing flags?
> > Sorry, I meant VM_FAULT_VMA_UNLOCKED, not FAULT_FLAG_VMA_LOCK in the
> > above reply.
> >
> > I took a closer look into using VM_FAULT_COMPLETED instead of
> > VM_FAULT_VMA_UNLOCKED but when we fall back from per-vma lock to
> > mmap_lock we need to retry with an indication that the per-vma lock
> > was dropped. Returning (VM_FAULT_RETRY | VM_FAULT_COMPLETE) to
> > indicate such state seems strange to me ("retry" and "complete" seem
>
> Not relevant to this migration patch, but for the whole idea I was thinking
> whether it should just work if we simply:
>
> fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> - vma_end_read(vma);
> + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
> + vma_end_read(vma);
>
> ?
Today when we can't handle a page fault under per-vma locks we return
VM_FAULT_RETRY, in which case per-vma lock is dropped and the fault is
retried under mmap_lock. The condition you suggest above would not
drop per-vma lock for VM_FAULT_RETRY case and would break the current
fallback mechanism.
However your suggestion gave me an idea. I could indicate that per-vma
lock got dropped using vmf structure (like Matthew suggested before)
and once handle_pte_fault(vmf) returns I could check if it returned
VM_FAULT_RETRY but per-vma lock is still held. If that happens I can
call vma_end_read() before returning from __handle_mm_fault(). That
way any time handle_mm_fault() returns VM_FAULT_RETRY per-vma lock
will be already released, so your condition in do_page_fault() will
work correctly. That would eliminate the need for a new
VM_FAULT_VMA_UNLOCKED flag. WDYT?
>
> GUP may need more caution on NOWAIT, but vma lock is only in fault paths so
> IIUC it's fine?
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jun 12, 2023 at 11:34 AM Peter Xu <[email protected]> wrote:
>
> On Mon, Jun 12, 2023 at 09:07:38AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Jun 12, 2023 at 6:36 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Fri, Jun 09, 2023 at 06:29:43PM -0700, Suren Baghdasaryan wrote:
> > > > On Fri, Jun 9, 2023 at 3:30 PM Suren Baghdasaryan <[email protected]> wrote:
> > > > >
> > > > > On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > > > > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > > > > > lock was dropped while in handle_mm_fault().
> > > > > > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > > > > > there are no guarantees it was not freed.
> > > > > >
> > > > > > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > > > > > still make them match on behaviors, or there's reason not to do so?
> > > > >
> > > > > I think we could match their behavior by also dropping mmap_lock here
> > > > > when fault is handled under mmap_lock (!(fault->flags &
> > > > > FAULT_FLAG_VMA_LOCK)).
> > > > > I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> > > > > mmap_lock in do_page_fault(), so indeed, I might be able to use
> > > > > VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> > > > > instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> > > > > of reusing existing flags?
> > > > Sorry, I meant VM_FAULT_VMA_UNLOCKED, not FAULT_FLAG_VMA_LOCK in the
> > > > above reply.
> > > >
> > > > I took a closer look into using VM_FAULT_COMPLETED instead of
> > > > VM_FAULT_VMA_UNLOCKED but when we fall back from per-vma lock to
> > > > mmap_lock we need to retry with an indication that the per-vma lock
> > > > was dropped. Returning (VM_FAULT_RETRY | VM_FAULT_COMPLETE) to
> > > > indicate such state seems strange to me ("retry" and "complete" seem
> > >
> > > Not relevant to this migration patch, but for the whole idea I was thinking
> > > whether it should just work if we simply:
> > >
> > > fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> > > - vma_end_read(vma);
> > > + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
> > > + vma_end_read(vma);
> > >
> > > ?
> >
> > Today when we can't handle a page fault under per-vma locks we return
> > VM_FAULT_RETRY, in which case per-vma lock is dropped and the fault is
>
> Oh I see what I missed. I think it may not be a good idea to reuse
> VM_FAULT_RETRY just for that, because it makes VM_FAULT_RETRY even more
> complicated - now it adds one more case where the lock is not released,
> that's when PER_VMA even if !NOWAIT.
>
> > retried under mmap_lock. The condition you suggest above would not
> > drop per-vma lock for VM_FAULT_RETRY case and would break the current
> > fallback mechanism.
> > However your suggestion gave me an idea. I could indicate that per-vma
> > lock got dropped using vmf structure (like Matthew suggested before)
> > and once handle_pte_fault(vmf) returns I could check if it returned
> > VM_FAULT_RETRY but per-vma lock is still held.
> > If that happens I can
> > call vma_end_read() before returning from __handle_mm_fault(). That
> > way any time handle_mm_fault() returns VM_FAULT_RETRY per-vma lock
> > will be already released, so your condition in do_page_fault() will
> > work correctly. That would eliminate the need for a new
> > VM_FAULT_VMA_UNLOCKED flag. WDYT?
>
> Sounds good.
>
> So probably that's the major pain for now with the legacy fallback (I'll
> have commented if I noticed it with the initial vma lock support..). I
> assume that'll go away as long as we recover the VM_FAULT_RETRY semantics
> to before.
I think so. With that change getting VM_FAULT_RETRY in do_page_fault()
will guarantee that per-vma lock was dropped. Is that what you mean?
>
> --
> Peter Xu
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Mon, Jun 12, 2023 at 11:44:33AM -0700, Suren Baghdasaryan wrote:
> I think so. With that change getting VM_FAULT_RETRY in do_page_fault()
> will guarantee that per-vma lock was dropped. Is that what you mean?
Yes, with the newly added "return VM_FAULT_RETRY" in do_swap_page() for
per-vma lock removed. AFAICT all the rest paths guaranteed that as long as
in the fault paths.
--
Peter Xu
On Mon, Jun 12, 2023 at 09:07:38AM -0700, Suren Baghdasaryan wrote:
> On Mon, Jun 12, 2023 at 6:36 AM Peter Xu <[email protected]> wrote:
> >
> > On Fri, Jun 09, 2023 at 06:29:43PM -0700, Suren Baghdasaryan wrote:
> > > On Fri, Jun 9, 2023 at 3:30 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 1:42 PM Peter Xu <[email protected]> wrote:
> > > > >
> > > > > On Thu, Jun 08, 2023 at 05:51:56PM -0700, Suren Baghdasaryan wrote:
> > > > > > migration_entry_wait does not need VMA lock, therefore it can be dropped
> > > > > > before waiting. Introduce VM_FAULT_VMA_UNLOCKED to indicate that VMA
> > > > > > lock was dropped while in handle_mm_fault().
> > > > > > Note that once VMA lock is dropped, the VMA reference can't be used as
> > > > > > there are no guarantees it was not freed.
> > > > >
> > > > > Then vma lock behaves differently from mmap read lock, am I right? Can we
> > > > > still make them match on behaviors, or there's reason not to do so?
> > > >
> > > > I think we could match their behavior by also dropping mmap_lock here
> > > > when fault is handled under mmap_lock (!(fault->flags &
> > > > FAULT_FLAG_VMA_LOCK)).
> > > > I missed the fact that VM_FAULT_COMPLETED can be used to skip dropping
> > > > mmap_lock in do_page_fault(), so indeed, I might be able to use
> > > > VM_FAULT_COMPLETED to skip vma_end_read(vma) for per-vma locks as well
> > > > instead of introducing FAULT_FLAG_VMA_LOCK. I think that was your idea
> > > > of reusing existing flags?
> > > Sorry, I meant VM_FAULT_VMA_UNLOCKED, not FAULT_FLAG_VMA_LOCK in the
> > > above reply.
> > >
> > > I took a closer look into using VM_FAULT_COMPLETED instead of
> > > VM_FAULT_VMA_UNLOCKED but when we fall back from per-vma lock to
> > > mmap_lock we need to retry with an indication that the per-vma lock
> > > was dropped. Returning (VM_FAULT_RETRY | VM_FAULT_COMPLETE) to
> > > indicate such state seems strange to me ("retry" and "complete" seem
> >
> > Not relevant to this migration patch, but for the whole idea I was thinking
> > whether it should just work if we simply:
> >
> > fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs);
> > - vma_end_read(vma);
> > + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
> > + vma_end_read(vma);
> >
> > ?
>
> Today when we can't handle a page fault under per-vma locks we return
> VM_FAULT_RETRY, in which case per-vma lock is dropped and the fault is
Oh I see what I missed. I think it may not be a good idea to reuse
VM_FAULT_RETRY just for that, because it makes VM_FAULT_RETRY even more
complicated - now it adds one more case where the lock is not released,
that's when PER_VMA even if !NOWAIT.
> retried under mmap_lock. The condition you suggest above would not
> drop per-vma lock for VM_FAULT_RETRY case and would break the current
> fallback mechanism.
> However your suggestion gave me an idea. I could indicate that per-vma
> lock got dropped using vmf structure (like Matthew suggested before)
> and once handle_pte_fault(vmf) returns I could check if it returned
> VM_FAULT_RETRY but per-vma lock is still held.
> If that happens I can
> call vma_end_read() before returning from __handle_mm_fault(). That
> way any time handle_mm_fault() returns VM_FAULT_RETRY per-vma lock
> will be already released, so your condition in do_page_fault() will
> work correctly. That would eliminate the need for a new
> VM_FAULT_VMA_UNLOCKED flag. WDYT?
Sounds good.
So probably that's the major pain for now with the legacy fallback (I'll
have commented if I noticed it with the initial vma lock support..). I
assume that'll go away as long as we recover the VM_FAULT_RETRY semantics
to before.
--
Peter Xu