When page fault is handled under VMA lock protection, all swap page
faults are retried with mmap_lock because folio_lock_or_retry
implementation has to drop and reacquire mmap_lock if folio could
not be immediately locked.
Instead of retrying all swapped page faults, retry only when folio
locking fails.
Signed-off-by: Suren Baghdasaryan <[email protected]>
---
Patch applies cleanly over linux-next and mm-unstable
mm/filemap.c | 6 ++++++
mm/memory.c | 5 -----
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f3a7e53fccf..67b937b0f436 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
* mmap_lock has been released (mmap_read_unlock(), unless flags had both
* FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
* which case mmap_lock is still held.
+ * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
+ * with VMA lock only, the VMA lock is still held.
*
* If neither ALLOW_RETRY nor KILLABLE are set, will always return true
* with the folio locked and the mmap_lock unperturbed.
@@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags)
{
+ /* Can't do this if not holding mmap_lock */
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ return false;
+
if (fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index d88f370eacd1..3301a8d01820 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3715,11 +3715,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!pte_unmap_same(vmf))
goto out;
- if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
- ret = VM_FAULT_RETRY;
- goto out;
- }
-
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
if (is_migration_entry(entry)) {
--
2.40.0.634.g4ca3ef3211-goog
On Fri, Apr 14, 2023 at 11:00 AM Suren Baghdasaryan <[email protected]> wrote:
>
> When page fault is handled under VMA lock protection, all swap page
> faults are retried with mmap_lock because folio_lock_or_retry
> implementation has to drop and reacquire mmap_lock if folio could
> not be immediately locked.
> Instead of retrying all swapped page faults, retry only when folio
> locking fails.
I just realized that the title of the patch is misleading. It's about
handling page fault under VMA lock. A better title would be something
like:
"mm: handle swap page faults under vma lock if page is uncontended"
>
> Signed-off-by: Suren Baghdasaryan <[email protected]>
> ---
> Patch applies cleanly over linux-next and mm-unstable
>
> mm/filemap.c | 6 ++++++
> mm/memory.c | 5 -----
> 2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6f3a7e53fccf..67b937b0f436 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> * mmap_lock has been released (mmap_read_unlock(), unless flags had both
> * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in
> * which case mmap_lock is still held.
> + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed
> + * with VMA lock only, the VMA lock is still held.
> *
> * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
> * with the folio locked and the mmap_lock unperturbed.
> @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> unsigned int flags)
> {
> + /* Can't do this if not holding mmap_lock */
> + if (flags & FAULT_FLAG_VMA_LOCK)
> + return false;
> +
> if (fault_flag_allow_retry_first(flags)) {
> /*
> * CAUTION! In this case, mmap_lock is not released
> diff --git a/mm/memory.c b/mm/memory.c
> index d88f370eacd1..3301a8d01820 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3715,11 +3715,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (!pte_unmap_same(vmf))
> goto out;
>
> - if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> - ret = VM_FAULT_RETRY;
> - goto out;
> - }
> -
> entry = pte_to_swp_entry(vmf->orig_pte);
> if (unlikely(non_swap_entry(entry))) {
> if (is_migration_entry(entry)) {
> --
> 2.40.0.634.g4ca3ef3211-goog
>
On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> When page fault is handled under VMA lock protection, all swap page
> faults are retried with mmap_lock because folio_lock_or_retry
> implementation has to drop and reacquire mmap_lock if folio could
> not be immediately locked.
> Instead of retrying all swapped page faults, retry only when folio
> locking fails.
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Let's just review what can now be handled under the VMA lock instead of
the mmap_lock, in case somebody knows better than me that it's not safe.
- We can call migration_entry_wait(). This will wait for PG_locked to
become clear (in migration_entry_wait_on_locked()). As previously
discussed offline, I think this is safe to do while holding the VMA
locked.
- We can call remove_device_exclusive_entry(). That calls
folio_lock_or_retry(), which will fail if it can't get the VMA lock.
- We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
with Nouveau and amdkfd could comment on how safe this is?
- I believe we can't call handle_pte_marker() because we exclude UFFD
VMAs earlier.
- We can call swap_readpage() if we allocate a new folio. I haven't
traced through all this code to tell if it's OK.
So ... I believe this is all OK, but we're definitely now willing to
wait for I/O from the swap device while holding the VMA lock when we
weren't before. And maybe we should make a bigger deal of it in the
changelog.
And maybe we shouldn't just be failing the folio_lock_or_retry(),
maybe we should be waiting for the folio lock with the VMA locked.
On Fri, Apr 14, 2023 at 11:43 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> > When page fault is handled under VMA lock protection, all swap page
> > faults are retried with mmap_lock because folio_lock_or_retry
> > implementation has to drop and reacquire mmap_lock if folio could
> > not be immediately locked.
> > Instead of retrying all swapped page faults, retry only when folio
> > locking fails.
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Thank you for the reviews!
>
> Let's just review what can now be handled under the VMA lock instead of
> the mmap_lock, in case somebody knows better than me that it's not safe.
>
> - We can call migration_entry_wait(). This will wait for PG_locked to
> become clear (in migration_entry_wait_on_locked()). As previously
> discussed offline, I think this is safe to do while holding the VMA
> locked.
> - We can call remove_device_exclusive_entry(). That calls
> folio_lock_or_retry(), which will fail if it can't get the VMA lock.
> - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> with Nouveau and amdkfd could comment on how safe this is?
> - I believe we can't call handle_pte_marker() because we exclude UFFD
> VMAs earlier.
> - We can call swap_readpage() if we allocate a new folio. I haven't
> traced through all this code to tell if it's OK.
>
> So ... I believe this is all OK, but we're definitely now willing to
> wait for I/O from the swap device while holding the VMA lock when we
> weren't before. And maybe we should make a bigger deal of it in the
> changelog.
>
> And maybe we shouldn't just be failing the folio_lock_or_retry(),
> maybe we should be waiting for the folio lock with the VMA locked.
Wouldn't that cause holding the VMA lock for the duration of swap I/O
(something you said we want to avoid in the previous paragraph) and
effectively undo d065bd810b6d ("mm: retry page fault when blocking on
disk transfer") for VMA locks?
>
On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote:
> > - We can call migration_entry_wait(). This will wait for PG_locked to
> > become clear (in migration_entry_wait_on_locked()). As previously
> > discussed offline, I think this is safe to do while holding the VMA
> > locked.
Just to be clear, this particular use of PG_locked is not during I/O,
it's during page migration. This is a few orders of magnitude
different.
> > - We can call swap_readpage() if we allocate a new folio. I haven't
> > traced through all this code to tell if it's OK.
... whereas this will wait for I/O. If we decide that's not OK, we'll
need to test for FAULT_FLAG_VMA_LOCK and bail out of this path.
> > So ... I believe this is all OK, but we're definitely now willing to
> > wait for I/O from the swap device while holding the VMA lock when we
> > weren't before. And maybe we should make a bigger deal of it in the
> > changelog.
> >
> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > maybe we should be waiting for the folio lock with the VMA locked.
>
> Wouldn't that cause holding the VMA lock for the duration of swap I/O
> (something you said we want to avoid in the previous paragraph) and
> effectively undo d065bd810b6d ("mm: retry page fault when blocking on
> disk transfer") for VMA locks?
I'm not certain we want to avoid holding the VMA lock for the duration
of an I/O. Here's how I understand the rationale for avoiding holding
the mmap_lock while we perform I/O (before the existence of the VMA lock):
- If everybody is doing page faults, there is no specific problem;
we all hold the lock for read and multiple page faults can be handled
in parallel.
- As soon as one thread attempts to manipulate the tree (eg calls
mmap()), all new readers must wait (as the rwsem is fair), and the
writer must wait for all existing readers to finish. That's
potentially milliseconds for an I/O during which time all page faults
stop.
Now we have the per-VMA lock, faults which can be handled without taking
the mmap_lock can still be satisfied, as long as that VMA is not being
modified. It is rare for a real application to take a page fault on a
VMA which is being modified.
So modifications to the tree will generally not take VMA locks on VMAs
which are currently handling faults, and new faults will generally not
find a VMA which is write-locked.
When we find a locked folio (presumably for I/O, although folios are
locked for other reasons), if we fall back to taking the mmap_lock
for read, we increase contention on the mmap_lock and make the page
fault wait on any mmap() operation. If we simply sleep waiting for the
I/O, we make any mmap() operation _which touches this VMA_ wait for
the I/O to complete. But I think that's OK, because new page faults
can continue to be serviced ... as long as they don't need to take
the mmap_lock.
So ... I think what we _really_ want here is ...
+++ b/mm/filemap.c
@@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
unsigned int flags)
{
- if (fault_flag_allow_retry_first(flags)) {
+ if (!(flags & FAULT_FLAG_VMA_LOCK) &&
+ fault_flag_allow_retry_first(flags)) {
/*
* CAUTION! In this case, mmap_lock is not released
* even though return 0.
@@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
ret = __folio_lock_killable(folio);
if (ret) {
- mmap_read_unlock(mm);
+ if (!(flags & FAULT_FLAG_VMA_LOCK))
+ mmap_read_unlock(mm);
return false;
}
} else {
On Fri, Apr 14, 2023 at 1:32 PM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote:
> > > - We can call migration_entry_wait(). This will wait for PG_locked to
> > > become clear (in migration_entry_wait_on_locked()). As previously
> > > discussed offline, I think this is safe to do while holding the VMA
> > > locked.
>
> Just to be clear, this particular use of PG_locked is not during I/O,
> it's during page migration. This is a few orders of magnitude
> different.
>
> > > - We can call swap_readpage() if we allocate a new folio. I haven't
> > > traced through all this code to tell if it's OK.
>
> ... whereas this will wait for I/O. If we decide that's not OK, we'll
> need to test for FAULT_FLAG_VMA_LOCK and bail out of this path.
>
> > > So ... I believe this is all OK, but we're definitely now willing to
> > > wait for I/O from the swap device while holding the VMA lock when we
> > > weren't before. And maybe we should make a bigger deal of it in the
> > > changelog.
> > >
> > > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > > maybe we should be waiting for the folio lock with the VMA locked.
> >
> > Wouldn't that cause holding the VMA lock for the duration of swap I/O
> > (something you said we want to avoid in the previous paragraph) and
> > effectively undo d065bd810b6d ("mm: retry page fault when blocking on
> > disk transfer") for VMA locks?
>
> I'm not certain we want to avoid holding the VMA lock for the duration
> of an I/O. Here's how I understand the rationale for avoiding holding
> the mmap_lock while we perform I/O (before the existence of the VMA lock):
>
> - If everybody is doing page faults, there is no specific problem;
> we all hold the lock for read and multiple page faults can be handled
> in parallel.
> - As soon as one thread attempts to manipulate the tree (eg calls
> mmap()), all new readers must wait (as the rwsem is fair), and the
> writer must wait for all existing readers to finish. That's
> potentially milliseconds for an I/O during which time all page faults
> stop.
>
> Now we have the per-VMA lock, faults which can be handled without taking
> the mmap_lock can still be satisfied, as long as that VMA is not being
> modified. It is rare for a real application to take a page fault on a
> VMA which is being modified.
>
> So modifications to the tree will generally not take VMA locks on VMAs
> which are currently handling faults, and new faults will generally not
> find a VMA which is write-locked.
>
> When we find a locked folio (presumably for I/O, although folios are
> locked for other reasons), if we fall back to taking the mmap_lock
> for read, we increase contention on the mmap_lock and make the page
> fault wait on any mmap() operation.
Do you mean we increase mmap_lock contention by holding the mmap_lock
between the start of pagefault retry and until we drop it in
__folio_lock_or_retry?
> If we simply sleep waiting for the
> I/O, we make any mmap() operation _which touches this VMA_ wait for
> the I/O to complete. But I think that's OK, because new page faults
> can continue to be serviced ... as long as they don't need to take
> the mmap_lock.
Ok, so we will potentially block VMA writers for the duration of the I/O...
Stupid question: why was this a bigger problem for mmap_lock?
Potentially our address space can consist of only one anon VMA, so
locking that VMA vs mmap_lock should be the same from swap pagefault
POV. Maybe mmap_lock is taken for write in some other important cases
when VMA lock is not needed?
>
> So ... I think what we _really_ want here is ...
>
> +++ b/mm/filemap.c
> @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
> bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
> unsigned int flags)
> {
> - if (fault_flag_allow_retry_first(flags)) {
> + if (!(flags & FAULT_FLAG_VMA_LOCK) &&
> + fault_flag_allow_retry_first(flags)) {
> /*
> * CAUTION! In this case, mmap_lock is not released
> * even though return 0.
> @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
>
> ret = __folio_lock_killable(folio);
> if (ret) {
> - mmap_read_unlock(mm);
> + if (!(flags & FAULT_FLAG_VMA_LOCK))
> + mmap_read_unlock(mm);
> return false;
> }
> } else {
>
On Fri, Apr 14, 2023 at 02:51:59PM -0700, Suren Baghdasaryan wrote:
> On Fri, Apr 14, 2023 at 1:32 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote:
> > > > - We can call migration_entry_wait(). This will wait for PG_locked to
> > > > become clear (in migration_entry_wait_on_locked()). As previously
> > > > discussed offline, I think this is safe to do while holding the VMA
> > > > locked.
> >
> > Just to be clear, this particular use of PG_locked is not during I/O,
> > it's during page migration. This is a few orders of magnitude
> > different.
> >
> > > > - We can call swap_readpage() if we allocate a new folio. I haven't
> > > > traced through all this code to tell if it's OK.
> >
> > ... whereas this will wait for I/O. If we decide that's not OK, we'll
> > need to test for FAULT_FLAG_VMA_LOCK and bail out of this path.
> >
> > > > So ... I believe this is all OK, but we're definitely now willing to
> > > > wait for I/O from the swap device while holding the VMA lock when we
> > > > weren't before. And maybe we should make a bigger deal of it in the
> > > > changelog.
> > > >
> > > > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > > > maybe we should be waiting for the folio lock with the VMA locked.
> > >
> > > Wouldn't that cause holding the VMA lock for the duration of swap I/O
> > > (something you said we want to avoid in the previous paragraph) and
> > > effectively undo d065bd810b6d ("mm: retry page fault when blocking on
> > > disk transfer") for VMA locks?
> >
> > I'm not certain we want to avoid holding the VMA lock for the duration
> > of an I/O. Here's how I understand the rationale for avoiding holding
> > the mmap_lock while we perform I/O (before the existence of the VMA lock):
> >
> > - If everybody is doing page faults, there is no specific problem;
> > we all hold the lock for read and multiple page faults can be handled
> > in parallel.
> > - As soon as one thread attempts to manipulate the tree (eg calls
> > mmap()), all new readers must wait (as the rwsem is fair), and the
> > writer must wait for all existing readers to finish. That's
> > potentially milliseconds for an I/O during which time all page faults
> > stop.
> >
> > Now we have the per-VMA lock, faults which can be handled without taking
> > the mmap_lock can still be satisfied, as long as that VMA is not being
> > modified. It is rare for a real application to take a page fault on a
> > VMA which is being modified.
> >
> > So modifications to the tree will generally not take VMA locks on VMAs
> > which are currently handling faults, and new faults will generally not
> > find a VMA which is write-locked.
> >
> > When we find a locked folio (presumably for I/O, although folios are
> > locked for other reasons), if we fall back to taking the mmap_lock
> > for read, we increase contention on the mmap_lock and make the page
> > fault wait on any mmap() operation.
>
> Do you mean we increase mmap_lock contention by holding the mmap_lock
> between the start of pagefault retry and until we drop it in
> __folio_lock_or_retry?
Yes, and if there is a writer (to any VMA, not just to the VMA we're
working on), this page fault has to wait for that writer. Rather than
just waiting for this I/O to complete.
> > If we simply sleep waiting for the
> > I/O, we make any mmap() operation _which touches this VMA_ wait for
> > the I/O to complete. But I think that's OK, because new page faults
> > can continue to be serviced ... as long as they don't need to take
> > the mmap_lock.
>
> Ok, so we will potentially block VMA writers for the duration of the I/O...
> Stupid question: why was this a bigger problem for mmap_lock?
> Potentially our address space can consist of only one anon VMA, so
> locking that VMA vs mmap_lock should be the same from swap pagefault
> POV. Maybe mmap_lock is taken for write in some other important cases
> when VMA lock is not needed?
I really doubt any process has only a single VMA. Usually, there's at
least stack, program text, program data, program rodata, heap, vdso,
libc text, libc data, libc rodata, etc.
$ cat /proc/self/maps |wc -l
23
That's 5 for the program, 5 for ld.so, 5 for libc, heap, stack, vvar,
vdso, and a few I can't identify.
Also, if our program only has one anon VMA, what is it doing that
it needs to modify the tree? Growing it, perhaps? Almost anything
else would cause it to have more than one VMA ;-)
But let's consider the case where we have a 1TB VMA which is servicing the
majority of faults and we try to mprotect one page in it. The mprotect()
thread will take the mmap_lock for write, then block on the VMA lock until
all the other threads taking page faults have finished (ie waiting for all
the swapin to happen). While it waits, new faults on that VMA will also
block (on the mmap_lock as the read_trylock on the VMA lock will fail).
So yes, this case is just as bad as the original problem.
We have two alternatives here. We can do what we do today -- fall back
to taking the mmap_lock for read before dropping it while we wait
for the folio lock. Or we can do something like this ...
+++ b/mm/filemap.c
@@ -1698,7 +1698,10 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
if (flags & FAULT_FLAG_RETRY_NOWAIT)
return false;
- mmap_read_unlock(mm);
+ if (flags & FAULT_FLAG_VMA_LOCK)
+ vma_end_read(vma);
+ else
+ mmap_read_unlock(mm);
if (flags & FAULT_FLAG_KILLABLE)
folio_wait_locked_killable(folio);
else
... and then figure out a return value that lets the caller of
handle_mm_fault() know not to call vma_end_read().
Or we can decide that it's OK to reintroduce this worst-case scenario,
because we've now reduced the likelihood of it happening so far.
Matthew Wilcox <[email protected]> writes:
> On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
>> When page fault is handled under VMA lock protection, all swap page
>> faults are retried with mmap_lock because folio_lock_or_retry
>> implementation has to drop and reacquire mmap_lock if folio could
>> not be immediately locked.
>> Instead of retrying all swapped page faults, retry only when folio
>> locking fails.
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Let's just review what can now be handled under the VMA lock instead of
> the mmap_lock, in case somebody knows better than me that it's not safe.
>
> - We can call migration_entry_wait(). This will wait for PG_locked to
> become clear (in migration_entry_wait_on_locked()). As previously
> discussed offline, I think this is safe to do while holding the VMA
> locked.
Do we even need to be holding the VMA locked while in
migration_entry_wait()? My understanding is we're just waiting for
PG_locked to be cleared so we can return with a reasonable chance the
migration entry is gone. If for example it has been unmapped or
protections downgraded we will simply refault.
> - We can call remove_device_exclusive_entry(). That calls
> folio_lock_or_retry(), which will fail if it can't get the VMA lock.
Looks ok to me.
> - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> with Nouveau and amdkfd could comment on how safe this is?
Currently this won't work because drives assume mmap_lock is held during
pgmap->ops->migrate_to_ram(). Primarily this is because
migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
that asserts mmap_lock is taken in walk_page_range() and also
migrate_vma_insert_page().
So I don't think we can call that case without mmap_lock.
At a glance it seems it should be relatively easy to move to using
lock_vma_under_rcu(). Drivers will need updating as well though because
migrate_vma_setup() is called outside of fault handling paths so drivers
will currently take mmap_lock rather than vma lock when looking up the
vma. See for example nouveau_svmm_bind().
> - I believe we can't call handle_pte_marker() because we exclude UFFD
> VMAs earlier.
> - We can call swap_readpage() if we allocate a new folio. I haven't
> traced through all this code to tell if it's OK.
>
> So ... I believe this is all OK, but we're definitely now willing to
> wait for I/O from the swap device while holding the VMA lock when we
> weren't before. And maybe we should make a bigger deal of it in the
> changelog.
>
> And maybe we shouldn't just be failing the folio_lock_or_retry(),
> maybe we should be waiting for the folio lock with the VMA locked.
On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <[email protected]> wrote:
>
>
> Matthew Wilcox <[email protected]> writes:
>
> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> >> When page fault is handled under VMA lock protection, all swap page
> >> faults are retried with mmap_lock because folio_lock_or_retry
> >> implementation has to drop and reacquire mmap_lock if folio could
> >> not be immediately locked.
> >> Instead of retrying all swapped page faults, retry only when folio
> >> locking fails.
> >
> > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> >
> > Let's just review what can now be handled under the VMA lock instead of
> > the mmap_lock, in case somebody knows better than me that it's not safe.
> >
> > - We can call migration_entry_wait(). This will wait for PG_locked to
> > become clear (in migration_entry_wait_on_locked()). As previously
> > discussed offline, I think this is safe to do while holding the VMA
> > locked.
>
> Do we even need to be holding the VMA locked while in
> migration_entry_wait()? My understanding is we're just waiting for
> PG_locked to be cleared so we can return with a reasonable chance the
> migration entry is gone. If for example it has been unmapped or
> protections downgraded we will simply refault.
If we drop VMA lock before migration_entry_wait() then we would need
to lock_vma_under_rcu again after the wait. In which case it might be
simpler to retry the fault with some special return code to indicate
that VMA lock is not held anymore and we want to retry without taking
mmap_lock. I think it's similar to the last options Matthew suggested
earlier. In which case we can reuse the same retry mechanism for both
cases, here and in __folio_lock_or_retry.
>
> > - We can call remove_device_exclusive_entry(). That calls
> > folio_lock_or_retry(), which will fail if it can't get the VMA lock.
>
> Looks ok to me.
>
> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> > with Nouveau and amdkfd could comment on how safe this is?
>
> Currently this won't work because drives assume mmap_lock is held during
> pgmap->ops->migrate_to_ram(). Primarily this is because
> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
> that asserts mmap_lock is taken in walk_page_range() and also
> migrate_vma_insert_page().
>
> So I don't think we can call that case without mmap_lock.
>
> At a glance it seems it should be relatively easy to move to using
> lock_vma_under_rcu(). Drivers will need updating as well though because
> migrate_vma_setup() is called outside of fault handling paths so drivers
> will currently take mmap_lock rather than vma lock when looking up the
> vma. See for example nouveau_svmm_bind().
Thanks for the pointers, Alistair! It does look like we need to be
more careful with the migrate_to_ram() path. For now I can fallback to
retrying with mmap_lock for this case, like with do with all cases
today. Afterwards this path can be made ready for working under VMA
lock and we can remove that retry. Does that sound good?
>
> > - I believe we can't call handle_pte_marker() because we exclude UFFD
> > VMAs earlier.
> > - We can call swap_readpage() if we allocate a new folio. I haven't
> > traced through all this code to tell if it's OK.
> >
> > So ... I believe this is all OK, but we're definitely now willing to
> > wait for I/O from the swap device while holding the VMA lock when we
> > weren't before. And maybe we should make a bigger deal of it in the
> > changelog.
> >
> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > maybe we should be waiting for the folio lock with the VMA locked.
>
Suren Baghdasaryan <[email protected]> writes:
> On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <[email protected]> wrote:
>>
>>
>> Matthew Wilcox <[email protected]> writes:
>>
>> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
>> >> When page fault is handled under VMA lock protection, all swap page
>> >> faults are retried with mmap_lock because folio_lock_or_retry
>> >> implementation has to drop and reacquire mmap_lock if folio could
>> >> not be immediately locked.
>> >> Instead of retrying all swapped page faults, retry only when folio
>> >> locking fails.
>> >
>> > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>> >
>> > Let's just review what can now be handled under the VMA lock instead of
>> > the mmap_lock, in case somebody knows better than me that it's not safe.
>> >
>> > - We can call migration_entry_wait(). This will wait for PG_locked to
>> > become clear (in migration_entry_wait_on_locked()). As previously
>> > discussed offline, I think this is safe to do while holding the VMA
>> > locked.
>>
>> Do we even need to be holding the VMA locked while in
>> migration_entry_wait()? My understanding is we're just waiting for
>> PG_locked to be cleared so we can return with a reasonable chance the
>> migration entry is gone. If for example it has been unmapped or
>> protections downgraded we will simply refault.
>
> If we drop VMA lock before migration_entry_wait() then we would need
> to lock_vma_under_rcu again after the wait. In which case it might be
> simpler to retry the fault with some special return code to indicate
> that VMA lock is not held anymore and we want to retry without taking
> mmap_lock. I think it's similar to the last options Matthew suggested
> earlier. In which case we can reuse the same retry mechanism for both
> cases, here and in __folio_lock_or_retry.
Good point. Agree there is no reason to re-take the VMA lock after the
wait, although in this case we shouldn't need to retry the fault
(ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way
out to userspace.
>>
>> > - We can call remove_device_exclusive_entry(). That calls
>> > folio_lock_or_retry(), which will fail if it can't get the VMA lock.
>>
>> Looks ok to me.
>>
>> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
>> > with Nouveau and amdkfd could comment on how safe this is?
>>
>> Currently this won't work because drives assume mmap_lock is held during
>> pgmap->ops->migrate_to_ram(). Primarily this is because
>> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
>> that asserts mmap_lock is taken in walk_page_range() and also
>> migrate_vma_insert_page().
>>
>> So I don't think we can call that case without mmap_lock.
>>
>> At a glance it seems it should be relatively easy to move to using
>> lock_vma_under_rcu(). Drivers will need updating as well though because
>> migrate_vma_setup() is called outside of fault handling paths so drivers
>> will currently take mmap_lock rather than vma lock when looking up the
>> vma. See for example nouveau_svmm_bind().
>
> Thanks for the pointers, Alistair! It does look like we need to be
> more careful with the migrate_to_ram() path. For now I can fallback to
> retrying with mmap_lock for this case, like with do with all cases
> today. Afterwards this path can be made ready for working under VMA
> lock and we can remove that retry. Does that sound good?
Sounds good to me. Fixing that shouldn't be too difficult but will need
changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy
to look at doing that if/when this change makes it in. Thanks.
>>
>> > - I believe we can't call handle_pte_marker() because we exclude UFFD
>> > VMAs earlier.
>> > - We can call swap_readpage() if we allocate a new folio. I haven't
>> > traced through all this code to tell if it's OK.
>> >
>> > So ... I believe this is all OK, but we're definitely now willing to
>> > wait for I/O from the swap device while holding the VMA lock when we
>> > weren't before. And maybe we should make a bigger deal of it in the
>> > changelog.
>> >
>> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
>> > maybe we should be waiting for the folio lock with the VMA locked.
>>
On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <[email protected]> wrote:
>
>
> Suren Baghdasaryan <[email protected]> writes:
>
> > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <[email protected]> wrote:
> >>
> >>
> >> Matthew Wilcox <[email protected]> writes:
> >>
> >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> >> >> When page fault is handled under VMA lock protection, all swap page
> >> >> faults are retried with mmap_lock because folio_lock_or_retry
> >> >> implementation has to drop and reacquire mmap_lock if folio could
> >> >> not be immediately locked.
> >> >> Instead of retrying all swapped page faults, retry only when folio
> >> >> locking fails.
> >> >
> >> > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> >> >
> >> > Let's just review what can now be handled under the VMA lock instead of
> >> > the mmap_lock, in case somebody knows better than me that it's not safe.
> >> >
> >> > - We can call migration_entry_wait(). This will wait for PG_locked to
> >> > become clear (in migration_entry_wait_on_locked()). As previously
> >> > discussed offline, I think this is safe to do while holding the VMA
> >> > locked.
> >>
> >> Do we even need to be holding the VMA locked while in
> >> migration_entry_wait()? My understanding is we're just waiting for
> >> PG_locked to be cleared so we can return with a reasonable chance the
> >> migration entry is gone. If for example it has been unmapped or
> >> protections downgraded we will simply refault.
> >
> > If we drop VMA lock before migration_entry_wait() then we would need
> > to lock_vma_under_rcu again after the wait. In which case it might be
> > simpler to retry the fault with some special return code to indicate
> > that VMA lock is not held anymore and we want to retry without taking
> > mmap_lock. I think it's similar to the last options Matthew suggested
> > earlier. In which case we can reuse the same retry mechanism for both
> > cases, here and in __folio_lock_or_retry.
>
> Good point. Agree there is no reason to re-take the VMA lock after the
> wait, although in this case we shouldn't need to retry the fault
> (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way
> out to userspace.
Actually, __collapse_huge_page_swapin() which calls do_swap_page() can
use VMA reference again inside its loop unless we return
VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped
the VMA lock and stability of the VMA is not guaranteed at that point.
So, we do need to return VM_FAULT_RETRY maybe with another bit
indicating that retry does not need to fallback to mmap_lock. Smth
like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK".
>
> >>
> >> > - We can call remove_device_exclusive_entry(). That calls
> >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock.
> >>
> >> Looks ok to me.
> >>
> >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> >> > with Nouveau and amdkfd could comment on how safe this is?
> >>
> >> Currently this won't work because drives assume mmap_lock is held during
> >> pgmap->ops->migrate_to_ram(). Primarily this is because
> >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
> >> that asserts mmap_lock is taken in walk_page_range() and also
> >> migrate_vma_insert_page().
> >>
> >> So I don't think we can call that case without mmap_lock.
> >>
> >> At a glance it seems it should be relatively easy to move to using
> >> lock_vma_under_rcu(). Drivers will need updating as well though because
> >> migrate_vma_setup() is called outside of fault handling paths so drivers
> >> will currently take mmap_lock rather than vma lock when looking up the
> >> vma. See for example nouveau_svmm_bind().
> >
> > Thanks for the pointers, Alistair! It does look like we need to be
> > more careful with the migrate_to_ram() path. For now I can fallback to
> > retrying with mmap_lock for this case, like with do with all cases
> > today. Afterwards this path can be made ready for working under VMA
> > lock and we can remove that retry. Does that sound good?
>
> Sounds good to me. Fixing that shouldn't be too difficult but will need
> changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy
> to look at doing that if/when this change makes it in. Thanks.
>
> >>
> >> > - I believe we can't call handle_pte_marker() because we exclude UFFD
> >> > VMAs earlier.
> >> > - We can call swap_readpage() if we allocate a new folio. I haven't
> >> > traced through all this code to tell if it's OK.
> >> >
> >> > So ... I believe this is all OK, but we're definitely now willing to
> >> > wait for I/O from the swap device while holding the VMA lock when we
> >> > weren't before. And maybe we should make a bigger deal of it in the
> >> > changelog.
> >> >
> >> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> >> > maybe we should be waiting for the folio lock with the VMA locked.
> >>
>
On Mon, Apr 17, 2023 at 4:50 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <[email protected]> wrote:
> >
> >
> > Suren Baghdasaryan <[email protected]> writes:
> >
> > > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <[email protected]> wrote:
> > >>
> > >>
> > >> Matthew Wilcox <[email protected]> writes:
> > >>
> > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> > >> >> When page fault is handled under VMA lock protection, all swap page
> > >> >> faults are retried with mmap_lock because folio_lock_or_retry
> > >> >> implementation has to drop and reacquire mmap_lock if folio could
> > >> >> not be immediately locked.
> > >> >> Instead of retrying all swapped page faults, retry only when folio
> > >> >> locking fails.
> > >> >
> > >> > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> > >> >
> > >> > Let's just review what can now be handled under the VMA lock instead of
> > >> > the mmap_lock, in case somebody knows better than me that it's not safe.
> > >> >
> > >> > - We can call migration_entry_wait(). This will wait for PG_locked to
> > >> > become clear (in migration_entry_wait_on_locked()). As previously
> > >> > discussed offline, I think this is safe to do while holding the VMA
> > >> > locked.
> > >>
> > >> Do we even need to be holding the VMA locked while in
> > >> migration_entry_wait()? My understanding is we're just waiting for
> > >> PG_locked to be cleared so we can return with a reasonable chance the
> > >> migration entry is gone. If for example it has been unmapped or
> > >> protections downgraded we will simply refault.
> > >
> > > If we drop VMA lock before migration_entry_wait() then we would need
> > > to lock_vma_under_rcu again after the wait. In which case it might be
> > > simpler to retry the fault with some special return code to indicate
> > > that VMA lock is not held anymore and we want to retry without taking
> > > mmap_lock. I think it's similar to the last options Matthew suggested
> > > earlier. In which case we can reuse the same retry mechanism for both
> > > cases, here and in __folio_lock_or_retry.
> >
> > Good point. Agree there is no reason to re-take the VMA lock after the
> > wait, although in this case we shouldn't need to retry the fault
> > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way
> > out to userspace.
>
> Actually, __collapse_huge_page_swapin() which calls do_swap_page() can
> use VMA reference again inside its loop unless we return
> VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped
> the VMA lock and stability of the VMA is not guaranteed at that point.
> So, we do need to return VM_FAULT_RETRY maybe with another bit
> indicating that retry does not need to fallback to mmap_lock. Smth
> like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK".
False alarm. __collapse_huge_page_swapin is always called under
mmap_lock protection. I'll go over the code once more to make sure
nothing else would use VMA after we drop the VMA lock in page fault
path.
>
> >
> > >>
> > >> > - We can call remove_device_exclusive_entry(). That calls
> > >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock.
> > >>
> > >> Looks ok to me.
> > >>
> > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> > >> > with Nouveau and amdkfd could comment on how safe this is?
> > >>
> > >> Currently this won't work because drives assume mmap_lock is held during
> > >> pgmap->ops->migrate_to_ram(). Primarily this is because
> > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
> > >> that asserts mmap_lock is taken in walk_page_range() and also
> > >> migrate_vma_insert_page().
> > >>
> > >> So I don't think we can call that case without mmap_lock.
> > >>
> > >> At a glance it seems it should be relatively easy to move to using
> > >> lock_vma_under_rcu(). Drivers will need updating as well though because
> > >> migrate_vma_setup() is called outside of fault handling paths so drivers
> > >> will currently take mmap_lock rather than vma lock when looking up the
> > >> vma. See for example nouveau_svmm_bind().
> > >
> > > Thanks for the pointers, Alistair! It does look like we need to be
> > > more careful with the migrate_to_ram() path. For now I can fallback to
> > > retrying with mmap_lock for this case, like with do with all cases
> > > today. Afterwards this path can be made ready for working under VMA
> > > lock and we can remove that retry. Does that sound good?
> >
> > Sounds good to me. Fixing that shouldn't be too difficult but will need
> > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy
> > to look at doing that if/when this change makes it in. Thanks.
> >
> > >>
> > >> > - I believe we can't call handle_pte_marker() because we exclude UFFD
> > >> > VMAs earlier.
> > >> > - We can call swap_readpage() if we allocate a new folio. I haven't
> > >> > traced through all this code to tell if it's OK.
> > >> >
> > >> > So ... I believe this is all OK, but we're definitely now willing to
> > >> > wait for I/O from the swap device while holding the VMA lock when we
> > >> > weren't before. And maybe we should make a bigger deal of it in the
> > >> > changelog.
> > >> >
> > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > >> > maybe we should be waiting for the folio lock with the VMA locked.
> > >>
> >
On Mon, Apr 17, 2023 at 6:07 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Mon, Apr 17, 2023 at 4:50 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <[email protected]> wrote:
> > >
> > >
> > > Suren Baghdasaryan <[email protected]> writes:
> > >
> > > > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <[email protected]> wrote:
> > > >>
> > > >>
> > > >> Matthew Wilcox <[email protected]> writes:
> > > >>
> > > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> > > >> >> When page fault is handled under VMA lock protection, all swap page
> > > >> >> faults are retried with mmap_lock because folio_lock_or_retry
> > > >> >> implementation has to drop and reacquire mmap_lock if folio could
> > > >> >> not be immediately locked.
> > > >> >> Instead of retrying all swapped page faults, retry only when folio
> > > >> >> locking fails.
> > > >> >
> > > >> > Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> > > >> >
> > > >> > Let's just review what can now be handled under the VMA lock instead of
> > > >> > the mmap_lock, in case somebody knows better than me that it's not safe.
> > > >> >
> > > >> > - We can call migration_entry_wait(). This will wait for PG_locked to
> > > >> > become clear (in migration_entry_wait_on_locked()). As previously
> > > >> > discussed offline, I think this is safe to do while holding the VMA
> > > >> > locked.
> > > >>
> > > >> Do we even need to be holding the VMA locked while in
> > > >> migration_entry_wait()? My understanding is we're just waiting for
> > > >> PG_locked to be cleared so we can return with a reasonable chance the
> > > >> migration entry is gone. If for example it has been unmapped or
> > > >> protections downgraded we will simply refault.
> > > >
> > > > If we drop VMA lock before migration_entry_wait() then we would need
> > > > to lock_vma_under_rcu again after the wait. In which case it might be
> > > > simpler to retry the fault with some special return code to indicate
> > > > that VMA lock is not held anymore and we want to retry without taking
> > > > mmap_lock. I think it's similar to the last options Matthew suggested
> > > > earlier. In which case we can reuse the same retry mechanism for both
> > > > cases, here and in __folio_lock_or_retry.
> > >
> > > Good point. Agree there is no reason to re-take the VMA lock after the
> > > wait, although in this case we shouldn't need to retry the fault
> > > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way
> > > out to userspace.
> >
> > Actually, __collapse_huge_page_swapin() which calls do_swap_page() can
> > use VMA reference again inside its loop unless we return
> > VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped
> > the VMA lock and stability of the VMA is not guaranteed at that point.
> > So, we do need to return VM_FAULT_RETRY maybe with another bit
> > indicating that retry does not need to fallback to mmap_lock. Smth
> > like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK".
>
> False alarm. __collapse_huge_page_swapin is always called under
> mmap_lock protection. I'll go over the code once more to make sure
> nothing else would use VMA after we drop the VMA lock in page fault
> path.
I posted a new series at
https://lore.kernel.org/all/[email protected]/
It implements suggestions discussed in this thread. Feedback is
appreciated! Thanks!
>
>
> >
> > >
> > > >>
> > > >> > - We can call remove_device_exclusive_entry(). That calls
> > > >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock.
> > > >>
> > > >> Looks ok to me.
> > > >>
> > > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar
> > > >> > with Nouveau and amdkfd could comment on how safe this is?
> > > >>
> > > >> Currently this won't work because drives assume mmap_lock is held during
> > > >> pgmap->ops->migrate_to_ram(). Primarily this is because
> > > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
> > > >> that asserts mmap_lock is taken in walk_page_range() and also
> > > >> migrate_vma_insert_page().
> > > >>
> > > >> So I don't think we can call that case without mmap_lock.
> > > >>
> > > >> At a glance it seems it should be relatively easy to move to using
> > > >> lock_vma_under_rcu(). Drivers will need updating as well though because
> > > >> migrate_vma_setup() is called outside of fault handling paths so drivers
> > > >> will currently take mmap_lock rather than vma lock when looking up the
> > > >> vma. See for example nouveau_svmm_bind().
> > > >
> > > > Thanks for the pointers, Alistair! It does look like we need to be
> > > > more careful with the migrate_to_ram() path. For now I can fallback to
> > > > retrying with mmap_lock for this case, like with do with all cases
> > > > today. Afterwards this path can be made ready for working under VMA
> > > > lock and we can remove that retry. Does that sound good?
> > >
> > > Sounds good to me. Fixing that shouldn't be too difficult but will need
> > > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy
> > > to look at doing that if/when this change makes it in. Thanks.
> > >
> > > >>
> > > >> > - I believe we can't call handle_pte_marker() because we exclude UFFD
> > > >> > VMAs earlier.
> > > >> > - We can call swap_readpage() if we allocate a new folio. I haven't
> > > >> > traced through all this code to tell if it's OK.
> > > >> >
> > > >> > So ... I believe this is all OK, but we're definitely now willing to
> > > >> > wait for I/O from the swap device while holding the VMA lock when we
> > > >> > weren't before. And maybe we should make a bigger deal of it in the
> > > >> > changelog.
> > > >> >
> > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > > >> > maybe we should be waiting for the folio lock with the VMA locked.
> > > >>
> > >