2020-07-22 17:48:26

by Andrea Righi

[permalink] [raw]
Subject: [PATCH] mm: swap: do not wait for lock_page() in unuse_pte_range()

Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
lead to stalls while running swapoff (i.e., not being able to ssh into
the system, inability to execute simple commands like 'ps', etc.).

Replace lock_page() with trylock_page() and release mm->mmap_sem if we
fail to lock it, giving other tasks a chance to continue and prevent
the stall.

This issue can be easily reproduced running swapoff in systems with a
large amount of RAM (>=100GB) and a lot of pages swapped out to disk. A
specific use case is to run swapoff immediately after resuming from
hibernation.

Under these conditions and without this patch applied the system can be
stalled even for 15min, with this patch applied the system is always
responsive.

Signed-off-by: Andrea Righi <[email protected]>
---
mm/swapfile.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 987276c557d1..794935ecf82a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
return -ENOMEM;
}

- lock_page(page);
+ if (!trylock_page(page)) {
+ ret = -EAGAIN;
+ put_page(page);
+ goto out;
+ }
wait_on_page_writeback(page);
ret = unuse_pte(vma, pmd, addr, entry, page);
if (ret < 0) {
@@ -2100,11 +2104,17 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type,
struct vm_area_struct *vma;
int ret = 0;

+retry:
mmap_read_lock(mm);
for (vma = mm->mmap; vma; vma = vma->vm_next) {
if (vma->anon_vma) {
ret = unuse_vma(vma, type, frontswap,
fs_pages_to_unuse);
+ if (ret == -EAGAIN) {
+ mmap_read_unlock(mm);
+ cond_resched();
+ goto retry;
+ }
if (ret)
break;
}
--
2.25.1


2020-07-22 18:05:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: do not wait for lock_page() in unuse_pte_range()

On Wed, Jul 22, 2020 at 07:44:36PM +0200, Andrea Righi wrote:
> Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
> lead to stalls while running swapoff (i.e., not being able to ssh into
> the system, inability to execute simple commands like 'ps', etc.).
>
> Replace lock_page() with trylock_page() and release mm->mmap_sem if we
> fail to lock it, giving other tasks a chance to continue and prevent
> the stall.

I think you've removed the warning at the expense of turning a stall
into a potential livelock.

> @@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> return -ENOMEM;
> }
>
> - lock_page(page);
> + if (!trylock_page(page)) {
> + ret = -EAGAIN;
> + put_page(page);
> + goto out;
> + }

If you look at the patterns we have elsewhere in the MM for doing
this kind of thing (eg truncate_inode_pages_range()), we iterate over the
entire range, take care of the easy cases, then go back and deal with the
hard cases later.

So that would argue for skipping any page that we can't trylock, but
continue over at least the VMA, and quite possibly the entire MM until
we're convinced that we have unused all of the required pages.

Another thing we could do is drop the MM semaphore _here_, sleep on this
page until it's unlocked, then go around again.

if (!trylock_page(page)) {
mmap_read_unlock(mm);
lock_page(page);
unlock_page(page);
put_page(page);
ret = -EAGAIN;
goto out;
}

(I haven't checked the call paths; maybe you can't do this because
sometimes it's called with the mmap sem held for write)

Also, if we're trying to scale this better, there are some fun
workloads where readers block writers who block subsequent readers
and we shouldn't wait for I/O in swapin_readahead(). See patches like
6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 for more on this kind of thing.

2020-07-22 18:51:22

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH] mm: swap: do not wait for lock_page() in unuse_pte_range()

On Wed, Jul 22, 2020 at 07:04:25PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 22, 2020 at 07:44:36PM +0200, Andrea Righi wrote:
> > Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
> > lead to stalls while running swapoff (i.e., not being able to ssh into
> > the system, inability to execute simple commands like 'ps', etc.).
> >
> > Replace lock_page() with trylock_page() and release mm->mmap_sem if we
> > fail to lock it, giving other tasks a chance to continue and prevent
> > the stall.
>
> I think you've removed the warning at the expense of turning a stall
> into a potential livelock.
>
> > @@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > return -ENOMEM;
> > }
> >
> > - lock_page(page);
> > + if (!trylock_page(page)) {
> > + ret = -EAGAIN;
> > + put_page(page);
> > + goto out;
> > + }
>
> If you look at the patterns we have elsewhere in the MM for doing
> this kind of thing (eg truncate_inode_pages_range()), we iterate over the
> entire range, take care of the easy cases, then go back and deal with the
> hard cases later.
>
> So that would argue for skipping any page that we can't trylock, but
> continue over at least the VMA, and quite possibly the entire MM until
> we're convinced that we have unused all of the required pages.
>
> Another thing we could do is drop the MM semaphore _here_, sleep on this
> page until it's unlocked, then go around again.
>
> if (!trylock_page(page)) {
> mmap_read_unlock(mm);
> lock_page(page);
> unlock_page(page);
> put_page(page);
> ret = -EAGAIN;
> goto out;
> }
>
> (I haven't checked the call paths; maybe you can't do this because
> sometimes it's called with the mmap sem held for write)
>
> Also, if we're trying to scale this better, there are some fun
> workloads where readers block writers who block subsequent readers
> and we shouldn't wait for I/O in swapin_readahead(). See patches like
> 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 for more on this kind of thing.

Thanks for the review, Matthew. I'll see if I can find a better solution
following your useful hints!

-Andrea