Here are four fixes to the new "rid quadratic" swapoff in 5.1-rc:
1/4 mm: swapoff: shmem_find_swap_entries() filter out other types
2/4 mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
3/4 mm: swapoff: take notice of completion sooner
4/4 mm: swapoff: shmem_unuse() stop eviction without igrab()
include/linux/shmem_fs.h | 1
mm/shmem.c | 57 +++++++++++++++++--------------------
mm/swapfile.c | 32 +++++++++++---------
3 files changed, 45 insertions(+), 45 deletions(-)
Hugh
Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
then forgotten from shmem_find_swap_entries(): with the result that
removing one swapfile would try to free up all the swap from shmem - no
problem when only one swapfile anyway, but counter-productive when more,
causing swapoff to be unnecessarily OOM-killed when it should succeed.
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
--- 5.1-rc4/mm/shmem.c 2019-03-17 16:18:15.701823872 -0700
+++ linux/mm/shmem.c 2019-04-07 19:12:23.603858531 -0700
@@ -1099,10 +1099,11 @@ extern struct swap_info_struct *swap_inf
static int shmem_find_swap_entries(struct address_space *mapping,
pgoff_t start, unsigned int nr_entries,
struct page **entries, pgoff_t *indices,
- bool frontswap)
+ unsigned int type, bool frontswap)
{
XA_STATE(xas, &mapping->i_pages, start);
struct page *page;
+ swp_entry_t entry;
unsigned int ret = 0;
if (!nr_entries)
@@ -1116,13 +1117,12 @@ static int shmem_find_swap_entries(struc
if (!xa_is_value(page))
continue;
- if (frontswap) {
- swp_entry_t entry = radix_to_swp_entry(page);
-
- if (!frontswap_test(swap_info[swp_type(entry)],
- swp_offset(entry)))
- continue;
- }
+ entry = radix_to_swp_entry(page);
+ if (swp_type(entry) != type)
+ continue;
+ if (frontswap &&
+ !frontswap_test(swap_info[type], swp_offset(entry)))
+ continue;
indices[ret] = xas.xa_index;
entries[ret] = page;
@@ -1194,7 +1194,7 @@ static int shmem_unuse_inode(struct inod
pvec.nr = shmem_find_swap_entries(mapping, start, nr_entries,
pvec.pages, indices,
- frontswap);
+ type, frontswap);
if (pvec.nr == 0) {
ret = 0;
break;
SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
further testing has proved it to be a source of unnecessary swapoff
EBUSY failures (which can then be followed by unmount EBUSY failures).
When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
or inode being evicted, freeing up swap independent of try_to_unuse().
Those typically completed much sooner than the old quadratic swapoff,
but now it's more common that swapoff may need to wait for them.
It's possible to move those cases from init_mm.mmlist and shmem_swaplist
to separate "exiting" swaplists, and try_to_unuse() then wait for those
lists to be emptied; but we've not bothered with that in the past, and
don't want to risk missing some other forgotten case. So just revert
to cycling around until the swap is gone, without any retries limit.
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/swapfile.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
--- 5.1-rc4/mm/swapfile.c 2019-03-17 16:18:15.713823942 -0700
+++ linux/mm/swapfile.c 2019-04-07 19:15:01.269054187 -0700
@@ -2023,7 +2023,6 @@ static unsigned int find_next_to_unuse(s
* If the boolean frontswap is true, only unuse pages_to_unuse pages;
* pages_to_unuse==0 means all pages; ignored if frontswap is false
*/
-#define SWAP_UNUSE_MAX_TRIES 3
int try_to_unuse(unsigned int type, bool frontswap,
unsigned long pages_to_unuse)
{
@@ -2035,7 +2034,6 @@ int try_to_unuse(unsigned int type, bool
struct page *page;
swp_entry_t entry;
unsigned int i;
- int retries = 0;
if (!si->inuse_pages)
return 0;
@@ -2117,14 +2115,16 @@ retry:
* If yes, we would need to do retry the unuse logic again.
* Under global memory pressure, swap entries can be reinserted back
* into process space after the mmlist loop above passes over them.
- * Its not worth continuosuly retrying to unuse the swap in this case.
- * So we try SWAP_UNUSE_MAX_TRIES times.
+ *
+ * Limit the number of retries? No: when shmem_unuse()'s igrab() fails,
+ * a shmem inode using swap is being evicted; and when mmget_not_zero()
+ * above fails, that mm is likely to be freeing swap from exit_mmap().
+ * Both proceed at their own independent pace: we could move them to
+ * separate lists, and wait for those lists to be emptied; but it's
+ * easier and more robust (though cpu-intensive) just to keep retrying.
*/
- if (++retries >= SWAP_UNUSE_MAX_TRIES)
- retval = -EBUSY;
- else if (si->inuse_pages)
+ if (si->inuse_pages)
goto retry;
-
out:
return (retval == FRONTSWAP_PAGES_UNUSED) ? 0 : retval;
}
The old try_to_unuse() implementation was driven by find_next_to_unuse(),
which terminated as soon as all the swap had been freed. Add inuse_pages
checks now (alongside signal_pending()) to stop scanning mms and swap_map
once finished. The same ought to be done in shmem_unuse() too, but never
was before, and needs a different interface: so leave it as is for now.
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/swapfile.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
--- 5.1-rc4/mm/swapfile.c 2019-04-07 19:15:01.269054187 -0700
+++ linux/mm/swapfile.c 2019-04-07 19:17:13.291957539 -0700
@@ -2051,11 +2051,9 @@ retry:
spin_lock(&mmlist_lock);
p = &init_mm.mmlist;
- while ((p = p->next) != &init_mm.mmlist) {
- if (signal_pending(current)) {
- retval = -EINTR;
- break;
- }
+ while (si->inuse_pages &&
+ !signal_pending(current) &&
+ (p = p->next) != &init_mm.mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
if (!mmget_not_zero(mm))
@@ -2082,7 +2080,9 @@ retry:
mmput(prev_mm);
i = 0;
- while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
+ while (si->inuse_pages &&
+ !signal_pending(current) &&
+ (i = find_next_to_unuse(si, i, frontswap)) != 0) {
entry = swp_entry(type, i);
page = find_get_page(swap_address_space(entry), i);
@@ -2123,8 +2123,11 @@ retry:
* separate lists, and wait for those lists to be emptied; but it's
* easier and more robust (though cpu-intensive) just to keep retrying.
*/
- if (si->inuse_pages)
- goto retry;
+ if (si->inuse_pages) {
+ if (!signal_pending(current))
+ goto retry;
+ retval = -EINTR;
+ }
out:
return (retval == FRONTSWAP_PAGES_UNUSED) ? 0 : retval;
}
The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.
Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().
Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be
put to use for others later (huge tmpfs patches expect to use it).
That simplifies shmem_unuse(), protecting it from both unlink and unmount;
and in practice lets it locate all the swap in its first try. But do not
rely on that: there's still a theoretical case, when shmem_writepage()
might have been preempted after its get_swap_page(), before making the
swap entry visible to swapoff.
Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/shmem_fs.h | 1 +
mm/shmem.c | 39 ++++++++++++++++++---------------------
mm/swapfile.c | 11 +++++------
3 files changed, 24 insertions(+), 27 deletions(-)
--- 5.1-rc4/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
+++ linux/include/linux/shmem_fs.h 2019-04-07 19:18:43.248639711 -0700
@@ -21,6 +21,7 @@ struct shmem_inode_info {
struct list_head swaplist; /* chain of maybes on swap */
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
+ atomic_t stop_eviction; /* hold when working on inode */
struct inode vfs_inode;
};
--- 5.1-rc4/mm/shmem.c 2019-04-07 19:12:23.603858531 -0700
+++ linux/mm/shmem.c 2019-04-07 19:18:43.248639711 -0700
@@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
}
spin_unlock(&sbinfo->shrinklist_lock);
}
- if (!list_empty(&info->swaplist)) {
+ while (!list_empty(&info->swaplist)) {
+ /* Wait while shmem_unuse() is scanning this inode... */
+ wait_var_event(&info->stop_eviction,
+ !atomic_read(&info->stop_eviction));
mutex_lock(&shmem_swaplist_mutex);
list_del_init(&info->swaplist);
+ /* ...but beware of the race if we peeked too early */
+ if (!atomic_read(&info->stop_eviction))
+ list_del_init(&info->swaplist);
mutex_unlock(&shmem_swaplist_mutex);
}
}
@@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
unsigned long *fs_pages_to_unuse)
{
struct shmem_inode_info *info, *next;
- struct inode *inode;
- struct inode *prev_inode = NULL;
int error = 0;
if (list_empty(&shmem_swaplist))
return 0;
mutex_lock(&shmem_swaplist_mutex);
-
- /*
- * The extra refcount on the inode is necessary to safely dereference
- * p->next after re-acquiring the lock. New shmem inodes with swap
- * get added to the end of the list and we will scan them all.
- */
list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
if (!info->swapped) {
list_del_init(&info->swaplist);
continue;
}
-
- inode = igrab(&info->vfs_inode);
- if (!inode)
- continue;
-
+ /*
+ * Drop the swaplist mutex while searching the inode for swap;
+ * but before doing so, make sure shmem_evict_inode() will not
+ * remove placeholder inode from swaplist, nor let it be freed
+ * (igrab() would protect from unlink, but not from unmount).
+ */
+ atomic_inc(&info->stop_eviction);
mutex_unlock(&shmem_swaplist_mutex);
- if (prev_inode)
- iput(prev_inode);
- prev_inode = inode;
- error = shmem_unuse_inode(inode, type, frontswap,
+ error = shmem_unuse_inode(&info->vfs_inode, type, frontswap,
fs_pages_to_unuse);
cond_resched();
@@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool
next = list_next_entry(info, swaplist);
if (!info->swapped)
list_del_init(&info->swaplist);
+ if (atomic_dec_and_test(&info->stop_eviction))
+ wake_up_var(&info->stop_eviction);
if (error)
break;
}
mutex_unlock(&shmem_swaplist_mutex);
- if (prev_inode)
- iput(prev_inode);
-
return error;
}
@@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
+ atomic_set(&info->stop_eviction, 0);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->shrinklist);
--- 5.1-rc4/mm/swapfile.c 2019-04-07 19:17:13.291957539 -0700
+++ linux/mm/swapfile.c 2019-04-07 19:18:43.248639711 -0700
@@ -2116,12 +2116,11 @@ retry:
* Under global memory pressure, swap entries can be reinserted back
* into process space after the mmlist loop above passes over them.
*
- * Limit the number of retries? No: when shmem_unuse()'s igrab() fails,
- * a shmem inode using swap is being evicted; and when mmget_not_zero()
- * above fails, that mm is likely to be freeing swap from exit_mmap().
- * Both proceed at their own independent pace: we could move them to
- * separate lists, and wait for those lists to be emptied; but it's
- * easier and more robust (though cpu-intensive) just to keep retrying.
+ * Limit the number of retries? No: when mmget_not_zero() above fails,
+ * that mm is likely to be freeing swap from exit_mmap(), which proceeds
+ * at its own independent pace; and even shmem_writepage() could have
+ * been preempted after get_swap_page(), temporarily hiding that swap.
+ * It's easy and robust (though cpu-intensive) just to keep retrying.
*/
if (si->inuse_pages) {
if (!signal_pending(current))
On 08.04.2019 23:01, Hugh Dickins wrote:
> The igrab() in shmem_unuse() looks good, but we forgot that it gives no
> protection against concurrent unmounting: a point made by Konstantin
> Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
> ("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
> swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
> Self-destruct in 5 seconds. Have a nice day..." followed by GPF.
>
> Once again, give up on using igrab(); but don't go back to making such
> heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
> the new design, and I expect could deadlock inside shmem_swapin_page().
>
> Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
> specific inode, and shmem_evict_inode() wait for that to go down to 0.
> Call it "stop_eviction" rather than "swapoff_busy" because it can be
> put to use for others later (huge tmpfs patches expect to use it).
>
> That simplifies shmem_unuse(), protecting it from both unlink and unmount;
> and in practice lets it locate all the swap in its first try. But do not
> rely on that: there's still a theoretical case, when shmem_writepage()
> might have been preempted after its get_swap_page(), before making the
> swap entry visible to swapoff.
>
> Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> include/linux/shmem_fs.h | 1 +
> mm/shmem.c | 39 ++++++++++++++++++---------------------
> mm/swapfile.c | 11 +++++------
> 3 files changed, 24 insertions(+), 27 deletions(-)
>
> --- 5.1-rc4/include/linux/shmem_fs.h 2019-03-17 16:18:15.181820820 -0700
> +++ linux/include/linux/shmem_fs.h 2019-04-07 19:18:43.248639711 -0700
> @@ -21,6 +21,7 @@ struct shmem_inode_info {
> struct list_head swaplist; /* chain of maybes on swap */
> struct shared_policy policy; /* NUMA memory alloc policy */
> struct simple_xattrs xattrs; /* list of xattrs */
> + atomic_t stop_eviction; /* hold when working on inode */
> struct inode vfs_inode;
> };
>
> --- 5.1-rc4/mm/shmem.c 2019-04-07 19:12:23.603858531 -0700
> +++ linux/mm/shmem.c 2019-04-07 19:18:43.248639711 -0700
> @@ -1081,9 +1081,15 @@ static void shmem_evict_inode(struct ino
> }
> spin_unlock(&sbinfo->shrinklist_lock);
> }
> - if (!list_empty(&info->swaplist)) {
> + while (!list_empty(&info->swaplist)) {
> + /* Wait while shmem_unuse() is scanning this inode... */
> + wait_var_event(&info->stop_eviction,
> + !atomic_read(&info->stop_eviction));
> mutex_lock(&shmem_swaplist_mutex);
> list_del_init(&info->swaplist);
Obviously, line above should be deleted.
> + /* ...but beware of the race if we peeked too early */
> + if (!atomic_read(&info->stop_eviction))
> + list_del_init(&info->swaplist);
> mutex_unlock(&shmem_swaplist_mutex);
> }
> }
> @@ -1227,36 +1233,27 @@ int shmem_unuse(unsigned int type, bool
> unsigned long *fs_pages_to_unuse)
> {
> struct shmem_inode_info *info, *next;
> - struct inode *inode;
> - struct inode *prev_inode = NULL;
> int error = 0;
>
> if (list_empty(&shmem_swaplist))
> return 0;
>
> mutex_lock(&shmem_swaplist_mutex);
> -
> - /*
> - * The extra refcount on the inode is necessary to safely dereference
> - * p->next after re-acquiring the lock. New shmem inodes with swap
> - * get added to the end of the list and we will scan them all.
> - */
> list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
> if (!info->swapped) {
> list_del_init(&info->swaplist);
> continue;
> }
> -
> - inode = igrab(&info->vfs_inode);
> - if (!inode)
> - continue;
> -
> + /*
> + * Drop the swaplist mutex while searching the inode for swap;
> + * but before doing so, make sure shmem_evict_inode() will not
> + * remove placeholder inode from swaplist, nor let it be freed
> + * (igrab() would protect from unlink, but not from unmount).
> + */
> + atomic_inc(&info->stop_eviction);
> mutex_unlock(&shmem_swaplist_mutex);
> - if (prev_inode)
> - iput(prev_inode);
> - prev_inode = inode;
>
> - error = shmem_unuse_inode(inode, type, frontswap,
> + error = shmem_unuse_inode(&info->vfs_inode, type, frontswap,
> fs_pages_to_unuse);
> cond_resched();
>
> @@ -1264,14 +1261,13 @@ int shmem_unuse(unsigned int type, bool
> next = list_next_entry(info, swaplist);
> if (!info->swapped)
> list_del_init(&info->swaplist);
> + if (atomic_dec_and_test(&info->stop_eviction))
> + wake_up_var(&info->stop_eviction);
> if (error)
> break;
> }
> mutex_unlock(&shmem_swaplist_mutex);
>
> - if (prev_inode)
> - iput(prev_inode);
> -
> return error;
> }
>
> @@ -2238,6 +2234,7 @@ static struct inode *shmem_get_inode(str
> info = SHMEM_I(inode);
> memset(info, 0, (char *)inode - (char *)info);
> spin_lock_init(&info->lock);
> + atomic_set(&info->stop_eviction, 0);
> info->seals = F_SEAL_SEAL;
> info->flags = flags & VM_NORESERVE;
> INIT_LIST_HEAD(&info->shrinklist);
> --- 5.1-rc4/mm/swapfile.c 2019-04-07 19:17:13.291957539 -0700
> +++ linux/mm/swapfile.c 2019-04-07 19:18:43.248639711 -0700
> @@ -2116,12 +2116,11 @@ retry:
> * Under global memory pressure, swap entries can be reinserted back
> * into process space after the mmlist loop above passes over them.
> *
> - * Limit the number of retries? No: when shmem_unuse()'s igrab() fails,
> - * a shmem inode using swap is being evicted; and when mmget_not_zero()
> - * above fails, that mm is likely to be freeing swap from exit_mmap().
> - * Both proceed at their own independent pace: we could move them to
> - * separate lists, and wait for those lists to be emptied; but it's
> - * easier and more robust (though cpu-intensive) just to keep retrying.
> + * Limit the number of retries? No: when mmget_not_zero() above fails,
> + * that mm is likely to be freeing swap from exit_mmap(), which proceeds
> + * at its own independent pace; and even shmem_writepage() could have
> + * been preempted after get_swap_page(), temporarily hiding that swap.
> + * It's easy and robust (though cpu-intensive) just to keep retrying.
> */
> if (si->inuse_pages) {
> if (!signal_pending(current))
>
On Tue, 9 Apr 2019, Konstantin Khlebnikov wrote:
> On 08.04.2019 23:01, Hugh Dickins wrote:
> > - if (!list_empty(&info->swaplist)) {
> > + while (!list_empty(&info->swaplist)) {
> > + /* Wait while shmem_unuse() is scanning this inode...
> > */
> > + wait_var_event(&info->stop_eviction,
> > + !atomic_read(&info->stop_eviction));
> > mutex_lock(&shmem_swaplist_mutex);
> > list_del_init(&info->swaplist);
>
> Obviously, line above should be deleted.
Definitely. Worryingly stupid. I guess I left it behind while translating
from an earlier tree. Many thanks for catching that in time, Konstantin.
I've rechecked the rest of this patch, and the others, and didn't find
anything else as stupid.
Andrew, please add this fixup for folding in - thanks:
[PATCH] mm: swapoff: shmem_unuse() stop eviction without igrab() fix
Fix my stupidity, thankfully caught by Konstantin.
Signed-off-by: Hugh Dickins <[email protected]>
---
Fix to fold into mm-swapoff-shmem_unuse-stop-eviction-without-igrab.patch
mm/shmem.c | 1 -
1 file changed, 1 deletion(-)
--- patch4/mm/shmem.c 2019-04-07 19:18:43.248639711 -0700
+++ patch5/mm/shmem.c 2019-04-09 11:24:32.745337734 -0700
@@ -1086,7 +1086,6 @@ static void shmem_evict_inode(struct ino
wait_var_event(&info->stop_eviction,
!atomic_read(&info->stop_eviction));
mutex_lock(&shmem_swaplist_mutex);
- list_del_init(&info->swaplist);
/* ...but beware of the race if we peeked too early */
if (!atomic_read(&info->stop_eviction))
list_del_init(&info->swaplist);