2020-05-22 05:58:55

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

I've only seen this livelock on one machine (repeatably, but not to
order), and not fully analyzed it - two processes seen looping around
getting -EEXIST from swapcache_prepare(), I guess a third (at lower
priority? but wanting the same cpu as one of the loopers? preemption
or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
then went off into direct reclaim, scheduled away, and somehow could
not get back to add the page to swap cache and let them all complete.

Restore the page allocation in __read_swap_cache_async() to before
the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
on instantiation" moved it outside the loop, which indeed looks much
nicer, but exposed this weakness. We used to allocate new_page once
and then keep it across all iterations of the loop: but I think that
just optimizes for a rare case, and complicates the flow, so go with
the new simpler structure, with allocate+free each time around (which
is more considerate use of the memory too).

Fix the comment on the looping case, which has long been inaccurate:
it's not a racing get_swap_page() that's the problem here.

Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
done before; but delete_from_swap_cache() already includes it.

And one more nit: I don't think it makes any difference in practice,
but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
add_to_swap_cache() needs that, to convert gfp_mask from user and page
cache allocation (e.g. highmem) to radix node allocation (lowmem), but
we don't need or usually apply that mask when charging mem_cgroup.

Signed-off-by: Hugh Dickins <[email protected]>
---
Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
made a further change here (took an arg off the mem_cgroup_charge call):
as is, this patch is diffed to go on top of both of them, and better
that I get it out now for Johannes look at; but could be rediffed for
folding into blah-instantiation.patch later.

Earlier in the day I promised two patches to __read_swap_cache_async(),
but find now that I cannot quite justify the second patch: it makes a
slight adjustment in swapcache_prepare(), then removes the redundant
__swp_swapcount() && swap_slot_cache_enabled business from blah_async().

I'd still like to do that, but this patch here brings back the
alloc_page_vma() in between them, and I don't have any evidence to
reassure us that I'm not then pessimizing a readahead case by doing
unnecessary allocation and free. Leave it for some other time perhaps.

mm/swap_state.c | 52 +++++++++++++++++++++++++---------------------
1 file changed, 29 insertions(+), 23 deletions(-)

--- 5.7-rc6-mm1/mm/swap_state.c 2020-05-20 12:21:56.149694170 -0700
+++ linux/mm/swap_state.c 2020-05-21 20:17:50.188773901 -0700
@@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
return NULL;

/*
+ * Get a new page to read into from swap. Allocate it now,
+ * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
+ * cause any racers to loop around until we add it to cache.
+ */
+ page = alloc_page_vma(gfp_mask, vma, addr);
+ if (!page)
+ return NULL;
+
+ /*
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry);
if (!err)
break;

- if (err == -EEXIST) {
- /*
- * We might race against get_swap_page() and stumble
- * across a SWAP_HAS_CACHE swap_map entry whose page
- * has not been brought into the swapcache yet.
- */
- cond_resched();
- continue;
- }
+ put_page(page);
+ if (err != -EEXIST)
+ return NULL;

- return NULL;
+ /*
+ * We might race against __delete_from_swap_cache(), and
+ * stumble across a swap_map entry whose SWAP_HAS_CACHE
+ * has not yet been cleared. Or race against another
+ * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+ * in swap_map, but not yet added its page to swap cache.
+ */
+ cond_resched();
}

/*
- * The swap entry is ours to swap in. Prepare a new page.
+ * The swap entry is ours to swap in. Prepare the new page.
*/

- page = alloc_page_vma(gfp_mask, vma, addr);
- if (!page)
- goto fail_free;
-
__SetPageLocked(page);
__SetPageSwapBacked(page);

/* May fail (-ENOMEM) if XArray node allocation failed. */
- if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
+ put_swap_page(page, entry);
goto fail_unlock;
+ }

- if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
- goto fail_delete;
+ if (mem_cgroup_charge(page, NULL, gfp_mask)) {
+ delete_from_swap_cache(page);
+ goto fail_unlock;
+ }

- /* Initiate read into locked page */
+ /* Caller will initiate read into locked page */
SetPageWorkingset(page);
lru_cache_add_anon(page);
*new_page_allocated = true;
return page;

-fail_delete:
- delete_from_swap_cache(page);
fail_unlock:
unlock_page(page);
put_page(page);
-fail_free:
- swap_free(entry);
return NULL;
}


2020-05-22 17:11:16

by Rafael Aquini

[permalink] [raw]
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
>
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness. We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
>
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
>
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
>
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
>
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
>
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
>
> mm/swap_state.c | 52 +++++++++++++++++++++++++---------------------
> 1 file changed, 29 insertions(+), 23 deletions(-)
>
> --- 5.7-rc6-mm1/mm/swap_state.c 2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c 2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
> return NULL;
>
> /*
> + * Get a new page to read into from swap. Allocate it now,
> + * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> + * cause any racers to loop around until we add it to cache.
> + */
> + page = alloc_page_vma(gfp_mask, vma, addr);
> + if (!page)
> + return NULL;
> +
> + /*
> * Swap entry may have been freed since our caller observed it.
> */
> err = swapcache_prepare(entry);
> if (!err)
> break;
>
> - if (err == -EEXIST) {
> - /*
> - * We might race against get_swap_page() and stumble
> - * across a SWAP_HAS_CACHE swap_map entry whose page
> - * has not been brought into the swapcache yet.
> - */
> - cond_resched();
> - continue;
> - }
> + put_page(page);
> + if (err != -EEXIST)
> + return NULL;
>
> - return NULL;
> + /*
> + * We might race against __delete_from_swap_cache(), and
> + * stumble across a swap_map entry whose SWAP_HAS_CACHE
> + * has not yet been cleared. Or race against another
> + * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> + * in swap_map, but not yet added its page to swap cache.
> + */
> + cond_resched();
> }
>
> /*
> - * The swap entry is ours to swap in. Prepare a new page.
> + * The swap entry is ours to swap in. Prepare the new page.
> */
>
> - page = alloc_page_vma(gfp_mask, vma, addr);
> - if (!page)
> - goto fail_free;
> -
> __SetPageLocked(page);
> __SetPageSwapBacked(page);
>
> /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> + put_swap_page(page, entry);
> goto fail_unlock;
> + }
>
> - if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> - goto fail_delete;
> + if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> + delete_from_swap_cache(page);
> + goto fail_unlock;
> + }
>
> - /* Initiate read into locked page */
> + /* Caller will initiate read into locked page */
> SetPageWorkingset(page);
> lru_cache_add_anon(page);
> *new_page_allocated = true;
> return page;
>
> -fail_delete:
> - delete_from_swap_cache(page);
> fail_unlock:
> unlock_page(page);
> put_page(page);
> -fail_free:
> - swap_free(entry);
> return NULL;
> }
>
>
Acked-by: Rafael Aquini <[email protected]>

2020-05-23 00:26:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

On Thu, 21 May 2020 22:56:20 -0700 (PDT) Hugh Dickins <[email protected]> wrote:

> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
>
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness. We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
>
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
>
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
>
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.

Thanks - I did the necessary jiggery-pokery to get this into the right
place.

2020-05-26 15:48:03

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
>
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness. We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
>
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
>
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
>
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---

Acked-by: Johannes Weiner <[email protected]>

> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.

IMO it's worth having as a separate change. Joonsoo was concerned
about the ordering but I didn't see it. Having this sequence of
changes on record would be good for later reference.

2020-05-27 21:59:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH mmotm] mm/swap: fix livelock in __read_swap_cache_async()

On Tue, 26 May 2020, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> > I've only seen this livelock on one machine (repeatably, but not to
> > order), and not fully analyzed it - two processes seen looping around
> > getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> > priority? but wanting the same cpu as one of the loopers? preemption
> > or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> > then went off into direct reclaim, scheduled away, and somehow could
> > not get back to add the page to swap cache and let them all complete.
> >
> > Restore the page allocation in __read_swap_cache_async() to before
> > the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> > on instantiation" moved it outside the loop, which indeed looks much
> > nicer, but exposed this weakness. We used to allocate new_page once
> > and then keep it across all iterations of the loop: but I think that
> > just optimizes for a rare case, and complicates the flow, so go with
> > the new simpler structure, with allocate+free each time around (which
> > is more considerate use of the memory too).
> >
> > Fix the comment on the looping case, which has long been inaccurate:
> > it's not a racing get_swap_page() that's the problem here.
> >
> > Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> > not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> > done before; but delete_from_swap_cache() already includes it.
> >
> > And one more nit: I don't think it makes any difference in practice,
> > but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> > add_to_swap_cache() needs that, to convert gfp_mask from user and page
> > cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> > we don't need or usually apply that mask when charging mem_cgroup.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
>
> Acked-by: Johannes Weiner <[email protected]>
>
> > Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> > but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> > made a further change here (took an arg off the mem_cgroup_charge call):
> > as is, this patch is diffed to go on top of both of them, and better
> > that I get it out now for Johannes look at; but could be rediffed for
> > folding into blah-instantiation.patch later.
>
> IMO it's worth having as a separate change. Joonsoo was concerned
> about the ordering but I didn't see it. Having this sequence of
> changes on record would be good for later reference.

Yes, there would be some value in that: whichever way Andrew prefers.

Now that the Acks are safely in (thanks guys), I will concede that
that SWAP_HAS_CACHE occasional busywait loop is not ideal - but with
this patch, no worse than it was before.

Later on I hope to come back and do something better there: it's not
immediately clear why swapcache_prepare() is important, and it would
be nicer if add_to_swap_cache() were the thing to fail with -EEXIST
(because then there's a page in the cache that others can lock to
wait on if required); but there's memories I need to dredge up
before going that way, and it may turn out to be a delusion.

Hugh