2024-01-26 09:17:12

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock

From: Chengming Zhou <[email protected]>

LRU_SKIP can only be returned if we don't ever dropped lru lock, or
we need to return LRU_RETRY to restart from the head of lru list.

Actually we may need to introduce another LRU_STOP to really terminate
the ongoing shrinking scan process, when we encounter a warm page
already in the swap cache. The current list_lru implementation
doesn't have this function to early break from __list_lru_walk_one.

Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
Signed-off-by: Chengming Zhou <[email protected]>
---
mm/zswap.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 00e90b9b5417..81cb3790e0dd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -901,10 +901,8 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
* into the warmer region. We should terminate shrinking (if we're in the dynamic
* shrinker context).
*/
- if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
- ret = LRU_SKIP;
+ if (writeback_result == -EEXIST && encountered_page_in_swapcache)
*encountered_page_in_swapcache = true;
- }

goto put_unlock;
}
--
2.40.1



2024-01-26 09:19:09

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

From: Chengming Zhou <[email protected]>

LRU writeback has race problem with swapoff, as spotted by Yosry[1]:

CPU1 CPU2
shrink_memcg_cb swap_off
list_lru_isolate zswap_invalidate
zswap_swapoff
kfree(tree)
// UAF
spin_lock(&tree->lock)

The problem is that the entry in lru list can't protect the tree from
being swapoff and freed, and the entry also can be invalidated and freed
concurrently after we unlock the lru lock.

We can fix it by moving the swap cache allocation ahead before
referencing the tree, then check invalidate race with tree lock,
only after that we can safely deref the entry. Note we couldn't
deref entry or tree anymore after we unlock the folio, since we
depend on this to hold on swapoff.

So this patch moves all tree and entry usage to zswap_writeback_entry(),
we only use the copied swpentry on the stack to allocate swap cache
and return with folio locked, after which we can reference the tree.
Then check invalidate race with tree lock, the following things is
much the same like zswap_load().

Since we can't deref the entry after zswap_writeback_entry(), we
can't use zswap_lru_putback() anymore, instead we rotate the entry
in the LRU list so concurrent reclaimers have little chance to see it.
Or it will be deleted from LRU list if writeback success.

Another confusing part to me is the update of memcg nr_zswap_protected
in zswap_lru_putback(). I'm not sure why it's needed here since
if we raced with swapin, memcg nr_zswap_protected has already been
updated in zswap_folio_swapin(). So not include this part for now.

[1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/zswap.c | 93 ++++++++++++++++++------------------------------------
1 file changed, 31 insertions(+), 62 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 81cb3790e0dd..fa2bdb7ec1d8 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
zpool_get_type((p)->zpools[0]))

static int zswap_writeback_entry(struct zswap_entry *entry,
- struct zswap_tree *tree);
+ swp_entry_t swpentry);
static int zswap_pool_get(struct zswap_pool *pool);
static void zswap_pool_put(struct zswap_pool *pool);

@@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
rcu_read_unlock();
}

-static void zswap_lru_putback(struct list_lru *list_lru,
- struct zswap_entry *entry)
-{
- int nid = entry_to_nid(entry);
- spinlock_t *lock = &list_lru->node[nid].lock;
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- rcu_read_lock();
- memcg = mem_cgroup_from_entry(entry);
- spin_lock(lock);
- /* we cannot use list_lru_add here, because it increments node's lru count */
- list_lru_putback(list_lru, &entry->lru, nid, memcg);
- spin_unlock(lock);
-
- lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
- /* increment the protection area to account for the LRU rotation. */
- atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
- rcu_read_unlock();
-}
-
/*********************************
* rbtree functions
**********************************/
@@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
{
struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
bool *encountered_page_in_swapcache = (bool *)arg;
- struct zswap_tree *tree;
- pgoff_t swpoffset;
+ swp_entry_t swpentry;
enum lru_status ret = LRU_REMOVED_RETRY;
int writeback_result;

+ /*
+ * First rotate to the tail of lru list before unlocking lru lock,
+ * so the concurrent reclaimers have little chance to see it.
+ * It will be deleted from the lru list if writeback success.
+ */
+ list_move_tail(item, &l->list);
+
/*
* Once the lru lock is dropped, the entry might get freed. The
- * swpoffset is copied to the stack, and entry isn't deref'd again
+ * swpentry is copied to the stack, and entry isn't deref'd again
* until the entry is verified to still be alive in the tree.
*/
- swpoffset = swp_offset(entry->swpentry);
- tree = swap_zswap_tree(entry->swpentry);
- list_lru_isolate(l, item);
+ swpentry = entry->swpentry;
+
/*
* It's safe to drop the lock here because we return either
* LRU_REMOVED_RETRY or LRU_RETRY.
*/
spin_unlock(lock);

- /* Check for invalidate() race */
- spin_lock(&tree->lock);
- if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
- goto unlock;
-
- /* Hold a reference to prevent a free during writeback */
- zswap_entry_get(entry);
- spin_unlock(&tree->lock);
-
- writeback_result = zswap_writeback_entry(entry, tree);
+ writeback_result = zswap_writeback_entry(entry, swpentry);

- spin_lock(&tree->lock);
if (writeback_result) {
zswap_reject_reclaim_fail++;
- zswap_lru_putback(&entry->pool->list_lru, entry);
ret = LRU_RETRY;

/*
@@ -903,27 +876,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
*/
if (writeback_result == -EEXIST && encountered_page_in_swapcache)
*encountered_page_in_swapcache = true;
-
- goto put_unlock;
+ } else {
+ zswap_written_back_pages++;
}
- zswap_written_back_pages++;
-
- if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWPWB);
-
- count_vm_event(ZSWPWB);
- /*
- * Writeback started successfully, the page now belongs to the
- * swapcache. Drop the entry from zswap - unless invalidate already
- * took it out while we had the tree->lock released for IO.
- */
- zswap_invalidate_entry(tree, entry);

-put_unlock:
- /* Drop local reference */
- zswap_entry_put(entry);
-unlock:
- spin_unlock(&tree->lock);
spin_lock(lock);
return ret;
}
@@ -1408,9 +1364,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
* freed.
*/
static int zswap_writeback_entry(struct zswap_entry *entry,
- struct zswap_tree *tree)
+ swp_entry_t swpentry)
{
- swp_entry_t swpentry = entry->swpentry;
+ struct zswap_tree *tree;
struct folio *folio;
struct mempolicy *mpol;
bool folio_was_allocated;
@@ -1442,18 +1398,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* backs (our zswap_entry reference doesn't prevent that), to
* avoid overwriting a new swap folio with old compressed data.
*/
+ tree = swap_zswap_tree(swpentry);
spin_lock(&tree->lock);
- if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
+ if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
spin_unlock(&tree->lock);
delete_from_swap_cache(folio);
folio_unlock(folio);
folio_put(folio);
return -ENOMEM;
}
+
+ /* Safe to deref entry after the entry is verified above. */
+ zswap_entry_get(entry);
spin_unlock(&tree->lock);

__zswap_load(entry, &folio->page);

+ count_vm_event(ZSWPWB);
+ if (entry->objcg)
+ count_objcg_event(entry->objcg, ZSWPWB);
+
+ spin_lock(&tree->lock);
+ zswap_invalidate_entry(tree, entry);
+ zswap_entry_put(entry);
+ spin_unlock(&tree->lock);
+
/* folio is up to date */
folio_mark_uptodate(folio);

--
2.40.1


2024-01-26 14:28:52

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock

On Fri, Jan 26, 2024 at 08:30:14AM +0000, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> LRU_SKIP can only be returned if we don't ever dropped lru lock, or
> we need to return LRU_RETRY to restart from the head of lru list.

Good catch. Can you mention the possible consequences in the log?

"Otherwise, the iteration might continue from a cursor position that
was freed while the locks were dropped."?

> Actually we may need to introduce another LRU_STOP to really terminate
> the ongoing shrinking scan process, when we encounter a warm page
> already in the swap cache. The current list_lru implementation
> doesn't have this function to early break from __list_lru_walk_one.
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Signed-off-by: Chengming Zhou <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2024-01-26 15:31:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On Fri, Jan 26, 2024 at 08:30:15AM +0000, [email protected] wrote:
> From: Chengming Zhou <[email protected]>
>
> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.

This is a great simplification on top of being a bug fix.

> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and return with folio locked, after which we can reference the tree.
> Then check invalidate race with tree lock, the following things is
> much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the LRU list so concurrent reclaimers have little chance to see it.
> Or it will be deleted from LRU list if writeback success.
>
> Another confusing part to me is the update of memcg nr_zswap_protected
> in zswap_lru_putback(). I'm not sure why it's needed here since
> if we raced with swapin, memcg nr_zswap_protected has already been
> updated in zswap_folio_swapin(). So not include this part for now.

Good observation.

Technically, it could also fail on -ENOMEM, but in practice these size
allocations don't fail, especially since the shrinker runs in
PF_MEMALLOC context. The shrink_worker might be affected, since it
doesn't But the common case is -EEXIST, which indeed double counts.

To make it "correct", you'd have to grab an objcg reference with the
LRU lock, and also re-order the objcg put on entry freeing after the
LRU del. This is probably not worth doing. But it could use a comment.

I was going to ask if you could reorder objcg uncharging after LRU
deletion to make it more robust for future changes in that direction.
However, staring at this, I notice this is a second UAF bug:

if (entry->objcg) {
obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
obj_cgroup_put(entry->objcg);
}
if (!entry->length)
atomic_dec(&zswap_same_filled_pages);
else {
zswap_lru_del(&entry->pool->list_lru, entry);

zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but
the put may have killed it. I'll send a separate patch on top.

> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * First rotate to the tail of lru list before unlocking lru lock,
> + * so the concurrent reclaimers have little chance to see it.
> + * It will be deleted from the lru list if writeback success.
> + */
> + list_move_tail(item, &l->list);

We don't hold a reference to the object, so there could also be an
invalidation waiting on the LRU lock, which will free the entry even
when writeback fails.

It would also be good to expand on the motivation, because it's not
clear WHY you'd want to hide it from other reclaimers.

Lastly, maybe mention the story around temporary failures? Most
shrinkers have a lock inversion pattern (object lock -> LRU lock for
linking versus LRU lock -> object trylock during reclaim) that can
fail and require the same object be tried again before advancing.

How about this?

/*
* Rotate the entry to the tail before unlocking the LRU,
* so that in case of an invalidation race concurrent
* reclaimers don't waste their time on it.
*
* If writeback succeeds, or failure is due to the entry
* being invalidated by the swap subsystem, the invalidation
* will unlink and free it.
*
* Temporary failures, where the same entry should be tried
* again immediately, almost never happen for this shrinker.
* We don't do any trylocking; -ENOMEM comes closest,
* but that's extremely rare and doesn't happen spuriously
* either. Don't bother distinguishing this case.
*
* But since they do exist in theory, the entry cannot just
* be unlinked, or we could leak it. Hence, rotate.
*/

Otherwise, looks great to me.

Acked-by: Johannes Weiner <[email protected]>

2024-01-26 18:02:06

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock

On Fri, Jan 26, 2024 at 12:31 AM <[email protected]> wrote:
>
> From: Chengming Zhou <[email protected]>
>
> LRU_SKIP can only be returned if we don't ever dropped lru lock, or
> we need to return LRU_RETRY to restart from the head of lru list.

Ooops. You're right! I just double checked and only LRU_REMOVED_RETRY
and LRU_RETRY indicate we might have dropped the lock. My bad.

>
> Actually we may need to introduce another LRU_STOP to really terminate
> the ongoing shrinking scan process, when we encounter a warm page

Yup. This is what I was trying (and failing) to do. To be honest, this
needs to be even stronger: short-circuit ALL concurrent/ongoing zswap
shrinker scan processes that are touching this memcg (as they will
also shrink into warmer regions going forward). But that's a bit more
engineering to do. LRU_STOP, which stops this scan process, would be a
good place to start.

> already in the swap cache. The current list_lru implementation
> doesn't have this function to early break from __list_lru_walk_one.
>
> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> Signed-off-by: Chengming Zhou <[email protected]>

Reviewed-by: Nhat Pham <[email protected]>

> ---
> mm/zswap.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 00e90b9b5417..81cb3790e0dd 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -901,10 +901,8 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> * into the warmer region. We should terminate shrinking (if we're in the dynamic
> * shrinker context).
> */
> - if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
> - ret = LRU_SKIP;
> + if (writeback_result == -EEXIST && encountered_page_in_swapcache)
> *encountered_page_in_swapcache = true;
> - }
>
> goto put_unlock;
> }
> --
> 2.40.1
>

2024-01-26 19:33:32

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On Fri, Jan 26, 2024 at 7:31 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Jan 26, 2024 at 08:30:15AM +0000, [email protected] wrote:
> > From: Chengming Zhou <[email protected]>
> >
> > LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
> >
> > CPU1 CPU2
> > shrink_memcg_cb swap_off
> > list_lru_isolate zswap_invalidate
> > zswap_swapoff
> > kfree(tree)
> > // UAF
> > spin_lock(&tree->lock)
> >
> > The problem is that the entry in lru list can't protect the tree from
> > being swapoff and freed, and the entry also can be invalidated and freed
> > concurrently after we unlock the lru lock.
> >
> > We can fix it by moving the swap cache allocation ahead before
> > referencing the tree, then check invalidate race with tree lock,
> > only after that we can safely deref the entry. Note we couldn't
> > deref entry or tree anymore after we unlock the folio, since we
> > depend on this to hold on swapoff.
>
> This is a great simplification on top of being a bug fix.
>
> > So this patch moves all tree and entry usage to zswap_writeback_entry(),
> > we only use the copied swpentry on the stack to allocate swap cache
> > and return with folio locked, after which we can reference the tree.
> > Then check invalidate race with tree lock, the following things is
> > much the same like zswap_load().
> >
> > Since we can't deref the entry after zswap_writeback_entry(), we
> > can't use zswap_lru_putback() anymore, instead we rotate the entry
> > in the LRU list so concurrent reclaimers have little chance to see it.
> > Or it will be deleted from LRU list if writeback success.
> >
> > Another confusing part to me is the update of memcg nr_zswap_protected
> > in zswap_lru_putback(). I'm not sure why it's needed here since
> > if we raced with swapin, memcg nr_zswap_protected has already been
> > updated in zswap_folio_swapin(). So not include this part for now.
>
> Good observation.
>
> Technically, it could also fail on -ENOMEM, but in practice these size
> allocations don't fail, especially since the shrinker runs in
> PF_MEMALLOC context. The shrink_worker might be affected, since it
> doesn't But the common case is -EEXIST, which indeed double counts.

Yup. At the time, I was thinking more along the lines of what
mechanisms should trigger protection size increase. "swapin" and "LRU
list rotations" were two different mechanisms in my head. I was aware
that there could be double counting, but deemed it OK, as the cost of
over-shrinking (increase in swapin) was fairly serious, and if we have
a fairly aggressive decaying strategy if we protect too much.

But yes, I doubt it mattered that much in hindsight :) And the case
where it is double counted far outnumber the case where it does not,
so I'm fine with removing it here.

>
> To make it "correct", you'd have to grab an objcg reference with the
> LRU lock, and also re-order the objcg put on entry freeing after the
> LRU del. This is probably not worth doing. But it could use a comment.
>
> I was going to ask if you could reorder objcg uncharging after LRU
> deletion to make it more robust for future changes in that direction.
> However, staring at this, I notice this is a second UAF bug:
>
> if (entry->objcg) {
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> obj_cgroup_put(entry->objcg);
> }
> if (!entry->length)
> atomic_dec(&zswap_same_filled_pages);
> else {
> zswap_lru_del(&entry->pool->list_lru, entry);
>
> zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but
> the put may have killed it. I'll send a separate patch on top.
>
> > @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> > {
> > struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> > bool *encountered_page_in_swapcache = (bool *)arg;
> > - struct zswap_tree *tree;
> > - pgoff_t swpoffset;
> > + swp_entry_t swpentry;
> > enum lru_status ret = LRU_REMOVED_RETRY;
> > int writeback_result;
> >
> > + /*
> > + * First rotate to the tail of lru list before unlocking lru lock,
> > + * so the concurrent reclaimers have little chance to see it.
> > + * It will be deleted from the lru list if writeback success.
> > + */
> > + list_move_tail(item, &l->list);
>
> We don't hold a reference to the object, so there could also be an
> invalidation waiting on the LRU lock, which will free the entry even
> when writeback fails.
>
> It would also be good to expand on the motivation, because it's not
> clear WHY you'd want to hide it from other reclaimers.
>
> Lastly, maybe mention the story around temporary failures? Most
> shrinkers have a lock inversion pattern (object lock -> LRU lock for
> linking versus LRU lock -> object trylock during reclaim) that can
> fail and require the same object be tried again before advancing.
>
> How about this?
>
> /*
> * Rotate the entry to the tail before unlocking the LRU,
> * so that in case of an invalidation race concurrent
> * reclaimers don't waste their time on it.
> *
> * If writeback succeeds, or failure is due to the entry
> * being invalidated by the swap subsystem, the invalidation
> * will unlink and free it.
> *
> * Temporary failures, where the same entry should be tried
> * again immediately, almost never happen for this shrinker.
> * We don't do any trylocking; -ENOMEM comes closest,
> * but that's extremely rare and doesn't happen spuriously
> * either. Don't bother distinguishing this case.
> *
> * But since they do exist in theory, the entry cannot just
> * be unlinked, or we could leak it. Hence, rotate.
> */
>
> Otherwise, looks great to me.
>
> Acked-by: Johannes Weiner <[email protected]>

2024-01-26 19:50:29

by Nhat Pham

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On Fri, Jan 26, 2024 at 12:32 AM <[email protected]> wrote:
>
> From: Chengming Zhou <[email protected]>
>
> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and return with folio locked, after which we can reference the tree.
> Then check invalidate race with tree lock, the following things is
> much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry

I added list_lru_putback to the list_lru API specifically for this use
case (zswap_lru_putback()). Now that we no longer need it, maybe we
can also remove this as well (assuming no-one else is using this?).

This can be done in a separate patch though.

> in the LRU list so concurrent reclaimers have little chance to see it.
> Or it will be deleted from LRU list if writeback success.
>
> Another confusing part to me is the update of memcg nr_zswap_protected
> in zswap_lru_putback(). I'm not sure why it's needed here since
> if we raced with swapin, memcg nr_zswap_protected has already been
> updated in zswap_folio_swapin(). So not include this part for now.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>
> Signed-off-by: Chengming Zhou <[email protected]>

LGTM! This is quite elegant.
Acked-by: Nhat Pham <[email protected]>

> ---
> mm/zswap.c | 93 ++++++++++++++++++------------------------------------
> 1 file changed, 31 insertions(+), 62 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..fa2bdb7ec1d8 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> zpool_get_type((p)->zpools[0]))
>
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree);
> + swp_entry_t swpentry);
> static int zswap_pool_get(struct zswap_pool *pool);
> static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> rcu_read_unlock();
> }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> - struct zswap_entry *entry)
> -{
> - int nid = entry_to_nid(entry);
> - spinlock_t *lock = &list_lru->node[nid].lock;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = mem_cgroup_from_entry(entry);
> - spin_lock(lock);
> - /* we cannot use list_lru_add here, because it increments node's lru count */
> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
> - spin_unlock(lock);
> -
> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> - /* increment the protection area to account for the LRU rotation. */
> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> - rcu_read_unlock();
> -}
> -
> /*********************************
> * rbtree functions
> **********************************/
> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * First rotate to the tail of lru list before unlocking lru lock,
> + * so the concurrent reclaimers have little chance to see it.
> + * It will be deleted from the lru list if writeback success.
> + */
> + list_move_tail(item, &l->list);
> +
> /*
> * Once the lru lock is dropped, the entry might get freed. The
> - * swpoffset is copied to the stack, and entry isn't deref'd again
> + * swpentry is copied to the stack, and entry isn't deref'd again
> * until the entry is verified to still be alive in the tree.
> */
> - swpoffset = swp_offset(entry->swpentry);
> - tree = swap_zswap_tree(entry->swpentry);
> - list_lru_isolate(l, item);

nit: IIUC, now that we're no longer removing the entry from the
list_lru, we protect against concurrent shrinking action via this
check inside zswap_writeback_entry() too right:

if (!folio_was_allocated) {
folio_put(folio);
return -EEXIST;
}

Maybe update the comment above it too?

> + swpentry = entry->swpentry;
> +
> /*
> * It's safe to drop the lock here because we return either
> * LRU_REMOVED_RETRY or LRU_RETRY.
> */
> spin_unlock(lock);
>
> - /* Check for invalidate() race */
> - spin_lock(&tree->lock);
> - if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
> - goto unlock;
> -
> - /* Hold a reference to prevent a free during writeback */
> - zswap_entry_get(entry);
> - spin_unlock(&tree->lock);
> -
> - writeback_result = zswap_writeback_entry(entry, tree);
> + writeback_result = zswap_writeback_entry(entry, swpentry);
>
> - spin_lock(&tree->lock);
> if (writeback_result) {
> zswap_reject_reclaim_fail++;
> - zswap_lru_putback(&entry->pool->list_lru, entry);
> ret = LRU_RETRY;
>
> /*
> @@ -903,27 +876,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> */
> if (writeback_result == -EEXIST && encountered_page_in_swapcache)
> *encountered_page_in_swapcache = true;
> -
> - goto put_unlock;
> + } else {
> + zswap_written_back_pages++;
> }
> - zswap_written_back_pages++;
> -
> - if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWPWB);
> -
> - count_vm_event(ZSWPWB);
> - /*
> - * Writeback started successfully, the page now belongs to the
> - * swapcache. Drop the entry from zswap - unless invalidate already
> - * took it out while we had the tree->lock released for IO.
> - */
> - zswap_invalidate_entry(tree, entry);
>
> -put_unlock:
> - /* Drop local reference */
> - zswap_entry_put(entry);
> -unlock:
> - spin_unlock(&tree->lock);
> spin_lock(lock);
> return ret;
> }
> @@ -1408,9 +1364,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
> * freed.
> */
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree)
> + swp_entry_t swpentry)
> {
> - swp_entry_t swpentry = entry->swpentry;
> + struct zswap_tree *tree;
> struct folio *folio;
> struct mempolicy *mpol;
> bool folio_was_allocated;
> @@ -1442,18 +1398,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> * backs (our zswap_entry reference doesn't prevent that), to
> * avoid overwriting a new swap folio with old compressed data.
> */
> + tree = swap_zswap_tree(swpentry);
> spin_lock(&tree->lock);
> - if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
> + if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> spin_unlock(&tree->lock);
> delete_from_swap_cache(folio);
> folio_unlock(folio);
> folio_put(folio);
> return -ENOMEM;
> }
> +
> + /* Safe to deref entry after the entry is verified above. */
> + zswap_entry_get(entry);
> spin_unlock(&tree->lock);
>
> __zswap_load(entry, &folio->page);
>
> + count_vm_event(ZSWPWB);
> + if (entry->objcg)
> + count_objcg_event(entry->objcg, ZSWPWB);
> +
> + spin_lock(&tree->lock);
> + zswap_invalidate_entry(tree, entry);
> + zswap_entry_put(entry);
> + spin_unlock(&tree->lock);
> +
> /* folio is up to date */
> folio_mark_uptodate(folio);
>
> --
> 2.40.1
>

2024-01-27 14:34:48

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock

On 2024/1/26 22:28, Johannes Weiner wrote:
> On Fri, Jan 26, 2024 at 08:30:14AM +0000, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> LRU_SKIP can only be returned if we don't ever dropped lru lock, or
>> we need to return LRU_RETRY to restart from the head of lru list.
>
> Good catch. Can you mention the possible consequences in the log?
>
> "Otherwise, the iteration might continue from a cursor position that
> was freed while the locks were dropped."?

Good, will do.

>
>> Actually we may need to introduce another LRU_STOP to really terminate
>> the ongoing shrinking scan process, when we encounter a warm page
>> already in the swap cache. The current list_lru implementation
>> doesn't have this function to early break from __list_lru_walk_one.
>>
>> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>

Thanks.

2024-01-27 14:37:51

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/zswap: don't return LRU_SKIP if we have dropped lru lock

On 2024/1/27 02:01, Nhat Pham wrote:
> On Fri, Jan 26, 2024 at 12:31 AM <[email protected]> wrote:
>>
>> From: Chengming Zhou <[email protected]>
>>
>> LRU_SKIP can only be returned if we don't ever dropped lru lock, or
>> we need to return LRU_RETRY to restart from the head of lru list.
>
> Ooops. You're right! I just double checked and only LRU_REMOVED_RETRY
> and LRU_RETRY indicate we might have dropped the lock. My bad.
>
>>
>> Actually we may need to introduce another LRU_STOP to really terminate
>> the ongoing shrinking scan process, when we encounter a warm page
>
> Yup. This is what I was trying (and failing) to do. To be honest, this
> needs to be even stronger: short-circuit ALL concurrent/ongoing zswap
> shrinker scan processes that are touching this memcg (as they will
> also shrink into warmer regions going forward). But that's a bit more
> engineering to do. LRU_STOP, which stops this scan process, would be a
> good place to start.

Good suggestion, will look into that more later.

>
>> already in the swap cache. The current list_lru implementation
>> doesn't have this function to early break from __list_lru_walk_one.
>>
>> Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> Reviewed-by: Nhat Pham <[email protected]>

Thanks.

>
>> ---
>> mm/zswap.c | 4 +---
>> 1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 00e90b9b5417..81cb3790e0dd 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -901,10 +901,8 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> * into the warmer region. We should terminate shrinking (if we're in the dynamic
>> * shrinker context).
>> */
>> - if (writeback_result == -EEXIST && encountered_page_in_swapcache) {
>> - ret = LRU_SKIP;
>> + if (writeback_result == -EEXIST && encountered_page_in_swapcache)
>> *encountered_page_in_swapcache = true;
>> - }
>>
>> goto put_unlock;
>> }
>> --
>> 2.40.1
>>

2024-01-27 14:53:56

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On 2024/1/26 23:31, Johannes Weiner wrote:
> On Fri, Jan 26, 2024 at 08:30:15AM +0000, [email protected] wrote:
>> From: Chengming Zhou <[email protected]>
>>
>> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>
> This is a great simplification on top of being a bug fix.
>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and return with folio locked, after which we can reference the tree.
>> Then check invalidate race with tree lock, the following things is
>> much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>> in the LRU list so concurrent reclaimers have little chance to see it.
>> Or it will be deleted from LRU list if writeback success.
>>
>> Another confusing part to me is the update of memcg nr_zswap_protected
>> in zswap_lru_putback(). I'm not sure why it's needed here since
>> if we raced with swapin, memcg nr_zswap_protected has already been
>> updated in zswap_folio_swapin(). So not include this part for now.
>
> Good observation.
>
> Technically, it could also fail on -ENOMEM, but in practice these size
> allocations don't fail, especially since the shrinker runs in
> PF_MEMALLOC context. The shrink_worker might be affected, since it
> doesn't But the common case is -EEXIST, which indeed double counts.

Ah right, the rotation of the more unlikely case that allocation fail
indeed need to update the memcg nr_zswap_protected, only the case of
-EEXIST has double counts problem.

>
> To make it "correct", you'd have to grab an objcg reference with the
> LRU lock, and also re-order the objcg put on entry freeing after the
> LRU del. This is probably not worth doing. But it could use a comment.

Agree, will add your comments below.

>
> I was going to ask if you could reorder objcg uncharging after LRU
> deletion to make it more robust for future changes in that direction.
> However, staring at this, I notice this is a second UAF bug:
>
> if (entry->objcg) {
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
> obj_cgroup_put(entry->objcg);
> }
> if (!entry->length)
> atomic_dec(&zswap_same_filled_pages);
> else {
> zswap_lru_del(&entry->pool->list_lru, entry);
>
> zswap_lru_del() uses entry->objcg to determine the list_lru memcg, but
> the put may have killed it. I'll send a separate patch on top.

Good observation.

>
>> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * First rotate to the tail of lru list before unlocking lru lock,
>> + * so the concurrent reclaimers have little chance to see it.
>> + * It will be deleted from the lru list if writeback success.
>> + */
>> + list_move_tail(item, &l->list);
>
> We don't hold a reference to the object, so there could also be an
> invalidation waiting on the LRU lock, which will free the entry even
> when writeback fails.
>
> It would also be good to expand on the motivation, because it's not
> clear WHY you'd want to hide it from other reclaimers.

Right, my comments are not clear enough.

>
> Lastly, maybe mention the story around temporary failures? Most
> shrinkers have a lock inversion pattern (object lock -> LRU lock for
> linking versus LRU lock -> object trylock during reclaim) that can
> fail and require the same object be tried again before advancing.

Your comments are great, will add in the next version.

Thanks.

>
> How about this?
>
> /*
> * Rotate the entry to the tail before unlocking the LRU,
> * so that in case of an invalidation race concurrent
> * reclaimers don't waste their time on it.
> *
> * If writeback succeeds, or failure is due to the entry
> * being invalidated by the swap subsystem, the invalidation
> * will unlink and free it.
> *
> * Temporary failures, where the same entry should be tried
> * again immediately, almost never happen for this shrinker.
> * We don't do any trylocking; -ENOMEM comes closest,
> * but that's extremely rare and doesn't happen spuriously
> * either. Don't bother distinguishing this case.
> *
> * But since they do exist in theory, the entry cannot just
> * be unlinked, or we could leak it. Hence, rotate.
> */
>
> Otherwise, looks great to me.
>
> Acked-by: Johannes Weiner <[email protected]>

2024-01-27 15:13:35

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On 2024/1/27 03:50, Nhat Pham wrote:
> On Fri, Jan 26, 2024 at 12:32 AM <[email protected]> wrote:
>>
>> From: Chengming Zhou <[email protected]>
>>
>> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and return with folio locked, after which we can reference the tree.
>> Then check invalidate race with tree lock, the following things is
>> much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>
> I added list_lru_putback to the list_lru API specifically for this use
> case (zswap_lru_putback()). Now that we no longer need it, maybe we
> can also remove this as well (assuming no-one else is using this?).
>
> This can be done in a separate patch though.

Right, I can append a patch to remove it since no other users.

>
>> in the LRU list so concurrent reclaimers have little chance to see it.
>> Or it will be deleted from LRU list if writeback success.
>>
>> Another confusing part to me is the update of memcg nr_zswap_protected
>> in zswap_lru_putback(). I'm not sure why it's needed here since
>> if we raced with swapin, memcg nr_zswap_protected has already been
>> updated in zswap_folio_swapin(). So not include this part for now.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> LGTM! This is quite elegant.
> Acked-by: Nhat Pham <[email protected]>
>
>> ---
>> mm/zswap.c | 93 ++++++++++++++++++------------------------------------
>> 1 file changed, 31 insertions(+), 62 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..fa2bdb7ec1d8 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>> zpool_get_type((p)->zpools[0]))
>>
>> static int zswap_writeback_entry(struct zswap_entry *entry,
>> - struct zswap_tree *tree);
>> + swp_entry_t swpentry);
>> static int zswap_pool_get(struct zswap_pool *pool);
>> static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>> rcu_read_unlock();
>> }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> - struct zswap_entry *entry)
>> -{
>> - int nid = entry_to_nid(entry);
>> - spinlock_t *lock = &list_lru->node[nid].lock;
>> - struct mem_cgroup *memcg;
>> - struct lruvec *lruvec;
>> -
>> - rcu_read_lock();
>> - memcg = mem_cgroup_from_entry(entry);
>> - spin_lock(lock);
>> - /* we cannot use list_lru_add here, because it increments node's lru count */
>> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> - spin_unlock(lock);
>> -
>> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> - /* increment the protection area to account for the LRU rotation. */
>> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> - rcu_read_unlock();
>> -}
>> -
>> /*********************************
>> * rbtree functions
>> **********************************/
>> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * First rotate to the tail of lru list before unlocking lru lock,
>> + * so the concurrent reclaimers have little chance to see it.
>> + * It will be deleted from the lru list if writeback success.
>> + */
>> + list_move_tail(item, &l->list);
>> +
>> /*
>> * Once the lru lock is dropped, the entry might get freed. The
>> - * swpoffset is copied to the stack, and entry isn't deref'd again
>> + * swpentry is copied to the stack, and entry isn't deref'd again
>> * until the entry is verified to still be alive in the tree.
>> */
>> - swpoffset = swp_offset(entry->swpentry);
>> - tree = swap_zswap_tree(entry->swpentry);
>> - list_lru_isolate(l, item);
>
> nit: IIUC, now that we're no longer removing the entry from the
> list_lru, we protect against concurrent shrinking action via this
> check inside zswap_writeback_entry() too right:
>
> if (!folio_was_allocated) {
> folio_put(folio);
> return -EEXIST;
> }
>
> Maybe update the comment above it too?

* Found an existing folio, we raced with load/swapin. We generally
* writeback cold folios from zswap, and swapin means the folio just
* became hot. Skip this folio and let the caller find another one.

So now found an existing folio not only means load/swapin, and also concurrent
shrinking action. Yes, this comment needs to be changed a little.

Thanks.