On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>
> CPU1 CPU2
> shrink_memcg_cb swap_off
> list_lru_isolate zswap_invalidate
> zswap_swapoff
> kfree(tree)
> // UAF
> spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and if returned with folio locked we can reference the tree safely.
> Then we can check invalidate race with tree lock, the following things
> is much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry
> in the beginning. And it will be unlinked and freed when invalidated
> if writeback success.
You are also removing one redundant lookup from the zswap writeback path
to check for the invalidation race, and simplifying the code. Give
yourself full credit :)
>
> Another change is we don't update the memcg nr_zswap_protected in
> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
> with swapin or concurrent shrinker action, since swapin already
> have memcg nr_zswap_protected updated, don't need double counts here.
> For concurrent shrinker, the folio will be writeback and freed anyway.
> -ENOMEM case is extremely rare and doesn't happen spuriously either,
> so don't bother distinguishing this case.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Nhat Pham <[email protected]>
> Signed-off-by: Chengming Zhou <[email protected]>
> ---
> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
> 1 file changed, 49 insertions(+), 65 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..f5cb5a46e4d7 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> zpool_get_type((p)->zpools[0]))
>
> static int zswap_writeback_entry(struct zswap_entry *entry,
> - struct zswap_tree *tree);
> + swp_entry_t swpentry);
> static int zswap_pool_get(struct zswap_pool *pool);
> static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
> rcu_read_unlock();
> }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> - struct zswap_entry *entry)
> -{
> - int nid = entry_to_nid(entry);
> - spinlock_t *lock = &list_lru->node[nid].lock;
> - struct mem_cgroup *memcg;
> - struct lruvec *lruvec;
> -
> - rcu_read_lock();
> - memcg = mem_cgroup_from_entry(entry);
> - spin_lock(lock);
> - /* we cannot use list_lru_add here, because it increments node's lru count */
> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
> - spin_unlock(lock);
> -
> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> - /* increment the protection area to account for the LRU rotation. */
> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> - rcu_read_unlock();
> -}
> -
> /*********************************
> * rbtree functions
> **********************************/
> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> {
> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> bool *encountered_page_in_swapcache = (bool *)arg;
> - struct zswap_tree *tree;
> - pgoff_t swpoffset;
> + swp_entry_t swpentry;
> enum lru_status ret = LRU_REMOVED_RETRY;
> int writeback_result;
>
> + /*
> + * Rotate the entry to the tail before unlocking the LRU,
> + * so that in case of an invalidation race concurrent
> + * reclaimers don't waste their time on it.
> + *
> + * If writeback succeeds, or failure is due to the entry
> + * being invalidated by the swap subsystem, the invalidation
> + * will unlink and free it.
> + *
> + * Temporary failures, where the same entry should be tried
> + * again immediately, almost never happen for this shrinker.
> + * We don't do any trylocking; -ENOMEM comes closest,
> + * but that's extremely rare and doesn't happen spuriously
> + * either. Don't bother distinguishing this case.
> + *
> + * But since they do exist in theory, the entry cannot just
> + * be unlinked, or we could leak it. Hence, rotate.
The entry cannot be unlinked because we cannot get a ref on it without
holding the tree lock, and we cannot deref the tree before we acquire a
swap cache ref in zswap_writeback_entry() -- or if
zswap_writeback_entry() fails. This should be called out explicitly
somewhere. Perhaps we can describe this whole deref dance with added
docs to shrink_memcg_cb().
We could also use a comment in zswap_writeback_entry() (or above it) to
state that we only deref the tree *after* we get the swapcache ref.
On 2024/1/30 08:22, Yosry Ahmed wrote:
> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>> LRU writeback has race problem with swapoff, as spotted by Yosry [1]:
>>
>> CPU1 CPU2
>> shrink_memcg_cb swap_off
>> list_lru_isolate zswap_invalidate
>> zswap_swapoff
>> kfree(tree)
>> // UAF
>> spin_lock(&tree->lock)
>>
>> The problem is that the entry in lru list can't protect the tree from
>> being swapoff and freed, and the entry also can be invalidated and freed
>> concurrently after we unlock the lru lock.
>>
>> We can fix it by moving the swap cache allocation ahead before
>> referencing the tree, then check invalidate race with tree lock,
>> only after that we can safely deref the entry. Note we couldn't
>> deref entry or tree anymore after we unlock the folio, since we
>> depend on this to hold on swapoff.
>>
>> So this patch moves all tree and entry usage to zswap_writeback_entry(),
>> we only use the copied swpentry on the stack to allocate swap cache
>> and if returned with folio locked we can reference the tree safely.
>> Then we can check invalidate race with tree lock, the following things
>> is much the same like zswap_load().
>>
>> Since we can't deref the entry after zswap_writeback_entry(), we
>> can't use zswap_lru_putback() anymore, instead we rotate the entry
>> in the beginning. And it will be unlinked and freed when invalidated
>> if writeback success.
>
> You are also removing one redundant lookup from the zswap writeback path
> to check for the invalidation race, and simplifying the code. Give
> yourself full credit :)
Ah right, forgot to mention it, I will add this part in the commit message.
Thanks for your reminder!
>
>>
>> Another change is we don't update the memcg nr_zswap_protected in
>> the -ENOMEM and -EEXIST cases anymore. -EEXIST case means we raced
>> with swapin or concurrent shrinker action, since swapin already
>> have memcg nr_zswap_protected updated, don't need double counts here.
>> For concurrent shrinker, the folio will be writeback and freed anyway.
>> -ENOMEM case is extremely rare and doesn't happen spuriously either,
>> so don't bother distinguishing this case.
>>
>> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>>
>> Acked-by: Johannes Weiner <[email protected]>
>> Acked-by: Nhat Pham <[email protected]>
>> Signed-off-by: Chengming Zhou <[email protected]>
>> ---
>> mm/zswap.c | 114 ++++++++++++++++++++++++++-----------------------------------
>> 1 file changed, 49 insertions(+), 65 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 81cb3790e0dd..f5cb5a46e4d7 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>> zpool_get_type((p)->zpools[0]))
>>
>> static int zswap_writeback_entry(struct zswap_entry *entry,
>> - struct zswap_tree *tree);
>> + swp_entry_t swpentry);
>> static int zswap_pool_get(struct zswap_pool *pool);
>> static void zswap_pool_put(struct zswap_pool *pool);
>>
>> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>> rcu_read_unlock();
>> }
>>
>> -static void zswap_lru_putback(struct list_lru *list_lru,
>> - struct zswap_entry *entry)
>> -{
>> - int nid = entry_to_nid(entry);
>> - spinlock_t *lock = &list_lru->node[nid].lock;
>> - struct mem_cgroup *memcg;
>> - struct lruvec *lruvec;
>> -
>> - rcu_read_lock();
>> - memcg = mem_cgroup_from_entry(entry);
>> - spin_lock(lock);
>> - /* we cannot use list_lru_add here, because it increments node's lru count */
>> - list_lru_putback(list_lru, &entry->lru, nid, memcg);
>> - spin_unlock(lock);
>> -
>> - lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
>> - /* increment the protection area to account for the LRU rotation. */
>> - atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
>> - rcu_read_unlock();
>> -}
>> -
>> /*********************************
>> * rbtree functions
>> **********************************/
>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>> {
>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>> bool *encountered_page_in_swapcache = (bool *)arg;
>> - struct zswap_tree *tree;
>> - pgoff_t swpoffset;
>> + swp_entry_t swpentry;
>> enum lru_status ret = LRU_REMOVED_RETRY;
>> int writeback_result;
>>
>> + /*
>> + * Rotate the entry to the tail before unlocking the LRU,
>> + * so that in case of an invalidation race concurrent
>> + * reclaimers don't waste their time on it.
>> + *
>> + * If writeback succeeds, or failure is due to the entry
>> + * being invalidated by the swap subsystem, the invalidation
>> + * will unlink and free it.
>> + *
>> + * Temporary failures, where the same entry should be tried
>> + * again immediately, almost never happen for this shrinker.
>> + * We don't do any trylocking; -ENOMEM comes closest,
>> + * but that's extremely rare and doesn't happen spuriously
>> + * either. Don't bother distinguishing this case.
>> + *
>> + * But since they do exist in theory, the entry cannot just
>> + * be unlinked, or we could leak it. Hence, rotate.
>
> The entry cannot be unlinked because we cannot get a ref on it without
> holding the tree lock, and we cannot deref the tree before we acquire a
> swap cache ref in zswap_writeback_entry() -- or if
> zswap_writeback_entry() fails. This should be called out explicitly
> somewhere. Perhaps we can describe this whole deref dance with added
> docs to shrink_memcg_cb().
Maybe we should add some comments before or after zswap_writeback_entry()?
Or do you have some suggestions? I'm not good at this. :)
>
> We could also use a comment in zswap_writeback_entry() (or above it) to
> state that we only deref the tree *after* we get the swapcache ref.
I just notice there are some comments in zswap_writeback_entry(), should
we add more comments here?
/*
* folio is locked, and the swapcache is now secured against
* concurrent swapping to and from the slot. Verify that the
* swap entry hasn't been invalidated and recycled behind our
* backs (our zswap_entry reference doesn't prevent that), to
* avoid overwriting a new swap folio with old compressed data.
*/
On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
> On 2024/1/30 08:22, Yosry Ahmed wrote:
> > On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
> >> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
> >> {
> >> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
> >> bool *encountered_page_in_swapcache = (bool *)arg;
> >> - struct zswap_tree *tree;
> >> - pgoff_t swpoffset;
> >> + swp_entry_t swpentry;
> >> enum lru_status ret = LRU_REMOVED_RETRY;
> >> int writeback_result;
> >>
> >> + /*
> >> + * Rotate the entry to the tail before unlocking the LRU,
> >> + * so that in case of an invalidation race concurrent
> >> + * reclaimers don't waste their time on it.
> >> + *
> >> + * If writeback succeeds, or failure is due to the entry
> >> + * being invalidated by the swap subsystem, the invalidation
> >> + * will unlink and free it.
> >> + *
> >> + * Temporary failures, where the same entry should be tried
> >> + * again immediately, almost never happen for this shrinker.
> >> + * We don't do any trylocking; -ENOMEM comes closest,
> >> + * but that's extremely rare and doesn't happen spuriously
> >> + * either. Don't bother distinguishing this case.
> >> + *
> >> + * But since they do exist in theory, the entry cannot just
> >> + * be unlinked, or we could leak it. Hence, rotate.
> >
> > The entry cannot be unlinked because we cannot get a ref on it without
> > holding the tree lock, and we cannot deref the tree before we acquire a
> > swap cache ref in zswap_writeback_entry() -- or if
> > zswap_writeback_entry() fails. This should be called out explicitly
> > somewhere. Perhaps we can describe this whole deref dance with added
> > docs to shrink_memcg_cb().
>
> Maybe we should add some comments before or after zswap_writeback_entry()?
> Or do you have some suggestions? I'm not good at this. :)
I agree with the suggestion of a central point to document this.
How about something like this:
/*
* As soon as we drop the LRU lock, the entry can be freed by
* a concurrent invalidation. This means the following:
*
* 1. We extract the swp_entry_t to the stack, allowing
* zswap_writeback_entry() to pin the swap entry and
* then validate the zwap entry against that swap entry's
* tree using pointer value comparison. Only when that
* is successful can the entry be dereferenced.
*
* 2. Usually, objects are taken off the LRU for reclaim. In
* this case this isn't possible, because if reclaim fails
* for whatever reason, we have no means of knowing if the
* entry is alive to put it back on the LRU.
*
* So rotate it before dropping the lock. If the entry is
* written back or invalidated, the free path will unlink
* it. For failures, rotation is the right thing as well.
*
* Temporary failures, where the same entry should be tried
* again immediately, almost never happen for this shrinker.
* We don't do any trylocking; -ENOMEM comes closest,
* but that's extremely rare and doesn't happen spuriously
* either. Don't bother distinguishing this case.
*/
> > We could also use a comment in zswap_writeback_entry() (or above it) to
> > state that we only deref the tree *after* we get the swapcache ref.
>
> I just notice there are some comments in zswap_writeback_entry(), should
> we add more comments here?
>
> /*
> * folio is locked, and the swapcache is now secured against
> * concurrent swapping to and from the slot. Verify that the
> * swap entry hasn't been invalidated and recycled behind our
> * backs (our zswap_entry reference doesn't prevent that), to
> * avoid overwriting a new swap folio with old compressed data.
> */
The bit in () is now stale, since we're not even holding a ref ;)
Otherwise, a brief note that entry can not be dereferenced until
validation would be helpful in zswap_writeback_entry(). The core of
the scheme I'd probably describe in shrink_memcg_cb(), though.
Can I ask a favor, though?
For non-critical updates to this patch, can you please make them
follow-up changes? I just sent out 20 cleanup patches on top of this
patch which would be super painful and error prone to rebase. I'd like
to avoid that if at all possible.
On 2024/1/30 11:17, Johannes Weiner wrote:
> On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
>> On 2024/1/30 08:22, Yosry Ahmed wrote:
>>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>>> {
>>>> struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>>> bool *encountered_page_in_swapcache = (bool *)arg;
>>>> - struct zswap_tree *tree;
>>>> - pgoff_t swpoffset;
>>>> + swp_entry_t swpentry;
>>>> enum lru_status ret = LRU_REMOVED_RETRY;
>>>> int writeback_result;
>>>>
>>>> + /*
>>>> + * Rotate the entry to the tail before unlocking the LRU,
>>>> + * so that in case of an invalidation race concurrent
>>>> + * reclaimers don't waste their time on it.
>>>> + *
>>>> + * If writeback succeeds, or failure is due to the entry
>>>> + * being invalidated by the swap subsystem, the invalidation
>>>> + * will unlink and free it.
>>>> + *
>>>> + * Temporary failures, where the same entry should be tried
>>>> + * again immediately, almost never happen for this shrinker.
>>>> + * We don't do any trylocking; -ENOMEM comes closest,
>>>> + * but that's extremely rare and doesn't happen spuriously
>>>> + * either. Don't bother distinguishing this case.
>>>> + *
>>>> + * But since they do exist in theory, the entry cannot just
>>>> + * be unlinked, or we could leak it. Hence, rotate.
>>>
>>> The entry cannot be unlinked because we cannot get a ref on it without
>>> holding the tree lock, and we cannot deref the tree before we acquire a
>>> swap cache ref in zswap_writeback_entry() -- or if
>>> zswap_writeback_entry() fails. This should be called out explicitly
>>> somewhere. Perhaps we can describe this whole deref dance with added
>>> docs to shrink_memcg_cb().
>>
>> Maybe we should add some comments before or after zswap_writeback_entry()?
>> Or do you have some suggestions? I'm not good at this. :)
>
> I agree with the suggestion of a central point to document this.
>
> How about something like this:
>
> /*
> * As soon as we drop the LRU lock, the entry can be freed by
> * a concurrent invalidation. This means the following:
> *
> * 1. We extract the swp_entry_t to the stack, allowing
> * zswap_writeback_entry() to pin the swap entry and
> * then validate the zwap entry against that swap entry's
> * tree using pointer value comparison. Only when that
> * is successful can the entry be dereferenced.
> *
> * 2. Usually, objects are taken off the LRU for reclaim. In
> * this case this isn't possible, because if reclaim fails
> * for whatever reason, we have no means of knowing if the
> * entry is alive to put it back on the LRU.
> *
> * So rotate it before dropping the lock. If the entry is
> * written back or invalidated, the free path will unlink
> * it. For failures, rotation is the right thing as well.
> *
> * Temporary failures, where the same entry should be tried
> * again immediately, almost never happen for this shrinker.
> * We don't do any trylocking; -ENOMEM comes closest,
> * but that's extremely rare and doesn't happen spuriously
> * either. Don't bother distinguishing this case.
> */
>
Thanks! I think this document is great enough.
>>> We could also use a comment in zswap_writeback_entry() (or above it) to
>>> state that we only deref the tree *after* we get the swapcache ref.
>>
>> I just notice there are some comments in zswap_writeback_entry(), should
>> we add more comments here?
>>
>> /*
>> * folio is locked, and the swapcache is now secured against
>> * concurrent swapping to and from the slot. Verify that the
>> * swap entry hasn't been invalidated and recycled behind our
>> * backs (our zswap_entry reference doesn't prevent that), to
>> * avoid overwriting a new swap folio with old compressed data.
>> */
>
> The bit in () is now stale, since we're not even holding a ref ;)
Right.
>
> Otherwise, a brief note that entry can not be dereferenced until
> validation would be helpful in zswap_writeback_entry(). The core of
Ok.
> the scheme I'd probably describe in shrink_memcg_cb(), though.
>
> Can I ask a favor, though?
>
> For non-critical updates to this patch, can you please make them
> follow-up changes? I just sent out 20 cleanup patches on top of this
> patch which would be super painful and error prone to rebase. I'd like
> to avoid that if at all possible.
Ok, so these comments changes should be changed on top of your cleanup series
and sent as a follow-up patch.
Thanks.