2024-03-20 05:52:39

by Chris Li

[permalink] [raw]
Subject: [PATCH v7] zswap: replace RB tree with xarray

Very deep RB tree requires rebalance at times. That
contributes to the zswap fault latencies. Xarray does not
need to perform tree rebalance. Replacing RB tree to xarray
can have some small performance gain.

One small difference is that xarray insert might fail with
ENOMEM, while RB tree insert does not allocate additional
memory.

The zswap_entry size will reduce a bit due to removing the
RB node, which has two pointers and a color field. Xarray
store the pointer in the xarray tree rather than the
zswap_entry. Every entry has one pointer from the xarray
tree. Overall, switching to xarray should save some memory,
if the swap entries are densely packed.

Notice the zswap_rb_search and zswap_rb_insert always
followed by zswap_rb_erase. Use xa_erase and xa_store
directly. That saves one tree lookup as well.

Remove zswap_invalidate_entry due to no need to call
zswap_rb_erase any more. Use zswap_free_entry instead.

The "struct zswap_tree" has been replaced by "struct xarray".
The tree spin lock has transferred to the xarray lock.

Run the kernel build testing 10 times for each version, averages:
(memory.max=2GB, zswap shrinker and writeback enabled,
one 50GB swapfile, 24 HT core, 32 jobs)

mm-unstable-a824831a082f xarray v7
user 3547.264 3541.509
sys 531.176 526.111
real 200.752 201.334

---
Reviewed-by: Nhat Pham <[email protected]>

Signed-off-by: Chris Li <[email protected]>
---
Changes in v7:
- update comment suggested by Johannes and Yosry
- Simplify some error handling code, suggested by Johannes.
- Link to v6: https://lore.kernel.org/r/[email protected]

Changes in v6:
- Add WARN_ONCE() for xa_store failed other than -ENOMEM.
- Collect review tags.
- Link to v5: https://lore.kernel.org/r/[email protected]

Changes in v5:
- Remove zswap_xa_insert(), call xa_store and xa_erase directly.
- Remove zswap_reject_xarray_fail.
- Link to v4: https://lore.kernel.org/r/[email protected]

Changes in v4:
- Remove zswap_xa_search_and_earse, use xa_erase directly.
- Move charge of objcg after zswap_xa_insert.
- Avoid erase old entry on insert fail error path.
- Remove not needed swap_zswap_tree change
- Link to v3: https://lore.kernel.org/r/[email protected]

Changes in v3:
- Use xa_cmpxchg instead of zswap_xa_search_and_delete in zswap_writeback_entry.
- Use xa_store in zswap_xa_insert directly. Reduce the scope of spinlock.
- Fix xa_store error handling for same page fill case.
- Link to v2: https://lore.kernel.org/r/[email protected]

Changes in v2:
- Replace struct zswap_tree with struct xarray.
- Remove zswap_tree spinlock, use xarray lock instead.
- Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
- Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
- Link to v1: https://lore.kernel.org/r/[email protected]
---
mm/zswap.c | 177 ++++++++++++++++++-------------------------------------------
1 file changed, 52 insertions(+), 125 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index b31c977f53e9..d8a14b27adcd 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -20,7 +20,6 @@
#include <linux/spinlock.h>
#include <linux/types.h>
#include <linux/atomic.h>
-#include <linux/rbtree.h>
#include <linux/swap.h>
#include <linux/crypto.h>
#include <linux/scatterlist.h>
@@ -196,7 +195,6 @@ static struct shrinker *zswap_shrinker;
* This structure contains the metadata for tracking a single compressed
* page within zswap.
*
- * rbnode - links the entry into red-black tree for the appropriate swap type
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
* decompression. For a same value filled page length is 0, and both
@@ -208,7 +206,6 @@ static struct shrinker *zswap_shrinker;
* lru - handle to the pool's lru used to evict pages.
*/
struct zswap_entry {
- struct rb_node rbnode;
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
@@ -220,12 +217,7 @@ struct zswap_entry {
struct list_head lru;
};

-struct zswap_tree {
- struct rb_root rbroot;
- spinlock_t lock;
-};
-
-static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
+static struct xarray *zswap_trees[MAX_SWAPFILES];
static unsigned int nr_zswap_trees[MAX_SWAPFILES];

/* RCU-protected iteration */
@@ -253,7 +245,7 @@ static bool zswap_has_pool;
* helpers and fwd declarations
**********************************/

-static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
{
return &zswap_trees[swp_type(swp)][swp_offset(swp)
>> SWAP_ADDRESS_SPACE_SHIFT];
@@ -792,63 +784,6 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
spin_unlock(&zswap_shrink_lock);
}

-/*********************************
-* rbtree functions
-**********************************/
-static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
-{
- struct rb_node *node = root->rb_node;
- struct zswap_entry *entry;
- pgoff_t entry_offset;
-
- while (node) {
- entry = rb_entry(node, struct zswap_entry, rbnode);
- entry_offset = swp_offset(entry->swpentry);
- if (entry_offset > offset)
- node = node->rb_left;
- else if (entry_offset < offset)
- node = node->rb_right;
- else
- return entry;
- }
- return NULL;
-}
-
-/*
- * In the case that a entry with the same offset is found, a pointer to
- * the existing entry is stored in dupentry and the function returns -EEXIST
- */
-static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
- struct zswap_entry **dupentry)
-{
- struct rb_node **link = &root->rb_node, *parent = NULL;
- struct zswap_entry *myentry;
- pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
-
- while (*link) {
- parent = *link;
- myentry = rb_entry(parent, struct zswap_entry, rbnode);
- myentry_offset = swp_offset(myentry->swpentry);
- if (myentry_offset > entry_offset)
- link = &(*link)->rb_left;
- else if (myentry_offset < entry_offset)
- link = &(*link)->rb_right;
- else {
- *dupentry = myentry;
- return -EEXIST;
- }
- }
- rb_link_node(&entry->rbnode, parent, link);
- rb_insert_color(&entry->rbnode, root);
- return 0;
-}
-
-static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
-{
- rb_erase(&entry->rbnode, root);
- RB_CLEAR_NODE(&entry->rbnode);
-}
-
/*********************************
* zswap entry functions
**********************************/
@@ -860,7 +795,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
if (!entry)
return NULL;
- RB_CLEAR_NODE(&entry->rbnode);
return entry;
}

@@ -896,17 +830,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
atomic_dec(&zswap_stored_pages);
}

-/*
- * The caller hold the tree lock and search the entry from the tree,
- * so it must be on the tree, remove it from the tree and free it.
- */
-static void zswap_invalidate_entry(struct zswap_tree *tree,
- struct zswap_entry *entry)
-{
- zswap_rb_erase(&tree->rbroot, entry);
- zswap_entry_free(entry);
-}
-
/*********************************
* compressed storage functions
**********************************/
@@ -1106,7 +1029,8 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
static int zswap_writeback_entry(struct zswap_entry *entry,
swp_entry_t swpentry)
{
- struct zswap_tree *tree;
+ struct xarray *tree;
+ pgoff_t offset = swp_offset(swpentry);
struct folio *folio;
struct mempolicy *mpol;
bool folio_was_allocated;
@@ -1143,19 +1067,13 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* be dereferenced.
*/
tree = swap_zswap_tree(swpentry);
- spin_lock(&tree->lock);
- if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
- spin_unlock(&tree->lock);
+ if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
delete_from_swap_cache(folio);
folio_unlock(folio);
folio_put(folio);
return -ENOMEM;
}

- /* Safe to deref entry after the entry is verified above. */
- zswap_rb_erase(&tree->rbroot, entry);
- spin_unlock(&tree->lock);
-
zswap_decompress(entry, &folio->page);

count_vm_event(ZSWPWB);
@@ -1467,8 +1385,8 @@ bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
- struct zswap_tree *tree = swap_zswap_tree(swp);
- struct zswap_entry *entry, *dupentry;
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
unsigned long max_pages, cur_pages;
@@ -1556,28 +1474,43 @@ bool zswap_store(struct folio *folio)
insert_entry:
entry->swpentry = swp;
entry->objcg = objcg;
+
+ old = xa_store(tree, offset, entry, GFP_KERNEL);
+ if (xa_is_err(old)) {
+ int err = xa_err(old);
+ WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+ zswap_reject_alloc_fail++;
+ goto store_failed;
+ }
+
+ /*
+ * We may have had an existing entry that became stale when
+ * the folio was redirtied and now the new version is being
+ * swapped out. Get rid of the old.
+ */
+ if (old)
+ zswap_entry_free(old);
+
if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
- /* Account before objcg ref is moved to tree */
count_objcg_event(objcg, ZSWPOUT);
}

- /* map */
- spin_lock(&tree->lock);
/*
- * The folio may have been dirtied again, invalidate the
- * possibly stale entry before inserting the new entry.
+ * We finish initializing the entry while it's already in xarray.
+ * This is safe because:
+ *
+ * 1. Concurrent stores and invalidations are excluded by folio lock.
+ *
+ * 2. Writeback is excluded by the entry not being on the LRU yet.
+ * The publishing order matters to prevent writeback from seeing
+ * an incoherent entry.
*/
- if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
- zswap_invalidate_entry(tree, dupentry);
- WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
- }
if (entry->length) {
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&zswap_list_lru, entry);
atomic_inc(&zswap_nr_stored);
}
- spin_unlock(&tree->lock);

/* update stats */
atomic_inc(&zswap_stored_pages);
@@ -1585,8 +1518,14 @@ bool zswap_store(struct folio *folio)

return true;

+store_failed:
+ if (!entry->length)
+ atomic_dec(&zswap_same_filled_pages);
+ else {
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
- zswap_pool_put(entry->pool);
+ zswap_pool_put(entry->pool);
+ }
freepage:
zswap_entry_cache_free(entry);
reject:
@@ -1597,11 +1536,9 @@ bool zswap_store(struct folio *folio)
* possibly stale entry which was previously stored at this offset.
* Otherwise, writeback could overwrite the new data in the swapfile.
*/
- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = xa_erase(tree, offset);
if (entry)
- zswap_invalidate_entry(tree, entry);
- spin_unlock(&tree->lock);
+ zswap_entry_free(entry);
return false;

shrink:
@@ -1614,20 +1551,15 @@ bool zswap_load(struct folio *folio)
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
struct page *page = &folio->page;
- struct zswap_tree *tree = swap_zswap_tree(swp);
+ struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
u8 *dst;

VM_WARN_ON_ONCE(!folio_test_locked(folio));

- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
- if (!entry) {
- spin_unlock(&tree->lock);
+ entry = xa_erase(tree, offset);
+ if (!entry)
return false;
- }
- zswap_rb_erase(&tree->rbroot, entry);
- spin_unlock(&tree->lock);

if (entry->length)
zswap_decompress(entry, page);
@@ -1651,19 +1583,17 @@ bool zswap_load(struct folio *folio)
void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
- struct zswap_tree *tree = swap_zswap_tree(swp);
+ struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;

- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = xa_erase(tree, offset);
if (entry)
- zswap_invalidate_entry(tree, entry);
- spin_unlock(&tree->lock);
+ zswap_entry_free(entry);
}

int zswap_swapon(int type, unsigned long nr_pages)
{
- struct zswap_tree *trees, *tree;
+ struct xarray *trees, *tree;
unsigned int nr, i;

nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
@@ -1673,11 +1603,8 @@ int zswap_swapon(int type, unsigned long nr_pages)
return -ENOMEM;
}

- for (i = 0; i < nr; i++) {
- tree = trees + i;
- tree->rbroot = RB_ROOT;
- spin_lock_init(&tree->lock);
- }
+ for (i = 0; i < nr; i++)
+ xa_init(trees + i);

nr_zswap_trees[type] = nr;
zswap_trees[type] = trees;
@@ -1686,7 +1613,7 @@ int zswap_swapon(int type, unsigned long nr_pages)

void zswap_swapoff(int type)
{
- struct zswap_tree *trees = zswap_trees[type];
+ struct xarray *trees = zswap_trees[type];
unsigned int i;

if (!trees)
@@ -1694,7 +1621,7 @@ void zswap_swapoff(int type)

/* try_to_unuse() invalidated all the entries already */
for (i = 0; i < nr_zswap_trees[type]; i++)
- WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
+ WARN_ON_ONCE(!xa_empty(trees + i));

kvfree(trees);
nr_zswap_trees[type] = 0;

---
base-commit: a824831a082f1d8f9b51a4c0598e633d38555fcf
change-id: 20240104-zswap-xarray-716260e541e3

Best regards,
--
Chris Li <[email protected]>



2024-03-20 06:13:41

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Tue, Mar 19, 2024 at 10:52:26PM -0700, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
>
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
>
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
>
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase and xa_store
> directly. That saves one tree lookup as well.
>
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
>
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
>
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled,
> one 50GB swapfile, 24 HT core, 32 jobs)
>
> mm-unstable-a824831a082f xarray v7
> user 3547.264 3541.509
> sys 531.176 526.111
> real 200.752 201.334
>
> ---

I believe there shouldn't be a separator before Rb and Sb below.

> Reviewed-by: Nhat Pham <[email protected]>
>
> Signed-off-by: Chris Li <[email protected]>

I have some comments below, with them addressed:

Acked-by: Yosry Ahmed <[email protected]>

[..]
> @@ -1556,28 +1474,43 @@ bool zswap_store(struct folio *folio)
> insert_entry:
> entry->swpentry = swp;
> entry->objcg = objcg;
> +
> + old = xa_store(tree, offset, entry, GFP_KERNEL);
> + if (xa_is_err(old)) {
> + int err = xa_err(old);

There should be a blank line after the declaration.

> + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> + zswap_reject_alloc_fail++;
> + goto store_failed;
> + }
> +
> + /*
> + * We may have had an existing entry that became stale when
> + * the folio was redirtied and now the new version is being
> + * swapped out. Get rid of the old.
> + */

This comment is mis-indented.

checkpatch would have caught these btw.

> + if (old)
> + zswap_entry_free(old);
> +
> if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);
> - /* Account before objcg ref is moved to tree */
> count_objcg_event(objcg, ZSWPOUT);
> }
>
> - /* map */
> - spin_lock(&tree->lock);
> /*
> - * The folio may have been dirtied again, invalidate the
> - * possibly stale entry before inserting the new entry.
> + * We finish initializing the entry while it's already in xarray.
> + * This is safe because:
> + *
> + * 1. Concurrent stores and invalidations are excluded by folio lock.
> + *
> + * 2. Writeback is excluded by the entry not being on the LRU yet.
> + * The publishing order matters to prevent writeback from seeing
> + * an incoherent entry.

As I mentioned before, writeback is also protected by the folio lock.
Concurrent writeback will find the folio in the swapcache and abort. The
fact that the entry is not on the LRU yet is just additional protection,
so I don't think the publishing order actually matters here. Right?

2024-03-20 06:34:32

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Tue, Mar 19, 2024 at 11:13 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Mar 19, 2024 at 10:52:26PM -0700, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Use xa_erase and xa_store
> > directly. That saves one tree lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Run the kernel build testing 10 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled,
> > one 50GB swapfile, 24 HT core, 32 jobs)
> >
> > mm-unstable-a824831a082f xarray v7
> > user 3547.264 3541.509
> > sys 531.176 526.111
> > real 200.752 201.334
> >
> > ---
>
> I believe there shouldn't be a separator before Rb and Sb below.

Ack.

>
> > Reviewed-by: Nhat Pham <[email protected]>
> >
> > Signed-off-by: Chris Li <[email protected]>
>
> I have some comments below, with them addressed:
>
> Acked-by: Yosry Ahmed <[email protected]>
>
> [..]
> > @@ -1556,28 +1474,43 @@ bool zswap_store(struct folio *folio)
> > insert_entry:
> > entry->swpentry = swp;
> > entry->objcg = objcg;
> > +
> > + old = xa_store(tree, offset, entry, GFP_KERNEL);
> > + if (xa_is_err(old)) {
> > + int err = xa_err(old);
>
> There should be a blank line after the declaration.
>
> > + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> > + zswap_reject_alloc_fail++;
> > + goto store_failed;
> > + }
> > +
> > + /*
> > + * We may have had an existing entry that became stale when
> > + * the folio was redirtied and now the new version is being
> > + * swapped out. Get rid of the old.
> > + */
>
> This comment is mis-indented.

Ah, there is some space instead of a tab because the comment was
copied from an email. Will fix it.


>
> checkpatch would have caught these btw.
>
> > + if (old)
> > + zswap_entry_free(old);
> > +
> > if (objcg) {
> > obj_cgroup_charge_zswap(objcg, entry->length);
> > - /* Account before objcg ref is moved to tree */
> > count_objcg_event(objcg, ZSWPOUT);
> > }
> >
> > - /* map */
> > - spin_lock(&tree->lock);
> > /*
> > - * The folio may have been dirtied again, invalidate the
> > - * possibly stale entry before inserting the new entry.
> > + * We finish initializing the entry while it's already in xarray.
> > + * This is safe because:
> > + *
> > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > + *
> > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > + * The publishing order matters to prevent writeback from seeing
> > + * an incoherent entry.
>
> As I mentioned before, writeback is also protected by the folio lock.
> Concurrent writeback will find the folio in the swapcache and abort. The
> fact that the entry is not on the LRU yet is just additional protection,
> so I don't think the publishing order actually matters here. Right?

Right. This comment is explaining why this publishing order does not
matter. I think we are talking about the same thing here?

Chris

2024-03-20 07:24:45

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

[..]
> > > - /* map */
> > > - spin_lock(&tree->lock);
> > > /*
> > > - * The folio may have been dirtied again, invalidate the
> > > - * possibly stale entry before inserting the new entry.
> > > + * We finish initializing the entry while it's already in xarray.
> > > + * This is safe because:
> > > + *
> > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > + *
> > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > + * The publishing order matters to prevent writeback from seeing
> > > + * an incoherent entry.
> >
> > As I mentioned before, writeback is also protected by the folio lock.
> > Concurrent writeback will find the folio in the swapcache and abort. The
> > fact that the entry is not on the LRU yet is just additional protection,
> > so I don't think the publishing order actually matters here. Right?
>
> Right. This comment is explaining why this publishing order does not
> matter. I think we are talking about the same thing here?

The comment literally says "the publishing order matters.." :)

I believe Johannes meant that we should only publish the entry to the
LRU once it is fully initialized, to prevent writeback from using a
partially initialized entry.

What I am saying is that, even if we add a partially initialized entry
to the zswap LRU, writeback will skip it anyway because the folio is
locked in the swapcache.

So basically I think the comment should say:

/*
* We finish initializing the entry while it's already in the
* xarray. This is safe because the folio is locked in the swap
* cache, which should protect against concurrent stores,
* invalidations, and writeback.
*/

Johannes, what do you think?

2024-03-20 10:08:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> [..]
> > > > - /* map */
> > > > - spin_lock(&tree->lock);
> > > > /*
> > > > - * The folio may have been dirtied again, invalidate the
> > > > - * possibly stale entry before inserting the new entry.
> > > > + * We finish initializing the entry while it's already in xarray.
> > > > + * This is safe because:
> > > > + *
> > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > + *
> > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > + * The publishing order matters to prevent writeback from seeing
> > > > + * an incoherent entry.
> > >
> > > As I mentioned before, writeback is also protected by the folio lock.
> > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > fact that the entry is not on the LRU yet is just additional protection,
> > > so I don't think the publishing order actually matters here. Right?
> >
> > Right. This comment is explaining why this publishing order does not
> > matter. I think we are talking about the same thing here?
>
> The comment literally says "the publishing order matters.." :)
>
> I believe Johannes meant that we should only publish the entry to the
> LRU once it is fully initialized, to prevent writeback from using a
> partially initialized entry.
>
> What I am saying is that, even if we add a partially initialized entry
> to the zswap LRU, writeback will skip it anyway because the folio is
> locked in the swapcache.
>
> So basically I think the comment should say:
>
> /*
> * We finish initializing the entry while it's already in the
> * xarray. This is safe because the folio is locked in the swap
> * cache, which should protect against concurrent stores,
> * invalidations, and writeback.
> */
>
> Johannes, what do you think?

I don't think that's quite right.

Writeback will bail on swapcache insert, yes, but it will access the
entry before attempting it. If LRU publishing happened before setting
entry->swpentry e.g., we'd have a problem, while your comment suggets
it would be safe to rearrange the code like this.

So LRU publishing order does matter.

2024-03-20 10:14:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Tue, Mar 19, 2024 at 10:52:26PM -0700, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
>
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
>
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
>
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Use xa_erase and xa_store
> directly. That saves one tree lookup as well.
>
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
>
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
>
> Run the kernel build testing 10 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled,
> one 50GB swapfile, 24 HT core, 32 jobs)
>
> mm-unstable-a824831a082f xarray v7
> user 3547.264 3541.509
> sys 531.176 526.111
> real 200.752 201.334
>
> ---
> Reviewed-by: Nhat Pham <[email protected]>
>
> Signed-off-by: Chris Li <[email protected]>

Excellent! With the checkpatch issues fixed,

Acked-by: Johannes Weiner <[email protected]>

2024-03-20 18:35:09

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 3:08 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > [..]
> > > > > - /* map */
> > > > > - spin_lock(&tree->lock);
> > > > > /*
> > > > > - * The folio may have been dirtied again, invalidate the
> > > > > - * possibly stale entry before inserting the new entry.
> > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > + * This is safe because:
> > > > > + *
> > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > + *
> > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > + * an incoherent entry.
> > > >
> > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > Concurrent writeback will find the folio in the swapcache and abort The
> > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > so I don't think the publishing order actually matters here. Right?
> > >
> > > Right. This comment is explaining why this publishing order does not
> > > matter. I think we are talking about the same thing here?
> >
> > The comment literally says "the publishing order matters.." :)
> >
> > I believe Johannes meant that we should only publish the entry to the
> > LRU once it is fully initialized, to prevent writeback from using a
> > partially initialized entry.
> >
> > What I am saying is that, even if we add a partially initialized entry
> > to the zswap LRU, writeback will skip it anyway because the folio is
> > locked in the swapcache.
> >
> > So basically I think the comment should say:
> >
> > /*
> > * We finish initializing the entry while it's already in the
> > * xarray. This is safe because the folio is locked in the swap
> > * cache, which should protect against concurrent stores,
> > * invalidations, and writeback.
> > */
> >
> > Johannes, what do you think?
>
> I don't think that's quite right.
>
> Writeback will bail on swapcache insert, yes, but it will access the
> entry before attempting it. If LRU publishing happened before setting
> entry->swpentry e.g., we'd have a problem, while your comment suggets
> it would be safe to rearrange the code like this.
>
> So LRU publishing order does matter.

Yes, I agree with Johannes on this one. The publish order does matter,
it is not always safe recording the publish order. I will keep the V7
comments here.

Chris

2024-03-20 19:11:58

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > [..]
> > > > > - /* map */
> > > > > - spin_lock(&tree->lock);
> > > > > /*
> > > > > - * The folio may have been dirtied again, invalidate the
> > > > > - * possibly stale entry before inserting the new entry.
> > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > + * This is safe because:
> > > > > + *
> > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > + *
> > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > + * an incoherent entry.
> > > >
> > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > so I don't think the publishing order actually matters here. Right?
> > >
> > > Right. This comment is explaining why this publishing order does not
> > > matter. I think we are talking about the same thing here?
> >
> > The comment literally says "the publishing order matters.." :)
> >
> > I believe Johannes meant that we should only publish the entry to the
> > LRU once it is fully initialized, to prevent writeback from using a
> > partially initialized entry.
> >
> > What I am saying is that, even if we add a partially initialized entry
> > to the zswap LRU, writeback will skip it anyway because the folio is
> > locked in the swapcache.
> >
> > So basically I think the comment should say:
> >
> > /*
> > * We finish initializing the entry while it's already in the
> > * xarray. This is safe because the folio is locked in the swap
> > * cache, which should protect against concurrent stores,
> > * invalidations, and writeback.
> > */
> >
> > Johannes, what do you think?
>
> I don't think that's quite right.
>
> Writeback will bail on swapcache insert, yes, but it will access the
> entry before attempting it. If LRU publishing happened before setting
> entry->swpentry e.g., we'd have a problem, while your comment suggets
> it would be safe to rearrange the code like this.
>
> So LRU publishing order does matter.

Ah yes, you are right. entry->swpentry should be set to make sure we
lookup the correct entry in the swapcache and the tree.

Perhaps we should spell this out in the comment and make the
initialization ordering more explicit? Maybe something like:

diff --git a/mm/zswap.c b/mm/zswap.c
index d8a14b27adcd7..70924b437743a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
goto put_pool;

insert_entry:
- entry->swpentry = swp;
- entry->objcg = objcg;
-
old = xa_store(tree, offset, entry, GFP_KERNEL);
if (xa_is_err(old)) {
int err = xa_err(old);
@@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
if (old)
zswap_entry_free(old);

+ entry->objcg = objcg;
if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
count_objcg_event(objcg, ZSWPOUT);
@@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)

/*
* We finish initializing the entry while it's already in xarray.
- * This is safe because:
- *
- * 1. Concurrent stores and invalidations are excluded by folio lock.
+ * This is safe because the folio is locked in the swapcache, which
+ * protects against concurrent stores and invalidations.
*
- * 2. Writeback is excluded by the entry not being on the LRU yet.
- * The publishing order matters to prevent writeback from seeing
- * an incoherent entry.
+ * Concurrent writeback is not possible until we add the entry to the
+ * LRU. We need to at least initialize entry->swpentry *before* adding
+ * the entry to the LRU to make sure writeback looks up the correct
+ * entry in the swapcache.
*/
if (entry->length) {
+ entry->swpentry = swp;
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&zswap_list_lru, entry);
atomic_inc(&zswap_nr_stored);


This also got me wondering, do we need a write barrier between
initializing entry->swpentry and zswap_lru_add()?

I guess if we read the wrong swpentry in zswap_writeback_entry() we will
eventually fail the xa_cmpxchg() and drop it anyway, but it seems
bug-prone.

2024-03-20 19:26:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > [..]
> > > > > > - /* map */
> > > > > > - spin_lock(&tree->lock);
> > > > > > /*
> > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > + * This is safe because:
> > > > > > + *
> > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > + *
> > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > + * an incoherent entry.
> > > > >
> > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > so I don't think the publishing order actually matters here. Right?
> > > >
> > > > Right. This comment is explaining why this publishing order does not
> > > > matter. I think we are talking about the same thing here?
> > >
> > > The comment literally says "the publishing order matters.." :)
> > >
> > > I believe Johannes meant that we should only publish the entry to the
> > > LRU once it is fully initialized, to prevent writeback from using a
> > > partially initialized entry.
> > >
> > > What I am saying is that, even if we add a partially initialized entry
> > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > locked in the swapcache.
> > >
> > > So basically I think the comment should say:
> > >
> > > /*
> > > * We finish initializing the entry while it's already in the
> > > * xarray. This is safe because the folio is locked in the swap
> > > * cache, which should protect against concurrent stores,
> > > * invalidations, and writeback.
> > > */
> > >
> > > Johannes, what do you think?
> >
> > I don't think that's quite right.
> >
> > Writeback will bail on swapcache insert, yes, but it will access the
> > entry before attempting it. If LRU publishing happened before setting
> > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > it would be safe to rearrange the code like this.
> >
> > So LRU publishing order does matter.
>
> Ah yes, you are right. entry->swpentry should be set to make sure we
> lookup the correct entry in the swapcache and the tree.
>
> Perhaps we should spell this out in the comment and make the
> initialization ordering more explicit? Maybe something like:
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index d8a14b27adcd7..70924b437743a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> goto put_pool;
>
> insert_entry:
> - entry->swpentry = swp;
> - entry->objcg = objcg;
> -
> old = xa_store(tree, offset, entry, GFP_KERNEL);
> if (xa_is_err(old)) {
> int err = xa_err(old);
> @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> if (old)
> zswap_entry_free(old);
>
> + entry->objcg = objcg;
> if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);
> count_objcg_event(objcg, ZSWPOUT);
> @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
>
> /*
> * We finish initializing the entry while it's already in xarray.
> - * This is safe because:
> - *
> - * 1. Concurrent stores and invalidations are excluded by folio lock.
> + * This is safe because the folio is locked in the swapcache, which
> + * protects against concurrent stores and invalidations.
> *
> - * 2. Writeback is excluded by the entry not being on the LRU yet.
> - * The publishing order matters to prevent writeback from seeing
> - * an incoherent entry.
> + * Concurrent writeback is not possible until we add the entry to the
> + * LRU. We need to at least initialize entry->swpentry *before* adding
> + * the entry to the LRU to make sure writeback looks up the correct
> + * entry in the swapcache.
> */
> if (entry->length) {
> + entry->swpentry = swp;
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&zswap_list_lru, entry);
> atomic_inc(&zswap_nr_stored);
>
>
> This also got me wondering, do we need a write barrier between
> initializing entry->swpentry and zswap_lru_add()?
>
> I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> bug-prone.

I think it's more robust the way Chris has it now. Writeback only
derefs ->swpentry today, but who knows if somebody wants to make a
changes that relies on a different member. Having submembers follow
different validity rules and timelines is error prone and makes the
code less hackable without buying all that much. The concept of
"publishing" an object like this is more common: if you can see it,
you can expect it to be coherent.

2024-03-20 19:34:46

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > [..]
> > > > > > > - /* map */
> > > > > > > - spin_lock(&tree->lock);
> > > > > > > /*
> > > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > > + * This is safe because:
> > > > > > > + *
> > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > + *
> > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > > + * an incoherent entry.
> > > > > >
> > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > so I don't think the publishing order actually matters here. Right?
> > > > >
> > > > > Right. This comment is explaining why this publishing order does not
> > > > > matter. I think we are talking about the same thing here?
> > > >
> > > > The comment literally says "the publishing order matters.." :)
> > > >
> > > > I believe Johannes meant that we should only publish the entry to the
> > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > partially initialized entry.
> > > >
> > > > What I am saying is that, even if we add a partially initialized entry
> > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > locked in the swapcache.
> > > >
> > > > So basically I think the comment should say:
> > > >
> > > > /*
> > > > * We finish initializing the entry while it's already in the
> > > > * xarray. This is safe because the folio is locked in the swap
> > > > * cache, which should protect against concurrent stores,
> > > > * invalidations, and writeback.
> > > > */
> > > >
> > > > Johannes, what do you think?
> > >
> > > I don't think that's quite right.
> > >
> > > Writeback will bail on swapcache insert, yes, but it will access the
> > > entry before attempting it. If LRU publishing happened before setting
> > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > it would be safe to rearrange the code like this.
> > >
> > > So LRU publishing order does matter.
> >
> > Ah yes, you are right. entry->swpentry should be set to make sure we
> > lookup the correct entry in the swapcache and the tree.
> >
> > Perhaps we should spell this out in the comment and make the
> > initialization ordering more explicit? Maybe something like:
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index d8a14b27adcd7..70924b437743a 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > goto put_pool;
> >
> > insert_entry:
> > - entry->swpentry = swp;
> > - entry->objcg = objcg;
> > -
> > old = xa_store(tree, offset, entry, GFP_KERNEL);
> > if (xa_is_err(old)) {
> > int err = xa_err(old);
> > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > if (old)
> > zswap_entry_free(old);
> >
> > + entry->objcg = objcg;
> > if (objcg) {
> > obj_cgroup_charge_zswap(objcg, entry->length);
> > count_objcg_event(objcg, ZSWPOUT);
> > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> >
> > /*
> > * We finish initializing the entry while it's already in xarray.
> > - * This is safe because:
> > - *
> > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > + * This is safe because the folio is locked in the swapcache, which
> > + * protects against concurrent stores and invalidations.
> > *
> > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > - * The publishing order matters to prevent writeback from seeing
> > - * an incoherent entry.
> > + * Concurrent writeback is not possible until we add the entry to the
> > + * LRU. We need to at least initialize entry->swpentry *before* adding
> > + * the entry to the LRU to make sure writeback looks up the correct
> > + * entry in the swapcache.
> > */
> > if (entry->length) {
> > + entry->swpentry = swp;
> > INIT_LIST_HEAD(&entry->lru);
> > zswap_lru_add(&zswap_list_lru, entry);
> > atomic_inc(&zswap_nr_stored);
> >
> >
> > This also got me wondering, do we need a write barrier between
> > initializing entry->swpentry and zswap_lru_add()?
> >
> > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > bug-prone.
>
> I think it's more robust the way Chris has it now. Writeback only
> derefs ->swpentry today, but who knows if somebody wants to make a
> changes that relies on a different member. Having submembers follow
> different validity rules and timelines is error prone and makes the
> code less hackable without buying all that much. The concept of
> "publishing" an object like this is more common: if you can see it,
> you can expect it to be coherent.

Fair enough, but don't we still need a barrier there? Couldn't some
initializations still be reorder after zswap_lru_add()?

2024-03-20 19:42:09

by Chris Li

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 12:34 PM Yosry Ahmed <[email protected]> wrote:
> > you can expect it to be coherent.
>
> Fair enough, but don't we still need a barrier there? Couldn't some
> initializations still be reorder after zswap_lru_add()?

I am under the impression that the lru list handling function already
handles the barrier. If an explicit barrier is needed, wouldn't all
other lru callers have barriers sprinkling all over the place?

Chris

2024-03-20 19:46:49

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 12:41:48PM -0700, Chris Li wrote:
> On Wed, Mar 20, 2024 at 12:34 PM Yosry Ahmed <[email protected]> wrote:
> > > you can expect it to be coherent.
> >
> > Fair enough, but don't we still need a barrier there? Couldn't some
> > initializations still be reorder after zswap_lru_add()?
>
> I am under the impression that the lru list handling function already
> handles the barrier. If an explicit barrier is needed, wouldn't all
> other lru callers have barriers sprinkling all over the place?

list_lru_add() holds a lock, which implies a compiler barrier. I was
wondering if we need smp_wmb() to prevent other CPUs from observing a
partially initialized entry during writeback.

I am not sure if the list lru makes such assumptions about the need for
barriers.

2024-03-20 20:03:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 07:34:36PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> > On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > > [..]
> > > > > > > > - /* map */
> > > > > > > > - spin_lock(&tree->lock);
> > > > > > > > /*
> > > > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > > > + * This is safe because:
> > > > > > > > + *
> > > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > > + *
> > > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > > > + * an incoherent entry.
> > > > > > >
> > > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > > so I don't think the publishing order actually matters here. Right?
> > > > > >
> > > > > > Right. This comment is explaining why this publishing order does not
> > > > > > matter. I think we are talking about the same thing here?
> > > > >
> > > > > The comment literally says "the publishing order matters.." :)
> > > > >
> > > > > I believe Johannes meant that we should only publish the entry to the
> > > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > > partially initialized entry.
> > > > >
> > > > > What I am saying is that, even if we add a partially initialized entry
> > > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > > locked in the swapcache.
> > > > >
> > > > > So basically I think the comment should say:
> > > > >
> > > > > /*
> > > > > * We finish initializing the entry while it's already in the
> > > > > * xarray. This is safe because the folio is locked in the swap
> > > > > * cache, which should protect against concurrent stores,
> > > > > * invalidations, and writeback.
> > > > > */
> > > > >
> > > > > Johannes, what do you think?
> > > >
> > > > I don't think that's quite right.
> > > >
> > > > Writeback will bail on swapcache insert, yes, but it will access the
> > > > entry before attempting it. If LRU publishing happened before setting
> > > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > > it would be safe to rearrange the code like this.
> > > >
> > > > So LRU publishing order does matter.
> > >
> > > Ah yes, you are right. entry->swpentry should be set to make sure we
> > > lookup the correct entry in the swapcache and the tree.
> > >
> > > Perhaps we should spell this out in the comment and make the
> > > initialization ordering more explicit? Maybe something like:
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index d8a14b27adcd7..70924b437743a 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > > goto put_pool;
> > >
> > > insert_entry:
> > > - entry->swpentry = swp;
> > > - entry->objcg = objcg;
> > > -
> > > old = xa_store(tree, offset, entry, GFP_KERNEL);
> > > if (xa_is_err(old)) {
> > > int err = xa_err(old);
> > > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > > if (old)
> > > zswap_entry_free(old);
> > >
> > > + entry->objcg = objcg;
> > > if (objcg) {
> > > obj_cgroup_charge_zswap(objcg, entry->length);
> > > count_objcg_event(objcg, ZSWPOUT);
> > > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> > >
> > > /*
> > > * We finish initializing the entry while it's already in xarray.
> > > - * This is safe because:
> > > - *
> > > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > + * This is safe because the folio is locked in the swapcache, which
> > > + * protects against concurrent stores and invalidations.
> > > *
> > > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > - * The publishing order matters to prevent writeback from seeing
> > > - * an incoherent entry.
> > > + * Concurrent writeback is not possible until we add the entry to the
> > > + * LRU. We need to at least initialize entry->swpentry *before* adding
> > > + * the entry to the LRU to make sure writeback looks up the correct
> > > + * entry in the swapcache.
> > > */
> > > if (entry->length) {
> > > + entry->swpentry = swp;
> > > INIT_LIST_HEAD(&entry->lru);
> > > zswap_lru_add(&zswap_list_lru, entry);
> > > atomic_inc(&zswap_nr_stored);
> > >
> > >
> > > This also got me wondering, do we need a write barrier between
> > > initializing entry->swpentry and zswap_lru_add()?
> > >
> > > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > > bug-prone.
> >
> > I think it's more robust the way Chris has it now. Writeback only
> > derefs ->swpentry today, but who knows if somebody wants to make a
> > changes that relies on a different member. Having submembers follow
> > different validity rules and timelines is error prone and makes the
> > code less hackable without buying all that much. The concept of
> > "publishing" an object like this is more common: if you can see it,
> > you can expect it to be coherent.
>
> Fair enough, but don't we still need a barrier there? Couldn't some
> initializations still be reorder after zswap_lru_add()?

Only if it were lockless. The LRU unlocking in zswap_store() implies
RELEASE, the LRU locking in writeback implies ACQUIRE. Those force the
desired ordering - nothing can bleed after RELEASE, nothing can bleed
before ACQUIRE.

2024-03-20 20:12:39

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH v7] zswap: replace RB tree with xarray

On Wed, Mar 20, 2024 at 04:03:22PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:34:36PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > > > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > > > [..]
> > > > > > > > > - /* map */
> > > > > > > > > - spin_lock(&tree->lock);
> > > > > > > > > /*
> > > > > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > > > > + * This is safe because:
> > > > > > > > > + *
> > > > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > > > + *
> > > > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > > > > + * an incoherent entry.
> > > > > > > >
> > > > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > > > so I don't think the publishing order actually matters here. Right?
> > > > > > >
> > > > > > > Right. This comment is explaining why this publishing order does not
> > > > > > > matter. I think we are talking about the same thing here?
> > > > > >
> > > > > > The comment literally says "the publishing order matters.." :)
> > > > > >
> > > > > > I believe Johannes meant that we should only publish the entry to the
> > > > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > > > partially initialized entry.
> > > > > >
> > > > > > What I am saying is that, even if we add a partially initialized entry
> > > > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > > > locked in the swapcache.
> > > > > >
> > > > > > So basically I think the comment should say:
> > > > > >
> > > > > > /*
> > > > > > * We finish initializing the entry while it's already in the
> > > > > > * xarray. This is safe because the folio is locked in the swap
> > > > > > * cache, which should protect against concurrent stores,
> > > > > > * invalidations, and writeback.
> > > > > > */
> > > > > >
> > > > > > Johannes, what do you think?
> > > > >
> > > > > I don't think that's quite right.
> > > > >
> > > > > Writeback will bail on swapcache insert, yes, but it will access the
> > > > > entry before attempting it. If LRU publishing happened before setting
> > > > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > > > it would be safe to rearrange the code like this.
> > > > >
> > > > > So LRU publishing order does matter.
> > > >
> > > > Ah yes, you are right. entry->swpentry should be set to make sure we
> > > > lookup the correct entry in the swapcache and the tree.
> > > >
> > > > Perhaps we should spell this out in the comment and make the
> > > > initialization ordering more explicit? Maybe something like:
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index d8a14b27adcd7..70924b437743a 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > > > goto put_pool;
> > > >
> > > > insert_entry:
> > > > - entry->swpentry = swp;
> > > > - entry->objcg = objcg;
> > > > -
> > > > old = xa_store(tree, offset, entry, GFP_KERNEL);
> > > > if (xa_is_err(old)) {
> > > > int err = xa_err(old);
> > > > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > > > if (old)
> > > > zswap_entry_free(old);
> > > >
> > > > + entry->objcg = objcg;
> > > > if (objcg) {
> > > > obj_cgroup_charge_zswap(objcg, entry->length);
> > > > count_objcg_event(objcg, ZSWPOUT);
> > > > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> > > >
> > > > /*
> > > > * We finish initializing the entry while it's already in xarray.
> > > > - * This is safe because:
> > > > - *
> > > > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > + * This is safe because the folio is locked in the swapcache, which
> > > > + * protects against concurrent stores and invalidations.
> > > > *
> > > > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > - * The publishing order matters to prevent writeback from seeing
> > > > - * an incoherent entry.
> > > > + * Concurrent writeback is not possible until we add the entry to the
> > > > + * LRU. We need to at least initialize entry->swpentry *before* adding
> > > > + * the entry to the LRU to make sure writeback looks up the correct
> > > > + * entry in the swapcache.
> > > > */
> > > > if (entry->length) {
> > > > + entry->swpentry = swp;
> > > > INIT_LIST_HEAD(&entry->lru);
> > > > zswap_lru_add(&zswap_list_lru, entry);
> > > > atomic_inc(&zswap_nr_stored);
> > > >
> > > >
> > > > This also got me wondering, do we need a write barrier between
> > > > initializing entry->swpentry and zswap_lru_add()?
> > > >
> > > > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > > > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > > > bug-prone.
> > >
> > > I think it's more robust the way Chris has it now. Writeback only
> > > derefs ->swpentry today, but who knows if somebody wants to make a
> > > changes that relies on a different member. Having submembers follow
> > > different validity rules and timelines is error prone and makes the
> > > code less hackable without buying all that much. The concept of
> > > "publishing" an object like this is more common: if you can see it,
> > > you can expect it to be coherent.
> >
> > Fair enough, but don't we still need a barrier there? Couldn't some
> > initializations still be reorder after zswap_lru_add()?
>
> Only if it were lockless. The LRU unlocking in zswap_store() implies
> RELEASE, the LRU locking in writeback implies ACQUIRE. Those force the
> desired ordering - nothing can bleed after RELEASE, nothing can bleed
> before ACQUIRE.

Ah yes, I found smp_store_release() deep in the spin_unlock() code.
Thanks for pointing this out, all is good in the world.