LinuxLists.cc - [PATCH v2] zswap: replace RB tree with xarray

2024-02-29 08:46:33

Subject: [PATCH v2] zswap: replace RB tree with xarray

Very deep RB tree requires rebalance at times. That
contributes to the zswap fault latencies. Xarray does not
need to perform tree rebalance. Replacing RB tree to xarray
can have some small performance gain.

One small difference is that xarray insert might fail with
ENOMEM, while RB tree insert does not allocate additional
memory.

The zswap_entry size will reduce a bit due to removing the
RB node, which has two pointers and a color field. Xarray
store the pointer in the xarray tree rather than the
zswap_entry. Every entry has one pointer from the xarray
tree. Overall, switching to xarray should save some memory,
if the swap entries are densely packed.

Notice the zswap_rb_search and zswap_rb_insert always
followed by zswap_rb_erase. Fold the entry erase into
zswap_xa_search_and_erase and zswap_xa_insert. That saves
one tree lookup as well.

Remove zswap_invalidate_entry due to no need to call
zswap_rb_erase any more. Use zswap_free_entry instead.

The "struct zswap_tree" has been replaced by "struct xarray".
The tree spin lock has transferred to the xarray lock.

Thanks to Chengming for providing the kernel build test number.

Run the kernel build testing 5 times for each version, averages:
(memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)

mm-266f922c0b5e zswap-xarray-test
real 63.43 63.12
user 1063.78 1062.59
sys 272.49 265.66

The sys time is about 2.5% faster.

Tested-by: Chengming Zhou <[email protected]>
---

Signed-off-by: Chris Li <[email protected]>
---
Changes in v2:
- Replace struct zswap_tree with struct xarray.
- Remove zswap_tree spinlock, use xarray lock instead.
- Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
- Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
- Link to v1: https://lore.kernel.org/r/[email protected]
---
mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
1 file changed, 64 insertions(+), 109 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 011e068eb355..ac9ef14d88be 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -20,7 +20,6 @@
#include <linux/spinlock.h>
#include <linux/types.h>
#include <linux/atomic.h>
-#include <linux/rbtree.h>
#include <linux/swap.h>
#include <linux/crypto.h>
#include <linux/scatterlist.h>
@@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
static u64 zswap_reject_alloc_fail;
/* Store failed because the entry metadata could not be allocated (rare) */
static u64 zswap_reject_kmemcache_fail;
+/* Store failed because xarray can't insert the entry*/
+static u64 zswap_reject_xarray_fail;

/* Shrinker work queue */
static struct workqueue_struct *shrink_wq;
@@ -196,7 +197,6 @@ static struct {
* This structure contains the metadata for tracking a single compressed
* page within zswap.
*
- * rbnode - links the entry into red-black tree for the appropriate swap type
* swpentry - associated swap entry, the offset indexes into the red-black tree
* length - the length in bytes of the compressed page data. Needed during
* decompression. For a same value filled page length is 0, and both
@@ -208,7 +208,6 @@ static struct {
* lru - handle to the pool's lru used to evict pages.
*/
struct zswap_entry {
- struct rb_node rbnode;
swp_entry_t swpentry;
unsigned int length;
struct zswap_pool *pool;
@@ -220,12 +219,7 @@ struct zswap_entry {
struct list_head lru;
};

-struct zswap_tree {
- struct rb_root rbroot;
- spinlock_t lock;
-};
-
-static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
+static struct xarray *zswap_trees[MAX_SWAPFILES];
static unsigned int nr_zswap_trees[MAX_SWAPFILES];

/* RCU-protected iteration */
@@ -253,10 +247,10 @@ static bool zswap_has_pool;
* helpers and fwd declarations
**********************************/

-static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
{
- return &zswap_trees[swp_type(swp)][swp_offset(swp)
- >> SWAP_ADDRESS_SPACE_SHIFT];
+ return zswap_trees[swp_type(swp)] + (swp_offset(swp)
+ >> SWAP_ADDRESS_SPACE_SHIFT);
}

#define zswap_pool_debug(msg, p) \
@@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
}

/*********************************
-* rbtree functions
+* xarray functions
**********************************/
-static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
+static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
{
- struct rb_node *node = root->rb_node;
- struct zswap_entry *entry;
- pgoff_t entry_offset;
-
- while (node) {
- entry = rb_entry(node, struct zswap_entry, rbnode);
- entry_offset = swp_offset(entry->swpentry);
- if (entry_offset > offset)
- node = node->rb_left;
- else if (entry_offset < offset)
- node = node->rb_right;
- else
- return entry;
- }
- return NULL;
+ return xa_erase(tree, offset);
}

/*
+ * Expects xa_lock to be held on entry.
* In the case that a entry with the same offset is found, a pointer to
- * the existing entry is stored in dupentry and the function returns -EEXIST
+ * the existing entry is stored in old and erased from the tree.
+ * Function return error on insert.
*/
-static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
- struct zswap_entry **dupentry)
+static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
+ struct zswap_entry **old)
{
- struct rb_node **link = &root->rb_node, *parent = NULL;
- struct zswap_entry *myentry;
- pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
-
- while (*link) {
- parent = *link;
- myentry = rb_entry(parent, struct zswap_entry, rbnode);
- myentry_offset = swp_offset(myentry->swpentry);
- if (myentry_offset > entry_offset)
- link = &(*link)->rb_left;
- else if (myentry_offset < entry_offset)
- link = &(*link)->rb_right;
- else {
- *dupentry = myentry;
- return -EEXIST;
- }
- }
- rb_link_node(&entry->rbnode, parent, link);
- rb_insert_color(&entry->rbnode, root);
- return 0;
-}
+ int err;
+ struct zswap_entry *e;
+ pgoff_t offset = swp_offset(entry->swpentry);

-static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
-{
- rb_erase(&entry->rbnode, root);
- RB_CLEAR_NODE(&entry->rbnode);
+ e = __xa_store(tree, offset, entry, GFP_KERNEL);
+ err = xa_err(e);
+
+ if (err) {
+ e = __xa_erase(tree, offset);
+ if (err == -ENOMEM)
+ zswap_reject_alloc_fail++;
+ else
+ zswap_reject_xarray_fail++;
+ }
+ *old = e;
+ return err;
}

/*********************************
@@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
if (!entry)
return NULL;
- RB_CLEAR_NODE(&entry->rbnode);
return entry;
}

@@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
zswap_update_total_size();
}

-/*
- * The caller hold the tree lock and search the entry from the tree,
- * so it must be on the tree, remove it from the tree and free it.
- */
-static void zswap_invalidate_entry(struct zswap_tree *tree,
- struct zswap_entry *entry)
-{
- zswap_rb_erase(&tree->rbroot, entry);
- zswap_entry_free(entry);
-}
-
/*********************************
* compressed storage functions
**********************************/
@@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
static int zswap_writeback_entry(struct zswap_entry *entry,
swp_entry_t swpentry)
{
- struct zswap_tree *tree;
+ struct xarray *tree;
+ pgoff_t offset = swp_offset(swpentry);
+ struct zswap_entry *e;
struct folio *folio;
struct mempolicy *mpol;
bool folio_was_allocated;
@@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
* be dereferenced.
*/
tree = swap_zswap_tree(swpentry);
- spin_lock(&tree->lock);
- if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
- spin_unlock(&tree->lock);
+ e = zswap_xa_search_and_erase(tree, offset);
+ if (e != entry) {
delete_from_swap_cache(folio);
folio_unlock(folio);
folio_put(folio);
return -ENOMEM;
}

- /* Safe to deref entry after the entry is verified above. */
- zswap_rb_erase(&tree->rbroot, entry);
- spin_unlock(&tree->lock);
-
zswap_decompress(entry, &folio->page);

count_vm_event(ZSWPWB);
@@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
{
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
- struct zswap_tree *tree = swap_zswap_tree(swp);
- struct zswap_entry *entry, *dupentry;
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct zswap_entry *entry, *old;
struct obj_cgroup *objcg = NULL;
struct mem_cgroup *memcg = NULL;
+ int err;

VM_WARN_ON_ONCE(!folio_test_locked(folio));
VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
}

/* map */
- spin_lock(&tree->lock);
+ xa_lock(tree);
/*
* The folio may have been dirtied again, invalidate the
* possibly stale entry before inserting the new entry.
*/
- if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
- zswap_invalidate_entry(tree, dupentry);
- WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
+ err = zswap_xa_insert(tree, entry, &old);
+ if (old)
+ zswap_entry_free(old);
+ if (err) {
+ xa_unlock(tree);
+ goto free_zpool;
}
+
if (entry->length) {
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&zswap.list_lru, entry);
atomic_inc(&zswap.nr_stored);
}
- spin_unlock(&tree->lock);
+ xa_unlock(tree);

/* update stats */
atomic_inc(&zswap_stored_pages);
@@ -1585,6 +1547,9 @@ bool zswap_store(struct folio *folio)

return true;

+free_zpool:
+ if (entry->handle)
+ zpool_free(zswap_find_zpool(entry), entry->handle);
put_pool:
zswap_pool_put(entry->pool);
freepage:
@@ -1598,11 +1563,9 @@ bool zswap_store(struct folio *folio)
* possibly stale entry which was previously stored at this offset.
* Otherwise, writeback could overwrite the new data in the swapfile.
*/
- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_xa_search_and_erase(tree, offset);
if (entry)
- zswap_invalidate_entry(tree, entry);
- spin_unlock(&tree->lock);
+ zswap_entry_free(entry);
return false;

shrink:
@@ -1615,20 +1578,15 @@ bool zswap_load(struct folio *folio)
swp_entry_t swp = folio->swap;
pgoff_t offset = swp_offset(swp);
struct page *page = &folio->page;
- struct zswap_tree *tree = swap_zswap_tree(swp);
+ struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;
u8 *dst;

VM_WARN_ON_ONCE(!folio_test_locked(folio));

- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
- if (!entry) {
- spin_unlock(&tree->lock);
+ entry = zswap_xa_search_and_erase(tree, offset);
+ if (!entry)
return false;
- }
- zswap_rb_erase(&tree->rbroot, entry);
- spin_unlock(&tree->lock);

if (entry->length)
zswap_decompress(entry, page);
@@ -1652,19 +1610,17 @@ bool zswap_load(struct folio *folio)
void zswap_invalidate(swp_entry_t swp)
{
pgoff_t offset = swp_offset(swp);
- struct zswap_tree *tree = swap_zswap_tree(swp);
+ struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry;

- spin_lock(&tree->lock);
- entry = zswap_rb_search(&tree->rbroot, offset);
+ entry = zswap_xa_search_and_erase(tree, offset);
if (entry)
- zswap_invalidate_entry(tree, entry);
- spin_unlock(&tree->lock);
+ zswap_entry_free(entry);
}

int zswap_swapon(int type, unsigned long nr_pages)
{
- struct zswap_tree *trees, *tree;
+ struct xarray *trees, *tree;
unsigned int nr, i;

nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
@@ -1674,11 +1630,8 @@ int zswap_swapon(int type, unsigned long nr_pages)
return -ENOMEM;
}

- for (i = 0; i < nr; i++) {
- tree = trees + i;
- tree->rbroot = RB_ROOT;
- spin_lock_init(&tree->lock);
- }
+ for (i = 0; i < nr; i++)
+ xa_init(trees + i);

nr_zswap_trees[type] = nr;
zswap_trees[type] = trees;
@@ -1687,7 +1640,7 @@ int zswap_swapon(int type, unsigned long nr_pages)

void zswap_swapoff(int type)
{
- struct zswap_tree *trees = zswap_trees[type];
+ struct xarray *trees = zswap_trees[type];
unsigned int i;

if (!trees)
@@ -1695,7 +1648,7 @@ void zswap_swapoff(int type)

/* try_to_unuse() invalidated all the entries already */
for (i = 0; i < nr_zswap_trees[type]; i++)
- WARN_ON_ONCE(!RB_EMPTY_ROOT(&trees[i].rbroot));
+ WARN_ON_ONCE(!xa_empty(trees + i));

kvfree(trees);
nr_zswap_trees[type] = 0;
@@ -1727,6 +1680,8 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, &zswap_reject_kmemcache_fail);
debugfs_create_u64("reject_compress_fail", 0444,
zswap_debugfs_root, &zswap_reject_compress_fail);
+ debugfs_create_u64("reject_xarray_fail", 0444,
+ zswap_debugfs_root, &zswap_reject_xarray_fail);
debugfs_create_u64("reject_compress_poor", 0444,
zswap_debugfs_root, &zswap_reject_compress_poor);
debugfs_create_u64("written_back_pages", 0444,

---
base-commit: 9a0181a3710eba1f5c6d19eadcca888be3d54e4f
change-id: 20240104-zswap-xarray-716260e541e3

Best regards,
--
Chris Li <[email protected]>

2024-02-29 10:01:00

by Chengming Zhou

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

Hi Chris,

On 2024/2/29 16:46, Chris Li wrote:
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
>
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.
>
> The zswap_entry size will reduce a bit due to removing the
> RB node, which has two pointers and a color field. Xarray
> store the pointer in the xarray tree rather than the
> zswap_entry. Every entry has one pointer from the xarray
> tree. Overall, switching to xarray should save some memory,
> if the swap entries are densely packed.
>
> Notice the zswap_rb_search and zswap_rb_insert always
> followed by zswap_rb_erase. Fold the entry erase into
> zswap_xa_search_and_erase and zswap_xa_insert. That saves
> one tree lookup as well.
>
> Remove zswap_invalidate_entry due to no need to call
> zswap_rb_erase any more. Use zswap_free_entry instead.
>
> The "struct zswap_tree" has been replaced by "struct xarray".
> The tree spin lock has transferred to the xarray lock.
>
> Thanks to Chengming for providing the kernel build test number.
>
> Run the kernel build testing 5 times for each version, averages:
> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
>
> mm-266f922c0b5e zswap-xarray-test
> real 63.43 63.12
> user 1063.78 1062.59
> sys 272.49 265.66
>
> The sys time is about 2.5% faster.
>
> Tested-by: Chengming Zhou <[email protected]>
> ---
>
>
> Signed-off-by: Chris Li <[email protected]>
> ---
> Changes in v2:
> - Replace struct zswap_tree with struct xarray.
> - Remove zswap_tree spinlock, use xarray lock instead.
> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> - Link to v1: https://lore.kernel.org/r/[email protected]
> ---
> mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
> 1 file changed, 64 insertions(+), 109 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 011e068eb355..ac9ef14d88be 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -20,7 +20,6 @@
> #include <linux/spinlock.h>
> #include <linux/types.h>
> #include <linux/atomic.h>
> -#include <linux/rbtree.h>
> #include <linux/swap.h>
> #include <linux/crypto.h>
> #include <linux/scatterlist.h>
> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> static u64 zswap_reject_alloc_fail;
> /* Store failed because the entry metadata could not be allocated (rare) */
> static u64 zswap_reject_kmemcache_fail;
> +/* Store failed because xarray can't insert the entry*/
> +static u64 zswap_reject_xarray_fail;
>
> /* Shrinker work queue */
> static struct workqueue_struct *shrink_wq;
> @@ -196,7 +197,6 @@ static struct {
> * This structure contains the metadata for tracking a single compressed
> * page within zswap.
> *
> - * rbnode - links the entry into red-black tree for the appropriate swap type
> * swpentry - associated swap entry, the offset indexes into the red-black tree
> * length - the length in bytes of the compressed page data. Needed during
> * decompression. For a same value filled page length is 0, and both
> @@ -208,7 +208,6 @@ static struct {
> * lru - handle to the pool's lru used to evict pages.
> */
> struct zswap_entry {
> - struct rb_node rbnode;
> swp_entry_t swpentry;
> unsigned int length;
> struct zswap_pool *pool;
> @@ -220,12 +219,7 @@ struct zswap_entry {
> struct list_head lru;
> };
>
> -struct zswap_tree {
> - struct rb_root rbroot;
> - spinlock_t lock;
> -};
> -
> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> +static struct xarray *zswap_trees[MAX_SWAPFILES];
> static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>
> /* RCU-protected iteration */
> @@ -253,10 +247,10 @@ static bool zswap_has_pool;
> * helpers and fwd declarations
> **********************************/
>
> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> {
> - return &zswap_trees[swp_type(swp)][swp_offset(swp)
> - >> SWAP_ADDRESS_SPACE_SHIFT];
> + return zswap_trees[swp_type(swp)] + (swp_offset(swp)
> + >> SWAP_ADDRESS_SPACE_SHIFT);
> }
>
> #define zswap_pool_debug(msg, p) \
> @@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> }
>
> /*********************************
> -* rbtree functions
> +* xarray functions
> **********************************/
> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> +static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
> {
> - struct rb_node *node = root->rb_node;
> - struct zswap_entry *entry;
> - pgoff_t entry_offset;
> -
> - while (node) {
> - entry = rb_entry(node, struct zswap_entry, rbnode);
> - entry_offset = swp_offset(entry->swpentry);
> - if (entry_offset > offset)
> - node = node->rb_left;
> - else if (entry_offset < offset)
> - node = node->rb_right;
> - else
> - return entry;
> - }
> - return NULL;
> + return xa_erase(tree, offset);
> }
>
> /*
> + * Expects xa_lock to be held on entry.
> * In the case that a entry with the same offset is found, a pointer to
> - * the existing entry is stored in dupentry and the function returns -EEXIST
> + * the existing entry is stored in old and erased from the tree.
> + * Function return error on insert.
> */
> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> - struct zswap_entry **dupentry)
> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> + struct zswap_entry **old)
> {
> - struct rb_node **link = &root->rb_node, *parent = NULL;
> - struct zswap_entry *myentry;
> - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> -
> - while (*link) {
> - parent = *link;
> - myentry = rb_entry(parent, struct zswap_entry, rbnode);
> - myentry_offset = swp_offset(myentry->swpentry);
> - if (myentry_offset > entry_offset)
> - link = &(*link)->rb_left;
> - else if (myentry_offset < entry_offset)
> - link = &(*link)->rb_right;
> - else {
> - *dupentry = myentry;
> - return -EEXIST;
> - }
> - }
> - rb_link_node(&entry->rbnode, parent, link);
> - rb_insert_color(&entry->rbnode, root);
> - return 0;
> -}
> + int err;
> + struct zswap_entry *e;
> + pgoff_t offset = swp_offset(entry->swpentry);
>
> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> -{
> - rb_erase(&entry->rbnode, root);
> - RB_CLEAR_NODE(&entry->rbnode);
> + e = __xa_store(tree, offset, entry, GFP_KERNEL);
> + err = xa_err(e);
> +
> + if (err) {
> + e = __xa_erase(tree, offset);
> + if (err == -ENOMEM)
> + zswap_reject_alloc_fail++;
> + else
> + zswap_reject_xarray_fail++;
> + }
> + *old = e;
> + return err;
> }
>
> /*********************************
> @@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> if (!entry)
> return NULL;
> - RB_CLEAR_NODE(&entry->rbnode);
> return entry;
> }
>
> @@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> zswap_update_total_size();
> }
>
> -/*
> - * The caller hold the tree lock and search the entry from the tree,
> - * so it must be on the tree, remove it from the tree and free it.
> - */
> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> - struct zswap_entry *entry)
> -{
> - zswap_rb_erase(&tree->rbroot, entry);
> - zswap_entry_free(entry);
> -}
> -
> /*********************************
> * compressed storage functions
> **********************************/
> @@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> static int zswap_writeback_entry(struct zswap_entry *entry,
> swp_entry_t swpentry)
> {
> - struct zswap_tree *tree;
> + struct xarray *tree;
> + pgoff_t offset = swp_offset(swpentry);
> + struct zswap_entry *e;
> struct folio *folio;
> struct mempolicy *mpol;
> bool folio_was_allocated;
> @@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> * be dereferenced.
> */
> tree = swap_zswap_tree(swpentry);
> - spin_lock(&tree->lock);
> - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> - spin_unlock(&tree->lock);
> + e = zswap_xa_search_and_erase(tree, offset);
> + if (e != entry) {

IIUC, here we should use xa_cmpxchg() instead of erasing it unconditionally.

> delete_from_swap_cache(folio);
> folio_unlock(folio);
> folio_put(folio);
> return -ENOMEM;
> }
>
> - /* Safe to deref entry after the entry is verified above. */
> - zswap_rb_erase(&tree->rbroot, entry);
> - spin_unlock(&tree->lock);
> -
> zswap_decompress(entry, &folio->page);
>
> count_vm_event(ZSWPWB);
> @@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
> {
> swp_entry_t swp = folio->swap;
> pgoff_t offset = swp_offset(swp);
> - struct zswap_tree *tree = swap_zswap_tree(swp);
> - struct zswap_entry *entry, *dupentry;
> + struct xarray *tree = swap_zswap_tree(swp);
> + struct zswap_entry *entry, *old;
> struct obj_cgroup *objcg = NULL;
> struct mem_cgroup *memcg = NULL;
> + int err;
>
> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
> }
>
> /* map */
> - spin_lock(&tree->lock);
> + xa_lock(tree);
> /*
> * The folio may have been dirtied again, invalidate the
> * possibly stale entry before inserting the new entry.
> */
> - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> - zswap_invalidate_entry(tree, dupentry);
> - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> + err = zswap_xa_insert(tree, entry, &old);
> + if (old)
> + zswap_entry_free(old);

Maybe it's safer to check old after !err, since "old" variable is not initialized
to NULL, and zswap_xa_insert() maybe won't overwrite "old" to NULL when err return?

> + if (err) {
> + xa_unlock(tree);
> + goto free_zpool;
> }
> +
> if (entry->length) {
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&zswap.list_lru, entry);
> atomic_inc(&zswap.nr_stored);
> }

It seems that we can put this part out of xarray lock section, then it's enough to
just use xa_insert().

Thanks.

2024-02-29 19:07:30

by Chris Li

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

Hi Chengming,

Thanks for the review and feedback.

On Thu, Feb 29, 2024 at 1:44 AM Chengming Zhou
<[email protected]> wrote:
>
> Hi Chris,
>
> On 2024/2/29 16:46, Chris Li wrote:
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
> >
> > The zswap_entry size will reduce a bit due to removing the
> > RB node, which has two pointers and a color field. Xarray
> > store the pointer in the xarray tree rather than the
> > zswap_entry. Every entry has one pointer from the xarray
> > tree. Overall, switching to xarray should save some memory,
> > if the swap entries are densely packed.
> >
> > Notice the zswap_rb_search and zswap_rb_insert always
> > followed by zswap_rb_erase. Fold the entry erase into
> > zswap_xa_search_and_erase and zswap_xa_insert. That saves
> > one tree lookup as well.
> >
> > Remove zswap_invalidate_entry due to no need to call
> > zswap_rb_erase any more. Use zswap_free_entry instead.
> >
> > The "struct zswap_tree" has been replaced by "struct xarray".
> > The tree spin lock has transferred to the xarray lock.
> >
> > Thanks to Chengming for providing the kernel build test number.
> >
> > Run the kernel build testing 5 times for each version, averages:
> > (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> >
> > mm-266f922c0b5e zswap-xarray-test
> > real 63.43 63.12
> > user 1063.78 1062.59
> > sys 272.49 265.66
> >
> > The sys time is about 2.5% faster.
> >
> > Tested-by: Chengming Zhou <[email protected]>
> > ---
> >
> >
> > Signed-off-by: Chris Li <[email protected]>
> > ---
> > Changes in v2:
> > - Replace struct zswap_tree with struct xarray.
> > - Remove zswap_tree spinlock, use xarray lock instead.
> > - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> > - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> > - Link to v1: https://lore.kernel.org/r/[email protected]
> > ---
> > mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
> > 1 file changed, 64 insertions(+), 109 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 011e068eb355..ac9ef14d88be 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -20,7 +20,6 @@
> > #include <linux/spinlock.h>
> > #include <linux/types.h>
> > #include <linux/atomic.h>
> > -#include <linux/rbtree.h>
> > #include <linux/swap.h>
> > #include <linux/crypto.h>
> > #include <linux/scatterlist.h>
> > @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> > static u64 zswap_reject_alloc_fail;
> > /* Store failed because the entry metadata could not be allocated (rare) */
> > static u64 zswap_reject_kmemcache_fail;
> > +/* Store failed because xarray can't insert the entry*/
> > +static u64 zswap_reject_xarray_fail;
> >
> > /* Shrinker work queue */
> > static struct workqueue_struct *shrink_wq;
> > @@ -196,7 +197,6 @@ static struct {
> > * This structure contains the metadata for tracking a single compressed
> > * page within zswap.
> > *
> > - * rbnode - links the entry into red-black tree for the appropriate swap type
> > * swpentry - associated swap entry, the offset indexes into the red-black tree
> > * length - the length in bytes of the compressed page data. Needed during
> > * decompression. For a same value filled page length is 0, and both
> > @@ -208,7 +208,6 @@ static struct {
> > * lru - handle to the pool's lru used to evict pages.
> > */
> > struct zswap_entry {
> > - struct rb_node rbnode;
> > swp_entry_t swpentry;
> > unsigned int length;
> > struct zswap_pool *pool;
> > @@ -220,12 +219,7 @@ struct zswap_entry {
> > struct list_head lru;
> > };
> >
> > -struct zswap_tree {
> > - struct rb_root rbroot;
> > - spinlock_t lock;
> > -};
> > -
> > -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > +static struct xarray *zswap_trees[MAX_SWAPFILES];
> > static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >
> > /* RCU-protected iteration */
> > @@ -253,10 +247,10 @@ static bool zswap_has_pool;
> > * helpers and fwd declarations
> > **********************************/
> >
> > -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> > +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> > {
> > - return &zswap_trees[swp_type(swp)][swp_offset(swp)
> > - >> SWAP_ADDRESS_SPACE_SHIFT];
> > + return zswap_trees[swp_type(swp)] + (swp_offset(swp)
> > + >> SWAP_ADDRESS_SPACE_SHIFT);
> > }
> >
> > #define zswap_pool_debug(msg, p) \
> > @@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > }
> >
> > /*********************************
> > -* rbtree functions
> > +* xarray functions
> > **********************************/
> > -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > +static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
> > {
> > - struct rb_node *node = root->rb_node;
> > - struct zswap_entry *entry;
> > - pgoff_t entry_offset;
> > -
> > - while (node) {
> > - entry = rb_entry(node, struct zswap_entry, rbnode);
> > - entry_offset = swp_offset(entry->swpentry);
> > - if (entry_offset > offset)
> > - node = node->rb_left;
> > - else if (entry_offset < offset)
> > - node = node->rb_right;
> > - else
> > - return entry;
> > - }
> > - return NULL;
> > + return xa_erase(tree, offset);
> > }
> >
> > /*
> > + * Expects xa_lock to be held on entry.
> > * In the case that a entry with the same offset is found, a pointer to
> > - * the existing entry is stored in dupentry and the function returns -EEXIST
> > + * the existing entry is stored in old and erased from the tree.
> > + * Function return error on insert.
> > */
> > -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > - struct zswap_entry **dupentry)
> > +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> > + struct zswap_entry **old)
> > {
> > - struct rb_node **link = &root->rb_node, *parent = NULL;
> > - struct zswap_entry *myentry;
> > - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> > -
> > - while (*link) {
> > - parent = *link;
> > - myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > - myentry_offset = swp_offset(myentry->swpentry);
> > - if (myentry_offset > entry_offset)
> > - link = &(*link)->rb_left;
> > - else if (myentry_offset < entry_offset)
> > - link = &(*link)->rb_right;
> > - else {
> > - *dupentry = myentry;
> > - return -EEXIST;
> > - }
> > - }
> > - rb_link_node(&entry->rbnode, parent, link);
> > - rb_insert_color(&entry->rbnode, root);
> > - return 0;
> > -}
> > + int err;
> > + struct zswap_entry *e;
> > + pgoff_t offset = swp_offset(entry->swpentry);
> >
> > -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > -{
> > - rb_erase(&entry->rbnode, root);
> > - RB_CLEAR_NODE(&entry->rbnode);
> > + e = __xa_store(tree, offset, entry, GFP_KERNEL);
> > + err = xa_err(e);
> > +
> > + if (err) {
> > + e = __xa_erase(tree, offset);

zswap_xa_insert will always erase the old entry, even when __xa_store fails.

> > + if (err == -ENOMEM)
> > + zswap_reject_alloc_fail++;
> > + else
> > + zswap_reject_xarray_fail++;
> > + }
> > + *old = e;

Old pointer is set regardless of the error.

> > + return err;
> > }
> >
> > /*********************************
> > @@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> > entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> > if (!entry)
> > return NULL;
> > - RB_CLEAR_NODE(&entry->rbnode);
> > return entry;
> > }
> >
> > @@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> > zswap_update_total_size();
> > }
> >
> > -/*
> > - * The caller hold the tree lock and search the entry from the tree,
> > - * so it must be on the tree, remove it from the tree and free it.
> > - */
> > -static void zswap_invalidate_entry(struct zswap_tree *tree,
> > - struct zswap_entry *entry)
> > -{
> > - zswap_rb_erase(&tree->rbroot, entry);
> > - zswap_entry_free(entry);
> > -}
> > -
> > /*********************************
> > * compressed storage functions
> > **********************************/
> > @@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > static int zswap_writeback_entry(struct zswap_entry *entry,
> > swp_entry_t swpentry)
> > {
> > - struct zswap_tree *tree;
> > + struct xarray *tree;
> > + pgoff_t offset = swp_offset(swpentry);
> > + struct zswap_entry *e;
> > struct folio *folio;
> > struct mempolicy *mpol;
> > bool folio_was_allocated;
> > @@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > * be dereferenced.
> > */
> > tree = swap_zswap_tree(swpentry);
> > - spin_lock(&tree->lock);
> > - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > - spin_unlock(&tree->lock);
> > + e = zswap_xa_search_and_erase(tree, offset);
> > + if (e != entry) {
>
> IIUC, here we should use xa_cmpxchg() instead of erasing it unconditionally.

Good catch, I agree with your suggestion. I will spin a V3 to correct that.

>
> > delete_from_swap_cache(folio);
> > folio_unlock(folio);
> > folio_put(folio);
> > return -ENOMEM;
> > }
> >
> > - /* Safe to deref entry after the entry is verified above. */
> > - zswap_rb_erase(&tree->rbroot, entry);
> > - spin_unlock(&tree->lock);
> > -
> > zswap_decompress(entry, &folio->page);
> >
> > count_vm_event(ZSWPWB);
> > @@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
> > {
> > swp_entry_t swp = folio->swap;
> > pgoff_t offset = swp_offset(swp);
> > - struct zswap_tree *tree = swap_zswap_tree(swp);
> > - struct zswap_entry *entry, *dupentry;
> > + struct xarray *tree = swap_zswap_tree(swp);
> > + struct zswap_entry *entry, *old;
> > struct obj_cgroup *objcg = NULL;
> > struct mem_cgroup *memcg = NULL;
> > + int err;
> >
> > VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > @@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
> > }
> >
> > /* map */
> > - spin_lock(&tree->lock);
> > + xa_lock(tree);
> > /*
> > * The folio may have been dirtied again, invalidate the
> > * possibly stale entry before inserting the new entry.
> > */
> > - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > - zswap_invalidate_entry(tree, dupentry);
> > - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > + err = zswap_xa_insert(tree, entry, &old);
> > + if (old)
> > + zswap_entry_free(old);
>
> Maybe it's safer to check old after !err, since "old" variable is not initialized
> to NULL, and zswap_xa_insert() maybe won't overwrite "old" to NULL when err return?

That is the intended behavior.

See the above in zswap_xa_insert(). It will always erase and return
"old" even when the __xa_store() has an error.
That is because by the time zswap needs to store a new entry at this
swap entry. The old data is already outdated. We should just remove
the old data. If __xa_store failed due to out of memory. That is the
same as allocating an entry out of memory. It is fine to fail
swap_store. Then the folio will just stay in the swap cache for the
next time.

Do you see any ill effects can be caused by deleting the old entry on
xa_insert error?

> > + if (err) {
> > + xa_unlock(tree);
> > + goto free_zpool;
> > }
> > +
> > if (entry->length) {
> > INIT_LIST_HEAD(&entry->lru);
> > zswap_lru_add(&zswap.list_lru, entry);
> > atomic_inc(&zswap.nr_stored);
> > }
>
> It seems that we can put this part out of the xarray lock section, then it's enough to
> just use xa_insert().

It is not enough protection. Consider this race:

CPU1 CPU2

xa_insert()
entry = swap_xa_search_and_erase()
zswap_free_entry(entry)

if (entry->length)
...
CPU1 is using entry after free.

Chris

2024-03-01 07:30:48

by Chengming Zhou

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

On 2024/3/1 02:58, Chris Li wrote:
> Hi Chengming,
>
> Thanks for the review and feedback.
>
> On Thu, Feb 29, 2024 at 1:44 AM Chengming Zhou
> <[email protected]> wrote:
>>
>> Hi Chris,
>>
>> On 2024/2/29 16:46, Chris Li wrote:
>>> Very deep RB tree requires rebalance at times. That
>>> contributes to the zswap fault latencies. Xarray does not
>>> need to perform tree rebalance. Replacing RB tree to xarray
>>> can have some small performance gain.
>>>
>>> One small difference is that xarray insert might fail with
>>> ENOMEM, while RB tree insert does not allocate additional
>>> memory.
>>>
>>> The zswap_entry size will reduce a bit due to removing the
>>> RB node, which has two pointers and a color field. Xarray
>>> store the pointer in the xarray tree rather than the
>>> zswap_entry. Every entry has one pointer from the xarray
>>> tree. Overall, switching to xarray should save some memory,
>>> if the swap entries are densely packed.
>>>
>>> Notice the zswap_rb_search and zswap_rb_insert always
>>> followed by zswap_rb_erase. Fold the entry erase into
>>> zswap_xa_search_and_erase and zswap_xa_insert. That saves
>>> one tree lookup as well.
>>>
>>> Remove zswap_invalidate_entry due to no need to call
>>> zswap_rb_erase any more. Use zswap_free_entry instead.
>>>
>>> The "struct zswap_tree" has been replaced by "struct xarray".
>>> The tree spin lock has transferred to the xarray lock.
>>>
>>> Thanks to Chengming for providing the kernel build test number.
>>>
>>> Run the kernel build testing 5 times for each version, averages:
>>> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
>>>
>>> mm-266f922c0b5e zswap-xarray-test
>>> real 63.43 63.12
>>> user 1063.78 1062.59
>>> sys 272.49 265.66
>>>
>>> The sys time is about 2.5% faster.
>>>
>>> Tested-by: Chengming Zhou <[email protected]>
>>> ---
>>>
>>>
>>> Signed-off-by: Chris Li <[email protected]>
>>> ---
>>> Changes in v2:
>>> - Replace struct zswap_tree with struct xarray.
>>> - Remove zswap_tree spinlock, use xarray lock instead.
>>> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
>>> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
>>> - Link to v1: https://lore.kernel.org/r/[email protected]
>>> ---
>>> mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
>>> 1 file changed, 64 insertions(+), 109 deletions(-)
>>>
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> index 011e068eb355..ac9ef14d88be 100644
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -20,7 +20,6 @@
>>> #include <linux/spinlock.h>
>>> #include <linux/types.h>
>>> #include <linux/atomic.h>
>>> -#include <linux/rbtree.h>
>>> #include <linux/swap.h>
>>> #include <linux/crypto.h>
>>> #include <linux/scatterlist.h>
>>> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
>>> static u64 zswap_reject_alloc_fail;
>>> /* Store failed because the entry metadata could not be allocated (rare) */
>>> static u64 zswap_reject_kmemcache_fail;
>>> +/* Store failed because xarray can't insert the entry*/
>>> +static u64 zswap_reject_xarray_fail;
>>>
>>> /* Shrinker work queue */
>>> static struct workqueue_struct *shrink_wq;
>>> @@ -196,7 +197,6 @@ static struct {
>>> * This structure contains the metadata for tracking a single compressed
>>> * page within zswap.
>>> *
>>> - * rbnode - links the entry into red-black tree for the appropriate swap type
>>> * swpentry - associated swap entry, the offset indexes into the red-black tree
>>> * length - the length in bytes of the compressed page data. Needed during
>>> * decompression. For a same value filled page length is 0, and both
>>> @@ -208,7 +208,6 @@ static struct {
>>> * lru - handle to the pool's lru used to evict pages.
>>> */
>>> struct zswap_entry {
>>> - struct rb_node rbnode;
>>> swp_entry_t swpentry;
>>> unsigned int length;
>>> struct zswap_pool *pool;
>>> @@ -220,12 +219,7 @@ struct zswap_entry {
>>> struct list_head lru;
>>> };
>>>
>>> -struct zswap_tree {
>>> - struct rb_root rbroot;
>>> - spinlock_t lock;
>>> -};
>>> -
>>> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
>>> +static struct xarray *zswap_trees[MAX_SWAPFILES];
>>> static unsigned int nr_zswap_trees[MAX_SWAPFILES];
>>>
>>> /* RCU-protected iteration */
>>> @@ -253,10 +247,10 @@ static bool zswap_has_pool;
>>> * helpers and fwd declarations
>>> **********************************/
>>>
>>> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>>> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>>> {
>>> - return &zswap_trees[swp_type(swp)][swp_offset(swp)
>>> - >> SWAP_ADDRESS_SPACE_SHIFT];
>>> + return zswap_trees[swp_type(swp)] + (swp_offset(swp)
>>> + >> SWAP_ADDRESS_SPACE_SHIFT);
>>> }
>>>
>>> #define zswap_pool_debug(msg, p) \
>>> @@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>>> }
>>>
>>> /*********************************
>>> -* rbtree functions
>>> +* xarray functions
>>> **********************************/
>>> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>> +static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
>>> {
>>> - struct rb_node *node = root->rb_node;
>>> - struct zswap_entry *entry;
>>> - pgoff_t entry_offset;
>>> -
>>> - while (node) {
>>> - entry = rb_entry(node, struct zswap_entry, rbnode);
>>> - entry_offset = swp_offset(entry->swpentry);
>>> - if (entry_offset > offset)
>>> - node = node->rb_left;
>>> - else if (entry_offset < offset)
>>> - node = node->rb_right;
>>> - else
>>> - return entry;
>>> - }
>>> - return NULL;
>>> + return xa_erase(tree, offset);
>>> }
>>>
>>> /*
>>> + * Expects xa_lock to be held on entry.
>>> * In the case that a entry with the same offset is found, a pointer to
>>> - * the existing entry is stored in dupentry and the function returns -EEXIST
>>> + * the existing entry is stored in old and erased from the tree.
>>> + * Function return error on insert.
>>> */
>>> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
>>> - struct zswap_entry **dupentry)
>>> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
>>> + struct zswap_entry **old)
>>> {
>>> - struct rb_node **link = &root->rb_node, *parent = NULL;
>>> - struct zswap_entry *myentry;
>>> - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
>>> -
>>> - while (*link) {
>>> - parent = *link;
>>> - myentry = rb_entry(parent, struct zswap_entry, rbnode);
>>> - myentry_offset = swp_offset(myentry->swpentry);
>>> - if (myentry_offset > entry_offset)
>>> - link = &(*link)->rb_left;
>>> - else if (myentry_offset < entry_offset)
>>> - link = &(*link)->rb_right;
>>> - else {
>>> - *dupentry = myentry;
>>> - return -EEXIST;
>>> - }
>>> - }
>>> - rb_link_node(&entry->rbnode, parent, link);
>>> - rb_insert_color(&entry->rbnode, root);
>>> - return 0;
>>> -}
>>> + int err;
>>> + struct zswap_entry *e;
>>> + pgoff_t offset = swp_offset(entry->swpentry);
>>>
>>> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
>>> -{
>>> - rb_erase(&entry->rbnode, root);
>>> - RB_CLEAR_NODE(&entry->rbnode);
>>> + e = __xa_store(tree, offset, entry, GFP_KERNEL);
>>> + err = xa_err(e);
>>> +
>>> + if (err) {
>>> + e = __xa_erase(tree, offset);
>
> zswap_xa_insert will always erase the old entry, even when __xa_store fails.
>
>>> + if (err == -ENOMEM)
>>> + zswap_reject_alloc_fail++;
>>> + else
>>> + zswap_reject_xarray_fail++;
>>> + }
>>> + *old = e;
>
> Old pointer is set regardless of the error.

Ok, I get it. The "old" pointer is always set on return.

>
>>> + return err;
>>> }
>>>
>>> /*********************************
>>> @@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
>>> entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
>>> if (!entry)
>>> return NULL;
>>> - RB_CLEAR_NODE(&entry->rbnode);
>>> return entry;
>>> }
>>>
>>> @@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
>>> zswap_update_total_size();
>>> }
>>>
>>> -/*
>>> - * The caller hold the tree lock and search the entry from the tree,
>>> - * so it must be on the tree, remove it from the tree and free it.
>>> - */
>>> -static void zswap_invalidate_entry(struct zswap_tree *tree,
>>> - struct zswap_entry *entry)
>>> -{
>>> - zswap_rb_erase(&tree->rbroot, entry);
>>> - zswap_entry_free(entry);
>>> -}
>>> -
>>> /*********************************
>>> * compressed storage functions
>>> **********************************/
>>> @@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
>>> static int zswap_writeback_entry(struct zswap_entry *entry,
>>> swp_entry_t swpentry)
>>> {
>>> - struct zswap_tree *tree;
>>> + struct xarray *tree;
>>> + pgoff_t offset = swp_offset(swpentry);
>>> + struct zswap_entry *e;
>>> struct folio *folio;
>>> struct mempolicy *mpol;
>>> bool folio_was_allocated;
>>> @@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>>> * be dereferenced.
>>> */
>>> tree = swap_zswap_tree(swpentry);
>>> - spin_lock(&tree->lock);
>>> - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
>>> - spin_unlock(&tree->lock);
>>> + e = zswap_xa_search_and_erase(tree, offset);
>>> + if (e != entry) {
>>
>> IIUC, here we should use xa_cmpxchg() instead of erasing it unconditionally.
>
> Good catch, I agree with your suggestion. I will spin a V3 to correct that.
>
>>
>>> delete_from_swap_cache(folio);
>>> folio_unlock(folio);
>>> folio_put(folio);
>>> return -ENOMEM;
>>> }
>>>
>>> - /* Safe to deref entry after the entry is verified above. */
>>> - zswap_rb_erase(&tree->rbroot, entry);
>>> - spin_unlock(&tree->lock);
>>> -
>>> zswap_decompress(entry, &folio->page);
>>>
>>> count_vm_event(ZSWPWB);
>>> @@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
>>> {
>>> swp_entry_t swp = folio->swap;
>>> pgoff_t offset = swp_offset(swp);
>>> - struct zswap_tree *tree = swap_zswap_tree(swp);
>>> - struct zswap_entry *entry, *dupentry;
>>> + struct xarray *tree = swap_zswap_tree(swp);
>>> + struct zswap_entry *entry, *old;
>>> struct obj_cgroup *objcg = NULL;
>>> struct mem_cgroup *memcg = NULL;
>>> + int err;
>>>
>>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
>>> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>>> @@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
>>> }
>>>
>>> /* map */
>>> - spin_lock(&tree->lock);
>>> + xa_lock(tree);
>>> /*
>>> * The folio may have been dirtied again, invalidate the
>>> * possibly stale entry before inserting the new entry.
>>> */
>>> - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
>>> - zswap_invalidate_entry(tree, dupentry);
>>> - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
>>> + err = zswap_xa_insert(tree, entry, &old);
>>> + if (old)
>>> + zswap_entry_free(old);
>>
>> Maybe it's safer to check old after !err, since "old" variable is not initialized
>> to NULL, and zswap_xa_insert() maybe won't overwrite "old" to NULL when err return?
>
> That is the intended behavior.
>
> See the above in zswap_xa_insert(). It will always erase and return
> "old" even when the __xa_store() has an error.
> That is because by the time zswap needs to store a new entry at this
> swap entry. The old data is already outdated. We should just remove
> the old data. If __xa_store failed due to out of memory. That is the
> same as allocating an entry out of memory. It is fine to fail
> swap_store. Then the folio will just stay in the swap cache for the
> next time.
>
> Do you see any ill effects can be caused by deleting the old entry on
> xa_insert error?

No, you're right, we should always delete/free old zswap entry no matter
store success or fail.

>
>>> + if (err) {
>>> + xa_unlock(tree);
>>> + goto free_zpool;
>>> }
>>> +
>>> if (entry->length) {
>>> INIT_LIST_HEAD(&entry->lru);
>>> zswap_lru_add(&zswap.list_lru, entry);
>>> atomic_inc(&zswap.nr_stored);
>>> }
>>
>> It seems that we can put this part out of the xarray lock section, then it's enough to
>> just use xa_insert().

I wanted to mean xa_store() here.

>
> It is not enough protection. Consider this race:
>
> CPU1 CPU2
>
> xa_insert()
> entry = swap_xa_search_and_erase()
> zswap_free_entry(entry)
>
> if (entry->length)
> ...
> CPU1 is using entry after free.

Hmm, right, but I don't know how could this race happen? Since the folio we store is
the owner of swap entry, which couldn't be deleted meanwhile, right?

Another problem I just notice is that if xa_store() failed, zswap_same_filled_pages
won't be correct. (Maybe we should move zswap_same_filled_pages inc)

Thanks.

2024-03-01 09:38:39

by Chris Li

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

On Thu, Feb 29, 2024 at 11:24 PM Chengming Zhou
<[email protected]> wrote:
>
> On 2024/3/1 02:58, Chris Li wrote:
> > Hi Chengming,
> >
> > Thanks for the review and feedback.
> >
> > On Thu, Feb 29, 2024 at 1:44 AM Chengming Zhou
> > <[email protected]> wrote:
> >>
> >> Hi Chris,
> >>
> >> On 2024/2/29 16:46, Chris Li wrote:
> >>> Very deep RB tree requires rebalance at times. That
> >>> contributes to the zswap fault latencies. Xarray does not
> >>> need to perform tree rebalance. Replacing RB tree to xarray
> >>> can have some small performance gain.
> >>>
> >>> One small difference is that xarray insert might fail with
> >>> ENOMEM, while RB tree insert does not allocate additional
> >>> memory.
> >>>
> >>> The zswap_entry size will reduce a bit due to removing the
> >>> RB node, which has two pointers and a color field. Xarray
> >>> store the pointer in the xarray tree rather than the
> >>> zswap_entry. Every entry has one pointer from the xarray
> >>> tree. Overall, switching to xarray should save some memory,
> >>> if the swap entries are densely packed.
> >>>
> >>> Notice the zswap_rb_search and zswap_rb_insert always
> >>> followed by zswap_rb_erase. Fold the entry erase into
> >>> zswap_xa_search_and_erase and zswap_xa_insert. That saves
> >>> one tree lookup as well.
> >>>
> >>> Remove zswap_invalidate_entry due to no need to call
> >>> zswap_rb_erase any more. Use zswap_free_entry instead.
> >>>
> >>> The "struct zswap_tree" has been replaced by "struct xarray".
> >>> The tree spin lock has transferred to the xarray lock.
> >>>
> >>> Thanks to Chengming for providing the kernel build test number.
> >>>
> >>> Run the kernel build testing 5 times for each version, averages:
> >>> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> >>>
> >>> mm-266f922c0b5e zswap-xarray-test
> >>> real 63.43 63.12
> >>> user 1063.78 1062.59
> >>> sys 272.49 265.66
> >>>
> >>> The sys time is about 2.5% faster.
> >>>
> >>> Tested-by: Chengming Zhou <[email protected]>
> >>> ---
> >>>
> >>>
> >>> Signed-off-by: Chris Li <[email protected]>
> >>> ---
> >>> Changes in v2:
> >>> - Replace struct zswap_tree with struct xarray.
> >>> - Remove zswap_tree spinlock, use xarray lock instead.
> >>> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> >>> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> >>> - Link to v1: https://lore.kernel.org/r/[email protected]
> >>> ---
> >>> mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
> >>> 1 file changed, 64 insertions(+), 109 deletions(-)
> >>>
> >>> diff --git a/mm/zswap.c b/mm/zswap.c
> >>> index 011e068eb355..ac9ef14d88be 100644
> >>> --- a/mm/zswap.c
> >>> +++ b/mm/zswap.c
> >>> @@ -20,7 +20,6 @@
> >>> #include <linux/spinlock.h>
> >>> #include <linux/types.h>
> >>> #include <linux/atomic.h>
> >>> -#include <linux/rbtree.h>
> >>> #include <linux/swap.h>
> >>> #include <linux/crypto.h>
> >>> #include <linux/scatterlist.h>
> >>> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> >>> static u64 zswap_reject_alloc_fail;
> >>> /* Store failed because the entry metadata could not be allocated (rare) */
> >>> static u64 zswap_reject_kmemcache_fail;
> >>> +/* Store failed because xarray can't insert the entry*/
> >>> +static u64 zswap_reject_xarray_fail;
> >>>
> >>> /* Shrinker work queue */
> >>> static struct workqueue_struct *shrink_wq;
> >>> @@ -196,7 +197,6 @@ static struct {
> >>> * This structure contains the metadata for tracking a single compressed
> >>> * page within zswap.
> >>> *
> >>> - * rbnode - links the entry into red-black tree for the appropriate swap type
> >>> * swpentry - associated swap entry, the offset indexes into the red-black tree
> >>> * length - the length in bytes of the compressed page data. Needed during
> >>> * decompression. For a same value filled page length is 0, and both
> >>> @@ -208,7 +208,6 @@ static struct {
> >>> * lru - handle to the pool's lru used to evict pages.
> >>> */
> >>> struct zswap_entry {
> >>> - struct rb_node rbnode;
> >>> swp_entry_t swpentry;
> >>> unsigned int length;
> >>> struct zswap_pool *pool;
> >>> @@ -220,12 +219,7 @@ struct zswap_entry {
> >>> struct list_head lru;
> >>> };
> >>>
> >>> -struct zswap_tree {
> >>> - struct rb_root rbroot;
> >>> - spinlock_t lock;
> >>> -};
> >>> -
> >>> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> >>> +static struct xarray *zswap_trees[MAX_SWAPFILES];
> >>> static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> >>>
> >>> /* RCU-protected iteration */
> >>> @@ -253,10 +247,10 @@ static bool zswap_has_pool;
> >>> * helpers and fwd declarations
> >>> **********************************/
> >>>
> >>> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> >>> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >>> {
> >>> - return &zswap_trees[swp_type(swp)][swp_offset(swp)
> >>> - >> SWAP_ADDRESS_SPACE_SHIFT];
> >>> + return zswap_trees[swp_type(swp)] + (swp_offset(swp)
> >>> + >> SWAP_ADDRESS_SPACE_SHIFT);
> >>> }
> >>>
> >>> #define zswap_pool_debug(msg, p) \
> >>> @@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> >>> }
> >>>
> >>> /*********************************
> >>> -* rbtree functions
> >>> +* xarray functions
> >>> **********************************/
> >>> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> >>> +static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
> >>> {
> >>> - struct rb_node *node = root->rb_node;
> >>> - struct zswap_entry *entry;
> >>> - pgoff_t entry_offset;
> >>> -
> >>> - while (node) {
> >>> - entry = rb_entry(node, struct zswap_entry, rbnode);
> >>> - entry_offset = swp_offset(entry->swpentry);
> >>> - if (entry_offset > offset)
> >>> - node = node->rb_left;
> >>> - else if (entry_offset < offset)
> >>> - node = node->rb_right;
> >>> - else
> >>> - return entry;
> >>> - }
> >>> - return NULL;
> >>> + return xa_erase(tree, offset);
> >>> }
> >>>
> >>> /*
> >>> + * Expects xa_lock to be held on entry.
> >>> * In the case that a entry with the same offset is found, a pointer to
> >>> - * the existing entry is stored in dupentry and the function returns -EEXIST
> >>> + * the existing entry is stored in old and erased from the tree.
> >>> + * Function return error on insert.
> >>> */
> >>> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> >>> - struct zswap_entry **dupentry)
> >>> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> >>> + struct zswap_entry **old)
> >>> {
> >>> - struct rb_node **link = &root->rb_node, *parent = NULL;
> >>> - struct zswap_entry *myentry;
> >>> - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> >>> -
> >>> - while (*link) {
> >>> - parent = *link;
> >>> - myentry = rb_entry(parent, struct zswap_entry, rbnode);
> >>> - myentry_offset = swp_offset(myentry->swpentry);
> >>> - if (myentry_offset > entry_offset)
> >>> - link = &(*link)->rb_left;
> >>> - else if (myentry_offset < entry_offset)
> >>> - link = &(*link)->rb_right;
> >>> - else {
> >>> - *dupentry = myentry;
> >>> - return -EEXIST;
> >>> - }
> >>> - }
> >>> - rb_link_node(&entry->rbnode, parent, link);
> >>> - rb_insert_color(&entry->rbnode, root);
> >>> - return 0;
> >>> -}
> >>> + int err;
> >>> + struct zswap_entry *e;
> >>> + pgoff_t offset = swp_offset(entry->swpentry);
> >>>
> >>> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> >>> -{
> >>> - rb_erase(&entry->rbnode, root);
> >>> - RB_CLEAR_NODE(&entry->rbnode);
> >>> + e = __xa_store(tree, offset, entry, GFP_KERNEL);
> >>> + err = xa_err(e);
> >>> +
> >>> + if (err) {
> >>> + e = __xa_erase(tree, offset);
> >
> > zswap_xa_insert will always erase the old entry, even when __xa_store fails.
> >
> >>> + if (err == -ENOMEM)
> >>> + zswap_reject_alloc_fail++;
> >>> + else
> >>> + zswap_reject_xarray_fail++;
> >>> + }
> >>> + *old = e;
> >
> > Old pointer is set regardless of the error.
>
> Ok, I get it. The "old" pointer is always set on return.
>
> >
> >>> + return err;
> >>> }
> >>>
> >>> /*********************************
> >>> @@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> >>> entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> >>> if (!entry)
> >>> return NULL;
> >>> - RB_CLEAR_NODE(&entry->rbnode);
> >>> return entry;
> >>> }
> >>>
> >>> @@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> >>> zswap_update_total_size();
> >>> }
> >>>
> >>> -/*
> >>> - * The caller hold the tree lock and search the entry from the tree,
> >>> - * so it must be on the tree, remove it from the tree and free it.
> >>> - */
> >>> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> >>> - struct zswap_entry *entry)
> >>> -{
> >>> - zswap_rb_erase(&tree->rbroot, entry);
> >>> - zswap_entry_free(entry);
> >>> -}
> >>> -
> >>> /*********************************
> >>> * compressed storage functions
> >>> **********************************/
> >>> @@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> >>> static int zswap_writeback_entry(struct zswap_entry *entry,
> >>> swp_entry_t swpentry)
> >>> {
> >>> - struct zswap_tree *tree;
> >>> + struct xarray *tree;
> >>> + pgoff_t offset = swp_offset(swpentry);
> >>> + struct zswap_entry *e;
> >>> struct folio *folio;
> >>> struct mempolicy *mpol;
> >>> bool folio_was_allocated;
> >>> @@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >>> * be dereferenced.
> >>> */
> >>> tree = swap_zswap_tree(swpentry);
> >>> - spin_lock(&tree->lock);
> >>> - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> >>> - spin_unlock(&tree->lock);
> >>> + e = zswap_xa_search_and_erase(tree, offset);
> >>> + if (e != entry) {
> >>
> >> IIUC, here we should use xa_cmpxchg() instead of erasing it unconditionally.
> >
> > Good catch, I agree with your suggestion. I will spin a V3 to correct that.
> >
> >>
> >>> delete_from_swap_cache(folio);
> >>> folio_unlock(folio);
> >>> folio_put(folio);
> >>> return -ENOMEM;
> >>> }
> >>>
> >>> - /* Safe to deref entry after the entry is verified above. */
> >>> - zswap_rb_erase(&tree->rbroot, entry);
> >>> - spin_unlock(&tree->lock);
> >>> -
> >>> zswap_decompress(entry, &folio->page);
> >>>
> >>> count_vm_event(ZSWPWB);
> >>> @@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
> >>> {
> >>> swp_entry_t swp = folio->swap;
> >>> pgoff_t offset = swp_offset(swp);
> >>> - struct zswap_tree *tree = swap_zswap_tree(swp);
> >>> - struct zswap_entry *entry, *dupentry;
> >>> + struct xarray *tree = swap_zswap_tree(swp);
> >>> + struct zswap_entry *entry, *old;
> >>> struct obj_cgroup *objcg = NULL;
> >>> struct mem_cgroup *memcg = NULL;
> >>> + int err;
> >>>
> >>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >>> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >>> @@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
> >>> }
> >>>
> >>> /* map */
> >>> - spin_lock(&tree->lock);
> >>> + xa_lock(tree);
> >>> /*
> >>> * The folio may have been dirtied again, invalidate the
> >>> * possibly stale entry before inserting the new entry.
> >>> */
> >>> - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> >>> - zswap_invalidate_entry(tree, dupentry);
> >>> - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> >>> + err = zswap_xa_insert(tree, entry, &old);
> >>> + if (old)
> >>> + zswap_entry_free(old);
> >>
> >> Maybe it's safer to check old after !err, since "old" variable is not initialized
> >> to NULL, and zswap_xa_insert() maybe won't overwrite "old" to NULL when err return?
> >
> > That is the intended behavior.
> >
> > See the above in zswap_xa_insert(). It will always erase and return
> > "old" even when the __xa_store() has an error.
> > That is because by the time zswap needs to store a new entry at this
> > swap entry. The old data is already outdated. We should just remove
> > the old data. If __xa_store failed due to out of memory. That is the
> > same as allocating an entry out of memory. It is fine to fail
> > swap_store. Then the folio will just stay in the swap cache for the
> > next time.
> >
> > Do you see any ill effects can be caused by deleting the old entry on
> > xa_insert error?
>
> No, you're right, we should always delete/free old zswap entry no matter
> store success or fail.
>
> >
> >>> + if (err) {
> >>> + xa_unlock(tree);
> >>> + goto free_zpool;
> >>> }
> >>> +
> >>> if (entry->length) {
> >>> INIT_LIST_HEAD(&entry->lru);
> >>> zswap_lru_add(&zswap.list_lru, entry);
> >>> atomic_inc(&zswap.nr_stored);
> >>> }
> >>
> >> It seems that we can put this part out of the xarray lock section, then it's enough to
> >> just use xa_insert().
>
> I wanted to mean xa_store() here.
>
> >
> > It is not enough protection. Consider this race:
> >
> > CPU1 CPU2
> >
> > xa_insert()
> > entry = swap_xa_search_and_erase()
> > zswap_free_entry(entry)
> >
> > if (entry->length)
> > ...
> > CPU1 is using entry after free.
>
> Hmm, right, but I don't know how could this race happen? Since the folio we store is
> the owner of swap entry, which couldn't be deleted meanwhile, right?

I will need to think about it more. Agree the current folio can't
delete itself. It is possible the folio lock was enough to prevent the
race.
>
> Another problem I just notice is that if xa_store() failed, zswap_same_filled_pages
> won't be correct. (Maybe we should move zswap_same_filled_pages inc)

You are right, I miss the same filled pages. Will address that in V3.

Chris

2024-03-01 21:58:58

by David Laight

[permalink] [raw]

Subject: RE: [PATCH v2] zswap: replace RB tree with xarray

From: Chris Li
> Sent: 29 February 2024 08:46
>
> Very deep RB tree requires rebalance at times. That
> contributes to the zswap fault latencies. Xarray does not
> need to perform tree rebalance. Replacing RB tree to xarray
> can have some small performance gain.
>
> One small difference is that xarray insert might fail with
> ENOMEM, while RB tree insert does not allocate additional
> memory.

What is the difference in kernel memory use?
IIRC someone pointed out (in the rosebush thread) that xarray
uses a lot of kernel memory if the items are randomly distributed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-03-01 22:33:56

by Chris Li

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

On Fri, Mar 1, 2024 at 1:58 PM David Laight <[email protected]> wrote:
>
> From: Chris Li
> > Sent: 29 February 2024 08:46
> >
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
>
> What is the difference in kernel memory use?
> IIRC someone pointed out (in the rosebush thread) that xarray
> uses a lot of kernel memory if the items are randomly distributed.

Do you have any suggestions on what script I can run to collect that
information for you?

If the swapfile is fully utilized, then every 1G of swapfile space
will save about 4M of memory using xarray compared to RB tree. Every
swap entry saves about 2 pointers (ignoring the intermedia xarray
node)
Every 1G of swapfile, fully populated xarray pointer uses about 2M of
memory. The RB tree node (left pointer, right pointer, color) will use
about 6M of memory.
In the worst case, you are wasting at most 2M of memory on xarray for
every 1G of swapfile.
Hope that calculation helps.

Keep in mind that the swap cache is already using xarray, using the
same swap entry offset as index. So whatever memory waste caused by
the sparse distribution of the swap entry, you are already getting
that in the current swap cache.

Chris

2024-03-02 18:33:36

by Chris Li

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

Hi Chengming,

On Fri, Mar 1, 2024 at 1:38 AM Chris Li <[email protected]> wrote:
>
> On Thu, Feb 29, 2024 at 11:24 PM Chengming Zhou
> <[email protected]> wrote:
> >
> > On 2024/3/1 02:58, Chris Li wrote:
> > > Hi Chengming,
> > >
> > > Thanks for the review and feedback.
> > >
> > > On Thu, Feb 29, 2024 at 1:44 AM Chengming Zhou
> > > <[email protected]> wrote:
> > >>
> > >> Hi Chris,
> > >>
> > >> On 2024/2/29 16:46, Chris Li wrote:
> > >>> Very deep RB tree requires rebalance at times. That
> > >>> contributes to the zswap fault latencies. Xarray does not
> > >>> need to perform tree rebalance. Replacing RB tree to xarray
> > >>> can have some small performance gain.
> > >>>
> > >>> One small difference is that xarray insert might fail with
> > >>> ENOMEM, while RB tree insert does not allocate additional
> > >>> memory.
> > >>>
> > >>> The zswap_entry size will reduce a bit due to removing the
> > >>> RB node, which has two pointers and a color field. Xarray
> > >>> store the pointer in the xarray tree rather than the
> > >>> zswap_entry. Every entry has one pointer from the xarray
> > >>> tree. Overall, switching to xarray should save some memory,
> > >>> if the swap entries are densely packed.
> > >>>
> > >>> Notice the zswap_rb_search and zswap_rb_insert always
> > >>> followed by zswap_rb_erase. Fold the entry erase into
> > >>> zswap_xa_search_and_erase and zswap_xa_insert. That saves
> > >>> one tree lookup as well.
> > >>>
> > >>> Remove zswap_invalidate_entry due to no need to call
> > >>> zswap_rb_erase any more. Use zswap_free_entry instead.
> > >>>
> > >>> The "struct zswap_tree" has been replaced by "struct xarray".
> > >>> The tree spin lock has transferred to the xarray lock.
> > >>>
> > >>> Thanks to Chengming for providing the kernel build test number.
> > >>>
> > >>> Run the kernel build testing 5 times for each version, averages:
> > >>> (memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile.)
> > >>>
> > >>> mm-266f922c0b5e zswap-xarray-test
> > >>> real 63.43 63.12
> > >>> user 1063.78 1062.59
> > >>> sys 272.49 265.66
> > >>>
> > >>> The sys time is about 2.5% faster.
> > >>>
> > >>> Tested-by: Chengming Zhou <[email protected]>
> > >>> ---
> > >>>
> > >>>
> > >>> Signed-off-by: Chris Li <[email protected]>
> > >>> ---
> > >>> Changes in v2:
> > >>> - Replace struct zswap_tree with struct xarray.
> > >>> - Remove zswap_tree spinlock, use xarray lock instead.
> > >>> - Fold zswap_rb_erase() into zswap_xa_search_and_delete() and zswap_xa_insert().
> > >>> - Delete zswap_invalidate_entry(), use zswap_free_entry() instead.
> > >>> - Link to v1: https://lore.kernel.org/r/[email protected]
> > >>> ---
> > >>> mm/zswap.c | 173 +++++++++++++++++++++++--------------------------------------
> > >>> 1 file changed, 64 insertions(+), 109 deletions(-)
> > >>>
> > >>> diff --git a/mm/zswap.c b/mm/zswap.c
> > >>> index 011e068eb355..ac9ef14d88be 100644
> > >>> --- a/mm/zswap.c
> > >>> +++ b/mm/zswap.c
> > >>> @@ -20,7 +20,6 @@
> > >>> #include <linux/spinlock.h>
> > >>> #include <linux/types.h>
> > >>> #include <linux/atomic.h>
> > >>> -#include <linux/rbtree.h>
> > >>> #include <linux/swap.h>
> > >>> #include <linux/crypto.h>
> > >>> #include <linux/scatterlist.h>
> > >>> @@ -71,6 +70,8 @@ static u64 zswap_reject_compress_poor;
> > >>> static u64 zswap_reject_alloc_fail;
> > >>> /* Store failed because the entry metadata could not be allocated (rare) */
> > >>> static u64 zswap_reject_kmemcache_fail;
> > >>> +/* Store failed because xarray can't insert the entry*/
> > >>> +static u64 zswap_reject_xarray_fail;
> > >>>
> > >>> /* Shrinker work queue */
> > >>> static struct workqueue_struct *shrink_wq;
> > >>> @@ -196,7 +197,6 @@ static struct {
> > >>> * This structure contains the metadata for tracking a single compressed
> > >>> * page within zswap.
> > >>> *
> > >>> - * rbnode - links the entry into red-black tree for the appropriate swap type
> > >>> * swpentry - associated swap entry, the offset indexes into the red-black tree
> > >>> * length - the length in bytes of the compressed page data. Needed during
> > >>> * decompression. For a same value filled page length is 0, and both
> > >>> @@ -208,7 +208,6 @@ static struct {
> > >>> * lru - handle to the pool's lru used to evict pages.
> > >>> */
> > >>> struct zswap_entry {
> > >>> - struct rb_node rbnode;
> > >>> swp_entry_t swpentry;
> > >>> unsigned int length;
> > >>> struct zswap_pool *pool;
> > >>> @@ -220,12 +219,7 @@ struct zswap_entry {
> > >>> struct list_head lru;
> > >>> };
> > >>>
> > >>> -struct zswap_tree {
> > >>> - struct rb_root rbroot;
> > >>> - spinlock_t lock;
> > >>> -};
> > >>> -
> > >>> -static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
> > >>> +static struct xarray *zswap_trees[MAX_SWAPFILES];
> > >>> static unsigned int nr_zswap_trees[MAX_SWAPFILES];
> > >>>
> > >>> /* RCU-protected iteration */
> > >>> @@ -253,10 +247,10 @@ static bool zswap_has_pool;
> > >>> * helpers and fwd declarations
> > >>> **********************************/
> > >>>
> > >>> -static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
> > >>> +static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> > >>> {
> > >>> - return &zswap_trees[swp_type(swp)][swp_offset(swp)
> > >>> - >> SWAP_ADDRESS_SPACE_SHIFT];
> > >>> + return zswap_trees[swp_type(swp)] + (swp_offset(swp)
> > >>> + >> SWAP_ADDRESS_SPACE_SHIFT);
> > >>> }
> > >>>
> > >>> #define zswap_pool_debug(msg, p) \
> > >>> @@ -805,60 +799,38 @@ void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
> > >>> }
> > >>>
> > >>> /*********************************
> > >>> -* rbtree functions
> > >>> +* xarray functions
> > >>> **********************************/
> > >>> -static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
> > >>> +static struct zswap_entry *zswap_xa_search_and_erase(struct xarray *tree, pgoff_t offset)
> > >>> {
> > >>> - struct rb_node *node = root->rb_node;
> > >>> - struct zswap_entry *entry;
> > >>> - pgoff_t entry_offset;
> > >>> -
> > >>> - while (node) {
> > >>> - entry = rb_entry(node, struct zswap_entry, rbnode);
> > >>> - entry_offset = swp_offset(entry->swpentry);
> > >>> - if (entry_offset > offset)
> > >>> - node = node->rb_left;
> > >>> - else if (entry_offset < offset)
> > >>> - node = node->rb_right;
> > >>> - else
> > >>> - return entry;
> > >>> - }
> > >>> - return NULL;
> > >>> + return xa_erase(tree, offset);
> > >>> }
> > >>>
> > >>> /*
> > >>> + * Expects xa_lock to be held on entry.
> > >>> * In the case that a entry with the same offset is found, a pointer to
> > >>> - * the existing entry is stored in dupentry and the function returns -EEXIST
> > >>> + * the existing entry is stored in old and erased from the tree.
> > >>> + * Function return error on insert.
> > >>> */
> > >>> -static int zswap_rb_insert(struct rb_root *root, struct zswap_entry *entry,
> > >>> - struct zswap_entry **dupentry)
> > >>> +static int zswap_xa_insert(struct xarray *tree, struct zswap_entry *entry,
> > >>> + struct zswap_entry **old)
> > >>> {
> > >>> - struct rb_node **link = &root->rb_node, *parent = NULL;
> > >>> - struct zswap_entry *myentry;
> > >>> - pgoff_t myentry_offset, entry_offset = swp_offset(entry->swpentry);
> > >>> -
> > >>> - while (*link) {
> > >>> - parent = *link;
> > >>> - myentry = rb_entry(parent, struct zswap_entry, rbnode);
> > >>> - myentry_offset = swp_offset(myentry->swpentry);
> > >>> - if (myentry_offset > entry_offset)
> > >>> - link = &(*link)->rb_left;
> > >>> - else if (myentry_offset < entry_offset)
> > >>> - link = &(*link)->rb_right;
> > >>> - else {
> > >>> - *dupentry = myentry;
> > >>> - return -EEXIST;
> > >>> - }
> > >>> - }
> > >>> - rb_link_node(&entry->rbnode, parent, link);
> > >>> - rb_insert_color(&entry->rbnode, root);
> > >>> - return 0;
> > >>> -}
> > >>> + int err;
> > >>> + struct zswap_entry *e;
> > >>> + pgoff_t offset = swp_offset(entry->swpentry);
> > >>>
> > >>> -static void zswap_rb_erase(struct rb_root *root, struct zswap_entry *entry)
> > >>> -{
> > >>> - rb_erase(&entry->rbnode, root);
> > >>> - RB_CLEAR_NODE(&entry->rbnode);
> > >>> + e = __xa_store(tree, offset, entry, GFP_KERNEL);
> > >>> + err = xa_err(e);
> > >>> +
> > >>> + if (err) {
> > >>> + e = __xa_erase(tree, offset);
> > >
> > > zswap_xa_insert will always erase the old entry, even when __xa_store fails.
> > >
> > >>> + if (err == -ENOMEM)
> > >>> + zswap_reject_alloc_fail++;
> > >>> + else
> > >>> + zswap_reject_xarray_fail++;
> > >>> + }
> > >>> + *old = e;
> > >
> > > Old pointer is set regardless of the error.
> >
> > Ok, I get it. The "old" pointer is always set on return.
> >
> > >
> > >>> + return err;
> > >>> }
> > >>>
> > >>> /*********************************
> > >>> @@ -872,7 +844,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp, int nid)
> > >>> entry = kmem_cache_alloc_node(zswap_entry_cache, gfp, nid);
> > >>> if (!entry)
> > >>> return NULL;
> > >>> - RB_CLEAR_NODE(&entry->rbnode);
> > >>> return entry;
> > >>> }
> > >>>
> > >>> @@ -914,17 +885,6 @@ static void zswap_entry_free(struct zswap_entry *entry)
> > >>> zswap_update_total_size();
> > >>> }
> > >>>
> > >>> -/*
> > >>> - * The caller hold the tree lock and search the entry from the tree,
> > >>> - * so it must be on the tree, remove it from the tree and free it.
> > >>> - */
> > >>> -static void zswap_invalidate_entry(struct zswap_tree *tree,
> > >>> - struct zswap_entry *entry)
> > >>> -{
> > >>> - zswap_rb_erase(&tree->rbroot, entry);
> > >>> - zswap_entry_free(entry);
> > >>> -}
> > >>> -
> > >>> /*********************************
> > >>> * compressed storage functions
> > >>> **********************************/
> > >>> @@ -1113,7 +1073,9 @@ static void zswap_decompress(struct zswap_entry *entry, struct page *page)
> > >>> static int zswap_writeback_entry(struct zswap_entry *entry,
> > >>> swp_entry_t swpentry)
> > >>> {
> > >>> - struct zswap_tree *tree;
> > >>> + struct xarray *tree;
> > >>> + pgoff_t offset = swp_offset(swpentry);
> > >>> + struct zswap_entry *e;
> > >>> struct folio *folio;
> > >>> struct mempolicy *mpol;
> > >>> bool folio_was_allocated;
> > >>> @@ -1150,19 +1112,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> > >>> * be dereferenced.
> > >>> */
> > >>> tree = swap_zswap_tree(swpentry);
> > >>> - spin_lock(&tree->lock);
> > >>> - if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
> > >>> - spin_unlock(&tree->lock);
> > >>> + e = zswap_xa_search_and_erase(tree, offset);
> > >>> + if (e != entry) {
> > >>
> > >> IIUC, here we should use xa_cmpxchg() instead of erasing it unconditionally.
> > >
> > > Good catch, I agree with your suggestion. I will spin a V3 to correct that.
> > >
> > >>
> > >>> delete_from_swap_cache(folio);
> > >>> folio_unlock(folio);
> > >>> folio_put(folio);
> > >>> return -ENOMEM;
> > >>> }
> > >>>
> > >>> - /* Safe to deref entry after the entry is verified above. */
> > >>> - zswap_rb_erase(&tree->rbroot, entry);
> > >>> - spin_unlock(&tree->lock);
> > >>> -
> > >>> zswap_decompress(entry, &folio->page);
> > >>>
> > >>> count_vm_event(ZSWPWB);
> > >>> @@ -1471,10 +1428,11 @@ bool zswap_store(struct folio *folio)
> > >>> {
> > >>> swp_entry_t swp = folio->swap;
> > >>> pgoff_t offset = swp_offset(swp);
> > >>> - struct zswap_tree *tree = swap_zswap_tree(swp);
> > >>> - struct zswap_entry *entry, *dupentry;
> > >>> + struct xarray *tree = swap_zswap_tree(swp);
> > >>> + struct zswap_entry *entry, *old;
> > >>> struct obj_cgroup *objcg = NULL;
> > >>> struct mem_cgroup *memcg = NULL;
> > >>> + int err;
> > >>>
> > >>> VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > >>> VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > >>> @@ -1562,21 +1520,25 @@ bool zswap_store(struct folio *folio)
> > >>> }
> > >>>
> > >>> /* map */
> > >>> - spin_lock(&tree->lock);
> > >>> + xa_lock(tree);
> > >>> /*
> > >>> * The folio may have been dirtied again, invalidate the
> > >>> * possibly stale entry before inserting the new entry.
> > >>> */
> > >>> - if (zswap_rb_insert(&tree->rbroot, entry, &dupentry) == -EEXIST) {
> > >>> - zswap_invalidate_entry(tree, dupentry);
> > >>> - WARN_ON(zswap_rb_insert(&tree->rbroot, entry, &dupentry));
> > >>> + err = zswap_xa_insert(tree, entry, &old);
> > >>> + if (old)
> > >>> + zswap_entry_free(old);
> > >>
> > >> Maybe it's safer to check old after !err, since "old" variable is not initialized
> > >> to NULL, and zswap_xa_insert() maybe won't overwrite "old" to NULL when err return?
> > >
> > > That is the intended behavior.
> > >
> > > See the above in zswap_xa_insert(). It will always erase and return
> > > "old" even when the __xa_store() has an error.
> > > That is because by the time zswap needs to store a new entry at this
> > > swap entry. The old data is already outdated. We should just remove
> > > the old data. If __xa_store failed due to out of memory. That is the
> > > same as allocating an entry out of memory. It is fine to fail
> > > swap_store. Then the folio will just stay in the swap cache for the
> > > next time.
> > >
> > > Do you see any ill effects can be caused by deleting the old entry on
> > > xa_insert error?
> >
> > No, you're right, we should always delete/free old zswap entry no matter
> > store success or fail.
> >
> > >
> > >>> + if (err) {
> > >>> + xa_unlock(tree);
> > >>> + goto free_zpool;
> > >>> }
> > >>> +
> > >>> if (entry->length) {
> > >>> INIT_LIST_HEAD(&entry->lru);
> > >>> zswap_lru_add(&zswap.list_lru, entry);
> > >>> atomic_inc(&zswap.nr_stored);
> > >>> }
> > >>
> > >> It seems that we can put this part out of the xarray lock section, then it's enough to
> > >> just use xa_insert().
> >
> > I wanted to mean xa_store() here.
> >
> > >
> > > It is not enough protection. Consider this race:
> > >
> > > CPU1 CPU2
> > >
> > > xa_insert()
> > > entry = swap_xa_search_and_erase()
> > > zswap_free_entry(entry)
> > >
> > > if (entry->length)
> > > ...
> > > CPU1 is using entry after free.
> >
> > Hmm, right, but I don't know how could this race happen? Since the folio we store is
> > the owner of swap entry, which couldn't be deleted meanwhile, right?
>
> I will need to think about it more. Agree the current folio can't
> delete itself. It is possible the folio lock was enough to prevent the
> race.

After sleeping on it a bit, I think it should be safe to reduce the
unlock range and use xa_store directly.

Does anyone have any other objects? If not, I will use xa_store directly for V3.

Chris

2024-03-05 03:08:33

by Yosry Ahmed

[permalink] [raw]

Subject: Re: [PATCH v2] zswap: replace RB tree with xarray

On Fri, Mar 01, 2024 at 09:58:43PM +0000, David Laight wrote:
> From: Chris Li
> > Sent: 29 February 2024 08:46
> >
> > Very deep RB tree requires rebalance at times. That
> > contributes to the zswap fault latencies. Xarray does not
> > need to perform tree rebalance. Replacing RB tree to xarray
> > can have some small performance gain.
> >
> > One small difference is that xarray insert might fail with
> > ENOMEM, while RB tree insert does not allocate additional
> > memory.
>
> What is the difference in kernel memory use?
> IIRC someone pointed out (in the rosebush thread) that xarray
> uses a lot of kernel memory if the items are randomly distributed.

I often thought it would be useful if there is an API for the xarray to
report how much memory it's using. For zswap, we could plumb this to a
debugfs interface or so to measure the tree usage and compare to the
rbtree. The cost is fixed with the rbtree IIUC, 3 words per zswap entry.