2023-02-23 03:05:03

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

Hi,

Existing zsmalloc page fullness grouping leads to suboptimal page
selection for both zs_malloc() and zs_compact(). This patchset
reworks zsmalloc fullness grouping/classification.

Additinally it also implements new compaction algorithm that is
expected to use CPU-cycles (as it potentially does fewer memcpy-s
in zs_object_copy()).

TEST
====

It's very challenging to reliably test this series. I ended up
developing my own synthetic test that has 100% reproducibility.
The test generates significan fragmentation (for each size class)
and then performs compaction for each class individually and tracks
the number of memcpy() in zs_object_copy(), so that we can compare
the amount work compaction does on per-class basis.

Total amount of work (zram mm_stat objs_moved)
----------------------------------------------

Old fullness grouping, old compaction algorithm:
323977 memcpy() in zs_object_copy().

Old fullness grouping, new compaction algorithm:
262944 memcpy() in zs_object_copy().

New fullness grouping, new compaction algorithm:
213978 memcpy() in zs_object_copy().


Per-class compaction memcpy() comparison (T-test)
-------------------------------------------------

x Old fullness grouping, old compaction algorithm
+ Old fullness grouping, new compaction algorithm

N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 289 2778 2006 1878.1714 641.02073
Difference at 95.0% confidence
-435.95 +/- 170.595
-18.8387% +/- 7.37193%
(Student's t, pooled s = 728.216)


x Old fullness grouping, old compaction algorithm
+ New fullness grouping, new compaction algorithm

N Min Max Median Avg Stddev
x 140 349 3513 2461 2314.1214 806.03271
+ 140 226 2279 1644 1528.4143 524.85268
Difference at 95.0% confidence
-785.707 +/- 159.331
-33.9527% +/- 6.88516%
(Student's t, pooled s = 680.132)

Sergey Senozhatsky (6):
zsmalloc: remove insert_zspage() ->inuse optimization
zsmalloc: remove stat and fullness enums
zsmalloc: fine-grained inuse ratio based fullness grouping
zsmalloc: rework compaction algorithm
zsmalloc: extend compaction statistics
zram: show zsmalloc objs_moved stat in mm_stat

Documentation/admin-guide/blockdev/zram.rst | 1 +
drivers/block/zram/zram_drv.c | 5 +-
include/linux/zsmalloc.h | 2 +
mm/zsmalloc.c | 365 ++++++++++----------
4 files changed, 188 insertions(+), 185 deletions(-)

--
2.39.2.637.g21b0678d19-goog



2023-02-23 03:05:06

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 1/6] zsmalloc: remove insert_zspage() ->inuse optimization

This optimization has no effect. It only ensures that
when a page was added to its corresponding fullness
list, its "inuse" counter was higher or lower than the
"inuse" counter of the page at the head of the list.
The intention was to keep busy pages at the head, so
they could be filled up and moved to the ZS_FULL
fullness group more quickly. However, this doesn't work
as the "inuse" counter of a page can be modified by
obj_free() but the page may still belong to the same
fullness list. So, fix_fullness_group() won't change
the page's position in relation to the head's "inuse"
counter, leading to a largely random order of pages
within the fullness list.

For instance, consider a printout of the "inuse"
counters of the first 10 pages in a class that holds
93 objects per zspage:

ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52

As we can see the page with the lowest "inuse" counter
is actually the head of the fullness list.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
mm/zsmalloc.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 3aed46ab7e6c..b57a89ed6f30 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -753,37 +753,24 @@ static enum fullness_group get_fullness_group(struct size_class *class,
}

/*
- * Each size class maintains various freelists and zspages are assigned
- * to one of these freelists based on the number of live objects they
- * have. This functions inserts the given zspage into the freelist
- * identified by <class, fullness_group>.
+ * This function adds the given zspage to the fullness list identified
+ * by <class, fullness_group>.
*/
static void insert_zspage(struct size_class *class,
- struct zspage *zspage,
- enum fullness_group fullness)
+ struct zspage *zspage,
+ enum fullness_group fullness)
{
- struct zspage *head;
-
class_stat_inc(class, fullness, 1);
- head = list_first_entry_or_null(&class->fullness_list[fullness],
- struct zspage, list);
- /*
- * We want to see more ZS_FULL pages and less almost empty/full.
- * Put pages with higher ->inuse first.
- */
- if (head && get_zspage_inuse(zspage) < get_zspage_inuse(head))
- list_add(&zspage->list, &head->list);
- else
- list_add(&zspage->list, &class->fullness_list[fullness]);
+ list_add(&zspage->list, &class->fullness_list[fullness]);
}

/*
- * This function removes the given zspage from the freelist identified
+ * This function removes the given zspage from the fullness list identified
* by <class, fullness_group>.
*/
static void remove_zspage(struct size_class *class,
- struct zspage *zspage,
- enum fullness_group fullness)
+ struct zspage *zspage,
+ enum fullness_group fullness)
{
VM_BUG_ON(list_empty(&class->fullness_list[fullness]));

--
2.39.2.637.g21b0678d19-goog


2023-02-23 03:05:11

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 2/6] zsmalloc: remove stat and fullness enums

The fullness_group enum is nested (sub-enum) within the
class_stat_type enum. zsmalloc requires the values in both
enums to match, because zsmalloc passes these values to
generic functions, e.g. class_stat_inc() and class_stat_dec(),
after casting them to integers.

Replace these enums (and enum nesting) and use simple defines
instead. Also rename some of zsmalloc stats defines, as they
sort of clash with zspage object tags.

Suggested-by: Yosry Ahmed <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
mm/zsmalloc.c | 104 ++++++++++++++++++++++----------------------------
1 file changed, 45 insertions(+), 59 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b57a89ed6f30..38ae8963c0eb 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -159,26 +159,18 @@
#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
ZS_SIZE_CLASS_DELTA) + 1)

-enum fullness_group {
- ZS_EMPTY,
- ZS_ALMOST_EMPTY,
- ZS_ALMOST_FULL,
- ZS_FULL,
- NR_ZS_FULLNESS,
-};
+#define ZS_EMPTY 0
+#define ZS_ALMOST_EMPTY 1
+#define ZS_ALMOST_FULL 2
+#define ZS_FULL 3
+#define ZS_OBJS_ALLOCATED 4
+#define ZS_OBJS_INUSE 5

-enum class_stat_type {
- CLASS_EMPTY,
- CLASS_ALMOST_EMPTY,
- CLASS_ALMOST_FULL,
- CLASS_FULL,
- OBJ_ALLOCATED,
- OBJ_USED,
- NR_ZS_STAT_TYPE,
-};
+#define NR_ZS_STAT 6
+#define NR_ZS_FULLNESS 4

struct zs_size_stat {
- unsigned long objs[NR_ZS_STAT_TYPE];
+ unsigned long objs[NR_ZS_STAT];
};

#ifdef CONFIG_ZSMALLOC_STAT
@@ -547,8 +539,8 @@ static inline void set_freeobj(struct zspage *zspage, unsigned int obj)
}

static void get_zspage_mapping(struct zspage *zspage,
- unsigned int *class_idx,
- enum fullness_group *fullness)
+ unsigned int *class_idx,
+ int *fullness)
{
BUG_ON(zspage->magic != ZSPAGE_MAGIC);

@@ -557,14 +549,14 @@ static void get_zspage_mapping(struct zspage *zspage,
}

static struct size_class *zspage_class(struct zs_pool *pool,
- struct zspage *zspage)
+ struct zspage *zspage)
{
return pool->size_class[zspage->class];
}

static void set_zspage_mapping(struct zspage *zspage,
- unsigned int class_idx,
- enum fullness_group fullness)
+ unsigned int class_idx,
+ int fullness)
{
zspage->class = class_idx;
zspage->fullness = fullness;
@@ -588,23 +580,20 @@ static int get_size_class_index(int size)
return min_t(int, ZS_SIZE_CLASSES - 1, idx);
}

-/* type can be of enum type class_stat_type or fullness_group */
static inline void class_stat_inc(struct size_class *class,
- int type, unsigned long cnt)
+ int type, unsigned long cnt)
{
class->stats.objs[type] += cnt;
}

-/* type can be of enum type class_stat_type or fullness_group */
static inline void class_stat_dec(struct size_class *class,
- int type, unsigned long cnt)
+ int type, unsigned long cnt)
{
class->stats.objs[type] -= cnt;
}

-/* type can be of enum type class_stat_type or fullness_group */
static inline unsigned long zs_stat_get(struct size_class *class,
- int type)
+ int type)
{
return class->stats.objs[type];
}
@@ -652,10 +641,10 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
continue;

spin_lock(&pool->lock);
- class_almost_full = zs_stat_get(class, CLASS_ALMOST_FULL);
- class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
- obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
- obj_used = zs_stat_get(class, OBJ_USED);
+ class_almost_full = zs_stat_get(class, ZS_ALMOST_FULL);
+ class_almost_empty = zs_stat_get(class, ZS_ALMOST_EMPTY);
+ obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED);
+ obj_used = zs_stat_get(class, ZS_OBJS_INUSE);
freeable = zs_can_compact(class);
spin_unlock(&pool->lock);

@@ -731,11 +720,10 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
* the pool (not yet implemented). This function returns fullness
* status of the given page.
*/
-static enum fullness_group get_fullness_group(struct size_class *class,
- struct zspage *zspage)
+static int get_fullness_group(struct size_class *class, struct zspage *zspage)
{
int inuse, objs_per_zspage;
- enum fullness_group fg;
+ int fg;

inuse = get_zspage_inuse(zspage);
objs_per_zspage = class->objs_per_zspage;
@@ -754,11 +742,11 @@ static enum fullness_group get_fullness_group(struct size_class *class,

/*
* This function adds the given zspage to the fullness list identified
- * by <class, fullness_group>.
+ * by <class, fullness group>.
*/
static void insert_zspage(struct size_class *class,
struct zspage *zspage,
- enum fullness_group fullness)
+ int fullness)
{
class_stat_inc(class, fullness, 1);
list_add(&zspage->list, &class->fullness_list[fullness]);
@@ -766,11 +754,11 @@ static void insert_zspage(struct size_class *class,

/*
* This function removes the given zspage from the fullness list identified
- * by <class, fullness_group>.
+ * by <class, fullness group>.
*/
static void remove_zspage(struct size_class *class,
struct zspage *zspage,
- enum fullness_group fullness)
+ int fullness)
{
VM_BUG_ON(list_empty(&class->fullness_list[fullness]));

@@ -787,11 +775,10 @@ static void remove_zspage(struct size_class *class,
* page from the freelist of the old fullness group to that of the new
* fullness group.
*/
-static enum fullness_group fix_fullness_group(struct size_class *class,
- struct zspage *zspage)
+static int fix_fullness_group(struct size_class *class, struct zspage *zspage)
{
int class_idx;
- enum fullness_group currfg, newfg;
+ int currfg, newfg;

get_zspage_mapping(zspage, &class_idx, &currfg);
newfg = get_fullness_group(class, zspage);
@@ -964,7 +951,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
struct zspage *zspage)
{
struct page *page, *next;
- enum fullness_group fg;
+ int fg;
unsigned int class_idx;

get_zspage_mapping(zspage, &class_idx, &fg);
@@ -990,7 +977,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,

cache_free_zspage(pool, zspage);

- class_stat_dec(class, OBJ_ALLOCATED, class->objs_per_zspage);
+ class_stat_dec(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
atomic_long_sub(class->pages_per_zspage,
&pool->pages_allocated);
}
@@ -1508,7 +1495,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
{
unsigned long handle, obj;
struct size_class *class;
- enum fullness_group newfg;
+ int newfg;
struct zspage *zspage;

if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
@@ -1530,7 +1517,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
/* Now move the zspage to another fullness group, if required */
fix_fullness_group(class, zspage);
record_obj(handle, obj);
- class_stat_inc(class, OBJ_USED, 1);
+ class_stat_inc(class, ZS_OBJS_INUSE, 1);
spin_unlock(&pool->lock);

return handle;
@@ -1552,8 +1539,8 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
record_obj(handle, obj);
atomic_long_add(class->pages_per_zspage,
&pool->pages_allocated);
- class_stat_inc(class, OBJ_ALLOCATED, class->objs_per_zspage);
- class_stat_inc(class, OBJ_USED, 1);
+ class_stat_inc(class, ZS_OBJS_ALLOCATED, class->objs_per_zspage);
+ class_stat_inc(class, ZS_OBJS_INUSE, 1);

/* We completely set up zspage so mark them as movable */
SetZsPageMovable(pool, zspage);
@@ -1609,7 +1596,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
struct page *f_page;
unsigned long obj;
struct size_class *class;
- enum fullness_group fullness;
+ int fullness;

if (IS_ERR_OR_NULL((void *)handle))
return;
@@ -1624,7 +1611,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
zspage = get_zspage(f_page);
class = zspage_class(pool, zspage);

- class_stat_dec(class, OBJ_USED, 1);
+ class_stat_dec(class, ZS_OBJS_INUSE, 1);

#ifdef CONFIG_ZPOOL
if (zspage->under_reclaim) {
@@ -1828,7 +1815,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
{
int i;
struct zspage *zspage;
- enum fullness_group fg[2] = {ZS_ALMOST_EMPTY, ZS_ALMOST_FULL};
+ int fg[2] = {ZS_ALMOST_EMPTY, ZS_ALMOST_FULL};

if (!source) {
fg[0] = ZS_ALMOST_FULL;
@@ -1852,12 +1839,11 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
* @class: destination class
* @zspage: target page
*
- * Return @zspage's fullness_group
+ * Return @zspage's fullness status
*/
-static enum fullness_group putback_zspage(struct size_class *class,
- struct zspage *zspage)
+static int putback_zspage(struct size_class *class, struct zspage *zspage)
{
- enum fullness_group fullness;
+ int fullness;

fullness = get_fullness_group(class, zspage);
insert_zspage(class, zspage, fullness);
@@ -2121,7 +2107,7 @@ static void async_free_zspage(struct work_struct *work)
int i;
struct size_class *class;
unsigned int class_idx;
- enum fullness_group fullness;
+ int fullness;
struct zspage *zspage, *tmp;
LIST_HEAD(free_pages);
struct zs_pool *pool = container_of(work, struct zs_pool,
@@ -2190,8 +2176,8 @@ static inline void zs_flush_migration(struct zs_pool *pool) { }
static unsigned long zs_can_compact(struct size_class *class)
{
unsigned long obj_wasted;
- unsigned long obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
- unsigned long obj_used = zs_stat_get(class, OBJ_USED);
+ unsigned long obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED);
+ unsigned long obj_used = zs_stat_get(class, ZS_OBJS_INUSE);

if (obj_allocated <= obj_used)
return 0;
@@ -2616,7 +2602,7 @@ static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
unsigned long handle;
struct zspage *zspage;
struct page *page;
- enum fullness_group fullness;
+ int fullness;

/* Lock LRU and fullness list */
spin_lock(&pool->lock);
--
2.39.2.637.g21b0678d19-goog


2023-02-23 03:05:15

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

Each zspage maintains ->inuse counter which keeps track of the
number of objects stored in the page. The ->inuse counter also
determines the page's "fullness group" which is calculated as
the ratio of the "inuse" objects to the total number of objects
the page can hold (objs_per_zspage). The closer the ->inuse
counter is to objs_per_zspage, the better.

Each size class maintains several fullness lists, that keep
track of zspages of particular "fullness". Pages within each
fullness list are stored in random order with regard to the
->inuse counter. This is because sorting the pages by ->inuse
counter each time obj_malloc() or obj_free() is called would
be too expensive. However, the ->inuse counter is still a
crucial factor in many situations.

For the two major zsmalloc operations, zs_malloc() and zs_compact(),
we typically select the head page from the corresponding fullness
list as the best candidate page. However, this assumption is not
always accurate.

For the zs_malloc() operation, the optimal candidate page should
have the highest ->inuse counter. This is because the goal is to
maximize the number of ZS_FULL pages and make full use of all
allocated memory.

For the zs_compact() operation, the optimal candidate page should
have the lowest ->inuse counter. This is because compaction needs
to move objects in use to another page before it can release the
zspage and return its physical pages to the buddy allocator. The
fewer objects in use, the quicker compaction can release the page.
Additionally, compaction is measured by the number of pages it
releases.

This patch reworks the fullness grouping mechanism. Instead of
having two groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and
ZS_ALMOST_FULL (usage ration above 3/4) - that result in too many
pages being included in the ALMOST_EMPTY group for specific
classes, size classes maintain a larger number of fullness lists
that give strict guarantees on the minimum and maximum ->inuse
values within each group. Each group represents a 10% change in the
->inuse ratio compared to neighboring groups. In essence, there
are groups for pages with 0%, 10%, 20% usage ratios, and so on,
up to 100%.

This enhances the selection of candidate pages for both zs_malloc()
and zs_compact(). A printout of the ->inuse counters of the first 7
pages per (random) class fullness group:

class-768 objs_per_zspage 16:
fullness 100%: empty
fullness 99%: empty
fullness 90%: empty
fullness 80%: empty
fullness 70%: empty
fullness 60%: 8 8 9 9 8 8 8
fullness 50%: empty
fullness 40%: 5 5 6 5 5 5 5
fullness 30%: 4 4 4 4 4 4 4
fullness 20%: 2 3 2 3 3 2 2
fullness 10%: 1 1 1 1 1 1 1
fullness 0%: empty

The zs_malloc() function searches through the groups of pages
starting with the one having the highest usage ratio. This means
that it always selects a page from the group with the least
internal fragmentation (highest usage ratio) and makes it even
less fragmented by increasing its usage ratio.

The zs_compact() function, on the other hand, begins by scanning
the group with the highest fragmentation (lowest usage ratio) to
locate the source page. The first available page is selected, and
then the function moves downward to find a destination page in
the group with the lowest internal fragmentation (highest usage
ratio).

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
mm/zsmalloc.c | 183 +++++++++++++++++++++++++++++---------------------
1 file changed, 107 insertions(+), 76 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 38ae8963c0eb..1a92ebe338eb 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -127,7 +127,7 @@
#define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)

#define HUGE_BITS 1
-#define FULLNESS_BITS 2
+#define FULLNESS_BITS 4
#define CLASS_BITS 8
#define ISOLATED_BITS 5
#define MAGIC_VAL_BITS 8
@@ -159,15 +159,33 @@
#define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
ZS_SIZE_CLASS_DELTA) + 1)

-#define ZS_EMPTY 0
-#define ZS_ALMOST_EMPTY 1
-#define ZS_ALMOST_FULL 2
-#define ZS_FULL 3
-#define ZS_OBJS_ALLOCATED 4
-#define ZS_OBJS_INUSE 5
-
-#define NR_ZS_STAT 6
-#define NR_ZS_FULLNESS 4
+/*
+ * Pages are distinguished by the ratio of used memory (that is the ratio
+ * of ->inuse objects to all objects that page can store). For example,
+ * INUSE_RATIO_30 means that the ratio of used objects is > 20% and <= 30%.
+ *
+ * The number of fullness groups is not random. It allows us to keep
+ * diffeence between the least busy page in the group (minimum permitted
+ * number of ->inuse objects) and the most busy page (maximum permitted
+ * number of ->inuse objects) at a reasonable value.
+ */
+#define ZS_INUSE_RATIO_0 0
+#define ZS_INUSE_RATIO_10 1
+#define ZS_INUSE_RATIO_20 2
+#define ZS_INUSE_RATIO_30 3
+#define ZS_INUSE_RATIO_40 4
+#define ZS_INUSE_RATIO_50 5
+#define ZS_INUSE_RATIO_60 6
+#define ZS_INUSE_RATIO_70 7
+#define ZS_INUSE_RATIO_80 8
+#define ZS_INUSE_RATIO_90 9
+#define ZS_INUSE_RATIO_99 10
+#define ZS_INUSE_RATIO_100 11
+#define ZS_OBJS_ALLOCATED 12
+#define ZS_OBJS_INUSE 13
+
+#define NR_ZS_INUSE_RATIO 12
+#define NR_ZS_STAT 14

struct zs_size_stat {
unsigned long objs[NR_ZS_STAT];
@@ -177,25 +195,10 @@ struct zs_size_stat {
static struct dentry *zs_stat_root;
#endif

-/*
- * We assign a page to ZS_ALMOST_EMPTY fullness group when:
- * n <= N / f, where
- * n = number of allocated objects
- * N = total number of objects zspage can store
- * f = fullness_threshold_frac
- *
- * Similarly, we assign zspage to:
- * ZS_ALMOST_FULL when n > N / f
- * ZS_EMPTY when n == 0
- * ZS_FULL when n == N
- *
- * (see: fix_fullness_group())
- */
-static const int fullness_threshold_frac = 4;
static size_t huge_class_size;

struct size_class {
- struct list_head fullness_list[NR_ZS_FULLNESS];
+ struct list_head fullness_list[NR_ZS_INUSE_RATIO];
/*
* Size of objects stored in this class. Must be multiple
* of ZS_ALIGN.
@@ -641,8 +644,23 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
continue;

spin_lock(&pool->lock);
- class_almost_full = zs_stat_get(class, ZS_ALMOST_FULL);
- class_almost_empty = zs_stat_get(class, ZS_ALMOST_EMPTY);
+
+ /*
+ * Replecate old behaviour for almost_full and almost_empty
+ * stats.
+ */
+ class_almost_full = zs_stat_get(class, ZS_INUSE_RATIO_99);
+ class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_90);
+ class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_80);
+ class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_70);
+
+ class_almost_empty = zs_stat_get(class, ZS_INUSE_RATIO_60);
+ class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_50);
+ class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_40);
+ class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_30);
+ class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_20);
+ class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_10);
+
obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED);
obj_used = zs_stat_get(class, ZS_OBJS_INUSE);
freeable = zs_can_compact(class);
@@ -712,32 +730,30 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
}
#endif

-
/*
* For each size class, zspages are divided into different groups
- * depending on how "full" they are. This was done so that we could
- * easily find empty or nearly empty zspages when we try to shrink
- * the pool (not yet implemented). This function returns fullness
+ * depending on their usage ratio. This function returns fullness
* status of the given page.
*/
static int get_fullness_group(struct size_class *class, struct zspage *zspage)
{
- int inuse, objs_per_zspage;
- int fg;
+ int inuse, objs_per_zspage, ratio;

inuse = get_zspage_inuse(zspage);
objs_per_zspage = class->objs_per_zspage;

if (inuse == 0)
- fg = ZS_EMPTY;
- else if (inuse == objs_per_zspage)
- fg = ZS_FULL;
- else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
- fg = ZS_ALMOST_EMPTY;
- else
- fg = ZS_ALMOST_FULL;
+ return ZS_INUSE_RATIO_0;
+ if (inuse == objs_per_zspage)
+ return ZS_INUSE_RATIO_100;

- return fg;
+ ratio = 100 * inuse / objs_per_zspage;
+ /*
+ * Take integer division into consideration: a page with one inuse
+ * object out of 127 possible, will endup having 0 usage ratio,
+ * which is wrong as it belongs in ZS_INUSE_RATIO_10 fullness group.
+ */
+ return ratio / 10 + 1;
}

/*
@@ -769,11 +785,11 @@ static void remove_zspage(struct size_class *class,
/*
* Each size class maintains zspages in different fullness groups depending
* on the number of live objects they contain. When allocating or freeing
- * objects, the fullness status of the page can change, say, from ALMOST_FULL
- * to ALMOST_EMPTY when freeing an object. This function checks if such
- * a status change has occurred for the given page and accordingly moves the
- * page from the freelist of the old fullness group to that of the new
- * fullness group.
+ * objects, the fullness status of the page can change, for instance, from
+ * INUSE_RATIO_80 to INUSE_RATIO_70 when freeing an object. This function
+ * checks if such a status change has occurred for the given page and
+ * accordingly moves the page from the list of the old fullness group to that
+ * of the new fullness group.
*/
static int fix_fullness_group(struct size_class *class, struct zspage *zspage)
{
@@ -959,7 +975,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
assert_spin_locked(&pool->lock);

VM_BUG_ON(get_zspage_inuse(zspage));
- VM_BUG_ON(fg != ZS_EMPTY);
+ VM_BUG_ON(fg != ZS_INUSE_RATIO_0);

/* Free all deferred handles from zs_free */
free_handles(pool, class, zspage);
@@ -998,7 +1014,7 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
return;
}

- remove_zspage(class, zspage, ZS_EMPTY);
+ remove_zspage(class, zspage, ZS_INUSE_RATIO_0);
#ifdef CONFIG_ZPOOL
list_del(&zspage->lru);
#endif
@@ -1134,9 +1150,9 @@ static struct zspage *find_get_zspage(struct size_class *class)
int i;
struct zspage *zspage;

- for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
+ for (i = ZS_INUSE_RATIO_99; i >= ZS_INUSE_RATIO_0; i--) {
zspage = list_first_entry_or_null(&class->fullness_list[i],
- struct zspage, list);
+ struct zspage, list);
if (zspage)
break;
}
@@ -1629,7 +1645,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
obj_free(class->size, obj, NULL);

fullness = fix_fullness_group(class, zspage);
- if (fullness == ZS_EMPTY)
+ if (fullness == ZS_INUSE_RATIO_0)
free_zspage(pool, class, zspage);

spin_unlock(&pool->lock);
@@ -1811,22 +1827,33 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
return ret;
}

-static struct zspage *isolate_zspage(struct size_class *class, bool source)
+static struct zspage *isolate_src_zspage(struct size_class *class)
{
- int i;
struct zspage *zspage;
- int fg[2] = {ZS_ALMOST_EMPTY, ZS_ALMOST_FULL};
+ int fg;

- if (!source) {
- fg[0] = ZS_ALMOST_FULL;
- fg[1] = ZS_ALMOST_EMPTY;
+ for (fg = ZS_INUSE_RATIO_10; fg <= ZS_INUSE_RATIO_99; fg++) {
+ zspage = list_first_entry_or_null(&class->fullness_list[fg],
+ struct zspage, list);
+ if (zspage) {
+ remove_zspage(class, zspage, fg);
+ return zspage;
+ }
}

- for (i = 0; i < 2; i++) {
- zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
- struct zspage, list);
+ return zspage;
+}
+
+static struct zspage *isolate_dst_zspage(struct size_class *class)
+{
+ struct zspage *zspage;
+ int fg;
+
+ for (fg = ZS_INUSE_RATIO_99; fg >= ZS_INUSE_RATIO_10; fg--) {
+ zspage = list_first_entry_or_null(&class->fullness_list[fg],
+ struct zspage, list);
if (zspage) {
- remove_zspage(class, zspage, fg[i]);
+ remove_zspage(class, zspage, fg);
return zspage;
}
}
@@ -2119,7 +2146,7 @@ static void async_free_zspage(struct work_struct *work)
continue;

spin_lock(&pool->lock);
- list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
+ list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0], &free_pages);
spin_unlock(&pool->lock);
}

@@ -2128,7 +2155,7 @@ static void async_free_zspage(struct work_struct *work)
lock_zspage(zspage);

get_zspage_mapping(zspage, &class_idx, &fullness);
- VM_BUG_ON(fullness != ZS_EMPTY);
+ VM_BUG_ON(fullness != ZS_INUSE_RATIO_0);
class = pool->size_class[class_idx];
spin_lock(&pool->lock);
#ifdef CONFIG_ZPOOL
@@ -2201,7 +2228,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
* as well as zpage allocation/free
*/
spin_lock(&pool->lock);
- while ((src_zspage = isolate_zspage(class, true))) {
+ while ((src_zspage = isolate_src_zspage(class))) {
/* protect someone accessing the zspage(i.e., zs_map_object) */
migrate_write_lock(src_zspage);

@@ -2211,7 +2238,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
cc.obj_idx = 0;
cc.s_page = get_first_page(src_zspage);

- while ((dst_zspage = isolate_zspage(class, false))) {
+ while ((dst_zspage = isolate_dst_zspage(class))) {
migrate_write_lock_nested(dst_zspage);

cc.d_page = get_first_page(dst_zspage);
@@ -2236,7 +2263,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
putback_zspage(class, dst_zspage);
migrate_write_unlock(dst_zspage);

- if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
+ if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
migrate_write_unlock(src_zspage);
free_zspage(pool, class, src_zspage);
pages_freed += class->pages_per_zspage;
@@ -2394,7 +2421,7 @@ struct zs_pool *zs_create_pool(const char *name)
int pages_per_zspage;
int objs_per_zspage;
struct size_class *class;
- int fullness = 0;
+ int fullness;

size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
if (size > ZS_MAX_ALLOC_SIZE)
@@ -2448,9 +2475,12 @@ struct zs_pool *zs_create_pool(const char *name)
class->pages_per_zspage = pages_per_zspage;
class->objs_per_zspage = objs_per_zspage;
pool->size_class[i] = class;
- for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
- fullness++)
+
+ fullness = ZS_INUSE_RATIO_0;
+ while (fullness < NR_ZS_INUSE_RATIO) {
INIT_LIST_HEAD(&class->fullness_list[fullness]);
+ fullness++;
+ }

prev_class = class;
}
@@ -2496,11 +2526,12 @@ void zs_destroy_pool(struct zs_pool *pool)
if (class->index != i)
continue;

- for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
- if (!list_empty(&class->fullness_list[fg])) {
- pr_info("Freeing non-empty class with size %db, fullness group %d\n",
- class->size, fg);
- }
+ for (fg = ZS_INUSE_RATIO_0; fg < NR_ZS_INUSE_RATIO; fg++) {
+ if (list_empty(&class->fullness_list[fg]))
+ continue;
+
+ pr_err("Class-%d fullness group %d is not empty\n",
+ class->size, fg);
}
kfree(class);
}
@@ -2672,7 +2703,7 @@ static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
* while the page is removed from the pool. Fix it
* up for the check in __free_zspage().
*/
- zspage->fullness = ZS_EMPTY;
+ zspage->fullness = ZS_INUSE_RATIO_0;

__free_zspage(pool, class, zspage);
spin_unlock(&pool->lock);
--
2.39.2.637.g21b0678d19-goog


2023-02-23 03:05:22

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 4/6] zsmalloc: rework compaction algorithm

The zsmalloc compaction algorithm has the potential to
waste some CPU cycles, particularly when compacting pages
within the same fullness group. This is due to the way it
selects the head page of the fullness list for source and
destination pages, and how it reinserts those pages during
each iteration. The algorithm may first use a page as a
migration destination and then as a migration source,
leading to an unnecessary back-and-forth movement of
objects.

Consider the following fullness list:

PageA PageB PageC PageD PageE

During the first iteration, the compaction algorithm will
select PageA as the source and PageB as the destination.
All of PageA's objects will be moved to PageB, and then
PageA will be released while PageB is reinserted into the
fullness list.

PageB PageC PageD PageE

During the next iteration, the compaction algorithm will
again select the head of the list as the source and destination,
meaning that PageB will now serve as the source and PageC as
the destination. This will result in the objects being moved
away from PageB, the same objects that were just moved to PageB
in the previous iteration.

To prevent this avalanche effect, the compaction algorithm
should not reinsert the destination page between iterations.
By doing so, the most optimal page will continue to be used
and its usage ratio will increase, reducing internal
fragmentation. The destination page should only be reinserted
into the fullness list if:
- It becomes full
- No source page is available.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
mm/zsmalloc.c | 82 ++++++++++++++++++++++++---------------------------
1 file changed, 38 insertions(+), 44 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 1a92ebe338eb..eacf9e32da5c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1786,15 +1786,14 @@ struct zs_compact_control {
int obj_idx;
};

-static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
- struct zs_compact_control *cc)
+static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
+ struct zs_compact_control *cc)
{
unsigned long used_obj, free_obj;
unsigned long handle;
struct page *s_page = cc->s_page;
struct page *d_page = cc->d_page;
int obj_idx = cc->obj_idx;
- int ret = 0;

while (1) {
handle = find_alloced_obj(class, s_page, &obj_idx);
@@ -1807,10 +1806,8 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
}

/* Stop if there is no more space */
- if (zspage_full(class, get_zspage(d_page))) {
- ret = -ENOMEM;
+ if (zspage_full(class, get_zspage(d_page)))
break;
- }

used_obj = handle_to_obj(handle);
free_obj = obj_malloc(pool, get_zspage(d_page), handle);
@@ -1823,8 +1820,6 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
/* Remember last position in this iteration */
cc->s_page = s_page;
cc->obj_idx = obj_idx;
-
- return ret;
}

static struct zspage *isolate_src_zspage(struct size_class *class)
@@ -2228,57 +2223,56 @@ static unsigned long __zs_compact(struct zs_pool *pool,
* as well as zpage allocation/free
*/
spin_lock(&pool->lock);
- while ((src_zspage = isolate_src_zspage(class))) {
- /* protect someone accessing the zspage(i.e., zs_map_object) */
- migrate_write_lock(src_zspage);
-
- if (!zs_can_compact(class))
- break;
-
- cc.obj_idx = 0;
- cc.s_page = get_first_page(src_zspage);
-
- while ((dst_zspage = isolate_dst_zspage(class))) {
- migrate_write_lock_nested(dst_zspage);
-
+ while (1) {
+ if (!dst_zspage) {
+ dst_zspage = isolate_dst_zspage(class);
+ if (!dst_zspage)
+ goto out;
+ migrate_write_lock(dst_zspage);
cc.d_page = get_first_page(dst_zspage);
- /*
- * If there is no more space in dst_page, resched
- * and see if anyone had allocated another zspage.
- */
- if (!migrate_zspage(pool, class, &cc))
- break;
+ }

+ if (!zs_can_compact(class)) {
putback_zspage(class, dst_zspage);
migrate_write_unlock(dst_zspage);
- dst_zspage = NULL;
- if (spin_is_contended(&pool->lock))
- break;
+ goto out;
}

- /* Stop if we couldn't find slot */
- if (dst_zspage == NULL)
- break;
+ src_zspage = isolate_src_zspage(class);
+ if (!src_zspage) {
+ putback_zspage(class, dst_zspage);
+ migrate_write_unlock(dst_zspage);
+ goto out;
+ }

- putback_zspage(class, dst_zspage);
- migrate_write_unlock(dst_zspage);
+ migrate_write_lock_nested(src_zspage);
+
+ cc.obj_idx = 0;
+ cc.s_page = get_first_page(src_zspage);
+ migrate_zspage(pool, class, &cc);

if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
migrate_write_unlock(src_zspage);
free_zspage(pool, class, src_zspage);
pages_freed += class->pages_per_zspage;
- } else
+ } else {
migrate_write_unlock(src_zspage);
- spin_unlock(&pool->lock);
- cond_resched();
- spin_lock(&pool->lock);
- }
+ }

- if (src_zspage) {
- putback_zspage(class, src_zspage);
- migrate_write_unlock(src_zspage);
- }
+ if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
+ || spin_is_contended(&pool->lock)) {
+ putback_zspage(class, dst_zspage);
+ migrate_write_unlock(dst_zspage);
+ dst_zspage = NULL;
+ }

+ if (!dst_zspage) {
+ spin_unlock(&pool->lock);
+ cond_resched();
+ spin_lock(&pool->lock);
+ }
+ }
+out:
spin_unlock(&pool->lock);

return pages_freed;
--
2.39.2.637.g21b0678d19-goog


2023-02-23 03:05:34

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 5/6] zsmalloc: extend compaction statistics

Extend zsmalloc zs_pool_stats with a new member that
holds the number of objects pool compaction moved
between pool pages.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
include/linux/zsmalloc.h | 2 ++
mm/zsmalloc.c | 1 +
2 files changed, 3 insertions(+)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index a48cd0ffe57d..8b3fa5b4a68c 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -36,6 +36,8 @@ enum zs_mapmode {
struct zs_pool_stats {
/* How many pages were migrated (freed) */
atomic_long_t pages_compacted;
+ /* How many objects were migrated during compaction */
+ atomic_long_t objs_moved;
};

struct zs_pool;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index eacf9e32da5c..f7e69df48fb0 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1815,6 +1815,7 @@ static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
obj_idx++;
record_obj(handle, free_obj);
obj_free(class->size, used_obj, NULL);
+ atomic_long_inc(&pool->stats.objs_moved);
}

/* Remember last position in this iteration */
--
2.39.2.637.g21b0678d19-goog


2023-02-23 03:05:43

by Sergey Senozhatsky

[permalink] [raw]
Subject: [PATCHv2 6/6] zram: show zsmalloc objs_moved stat in mm_stat

Extend zram mm_show with new objs_moved zs_pool_stats.

Signed-off-by: Sergey Senozhatsky <[email protected]>
---
Documentation/admin-guide/blockdev/zram.rst | 1 +
drivers/block/zram/zram_drv.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index e4551579cb12..699cdbf27e37 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -267,6 +267,7 @@ line of text and contains the following stats separated by whitespace:
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
huge_pages_since the number of incompressible pages since zram set up
+ objs_moved The number of objects moved during pool compaction
================ =============================================================

File /sys/block/zram<id>/bd_stat
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index aa490da3cef2..3194e9254c6f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1221,7 +1221,7 @@ static ssize_t mm_stat_show(struct device *dev,
max_used = atomic_long_read(&zram->stats.max_used_pages);

ret = scnprintf(buf, PAGE_SIZE,
- "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
+ "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu %8llu\n",
orig_size << PAGE_SHIFT,
(u64)atomic64_read(&zram->stats.compr_data_size),
mem_used << PAGE_SHIFT,
@@ -1230,7 +1230,8 @@ static ssize_t mm_stat_show(struct device *dev,
(u64)atomic64_read(&zram->stats.same_pages),
atomic_long_read(&pool_stats.pages_compacted),
(u64)atomic64_read(&zram->stats.huge_pages),
- (u64)atomic64_read(&zram->stats.huge_pages_since));
+ (u64)atomic64_read(&zram->stats.huge_pages_since),
+ (u64)atomic64_read(&pool_stats.objs_moved));
up_read(&zram->init_lock);

return ret;
--
2.39.2.637.g21b0678d19-goog


2023-02-23 23:09:39

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 1/6] zsmalloc: remove insert_zspage() ->inuse optimization

On Thu, Feb 23, 2023 at 12:04:46PM +0900, Sergey Senozhatsky wrote:
> This optimization has no effect. It only ensures that
> when a page was added to its corresponding fullness
> list, its "inuse" counter was higher or lower than the
> "inuse" counter of the page at the head of the list.
> The intention was to keep busy pages at the head, so
> they could be filled up and moved to the ZS_FULL
> fullness group more quickly. However, this doesn't work
> as the "inuse" counter of a page can be modified by

zspage

Let's use term zspage instead of page to prevent confusing.

> obj_free() but the page may still belong to the same
> fullness list. So, fix_fullness_group() won't change

Yes. I didn't expect it should be perfect from the beginning
but would help just little optimization.

> the page's position in relation to the head's "inuse"
> counter, leading to a largely random order of pages
> within the fullness list.

Good point.

>
> For instance, consider a printout of the "inuse"
> counters of the first 10 pages in a class that holds
> 93 objects per zspage:
>
> ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52
>
> As we can see the page with the lowest "inuse" counter
> is actually the head of the fullness list.

Let's write what the patch is doing cleary

"So, let's remove the pointless optimization" or something better word.

>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> mm/zsmalloc.c | 29 ++++++++---------------------
> 1 file changed, 8 insertions(+), 21 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 3aed46ab7e6c..b57a89ed6f30 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -753,37 +753,24 @@ static enum fullness_group get_fullness_group(struct size_class *class,
> }
>
> /*
> - * Each size class maintains various freelists and zspages are assigned
> - * to one of these freelists based on the number of live objects they
> - * have. This functions inserts the given zspage into the freelist
> - * identified by <class, fullness_group>.
> + * This function adds the given zspage to the fullness list identified
> + * by <class, fullness_group>.
> */
> static void insert_zspage(struct size_class *class,
> - struct zspage *zspage,
> - enum fullness_group fullness)
> + struct zspage *zspage,
> + enum fullness_group fullness)

Unnecessary changes

> {
> - struct zspage *head;
> -
> class_stat_inc(class, fullness, 1);
> - head = list_first_entry_or_null(&class->fullness_list[fullness],
> - struct zspage, list);
> - /*
> - * We want to see more ZS_FULL pages and less almost empty/full.
> - * Put pages with higher ->inuse first.
> - */
> - if (head && get_zspage_inuse(zspage) < get_zspage_inuse(head))
> - list_add(&zspage->list, &head->list);
> - else
> - list_add(&zspage->list, &class->fullness_list[fullness]);
> + list_add(&zspage->list, &class->fullness_list[fullness]);
> }
>
> /*
> - * This function removes the given zspage from the freelist identified
> + * This function removes the given zspage from the fullness list identified
> * by <class, fullness_group>.
> */
> static void remove_zspage(struct size_class *class,
> - struct zspage *zspage,
> - enum fullness_group fullness)
> + struct zspage *zspage,
> + enum fullness_group fullness)

Ditto.

Other than that, looks good to me.

2023-02-23 23:11:34

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 2/6] zsmalloc: remove stat and fullness enums

On Thu, Feb 23, 2023 at 12:04:47PM +0900, Sergey Senozhatsky wrote:
> The fullness_group enum is nested (sub-enum) within the
> class_stat_type enum. zsmalloc requires the values in both
> enums to match, because zsmalloc passes these values to
> generic functions, e.g. class_stat_inc() and class_stat_dec(),
> after casting them to integers.
>
> Replace these enums (and enum nesting) and use simple defines
> instead. Also rename some of zsmalloc stats defines, as they
> sort of clash with zspage object tags.
>
> Suggested-by: Yosry Ahmed <[email protected]>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> mm/zsmalloc.c | 104 ++++++++++++++++++++++----------------------------
> 1 file changed, 45 insertions(+), 59 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b57a89ed6f30..38ae8963c0eb 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -159,26 +159,18 @@
> #define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> ZS_SIZE_CLASS_DELTA) + 1)
>
> -enum fullness_group {
> - ZS_EMPTY,
> - ZS_ALMOST_EMPTY,
> - ZS_ALMOST_FULL,
> - ZS_FULL,
> - NR_ZS_FULLNESS,
> -};
> +#define ZS_EMPTY 0
> +#define ZS_ALMOST_EMPTY 1
> +#define ZS_ALMOST_FULL 2
> +#define ZS_FULL 3
> +#define ZS_OBJS_ALLOCATED 4
> +#define ZS_OBJS_INUSE 5
>
> -enum class_stat_type {
> - CLASS_EMPTY,
> - CLASS_ALMOST_EMPTY,
> - CLASS_ALMOST_FULL,
> - CLASS_FULL,
> - OBJ_ALLOCATED,
> - OBJ_USED,
> - NR_ZS_STAT_TYPE,
> -};
> +#define NR_ZS_STAT 6
> +#define NR_ZS_FULLNESS 4

Using define list instead of enum list looks like going backward. :)

Why can't we do this?

enum class_stat_type {
ZS_EMPTY,
ZS_ALMOST_EMPTY,
ZS_ALMOST_FULL,
ZS_FULL,
NR_ZS_FULLNESS,
ZS_OBJ_ALLOCATED = NR_ZS_FULLNESS,
ZS_OBJ_USED,
NR_ZS_STAT,
}


};
>
> struct zs_size_stat {
> - unsigned long objs[NR_ZS_STAT_TYPE];
> + unsigned long objs[NR_ZS_STAT];
> };
>

2023-02-23 23:27:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Thu, Feb 23, 2023 at 12:04:48PM +0900, Sergey Senozhatsky wrote:
> Each zspage maintains ->inuse counter which keeps track of the
> number of objects stored in the page. The ->inuse counter also
> determines the page's "fullness group" which is calculated as

zspage's

> the ratio of the "inuse" objects to the total number of objects
> the page can hold (objs_per_zspage). The closer the ->inuse

zspage

> counter is to objs_per_zspage, the better.
>
> Each size class maintains several fullness lists, that keep
> track of zspages of particular "fullness". Pages within each
> fullness list are stored in random order with regard to the
> ->inuse counter. This is because sorting the pages by ->inuse
> counter each time obj_malloc() or obj_free() is called would
> be too expensive. However, the ->inuse counter is still a
> crucial factor in many situations.
>
> For the two major zsmalloc operations, zs_malloc() and zs_compact(),
> we typically select the head page from the corresponding fullness
> list as the best candidate page. However, this assumption is not
> always accurate.
>
> For the zs_malloc() operation, the optimal candidate page should
> have the highest ->inuse counter. This is because the goal is to
> maximize the number of ZS_FULL pages and make full use of all
> allocated memory.
>
> For the zs_compact() operation, the optimal candidate page should

as source page

> have the lowest ->inuse counter. This is because compaction needs
> to move objects in use to another page before it can release the
> zspage and return its physical pages to the buddy allocator. The
> fewer objects in use, the quicker compaction can release the page.
> Additionally, compaction is measured by the number of pages it
> releases.
>
> This patch reworks the fullness grouping mechanism. Instead of
> having two groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and
> ZS_ALMOST_FULL (usage ration above 3/4) - that result in too many
> pages being included in the ALMOST_EMPTY group for specific
> classes, size classes maintain a larger number of fullness lists
> that give strict guarantees on the minimum and maximum ->inuse
> values within each group. Each group represents a 10% change in the
> ->inuse ratio compared to neighboring groups. In essence, there
> are groups for pages with 0%, 10%, 20% usage ratios, and so on,
> up to 100%.
>
> This enhances the selection of candidate pages for both zs_malloc()
> and zs_compact(). A printout of the ->inuse counters of the first 7
> pages per (random) class fullness group:
>
> class-768 objs_per_zspage 16:
> fullness 100%: empty
> fullness 99%: empty
> fullness 90%: empty
> fullness 80%: empty
> fullness 70%: empty
> fullness 60%: 8 8 9 9 8 8 8
> fullness 50%: empty
> fullness 40%: 5 5 6 5 5 5 5
> fullness 30%: 4 4 4 4 4 4 4
> fullness 20%: 2 3 2 3 3 2 2
> fullness 10%: 1 1 1 1 1 1 1
> fullness 0%: empty
>
> The zs_malloc() function searches through the groups of pages
> starting with the one having the highest usage ratio. This means
> that it always selects a page from the group with the least
> internal fragmentation (highest usage ratio) and makes it even
> less fragmented by increasing its usage ratio.
>
> The zs_compact() function, on the other hand, begins by scanning
> the group with the highest fragmentation (lowest usage ratio) to
> locate the source page. The first available page is selected, and
> then the function moves downward to find a destination page in
> the group with the lowest internal fragmentation (highest usage
> ratio).

That's nice! I just have small nits below.

>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> mm/zsmalloc.c | 183 +++++++++++++++++++++++++++++---------------------
> 1 file changed, 107 insertions(+), 76 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 38ae8963c0eb..1a92ebe338eb 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -127,7 +127,7 @@
> #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
>
> #define HUGE_BITS 1
> -#define FULLNESS_BITS 2
> +#define FULLNESS_BITS 4
> #define CLASS_BITS 8
> #define ISOLATED_BITS 5
> #define MAGIC_VAL_BITS 8
> @@ -159,15 +159,33 @@
> #define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> ZS_SIZE_CLASS_DELTA) + 1)
>
> -#define ZS_EMPTY 0
> -#define ZS_ALMOST_EMPTY 1
> -#define ZS_ALMOST_FULL 2
> -#define ZS_FULL 3
> -#define ZS_OBJS_ALLOCATED 4
> -#define ZS_OBJS_INUSE 5
> -
> -#define NR_ZS_STAT 6
> -#define NR_ZS_FULLNESS 4
> +/*
> + * Pages are distinguished by the ratio of used memory (that is the ratio
> + * of ->inuse objects to all objects that page can store). For example,
> + * INUSE_RATIO_30 means that the ratio of used objects is > 20% and <= 30%.
> + *
> + * The number of fullness groups is not random. It allows us to keep
> + * diffeence between the least busy page in the group (minimum permitted
> + * number of ->inuse objects) and the most busy page (maximum permitted
> + * number of ->inuse objects) at a reasonable value.
> + */
> +#define ZS_INUSE_RATIO_0 0

How about keeping ZS_EMPTY and ZS_FULL since they are used
multiple places in source code? It would have less churning.

> +#define ZS_INUSE_RATIO_10 1
> +#define ZS_INUSE_RATIO_20 2
> +#define ZS_INUSE_RATIO_30 3
> +#define ZS_INUSE_RATIO_40 4
> +#define ZS_INUSE_RATIO_50 5
> +#define ZS_INUSE_RATIO_60 6
> +#define ZS_INUSE_RATIO_70 7
> +#define ZS_INUSE_RATIO_80 8
> +#define ZS_INUSE_RATIO_90 9
> +#define ZS_INUSE_RATIO_99 10

Do we really need all the define macro for the range from 10 to 99?
Can't we do this?

enum class_stat_type {
ZS_EMPTY,
/*
* There are fullness buckets between 10% - 99%.
*/
ZS_FULL = 11
NR_ZS_FULLNESS,
ZS_OBJ_ALLOCATED = NR_ZS_FULLNESS,
ZS_OBJ_USED,
NR_ZS_STAT,
}

> +#define ZS_INUSE_RATIO_100 11


> +#define ZS_OBJS_ALLOCATED 12
> +#define ZS_OBJS_INUSE 13
> +
> +#define NR_ZS_INUSE_RATIO 12
> +#define NR_ZS_STAT 14
>
> struct zs_size_stat {
> unsigned long objs[NR_ZS_STAT];
> @@ -177,25 +195,10 @@ struct zs_size_stat {
> static struct dentry *zs_stat_root;
> #endif
>
> -/*
> - * We assign a page to ZS_ALMOST_EMPTY fullness group when:
> - * n <= N / f, where
> - * n = number of allocated objects
> - * N = total number of objects zspage can store
> - * f = fullness_threshold_frac
> - *
> - * Similarly, we assign zspage to:
> - * ZS_ALMOST_FULL when n > N / f
> - * ZS_EMPTY when n == 0
> - * ZS_FULL when n == N
> - *
> - * (see: fix_fullness_group())
> - */
> -static const int fullness_threshold_frac = 4;
> static size_t huge_class_size;
>
> struct size_class {
> - struct list_head fullness_list[NR_ZS_FULLNESS];
> + struct list_head fullness_list[NR_ZS_INUSE_RATIO];

With the trick with enum, we don't have this change.

> /*
> * Size of objects stored in this class. Must be multiple
> * of ZS_ALIGN.
> @@ -641,8 +644,23 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
> continue;
>
> spin_lock(&pool->lock);
> - class_almost_full = zs_stat_get(class, ZS_ALMOST_FULL);
> - class_almost_empty = zs_stat_get(class, ZS_ALMOST_EMPTY);
> +
> + /*
> + * Replecate old behaviour for almost_full and almost_empty
> + * stats.
> + */
> + class_almost_full = zs_stat_get(class, ZS_INUSE_RATIO_99);
> + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_90);
> + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_80);
> + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_70);

> +
> + class_almost_empty = zs_stat_get(class, ZS_INUSE_RATIO_60);
> + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_50);
> + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_40);
> + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_30);
> + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_20);
> + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_10);

I guess you can use just loop here from 1 to 6

And then from 7 to 10 for class_almost_full.

> +
> obj_allocated = zs_stat_get(class, ZS_OBJS_ALLOCATED);
> obj_used = zs_stat_get(class, ZS_OBJS_INUSE);
> freeable = zs_can_compact(class);
> @@ -712,32 +730,30 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
> }
> #endif
>
> -
> /*
> * For each size class, zspages are divided into different groups
> - * depending on how "full" they are. This was done so that we could
> - * easily find empty or nearly empty zspages when we try to shrink
> - * the pool (not yet implemented). This function returns fullness
> + * depending on their usage ratio. This function returns fullness
> * status of the given page.
> */
> static int get_fullness_group(struct size_class *class, struct zspage *zspage)
> {
> - int inuse, objs_per_zspage;
> - int fg;
> + int inuse, objs_per_zspage, ratio;
>
> inuse = get_zspage_inuse(zspage);
> objs_per_zspage = class->objs_per_zspage;
>
> if (inuse == 0)
> - fg = ZS_EMPTY;
> - else if (inuse == objs_per_zspage)
> - fg = ZS_FULL;
> - else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
> - fg = ZS_ALMOST_EMPTY;
> - else
> - fg = ZS_ALMOST_FULL;
> + return ZS_INUSE_RATIO_0;
> + if (inuse == objs_per_zspage)
> + return ZS_INUSE_RATIO_100;
>
> - return fg;
> + ratio = 100 * inuse / objs_per_zspage;
> + /*
> + * Take integer division into consideration: a page with one inuse
> + * object out of 127 possible, will endup having 0 usage ratio,
> + * which is wrong as it belongs in ZS_INUSE_RATIO_10 fullness group.
> + */
> + return ratio / 10 + 1;
> }
>
> /*
> @@ -769,11 +785,11 @@ static void remove_zspage(struct size_class *class,
> /*
> * Each size class maintains zspages in different fullness groups depending
> * on the number of live objects they contain. When allocating or freeing
> - * objects, the fullness status of the page can change, say, from ALMOST_FULL
> - * to ALMOST_EMPTY when freeing an object. This function checks if such
> - * a status change has occurred for the given page and accordingly moves the
> - * page from the freelist of the old fullness group to that of the new
> - * fullness group.
> + * objects, the fullness status of the page can change, for instance, from
> + * INUSE_RATIO_80 to INUSE_RATIO_70 when freeing an object. This function
> + * checks if such a status change has occurred for the given page and
> + * accordingly moves the page from the list of the old fullness group to that
> + * of the new fullness group.
> */
> static int fix_fullness_group(struct size_class *class, struct zspage *zspage)
> {
> @@ -959,7 +975,7 @@ static void __free_zspage(struct zs_pool *pool, struct size_class *class,
> assert_spin_locked(&pool->lock);
>
> VM_BUG_ON(get_zspage_inuse(zspage));
> - VM_BUG_ON(fg != ZS_EMPTY);
> + VM_BUG_ON(fg != ZS_INUSE_RATIO_0);
>
> /* Free all deferred handles from zs_free */
> free_handles(pool, class, zspage);
> @@ -998,7 +1014,7 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
> return;
> }
>
> - remove_zspage(class, zspage, ZS_EMPTY);
> + remove_zspage(class, zspage, ZS_INUSE_RATIO_0);
> #ifdef CONFIG_ZPOOL
> list_del(&zspage->lru);
> #endif
> @@ -1134,9 +1150,9 @@ static struct zspage *find_get_zspage(struct size_class *class)
> int i;
> struct zspage *zspage;
>
> - for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
> + for (i = ZS_INUSE_RATIO_99; i >= ZS_INUSE_RATIO_0; i--) {
> zspage = list_first_entry_or_null(&class->fullness_list[i],
> - struct zspage, list);
> + struct zspage, list);
> if (zspage)
> break;
> }
> @@ -1629,7 +1645,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> obj_free(class->size, obj, NULL);
>
> fullness = fix_fullness_group(class, zspage);
> - if (fullness == ZS_EMPTY)
> + if (fullness == ZS_INUSE_RATIO_0)
> free_zspage(pool, class, zspage);
>
> spin_unlock(&pool->lock);
> @@ -1811,22 +1827,33 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> return ret;
> }
>
> -static struct zspage *isolate_zspage(struct size_class *class, bool source)
> +static struct zspage *isolate_src_zspage(struct size_class *class)
> {
> - int i;
> struct zspage *zspage;
> - int fg[2] = {ZS_ALMOST_EMPTY, ZS_ALMOST_FULL};
> + int fg;
>
> - if (!source) {
> - fg[0] = ZS_ALMOST_FULL;
> - fg[1] = ZS_ALMOST_EMPTY;
> + for (fg = ZS_INUSE_RATIO_10; fg <= ZS_INUSE_RATIO_99; fg++) {
> + zspage = list_first_entry_or_null(&class->fullness_list[fg],
> + struct zspage, list);
> + if (zspage) {
> + remove_zspage(class, zspage, fg);
> + return zspage;
> + }
> }
>
> - for (i = 0; i < 2; i++) {
> - zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
> - struct zspage, list);
> + return zspage;
> +}
> +
> +static struct zspage *isolate_dst_zspage(struct size_class *class)
> +{
> + struct zspage *zspage;
> + int fg;
> +
> + for (fg = ZS_INUSE_RATIO_99; fg >= ZS_INUSE_RATIO_10; fg--) {
> + zspage = list_first_entry_or_null(&class->fullness_list[fg],
> + struct zspage, list);
> if (zspage) {
> - remove_zspage(class, zspage, fg[i]);
> + remove_zspage(class, zspage, fg);
> return zspage;
> }
> }
> @@ -2119,7 +2146,7 @@ static void async_free_zspage(struct work_struct *work)
> continue;
>
> spin_lock(&pool->lock);
> - list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
> + list_splice_init(&class->fullness_list[ZS_INUSE_RATIO_0], &free_pages);
> spin_unlock(&pool->lock);
> }
>
> @@ -2128,7 +2155,7 @@ static void async_free_zspage(struct work_struct *work)
> lock_zspage(zspage);
>
> get_zspage_mapping(zspage, &class_idx, &fullness);
> - VM_BUG_ON(fullness != ZS_EMPTY);
> + VM_BUG_ON(fullness != ZS_INUSE_RATIO_0);
> class = pool->size_class[class_idx];
> spin_lock(&pool->lock);
> #ifdef CONFIG_ZPOOL
> @@ -2201,7 +2228,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> * as well as zpage allocation/free
> */
> spin_lock(&pool->lock);
> - while ((src_zspage = isolate_zspage(class, true))) {
> + while ((src_zspage = isolate_src_zspage(class))) {
> /* protect someone accessing the zspage(i.e., zs_map_object) */
> migrate_write_lock(src_zspage);
>
> @@ -2211,7 +2238,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> cc.obj_idx = 0;
> cc.s_page = get_first_page(src_zspage);
>
> - while ((dst_zspage = isolate_zspage(class, false))) {
> + while ((dst_zspage = isolate_dst_zspage(class))) {
> migrate_write_lock_nested(dst_zspage);
>
> cc.d_page = get_first_page(dst_zspage);
> @@ -2236,7 +2263,7 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> putback_zspage(class, dst_zspage);
> migrate_write_unlock(dst_zspage);
>
> - if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
> + if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
> migrate_write_unlock(src_zspage);
> free_zspage(pool, class, src_zspage);
> pages_freed += class->pages_per_zspage;
> @@ -2394,7 +2421,7 @@ struct zs_pool *zs_create_pool(const char *name)
> int pages_per_zspage;
> int objs_per_zspage;
> struct size_class *class;
> - int fullness = 0;
> + int fullness;
>
> size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
> if (size > ZS_MAX_ALLOC_SIZE)
> @@ -2448,9 +2475,12 @@ struct zs_pool *zs_create_pool(const char *name)
> class->pages_per_zspage = pages_per_zspage;
> class->objs_per_zspage = objs_per_zspage;
> pool->size_class[i] = class;
> - for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
> - fullness++)
> +
> + fullness = ZS_INUSE_RATIO_0;
> + while (fullness < NR_ZS_INUSE_RATIO) {
> INIT_LIST_HEAD(&class->fullness_list[fullness]);
> + fullness++;
> + }
>
> prev_class = class;
> }
> @@ -2496,11 +2526,12 @@ void zs_destroy_pool(struct zs_pool *pool)
> if (class->index != i)
> continue;
>
> - for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
> - if (!list_empty(&class->fullness_list[fg])) {
> - pr_info("Freeing non-empty class with size %db, fullness group %d\n",
> - class->size, fg);
> - }
> + for (fg = ZS_INUSE_RATIO_0; fg < NR_ZS_INUSE_RATIO; fg++) {
> + if (list_empty(&class->fullness_list[fg]))
> + continue;
> +
> + pr_err("Class-%d fullness group %d is not empty\n",
> + class->size, fg);
> }
> kfree(class);
> }
> @@ -2672,7 +2703,7 @@ static int zs_reclaim_page(struct zs_pool *pool, unsigned int retries)
> * while the page is removed from the pool. Fix it
> * up for the check in __free_zspage().
> */
> - zspage->fullness = ZS_EMPTY;
> + zspage->fullness = ZS_INUSE_RATIO_0;
>
> __free_zspage(pool, class, zspage);
> spin_unlock(&pool->lock);
> --
> 2.39.2.637.g21b0678d19-goog
>

2023-02-23 23:33:32

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCHv2 2/6] zsmalloc: remove stat and fullness enums

On Thu, Feb 23, 2023 at 3:11 PM Minchan Kim <[email protected]> wrote:
>
> On Thu, Feb 23, 2023 at 12:04:47PM +0900, Sergey Senozhatsky wrote:
> > The fullness_group enum is nested (sub-enum) within the
> > class_stat_type enum. zsmalloc requires the values in both
> > enums to match, because zsmalloc passes these values to
> > generic functions, e.g. class_stat_inc() and class_stat_dec(),
> > after casting them to integers.
> >
> > Replace these enums (and enum nesting) and use simple defines
> > instead. Also rename some of zsmalloc stats defines, as they
> > sort of clash with zspage object tags.
> >
> > Suggested-by: Yosry Ahmed <[email protected]>
> > Signed-off-by: Sergey Senozhatsky <[email protected]>
> > ---
> > mm/zsmalloc.c | 104 ++++++++++++++++++++++----------------------------
> > 1 file changed, 45 insertions(+), 59 deletions(-)
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index b57a89ed6f30..38ae8963c0eb 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -159,26 +159,18 @@
> > #define ZS_SIZE_CLASSES (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE, \
> > ZS_SIZE_CLASS_DELTA) + 1)
> >
> > -enum fullness_group {
> > - ZS_EMPTY,
> > - ZS_ALMOST_EMPTY,
> > - ZS_ALMOST_FULL,
> > - ZS_FULL,
> > - NR_ZS_FULLNESS,
> > -};
> > +#define ZS_EMPTY 0
> > +#define ZS_ALMOST_EMPTY 1
> > +#define ZS_ALMOST_FULL 2
> > +#define ZS_FULL 3
> > +#define ZS_OBJS_ALLOCATED 4
> > +#define ZS_OBJS_INUSE 5
> >
> > -enum class_stat_type {
> > - CLASS_EMPTY,
> > - CLASS_ALMOST_EMPTY,
> > - CLASS_ALMOST_FULL,
> > - CLASS_FULL,
> > - OBJ_ALLOCATED,
> > - OBJ_USED,
> > - NR_ZS_STAT_TYPE,
> > -};
> > +#define NR_ZS_STAT 6
> > +#define NR_ZS_FULLNESS 4
>
> Using define list instead of enum list looks like going backward. :)
>
> Why can't we do this?
>
> enum class_stat_type {
> ZS_EMPTY,
> ZS_ALMOST_EMPTY,
> ZS_ALMOST_FULL,
> ZS_FULL,
> NR_ZS_FULLNESS,
> ZS_OBJ_ALLOCATED = NR_ZS_FULLNESS,
> ZS_OBJ_USED,
> NR_ZS_STAT,
> }

Right, I suggested getting rid of the extra enums, so merging them
into 1 is great.

>
>
> };
> >
> > struct zs_size_stat {
> > - unsigned long objs[NR_ZS_STAT_TYPE];
> > + unsigned long objs[NR_ZS_STAT];
> > };
> >

2023-02-23 23:46:32

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 4/6] zsmalloc: rework compaction algorithm

On Thu, Feb 23, 2023 at 12:04:49PM +0900, Sergey Senozhatsky wrote:
> The zsmalloc compaction algorithm has the potential to
> waste some CPU cycles, particularly when compacting pages
> within the same fullness group. This is due to the way it
> selects the head page of the fullness list for source and
> destination pages, and how it reinserts those pages during
> each iteration. The algorithm may first use a page as a
> migration destination and then as a migration source,
> leading to an unnecessary back-and-forth movement of
> objects.
>
> Consider the following fullness list:
>
> PageA PageB PageC PageD PageE
>
> During the first iteration, the compaction algorithm will
> select PageA as the source and PageB as the destination.
> All of PageA's objects will be moved to PageB, and then
> PageA will be released while PageB is reinserted into the
> fullness list.
>
> PageB PageC PageD PageE
>
> During the next iteration, the compaction algorithm will
> again select the head of the list as the source and destination,
> meaning that PageB will now serve as the source and PageC as
> the destination. This will result in the objects being moved
> away from PageB, the same objects that were just moved to PageB
> in the previous iteration.
>
> To prevent this avalanche effect, the compaction algorithm

Good point.

> should not reinsert the destination page between iterations.
> By doing so, the most optimal page will continue to be used
> and its usage ratio will increase, reducing internal
> fragmentation. The destination page should only be reinserted
> into the fullness list if:
> - It becomes full
> - No source page is available.

I think that's really better option, yeah.

>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> mm/zsmalloc.c | 82 ++++++++++++++++++++++++---------------------------
> 1 file changed, 38 insertions(+), 44 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 1a92ebe338eb..eacf9e32da5c 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1786,15 +1786,14 @@ struct zs_compact_control {
> int obj_idx;
> };
>
> -static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> - struct zs_compact_control *cc)
> +static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
> + struct zs_compact_control *cc)
> {
> unsigned long used_obj, free_obj;
> unsigned long handle;
> struct page *s_page = cc->s_page;
> struct page *d_page = cc->d_page;
> int obj_idx = cc->obj_idx;
> - int ret = 0;
>
> while (1) {
> handle = find_alloced_obj(class, s_page, &obj_idx);
> @@ -1807,10 +1806,8 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> }
>
> /* Stop if there is no more space */
> - if (zspage_full(class, get_zspage(d_page))) {
> - ret = -ENOMEM;
> + if (zspage_full(class, get_zspage(d_page)))
> break;
> - }
>
> used_obj = handle_to_obj(handle);
> free_obj = obj_malloc(pool, get_zspage(d_page), handle);
> @@ -1823,8 +1820,6 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> /* Remember last position in this iteration */
> cc->s_page = s_page;
> cc->obj_idx = obj_idx;
> -
> - return ret;
> }
>
> static struct zspage *isolate_src_zspage(struct size_class *class)
> @@ -2228,57 +2223,56 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> * as well as zpage allocation/free
> */
> spin_lock(&pool->lock);
> - while ((src_zspage = isolate_src_zspage(class))) {
> - /* protect someone accessing the zspage(i.e., zs_map_object) */
> - migrate_write_lock(src_zspage);
> -
> - if (!zs_can_compact(class))
> - break;
> -
> - cc.obj_idx = 0;
> - cc.s_page = get_first_page(src_zspage);
> -
> - while ((dst_zspage = isolate_dst_zspage(class))) {
> - migrate_write_lock_nested(dst_zspage);
> -
> + while (1) {

Hmm, I preferred the old loop structure. Did you see any problem
to keep old code structure?

Can't we just add checking logic whether destination zspage page
is full after migrate_zspage and the putback if it is. Otherwise,
keep continuing with the source zspage or new zspage we completely
migrate all zpages in the zspage. If there is no more source zspages
in the list, we can break the loop and then putback the dest zspage
into right class group at out of end loop.

> + if (!dst_zspage) {
> + dst_zspage = isolate_dst_zspage(class);
> + if (!dst_zspage)
> + goto out;
> + migrate_write_lock(dst_zspage);
> cc.d_page = get_first_page(dst_zspage);
> - /*
> - * If there is no more space in dst_page, resched
> - * and see if anyone had allocated another zspage.
> - */
> - if (!migrate_zspage(pool, class, &cc))
> - break;
> + }
>
> + if (!zs_can_compact(class)) {
> putback_zspage(class, dst_zspage);
> migrate_write_unlock(dst_zspage);
> - dst_zspage = NULL;
> - if (spin_is_contended(&pool->lock))
> - break;
> + goto out;
> }
>
> - /* Stop if we couldn't find slot */
> - if (dst_zspage == NULL)
> - break;
> + src_zspage = isolate_src_zspage(class);
> + if (!src_zspage) {
> + putback_zspage(class, dst_zspage);
> + migrate_write_unlock(dst_zspage);
> + goto out;
> + }
>
> - putback_zspage(class, dst_zspage);
> - migrate_write_unlock(dst_zspage);
> + migrate_write_lock_nested(src_zspage);
> +
> + cc.obj_idx = 0;
> + cc.s_page = get_first_page(src_zspage);
> + migrate_zspage(pool, class, &cc);
>
> if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
> migrate_write_unlock(src_zspage);
> free_zspage(pool, class, src_zspage);
> pages_freed += class->pages_per_zspage;
> - } else
> + } else {
> migrate_write_unlock(src_zspage);
> - spin_unlock(&pool->lock);
> - cond_resched();
> - spin_lock(&pool->lock);
> - }
> + }
>
> - if (src_zspage) {
> - putback_zspage(class, src_zspage);
> - migrate_write_unlock(src_zspage);
> - }
> + if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> + || spin_is_contended(&pool->lock)) {
> + putback_zspage(class, dst_zspage);
> + migrate_write_unlock(dst_zspage);
> + dst_zspage = NULL;
> + }
>
> + if (!dst_zspage) {
> + spin_unlock(&pool->lock);
> + cond_resched();
> + spin_lock(&pool->lock);
> + }
> + }
> +out:
> spin_unlock(&pool->lock);
>
> return pages_freed;
> --
> 2.39.2.637.g21b0678d19-goog
>

2023-02-23 23:51:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On Thu, Feb 23, 2023 at 12:04:50PM +0900, Sergey Senozhatsky wrote:
> Extend zsmalloc zs_pool_stats with a new member that
> holds the number of objects pool compaction moved
> between pool pages.

I totally understand this new stat would be very useful for your
development but not sure it's really useful for workload tune or
monitoring.

Unless we have strong usecase, I'd like to avoid new stat.

>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> include/linux/zsmalloc.h | 2 ++
> mm/zsmalloc.c | 1 +
> 2 files changed, 3 insertions(+)
>
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index a48cd0ffe57d..8b3fa5b4a68c 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -36,6 +36,8 @@ enum zs_mapmode {
> struct zs_pool_stats {
> /* How many pages were migrated (freed) */
> atomic_long_t pages_compacted;
> + /* How many objects were migrated during compaction */
> + atomic_long_t objs_moved;
> };
>
> struct zs_pool;
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index eacf9e32da5c..f7e69df48fb0 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1815,6 +1815,7 @@ static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
> obj_idx++;
> record_obj(handle, free_obj);
> obj_free(class->size, used_obj, NULL);
> + atomic_long_inc(&pool->stats.objs_moved);
> }


2023-02-23 23:53:37

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

On Thu, Feb 23, 2023 at 12:04:45PM +0900, Sergey Senozhatsky wrote:
> Hi,
>
> Existing zsmalloc page fullness grouping leads to suboptimal page
> selection for both zs_malloc() and zs_compact(). This patchset
> reworks zsmalloc fullness grouping/classification.
>
> Additinally it also implements new compaction algorithm that is
> expected to use CPU-cycles (as it potentially does fewer memcpy-s
> in zs_object_copy()).
>
> TEST
> ====
>
> It's very challenging to reliably test this series. I ended up
> developing my own synthetic test that has 100% reproducibility.
> The test generates significan fragmentation (for each size class)
> and then performs compaction for each class individually and tracks
> the number of memcpy() in zs_object_copy(), so that we can compare
> the amount work compaction does on per-class basis.
>
> Total amount of work (zram mm_stat objs_moved)
> ----------------------------------------------
>
> Old fullness grouping, old compaction algorithm:
> 323977 memcpy() in zs_object_copy().
>
> Old fullness grouping, new compaction algorithm:
> 262944 memcpy() in zs_object_copy().
>
> New fullness grouping, new compaction algorithm:
> 213978 memcpy() in zs_object_copy().
>
>
> Per-class compaction memcpy() comparison (T-test)

Just curiosity: What's the T-test?

> -------------------------------------------------
>
> x Old fullness grouping, old compaction algorithm
> + Old fullness grouping, new compaction algorithm
>
> N Min Max Median Avg Stddev
> x 140 349 3513 2461 2314.1214 806.03271
> + 140 289 2778 2006 1878.1714 641.02073
> Difference at 95.0% confidence
> -435.95 +/- 170.595
> -18.8387% +/- 7.37193%
> (Student's t, pooled s = 728.216)
>
>
> x Old fullness grouping, old compaction algorithm
> + New fullness grouping, new compaction algorithm
>
> N Min Max Median Avg Stddev
> x 140 349 3513 2461 2314.1214 806.03271
> + 140 226 2279 1644 1528.4143 524.85268
> Difference at 95.0% confidence
> -785.707 +/- 159.331
> -33.9527% +/- 6.88516%
> (Student's t, pooled s = 680.132)

What's the different with result above? Did you just run two times and
shows they are consistent or this is new result based on different
testing?

Anyway, this is really nice improvement. The comment I had in thread
are just minors.

Thanks, Sergey!

2023-02-26 03:53:16

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

On (23/02/23 15:53), Minchan Kim wrote:
> > TEST
> > ====
> >
> > It's very challenging to reliably test this series. I ended up
> > developing my own synthetic test that has 100% reproducibility.
> > The test generates significan fragmentation (for each size class)
> > and then performs compaction for each class individually and tracks
> > the number of memcpy() in zs_object_copy(), so that we can compare
> > the amount work compaction does on per-class basis.
> >
> > Total amount of work (zram mm_stat objs_moved)
> > ----------------------------------------------
> >
> > Old fullness grouping, old compaction algorithm:
> > 323977 memcpy() in zs_object_copy().
> >
> > Old fullness grouping, new compaction algorithm:
> > 262944 memcpy() in zs_object_copy().
> >
> > New fullness grouping, new compaction algorithm:
> > 213978 memcpy() in zs_object_copy().
> >
> >
> > Per-class compaction memcpy() comparison (T-test)
>
> Just curiosity: What's the T-test?

T-test is a statistical method used to compare the means
of two independent groups or samples and determine if the
difference between them is statistically significant.

> > x Old fullness grouping, old compaction algorithm
> > + Old fullness grouping, new compaction algorithm
> >
> > N Min Max Median Avg Stddev
> > x 140 349 3513 2461 2314.1214 806.03271
> > + 140 289 2778 2006 1878.1714 641.02073
> > Difference at 95.0% confidence
> > -435.95 +/- 170.595
> > -18.8387% +/- 7.37193%
> > (Student's t, pooled s = 728.216)
> >
> >
> > x Old fullness grouping, old compaction algorithm
> > + New fullness grouping, new compaction algorithm
> >
> > N Min Max Median Avg Stddev
> > x 140 349 3513 2461 2314.1214 806.03271
> > + 140 226 2279 1644 1528.4143 524.85268
> > Difference at 95.0% confidence
> > -785.707 +/- 159.331
> > -33.9527% +/- 6.88516%
> > (Student's t, pooled s = 680.132)
>
> What's the different with result above? Did you just run two times and
> shows they are consistent or this is new result based on different
> testing?

The test is exactly the same, it is designed to have 0 variability, it
creates exactly same fragmentation during each run, so we always compare
apples to apples. What is being changed (and hence tested) are fullness
grouping and compaction algorithm.

The first one tests the effect of new compaction algorithm alone:
old fullness grouping and old compaction algorithm VS old fullness
grouping and new compaction algorithm. The data show that with
sufficient level of confidence (95%) we can claim that new compaction
does make a statstically significant improvement and reduce the number
of memcpy() calls (by 18.3% in this particular case).

The second one tests the effect of new fullness grouping and new
compaction algorithm. The data show that with sufficient level of
confidence we can claim that new fullness grouping and new compaction
do make a statstically significant improvement and reduce the number
of memcpy() calls (by 33.9% in this particular case).

2023-02-26 03:57:12

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On (23/02/23 15:51), Minchan Kim wrote:
> On Thu, Feb 23, 2023 at 12:04:50PM +0900, Sergey Senozhatsky wrote:
> > Extend zsmalloc zs_pool_stats with a new member that
> > holds the number of objects pool compaction moved
> > between pool pages.
>
> I totally understand this new stat would be very useful for your
> development but not sure it's really useful for workload tune or
> monitoring.
>
> Unless we have strong usecase, I'd like to avoid new stat.

The way I see is that it *can* give some interesting additional data to
periodical compaction (the one is not triggeed by the shrinker): if the
number of moves objects is relatively high but the number of comapcted
(feeed) pages is relatively low then the system has fragmentation in
small size classes (that tend to have many objects per zspage but not
too many pages per zspage) and in this case the interval between
periodical compactions probably can be increased. What do you think?

2023-02-26 04:10:11

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 4/6] zsmalloc: rework compaction algorithm

On (23/02/23 15:46), Minchan Kim wrote:
> > spin_lock(&pool->lock);
> > - while ((src_zspage = isolate_src_zspage(class))) {
> > - /* protect someone accessing the zspage(i.e., zs_map_object) */
> > - migrate_write_lock(src_zspage);
> > -
> > - if (!zs_can_compact(class))
> > - break;
> > -
> > - cc.obj_idx = 0;
> > - cc.s_page = get_first_page(src_zspage);
> > -
> > - while ((dst_zspage = isolate_dst_zspage(class))) {
> > - migrate_write_lock_nested(dst_zspage);
> > -
> > + while (1) {
>
> Hmm, I preferred the old loop structure. Did you see any problem
> to keep old code structure?

Unfortunately we cannot keep the current structure as it will create
conflicting/reverse locking patterns.

What we currently have is that source page is isolated first and its
migration lock is the outter lock:

migrate_write_lock src

Destination page is isolated second and its migration lock is nested:

migrate_write_lock_nested dst

Since destination page lock is nested we always need to unlock it before
we unlock the outer lock (source page migrate lock). If we keep destination
locked (nested lock, which will be a bug) then on the next iteration we will
isolate a new source page and try to migrate_write_lock it except that now
source page migration lock is in fact nested which we take under another
nested lock (which is another bug).

Hence we need to flip the structure: we isolate destination page, its
lock is outter lock, we keep it locked as long as we need and source page
lock becomes nested. I think that's the simplest way.

2023-02-26 04:38:32

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/02/23 15:27), Minchan Kim wrote:
> > + * Pages are distinguished by the ratio of used memory (that is the ratio
> > + * of ->inuse objects to all objects that page can store). For example,
> > + * INUSE_RATIO_30 means that the ratio of used objects is > 20% and <= 30%.
> > + *
> > + * The number of fullness groups is not random. It allows us to keep
> > + * diffeence between the least busy page in the group (minimum permitted
> > + * number of ->inuse objects) and the most busy page (maximum permitted
> > + * number of ->inuse objects) at a reasonable value.
> > + */
> > +#define ZS_INUSE_RATIO_0 0
>
> How about keeping ZS_EMPTY and ZS_FULL since they are used
> multiple places in source code? It would have less churning.

I have to admit that I sort of like the unified naming
"zspage inuse ratio goes from 0 to 100"

but I can keep ZS_EMPTY / ZS_FULL as two "special" inuse values.

> > +#define ZS_INUSE_RATIO_10 1
> > +#define ZS_INUSE_RATIO_20 2
> > +#define ZS_INUSE_RATIO_30 3
> > +#define ZS_INUSE_RATIO_40 4
> > +#define ZS_INUSE_RATIO_50 5
> > +#define ZS_INUSE_RATIO_60 6
> > +#define ZS_INUSE_RATIO_70 7
> > +#define ZS_INUSE_RATIO_80 8
> > +#define ZS_INUSE_RATIO_90 9
> > +#define ZS_INUSE_RATIO_99 10
>
> Do we really need all the define macro for the range from 10 to 99?
> Can't we do this?
>
> enum class_stat_type {
> ZS_EMPTY,
> /*
> * There are fullness buckets between 10% - 99%.
> */
> ZS_FULL = 11
> NR_ZS_FULLNESS,
> ZS_OBJ_ALLOCATED = NR_ZS_FULLNESS,
> ZS_OBJ_USED,
> NR_ZS_STAT,
> }

This creates undocumented secret constats, which are being heavily
used (zspage fullness values, indeces in fullness lists arrays,
stats array offsets, etc.) but have no trace in the code. And this
also forces us to use magic number in the code. So should fullness
grouping change, things like, for instance, zs_stat_get(7), will
compile just fine yet will do something very different and we will
have someone to spot the regression.

So yes, it's 10 lines of defines, it's not even 10 lines of code, but
1) it is documentation, we keep constats documented
2) more importantly, it protects us from regressions and bugs

From maintinability point of view, having everything excpliticly
documented / spelled out is a win.

As of why I decided to go with defines, this is because zspage fullness
values and class stats are two conceptually different things, they don't
really fit in one single enum, unless enum's name is "zs_constants".
What do you think?

[..]
> > * Size of objects stored in this class. Must be multiple
> > * of ZS_ALIGN.
> > @@ -641,8 +644,23 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
> > continue;
> >
> > spin_lock(&pool->lock);
> > - class_almost_full = zs_stat_get(class, ZS_ALMOST_FULL);
> > - class_almost_empty = zs_stat_get(class, ZS_ALMOST_EMPTY);
> > +
> > + /*
> > + * Replecate old behaviour for almost_full and almost_empty
> > + * stats.
> > + */
> > + class_almost_full = zs_stat_get(class, ZS_INUSE_RATIO_99);
> > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_90);
> > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_80);
> > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_70);
>
> > +
> > + class_almost_empty = zs_stat_get(class, ZS_INUSE_RATIO_60);
> > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_50);
> > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_40);
> > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_30);
> > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_20);
> > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_10);
>
> I guess you can use just loop here from 1 to 6
>
> And then from 7 to 10 for class_almost_full.

I can change it to

for (r = ZS_INUSE_RATIO_10; r <= ZS_INUSE_RATIO_70; r++)
and
for (r = ZS_INUSE_RATIO_80; r <= ZS_INUSE_RATIO_99; r++)

which would be safer than using hard-coded numbers.

Shall we actually instead report per inuse ratio stats instead? I sort
of don't see too many reasons to keep that below/above 3/4 thing.

2023-02-26 04:39:41

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 2/6] zsmalloc: remove stat and fullness enums

On (23/02/23 15:11), Minchan Kim wrote:
>
> Using define list instead of enum list looks like going backward. :)
>
> Why can't we do this?

I replied in another email, just to keep conversation in a single thread.

2023-02-26 04:41:09

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 1/6] zsmalloc: remove insert_zspage() ->inuse optimization

On (23/02/23 15:09), Minchan Kim wrote:
>
> On Thu, Feb 23, 2023 at 12:04:46PM +0900, Sergey Senozhatsky wrote:
> > This optimization has no effect. It only ensures that
> > when a page was added to its corresponding fullness
> > list, its "inuse" counter was higher or lower than the
> > "inuse" counter of the page at the head of the list.
> > The intention was to keep busy pages at the head, so
> > they could be filled up and moved to the ZS_FULL
> > fullness group more quickly. However, this doesn't work
> > as the "inuse" counter of a page can be modified by
>
> zspage
>
> Let's use term zspage instead of page to prevent confusing.
>
> > obj_free() but the page may still belong to the same
> > fullness list. So, fix_fullness_group() won't change
>
> Yes. I didn't expect it should be perfect from the beginning
> but would help just little optimization.
>
> > the page's position in relation to the head's "inuse"
> > counter, leading to a largely random order of pages
> > within the fullness list.
>
> Good point.
>
> >
> > For instance, consider a printout of the "inuse"
> > counters of the first 10 pages in a class that holds
> > 93 objects per zspage:
> >
> > ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52
> >
> > As we can see the page with the lowest "inuse" counter
> > is actually the head of the fullness list.
>
> Let's write what the patch is doing cleary
>
> "So, let's remove the pointless optimization" or something better word.

ACK to all feedback (for all the patches). Thanks!

2023-02-28 22:18:03

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

On Sun, Feb 26, 2023 at 12:50:45PM +0900, Sergey Senozhatsky wrote:
> On (23/02/23 15:53), Minchan Kim wrote:
> > > TEST
> > > ====
> > >
> > > It's very challenging to reliably test this series. I ended up
> > > developing my own synthetic test that has 100% reproducibility.
> > > The test generates significan fragmentation (for each size class)
> > > and then performs compaction for each class individually and tracks
> > > the number of memcpy() in zs_object_copy(), so that we can compare
> > > the amount work compaction does on per-class basis.
> > >
> > > Total amount of work (zram mm_stat objs_moved)
> > > ----------------------------------------------
> > >
> > > Old fullness grouping, old compaction algorithm:
> > > 323977 memcpy() in zs_object_copy().
> > >
> > > Old fullness grouping, new compaction algorithm:
> > > 262944 memcpy() in zs_object_copy().
> > >
> > > New fullness grouping, new compaction algorithm:
> > > 213978 memcpy() in zs_object_copy().
> > >
> > >
> > > Per-class compaction memcpy() comparison (T-test)
> >
> > Just curiosity: What's the T-test?
>
> T-test is a statistical method used to compare the means
> of two independent groups or samples and determine if the
> difference between them is statistically significant.
>
> > > x Old fullness grouping, old compaction algorithm
> > > + Old fullness grouping, new compaction algorithm
> > >
> > > N Min Max Median Avg Stddev
> > > x 140 349 3513 2461 2314.1214 806.03271
> > > + 140 289 2778 2006 1878.1714 641.02073
> > > Difference at 95.0% confidence
> > > -435.95 +/- 170.595
> > > -18.8387% +/- 7.37193%
> > > (Student's t, pooled s = 728.216)
> > >
> > >
> > > x Old fullness grouping, old compaction algorithm
> > > + New fullness grouping, new compaction algorithm
> > >
> > > N Min Max Median Avg Stddev
> > > x 140 349 3513 2461 2314.1214 806.03271
> > > + 140 226 2279 1644 1528.4143 524.85268
> > > Difference at 95.0% confidence
> > > -785.707 +/- 159.331
> > > -33.9527% +/- 6.88516%
> > > (Student's t, pooled s = 680.132)
> >
> > What's the different with result above? Did you just run two times and
> > shows they are consistent or this is new result based on different
> > testing?
>
> The test is exactly the same, it is designed to have 0 variability, it
> creates exactly same fragmentation during each run, so we always compare
> apples to apples. What is being changed (and hence tested) are fullness
> grouping and compaction algorithm.
>
> The first one tests the effect of new compaction algorithm alone:
> old fullness grouping and old compaction algorithm VS old fullness
> grouping and new compaction algorithm. The data show that with
> sufficient level of confidence (95%) we can claim that new compaction
> does make a statstically significant improvement and reduce the number
> of memcpy() calls (by 18.3% in this particular case).
>
> The second one tests the effect of new fullness grouping and new
> compaction algorithm. The data show that with sufficient level of
> confidence we can claim that new fullness grouping and new compaction
> do make a statstically significant improvement and reduce the number
> of memcpy() calls (by 33.9% in this particular case).

Thanks for the explanation, Sergey.

Please include the testing result data in the description of the patch
you made significant change to achieve it as well as cover letter.

Otherwise, zsmalloc-remove-insert_zspage-inuse-optimization.patch
has every data now but that patch didn't make such an improvement.

2023-02-28 22:20:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On Sun, Feb 26, 2023 at 12:55:45PM +0900, Sergey Senozhatsky wrote:
> On (23/02/23 15:51), Minchan Kim wrote:
> > On Thu, Feb 23, 2023 at 12:04:50PM +0900, Sergey Senozhatsky wrote:
> > > Extend zsmalloc zs_pool_stats with a new member that
> > > holds the number of objects pool compaction moved
> > > between pool pages.
> >
> > I totally understand this new stat would be very useful for your
> > development but not sure it's really useful for workload tune or
> > monitoring.
> >
> > Unless we have strong usecase, I'd like to avoid new stat.
>
> The way I see is that it *can* give some interesting additional data to
> periodical compaction (the one is not triggeed by the shrinker): if the
> number of moves objects is relatively high but the number of comapcted
> (feeed) pages is relatively low then the system has fragmentation in
> small size classes (that tend to have many objects per zspage but not
> too many pages per zspage) and in this case the interval between
> periodical compactions probably can be increased. What do you think?

In the case, how could we get only data triggered by periodical munual
compaction?

2023-02-28 22:54:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Sun, Feb 26, 2023 at 01:38:22PM +0900, Sergey Senozhatsky wrote:
> On (23/02/23 15:27), Minchan Kim wrote:
> > > + * Pages are distinguished by the ratio of used memory (that is the ratio
> > > + * of ->inuse objects to all objects that page can store). For example,
> > > + * INUSE_RATIO_30 means that the ratio of used objects is > 20% and <= 30%.
> > > + *
> > > + * The number of fullness groups is not random. It allows us to keep
> > > + * diffeence between the least busy page in the group (minimum permitted
> > > + * number of ->inuse objects) and the most busy page (maximum permitted
> > > + * number of ->inuse objects) at a reasonable value.
> > > + */
> > > +#define ZS_INUSE_RATIO_0 0
> >
> > How about keeping ZS_EMPTY and ZS_FULL since they are used
> > multiple places in source code? It would have less churning.
>
> I have to admit that I sort of like the unified naming
> "zspage inuse ratio goes from 0 to 100"
>
> but I can keep ZS_EMPTY / ZS_FULL as two "special" inuse values.
>
> > > +#define ZS_INUSE_RATIO_10 1
> > > +#define ZS_INUSE_RATIO_20 2
> > > +#define ZS_INUSE_RATIO_30 3
> > > +#define ZS_INUSE_RATIO_40 4
> > > +#define ZS_INUSE_RATIO_50 5
> > > +#define ZS_INUSE_RATIO_60 6
> > > +#define ZS_INUSE_RATIO_70 7
> > > +#define ZS_INUSE_RATIO_80 8
> > > +#define ZS_INUSE_RATIO_90 9
> > > +#define ZS_INUSE_RATIO_99 10
> >
> > Do we really need all the define macro for the range from 10 to 99?
> > Can't we do this?
> >
> > enum class_stat_type {
> > ZS_EMPTY,
> > /*
> > * There are fullness buckets between 10% - 99%.
> > */
> > ZS_FULL = 11
> > NR_ZS_FULLNESS,
> > ZS_OBJ_ALLOCATED = NR_ZS_FULLNESS,
> > ZS_OBJ_USED,
> > NR_ZS_STAT,
> > }
>
> This creates undocumented secret constats, which are being heavily
> used (zspage fullness values, indeces in fullness lists arrays,
> stats array offsets, etc.) but have no trace in the code. And this
> also forces us to use magic number in the code. So should fullness
> grouping change, things like, for instance, zs_stat_get(7), will
> compile just fine yet will do something very different and we will
> have someone to spot the regression.
>
> So yes, it's 10 lines of defines, it's not even 10 lines of code, but
> 1) it is documentation, we keep constats documented
> 2) more importantly, it protects us from regressions and bugs
>
> From maintinability point of view, having everything excpliticly
> documented / spelled out is a win.
>
> As of why I decided to go with defines, this is because zspage fullness
> values and class stats are two conceptually different things, they don't
> really fit in one single enum, unless enum's name is "zs_constants".
> What do you think?

Agree. We don't need to combine them, then.
BTW, I still prefer the enum instead of 10 define.

enum fullness_group {
ZS_EMPTY,
ZS_INUSE_RATIO_MIN,
ZS_INUSE_RATIO_ALMOST_FULL = 7,
ZS_INUSE_RATIO_MAX = 10,
ZS_FULL,
NR_ZS_FULLNESS,
}

>
> [..]
> > > * Size of objects stored in this class. Must be multiple
> > > * of ZS_ALIGN.
> > > @@ -641,8 +644,23 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
> > > continue;
> > >
> > > spin_lock(&pool->lock);
> > > - class_almost_full = zs_stat_get(class, ZS_ALMOST_FULL);
> > > - class_almost_empty = zs_stat_get(class, ZS_ALMOST_EMPTY);
> > > +
> > > + /*
> > > + * Replecate old behaviour for almost_full and almost_empty
> > > + * stats.
> > > + */
> > > + class_almost_full = zs_stat_get(class, ZS_INUSE_RATIO_99);
> > > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_90);
> > > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_80);
> > > + class_almost_full += zs_stat_get(class, ZS_INUSE_RATIO_70);
> >
> > > +
> > > + class_almost_empty = zs_stat_get(class, ZS_INUSE_RATIO_60);
> > > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_50);
> > > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_40);
> > > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_30);
> > > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_20);
> > > + class_almost_empty += zs_stat_get(class, ZS_INUSE_RATIO_10);
> >
> > I guess you can use just loop here from 1 to 6
> >
> > And then from 7 to 10 for class_almost_full.
>
> I can change it to
>
> for (r = ZS_INUSE_RATIO_10; r <= ZS_INUSE_RATIO_70; r++)
> and
> for (r = ZS_INUSE_RATIO_80; r <= ZS_INUSE_RATIO_99; r++)
>
> which would be safer than using hard-coded numbers.

I didn't mean to have hard code either but just wanted to show
the intention to use the loop.

>
> Shall we actually instead report per inuse ratio stats instead? I sort
> of don't see too many reasons to keep that below/above 3/4 thing.

Oh, yeah. Since it's debugfs, we would get excuse to break.

2023-02-28 23:14:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 4/6] zsmalloc: rework compaction algorithm

On Thu, Feb 23, 2023 at 12:04:49PM +0900, Sergey Senozhatsky wrote:
> The zsmalloc compaction algorithm has the potential to
> waste some CPU cycles, particularly when compacting pages
> within the same fullness group. This is due to the way it
> selects the head page of the fullness list for source and
> destination pages, and how it reinserts those pages during
> each iteration. The algorithm may first use a page as a
> migration destination and then as a migration source,
> leading to an unnecessary back-and-forth movement of
> objects.
>
> Consider the following fullness list:
>
> PageA PageB PageC PageD PageE
>
> During the first iteration, the compaction algorithm will
> select PageA as the source and PageB as the destination.
> All of PageA's objects will be moved to PageB, and then
> PageA will be released while PageB is reinserted into the
> fullness list.
>
> PageB PageC PageD PageE
>
> During the next iteration, the compaction algorithm will
> again select the head of the list as the source and destination,
> meaning that PageB will now serve as the source and PageC as
> the destination. This will result in the objects being moved
> away from PageB, the same objects that were just moved to PageB
> in the previous iteration.
>
> To prevent this avalanche effect, the compaction algorithm
> should not reinsert the destination page between iterations.
> By doing so, the most optimal page will continue to be used
> and its usage ratio will increase, reducing internal
> fragmentation. The destination page should only be reinserted
> into the fullness list if:
> - It becomes full
> - No source page is available.
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
> ---
> mm/zsmalloc.c | 82 ++++++++++++++++++++++++---------------------------
> 1 file changed, 38 insertions(+), 44 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 1a92ebe338eb..eacf9e32da5c 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1786,15 +1786,14 @@ struct zs_compact_control {
> int obj_idx;
> };
>
> -static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> - struct zs_compact_control *cc)
> +static void migrate_zspage(struct zs_pool *pool, struct size_class *class,
> + struct zs_compact_control *cc)
> {
> unsigned long used_obj, free_obj;
> unsigned long handle;
> struct page *s_page = cc->s_page;
> struct page *d_page = cc->d_page;
> int obj_idx = cc->obj_idx;
> - int ret = 0;
>
> while (1) {
> handle = find_alloced_obj(class, s_page, &obj_idx);
> @@ -1807,10 +1806,8 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> }
>
> /* Stop if there is no more space */
> - if (zspage_full(class, get_zspage(d_page))) {
> - ret = -ENOMEM;
> + if (zspage_full(class, get_zspage(d_page)))
> break;
> - }
>
> used_obj = handle_to_obj(handle);
> free_obj = obj_malloc(pool, get_zspage(d_page), handle);
> @@ -1823,8 +1820,6 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> /* Remember last position in this iteration */
> cc->s_page = s_page;
> cc->obj_idx = obj_idx;
> -
> - return ret;
> }
>
> static struct zspage *isolate_src_zspage(struct size_class *class)
> @@ -2228,57 +2223,56 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> * as well as zpage allocation/free
> */
> spin_lock(&pool->lock);
> - while ((src_zspage = isolate_src_zspage(class))) {
> - /* protect someone accessing the zspage(i.e., zs_map_object) */
> - migrate_write_lock(src_zspage);
> -
> - if (!zs_can_compact(class))
> - break;
> -
> - cc.obj_idx = 0;
> - cc.s_page = get_first_page(src_zspage);
> -
> - while ((dst_zspage = isolate_dst_zspage(class))) {
> - migrate_write_lock_nested(dst_zspage);
> -
> + while (1) {
> + if (!dst_zspage) {
> + dst_zspage = isolate_dst_zspage(class);
> + if (!dst_zspage)
> + goto out;
> + migrate_write_lock(dst_zspage);
> cc.d_page = get_first_page(dst_zspage);
> - /*
> - * If there is no more space in dst_page, resched
> - * and see if anyone had allocated another zspage.
> - */
> - if (!migrate_zspage(pool, class, &cc))
> - break;
> + }
>
> + if (!zs_can_compact(class)) {
> putback_zspage(class, dst_zspage);
> migrate_write_unlock(dst_zspage);
> - dst_zspage = NULL;
> - if (spin_is_contended(&pool->lock))
> - break;
> + goto out;

just break instead of goto

> }
>
> - /* Stop if we couldn't find slot */
> - if (dst_zspage == NULL)
> - break;
> + src_zspage = isolate_src_zspage(class);
> + if (!src_zspage) {
> + putback_zspage(class, dst_zspage);
> + migrate_write_unlock(dst_zspage);
> + goto out;

just break instead of goto

> + }
>
> - putback_zspage(class, dst_zspage);
> - migrate_write_unlock(dst_zspage);
> + migrate_write_lock_nested(src_zspage);
> +
> + cc.obj_idx = 0;
> + cc.s_page = get_first_page(src_zspage);
> + migrate_zspage(pool, class, &cc);
>
> if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
> migrate_write_unlock(src_zspage);
> free_zspage(pool, class, src_zspage);
> pages_freed += class->pages_per_zspage;
> - } else
> + } else {
> migrate_write_unlock(src_zspage);

So here, migratre_wirite_unlock(src_zspage) is done in both conditions
we we could change like this.

ret = putback_zspage(class, src_zspage);
migrate_write_unlock(src_zspage);
if (ret == ZS_INUSE_RATIO_0 or ZS_EMPTY) {
free_zspage();
xxx
}


> - spin_unlock(&pool->lock);
> - cond_resched();
> - spin_lock(&pool->lock);
> - }
> + }
>
> - if (src_zspage) {
> - putback_zspage(class, src_zspage);
> - migrate_write_unlock(src_zspage);
> - }
> + if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> + || spin_is_contended(&pool->lock)) {
> + putback_zspage(class, dst_zspage);
> + migrate_write_unlock(dst_zspage);
> + dst_zspage = NULL;

spin_unlock(&pool->lock);
cond_resched()
spin_lock(&pool->lock);
> + }
>
> + if (!dst_zspage) {

Then we could remove the condition logic, here.

> + spin_unlock(&pool->lock);
> + cond_resched();
> + spin_lock(&pool->lock);
> + }
> + }
> +out:
> spin_unlock(&pool->lock);
>
> return pages_freed;

So, how about this on top of your patch?


diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index eacf9e32da5c..4dfc910f5d89 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -2223,40 +2223,33 @@ static unsigned long __zs_compact(struct zs_pool *pool,
* as well as zpage allocation/free
*/
spin_lock(&pool->lock);
- while (1) {
+ while (zs_can_compact(class)) {
+ int ret;
+
if (!dst_zspage) {
dst_zspage = isolate_dst_zspage(class);
if (!dst_zspage)
- goto out;
+ break;
migrate_write_lock(dst_zspage);
cc.d_page = get_first_page(dst_zspage);
}

- if (!zs_can_compact(class)) {
- putback_zspage(class, dst_zspage);
- migrate_write_unlock(dst_zspage);
- goto out;
- }
-
src_zspage = isolate_src_zspage(class);
- if (!src_zspage) {
- putback_zspage(class, dst_zspage);
- migrate_write_unlock(dst_zspage);
- goto out;
- }
+ if (!src_zspage)
+ break;

migrate_write_lock_nested(src_zspage);
-
cc.obj_idx = 0;
cc.s_page = get_first_page(src_zspage);
+
migrate_zspage(pool, class, &cc);
+ ret = putback_zspage(class, src_zspage);
+ migrate_write_unlock(src_zspage);

- if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
- migrate_write_unlock(src_zspage);
+ if (ret == ZS_INUSE_RATIO_0) {
free_zspage(pool, class, src_zspage);
pages_freed += class->pages_per_zspage;
- } else {
- migrate_write_unlock(src_zspage);
+ src_zspage = NULL;
}

if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
@@ -2264,14 +2257,22 @@ static unsigned long __zs_compact(struct zs_pool *pool,
putback_zspage(class, dst_zspage);
migrate_write_unlock(dst_zspage);
dst_zspage = NULL;
- }

- if (!dst_zspage) {
spin_unlock(&pool->lock);
cond_resched();
spin_lock(&pool->lock);
}
}
+
+ if (src_zspage) {
+ putback_zspage(class, src_zspage);
+ migrate_write_unlock(src_zspage);
+ }
+
+ if (dst_zspage) {
+ putback_zspage(class, dst_zspage);
+ migrate_write_unlock(dst_zspage);
+ }
out:
spin_unlock(&pool->lock);


2023-03-01 03:47:51

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 4/6] zsmalloc: rework compaction algorithm

On (23/02/28 15:14), Minchan Kim wrote:
> So, how about this on top of your patch?
>

Looks good. Let me pick it up.

> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index eacf9e32da5c..4dfc910f5d89 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -2223,40 +2223,33 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> * as well as zpage allocation/free
> */
> spin_lock(&pool->lock);
> - while (1) {
> + while (zs_can_compact(class)) {
> + int ret;
> +
> if (!dst_zspage) {
> dst_zspage = isolate_dst_zspage(class);
> if (!dst_zspage)
> - goto out;
> + break;
> migrate_write_lock(dst_zspage);
> cc.d_page = get_first_page(dst_zspage);
> }
>
> - if (!zs_can_compact(class)) {
> - putback_zspage(class, dst_zspage);
> - migrate_write_unlock(dst_zspage);
> - goto out;
> - }
> -
> src_zspage = isolate_src_zspage(class);
> - if (!src_zspage) {
> - putback_zspage(class, dst_zspage);
> - migrate_write_unlock(dst_zspage);
> - goto out;
> - }
> + if (!src_zspage)
> + break;
>
> migrate_write_lock_nested(src_zspage);
> -
> cc.obj_idx = 0;
> cc.s_page = get_first_page(src_zspage);
> +
> migrate_zspage(pool, class, &cc);
> + ret = putback_zspage(class, src_zspage);
> + migrate_write_unlock(src_zspage);
>
> - if (putback_zspage(class, src_zspage) == ZS_INUSE_RATIO_0) {
> - migrate_write_unlock(src_zspage);
> + if (ret == ZS_INUSE_RATIO_0) {
> free_zspage(pool, class, src_zspage);
> pages_freed += class->pages_per_zspage;
> - } else {
> - migrate_write_unlock(src_zspage);
> + src_zspage = NULL;
> }
>
> if (get_fullness_group(class, dst_zspage) == ZS_INUSE_RATIO_100
> @@ -2264,14 +2257,22 @@ static unsigned long __zs_compact(struct zs_pool *pool,
> putback_zspage(class, dst_zspage);
> migrate_write_unlock(dst_zspage);
> dst_zspage = NULL;
> - }
>
> - if (!dst_zspage) {
> spin_unlock(&pool->lock);
> cond_resched();
> spin_lock(&pool->lock);
> }
> }
> +
> + if (src_zspage) {
> + putback_zspage(class, src_zspage);
> + migrate_write_unlock(src_zspage);
> + }
> +
> + if (dst_zspage) {
> + putback_zspage(class, dst_zspage);
> + migrate_write_unlock(dst_zspage);
> + }
> out:
> spin_unlock(&pool->lock);
>

2023-03-01 03:55:05

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On (23/02/28 14:20), Minchan Kim wrote:
> On Sun, Feb 26, 2023 at 12:55:45PM +0900, Sergey Senozhatsky wrote:
> > On (23/02/23 15:51), Minchan Kim wrote:
> > > On Thu, Feb 23, 2023 at 12:04:50PM +0900, Sergey Senozhatsky wrote:
> > > > Extend zsmalloc zs_pool_stats with a new member that
> > > > holds the number of objects pool compaction moved
> > > > between pool pages.
> > >
> > > I totally understand this new stat would be very useful for your
> > > development but not sure it's really useful for workload tune or
> > > monitoring.
> > >
> > > Unless we have strong usecase, I'd like to avoid new stat.
> >
> > The way I see is that it *can* give some interesting additional data to
> > periodical compaction (the one is not triggeed by the shrinker): if the
> > number of moves objects is relatively high but the number of comapcted
> > (feeed) pages is relatively low then the system has fragmentation in
> > small size classes (that tend to have many objects per zspage but not
> > too many pages per zspage) and in this case the interval between
> > periodical compactions probably can be increased. What do you think?
>
> In the case, how could we get only data triggered by periodical munual
> compaction?

Something very simple like

read zram mm_stat
trigger comapction
read zram mm_stat

can work in most cases, I guess. There can be memory pressure
and shrinkers can compact the pool concurrently, in which case
mm_stat will include shrinker impact, but that's probably not
a problem. If system is under memory pressure then user space
in general does not have to do comapction, since the kernel will
handle it.

Just an idea. It feels like "pages compacted" on its own tells very
little, but I don't insist on exporting that new stat.

2023-03-01 03:58:00

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

On (23/02/28 14:17), Minchan Kim wrote:
> Thanks for the explanation, Sergey.
>
> Please include the testing result data in the description of the patch
> you made significant change to achieve it as well as cover letter.

OK, I can include it into the "new compaction algorithm" patch.

2023-03-01 04:05:32

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/02/28 14:53), Minchan Kim wrote:
[..]
> > As of why I decided to go with defines, this is because zspage fullness
> > values and class stats are two conceptually different things, they don't
> > really fit in one single enum, unless enum's name is "zs_constants".
> > What do you think?
>
> Agree. We don't need to combine them, then.
> BTW, I still prefer the enum instead of 10 define.
>
> enum fullness_group {
> ZS_EMPTY,
> ZS_INUSE_RATIO_MIN,
> ZS_INUSE_RATIO_ALMOST_FULL = 7,
> ZS_INUSE_RATIO_MAX = 10,
> ZS_FULL,
> NR_ZS_FULLNESS,
> }

So we keep enum nesting? Sorry, I'm not exactly following.

We have fullness values (which we use independently) and stats array
which has overlapping offsets with fullness values.

[..]
> > I can change it to
> >
> > for (r = ZS_INUSE_RATIO_10; r <= ZS_INUSE_RATIO_70; r++)
> > and
> > for (r = ZS_INUSE_RATIO_80; r <= ZS_INUSE_RATIO_99; r++)
> >
> > which would be safer than using hard-coded numbers.
>
> I didn't mean to have hard code either but just wanted to show
> the intention to use the loop.

Got it. I just wanted to show that being very verbose (having every
constant documented) is nice :)

> >
> > Shall we actually instead report per inuse ratio stats instead? I sort
> > of don't see too many reasons to keep that below/above 3/4 thing.
>
> Oh, yeah. Since it's debugfs, we would get excuse to break.

This was in my original patch, but I decided to put a comment and keep
the old behavior. I probably will switch to a more precise reporting
(per inuse ratio) in a separate patch, so that we can easily revert it
without any impact on new fullness grouping.

2023-03-01 08:55:55

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/02/28 14:53), Minchan Kim wrote:
> BTW, I still prefer the enum instead of 10 define.
>
> enum fullness_group {
> ZS_EMPTY,
> ZS_INUSE_RATIO_MIN,
> ZS_INUSE_RATIO_ALMOST_FULL = 7,
> ZS_INUSE_RATIO_MAX = 10,
> ZS_FULL,
> NR_ZS_FULLNESS,
> }

For educational purposes, may I ask what do enums give us? We
always use integers - int:4 in zspage fullness, int for arrays
offsets and we cast to plain integers in get/set stats. So those
enums exist only at declaration point, and plain int otherwise.
What are the benefits over #defines?

2023-03-01 23:48:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On Wed, Mar 01, 2023 at 12:54:56PM +0900, Sergey Senozhatsky wrote:
> On (23/02/28 14:20), Minchan Kim wrote:
> > On Sun, Feb 26, 2023 at 12:55:45PM +0900, Sergey Senozhatsky wrote:
> > > On (23/02/23 15:51), Minchan Kim wrote:
> > > > On Thu, Feb 23, 2023 at 12:04:50PM +0900, Sergey Senozhatsky wrote:
> > > > > Extend zsmalloc zs_pool_stats with a new member that
> > > > > holds the number of objects pool compaction moved
> > > > > between pool pages.
> > > >
> > > > I totally understand this new stat would be very useful for your
> > > > development but not sure it's really useful for workload tune or
> > > > monitoring.
> > > >
> > > > Unless we have strong usecase, I'd like to avoid new stat.
> > >
> > > The way I see is that it *can* give some interesting additional data to
> > > periodical compaction (the one is not triggeed by the shrinker): if the
> > > number of moves objects is relatively high but the number of comapcted
> > > (feeed) pages is relatively low then the system has fragmentation in
> > > small size classes (that tend to have many objects per zspage but not
> > > too many pages per zspage) and in this case the interval between
> > > periodical compactions probably can be increased. What do you think?
> >
> > In the case, how could we get only data triggered by periodical munual
> > compaction?
>
> Something very simple like
>
> read zram mm_stat
> trigger comapction
> read zram mm_stat
>
> can work in most cases, I guess. There can be memory pressure
> and shrinkers can compact the pool concurrently, in which case
> mm_stat will include shrinker impact, but that's probably not
> a problem. If system is under memory pressure then user space

Agreed.

> in general does not have to do comapction, since the kernel will
> handle it.
>
> Just an idea. It feels like "pages compacted" on its own tells very
> little, but I don't insist on exporting that new stat.

I don't mind adding the simple metric but I want to add metric if
we have real usecase with handful of comments how they uses it
in real world.

Thanks.

2023-03-01 23:48:26

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 0/6] zsmalloc: fine-grained fullness and new compaction algorithm

On Wed, Mar 01, 2023 at 12:57:51PM +0900, Sergey Senozhatsky wrote:
> On (23/02/28 14:17), Minchan Kim wrote:
> > Thanks for the explanation, Sergey.
> >
> > Please include the testing result data in the description of the patch
> > you made significant change to achieve it as well as cover letter.
>
> OK, I can include it into the "new compaction algorithm" patch.

Thanks.

2023-03-02 00:13:28

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Wed, Mar 01, 2023 at 01:05:20PM +0900, Sergey Senozhatsky wrote:
> On (23/02/28 14:53), Minchan Kim wrote:
> [..]
> > > As of why I decided to go with defines, this is because zspage fullness
> > > values and class stats are two conceptually different things, they don't
> > > really fit in one single enum, unless enum's name is "zs_constants".
> > > What do you think?
> >
> > Agree. We don't need to combine them, then.
> > BTW, I still prefer the enum instead of 10 define.
> >
> > enum fullness_group {
> > ZS_EMPTY,
> > ZS_INUSE_RATIO_MIN,
> > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > ZS_INUSE_RATIO_MAX = 10,
> > ZS_FULL,
> > NR_ZS_FULLNESS,
> > }
>
> So we keep enum nesting? Sorry, I'm not exactly following.

Sorry, I meant let's keep separating them since they are different
things conceptually as you mentioned.

>
> We have fullness values (which we use independently) and stats array
> which has overlapping offsets with fullness values.
>
> [..]
> > > I can change it to
> > >
> > > for (r = ZS_INUSE_RATIO_10; r <= ZS_INUSE_RATIO_70; r++)
> > > and
> > > for (r = ZS_INUSE_RATIO_80; r <= ZS_INUSE_RATIO_99; r++)
> > >
> > > which would be safer than using hard-coded numbers.
> >
> > I didn't mean to have hard code either but just wanted to show
> > the intention to use the loop.
>
> Got it. I just wanted to show that being very verbose (having every
> constant documented) is nice :)
>
> > >
> > > Shall we actually instead report per inuse ratio stats instead? I sort
> > > of don't see too many reasons to keep that below/above 3/4 thing.
> >
> > Oh, yeah. Since it's debugfs, we would get excuse to break.
>
> This was in my original patch, but I decided to put a comment and keep
> the old behavior. I probably will switch to a more precise reporting
> (per inuse ratio) in a separate patch, so that we can easily revert it
> without any impact on new fullness grouping.

Sounds good.

2023-03-02 00:28:32

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Wed, Mar 01, 2023 at 05:55:44PM +0900, Sergey Senozhatsky wrote:
> On (23/02/28 14:53), Minchan Kim wrote:
> > BTW, I still prefer the enum instead of 10 define.
> >
> > enum fullness_group {
> > ZS_EMPTY,
> > ZS_INUSE_RATIO_MIN,
> > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > ZS_INUSE_RATIO_MAX = 10,
> > ZS_FULL,
> > NR_ZS_FULLNESS,
> > }
>
> For educational purposes, may I ask what do enums give us? We
> always use integers - int:4 in zspage fullness, int for arrays
> offsets and we cast to plain integers in get/set stats. So those
> enums exist only at declaration point, and plain int otherwise.
> What are the benefits over #defines?

Well, I just didn't like the 12 hard coded define *list* values
and never used other places except zs_stats_size_show since
I thought we could handle zs_stats_size_show in the loop without
the specific each ratio definary.

Furthermore, above example, the special ZS_INUSE_RATIO_MAX will
be definary instead of hard coded 10.

ZS_INUSE_RATIO_MAX = ZS_INUSE_RATIO_MIN + ZS_INUSER_RATIO_CLASS_SIZE

so, if we want to change the ratio later, we would need minimal
changes all the places instead of changing all the hard codeded
definary.

2023-03-02 00:53:13

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/03/01 16:28), Minchan Kim wrote:
> On Wed, Mar 01, 2023 at 05:55:44PM +0900, Sergey Senozhatsky wrote:
> > On (23/02/28 14:53), Minchan Kim wrote:
> > > BTW, I still prefer the enum instead of 10 define.
> > >
> > > enum fullness_group {
> > > ZS_EMPTY,
> > > ZS_INUSE_RATIO_MIN,
> > > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > > ZS_INUSE_RATIO_MAX = 10,
> > > ZS_FULL,
> > > NR_ZS_FULLNESS,
> > > }
> >
> > For educational purposes, may I ask what do enums give us? We
> > always use integers - int:4 in zspage fullness, int for arrays
> > offsets and we cast to plain integers in get/set stats. So those
> > enums exist only at declaration point, and plain int otherwise.
> > What are the benefits over #defines?
>
> Well, I just didn't like the 12 hard coded define *list* values
> and never used other places except zs_stats_size_show since

If we have two enums, then we need more lines

enum fullness {
ZS_INUSE_RATIO_0
...
ZS_INUSE_RATIO_100
}

enum stats {
INUSE_RATIO_0
...
INUSE_RATIO_100

// the rest of stats
}

and then we use int:4 fullness value to access stats.

> I thought we could handle zs_stats_size_show in the loop without
> the specific each ratio definary.

For per inuse ratio zs_stats_size_show() we need to access stats
individually:

inuse10, inuse20, inuse30, ... inuse99

2023-03-03 00:20:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Thu, Mar 02, 2023 at 09:53:03AM +0900, Sergey Senozhatsky wrote:
> On (23/03/01 16:28), Minchan Kim wrote:
> > On Wed, Mar 01, 2023 at 05:55:44PM +0900, Sergey Senozhatsky wrote:
> > > On (23/02/28 14:53), Minchan Kim wrote:
> > > > BTW, I still prefer the enum instead of 10 define.
> > > >
> > > > enum fullness_group {
> > > > ZS_EMPTY,
> > > > ZS_INUSE_RATIO_MIN,
> > > > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > > > ZS_INUSE_RATIO_MAX = 10,
> > > > ZS_FULL,
> > > > NR_ZS_FULLNESS,
> > > > }
> > >
> > > For educational purposes, may I ask what do enums give us? We
> > > always use integers - int:4 in zspage fullness, int for arrays
> > > offsets and we cast to plain integers in get/set stats. So those
> > > enums exist only at declaration point, and plain int otherwise.
> > > What are the benefits over #defines?
> >
> > Well, I just didn't like the 12 hard coded define *list* values
> > and never used other places except zs_stats_size_show since
>
> If we have two enums, then we need more lines
>
> enum fullness {
> ZS_INUSE_RATIO_0
> ...
> ZS_INUSE_RATIO_100
> }
>
> enum stats {
> INUSE_RATIO_0
> ...
> INUSE_RATIO_100
>
> // the rest of stats
> }
>
> and then we use int:4 fullness value to access stats.

Yeah. I don't see any problem unless I miss your point.

>
> > I thought we could handle zs_stats_size_show in the loop without
> > the specific each ratio definary.
>
> For per inuse ratio zs_stats_size_show() we need to access stats
> individually:
>
> inuse10, inuse20, inuse30, ... inuse99

Does it need specific index in the enum list?

I don't mind having all the hard coded index if it's *necessary*
but wanted to try we could make the index with base + index
on demand in the loop via simple arithmetic.

2023-03-03 01:06:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/03/02 16:20), Minchan Kim wrote:
> On Thu, Mar 02, 2023 at 09:53:03AM +0900, Sergey Senozhatsky wrote:
> > On (23/03/01 16:28), Minchan Kim wrote:
> > > On Wed, Mar 01, 2023 at 05:55:44PM +0900, Sergey Senozhatsky wrote:
> > > > On (23/02/28 14:53), Minchan Kim wrote:
> > > > > BTW, I still prefer the enum instead of 10 define.
> > > > >
> > > > > enum fullness_group {
> > > > > ZS_EMPTY,
> > > > > ZS_INUSE_RATIO_MIN,
> > > > > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > > > > ZS_INUSE_RATIO_MAX = 10,
> > > > > ZS_FULL,
> > > > > NR_ZS_FULLNESS,
> > > > > }
> > > >
> > > > For educational purposes, may I ask what do enums give us? We
> > > > always use integers - int:4 in zspage fullness, int for arrays
> > > > offsets and we cast to plain integers in get/set stats. So those
> > > > enums exist only at declaration point, and plain int otherwise.
> > > > What are the benefits over #defines?
> > >
> > > Well, I just didn't like the 12 hard coded define *list* values
> > > and never used other places except zs_stats_size_show since
> >
> > If we have two enums, then we need more lines
> >
> > enum fullness {
> > ZS_INUSE_RATIO_0
> > ...
> > ZS_INUSE_RATIO_100
> > }
> >
> > enum stats {
> > INUSE_RATIO_0
> > ...
> > INUSE_RATIO_100
> >
> > // the rest of stats
> > }
> >
> > and then we use int:4 fullness value to access stats.
>
> Yeah. I don't see any problem unless I miss your point.

OK. How about having one enum? E.g. "zs_flags" or something which
will contain all our constants?

Otherwise I can create two big enums for fullness and stats.
What's your preference on inuse_0 and inuse_100 naming? Do we
keep unified naming or should it be INUSE_MIN/INUSE_MAX or
EMPTY/FULL?

> > For per inuse ratio zs_stats_size_show() we need to access stats
> > individually:
> >
> > inuse10, inuse20, inuse30, ... inuse99
>
> Does it need specific index in the enum list?

If we report per inuse group then yes:

sprintf("... %lu %lu ..... %lu %lu ...\n",
...
get_stat(ZS_INUSE_RATIO_10),
get_stat(ZS_INUSE_RATIO_20),
get_stat(ZS_INUSE_RATIO_30),
...
get_stat(ZS_INUSE_RATIO_99),
...);

2023-03-03 01:38:16

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On Fri, Mar 03, 2023 at 10:06:43AM +0900, Sergey Senozhatsky wrote:
> On (23/03/02 16:20), Minchan Kim wrote:
> > On Thu, Mar 02, 2023 at 09:53:03AM +0900, Sergey Senozhatsky wrote:
> > > On (23/03/01 16:28), Minchan Kim wrote:
> > > > On Wed, Mar 01, 2023 at 05:55:44PM +0900, Sergey Senozhatsky wrote:
> > > > > On (23/02/28 14:53), Minchan Kim wrote:
> > > > > > BTW, I still prefer the enum instead of 10 define.
> > > > > >
> > > > > > enum fullness_group {
> > > > > > ZS_EMPTY,
> > > > > > ZS_INUSE_RATIO_MIN,
> > > > > > ZS_INUSE_RATIO_ALMOST_FULL = 7,
> > > > > > ZS_INUSE_RATIO_MAX = 10,
> > > > > > ZS_FULL,
> > > > > > NR_ZS_FULLNESS,
> > > > > > }
> > > > >
> > > > > For educational purposes, may I ask what do enums give us? We
> > > > > always use integers - int:4 in zspage fullness, int for arrays
> > > > > offsets and we cast to plain integers in get/set stats. So those
> > > > > enums exist only at declaration point, and plain int otherwise.
> > > > > What are the benefits over #defines?
> > > >
> > > > Well, I just didn't like the 12 hard coded define *list* values
> > > > and never used other places except zs_stats_size_show since
> > >
> > > If we have two enums, then we need more lines
> > >
> > > enum fullness {
> > > ZS_INUSE_RATIO_0
> > > ...
> > > ZS_INUSE_RATIO_100
> > > }
> > >
> > > enum stats {
> > > INUSE_RATIO_0
> > > ...
> > > INUSE_RATIO_100
> > >
> > > // the rest of stats
> > > }
> > >
> > > and then we use int:4 fullness value to access stats.
> >
> > Yeah. I don't see any problem unless I miss your point.
>
> OK. How about having one enum? E.g. "zs_flags" or something which
> will contain all our constants?
>
> Otherwise I can create two big enums for fullness and stats.

Let's go with two enums at this moment since your great work is not
tied into the problem. If that becomes really maintaince hole,
we could tidy it up at that time.

> What's your preference on inuse_0 and inuse_100 naming? Do we
> keep unified naming or should it be INUSE_MIN/INUSE_MAX or
> EMPTY/FULL?

I don't have strong opinion about it. I will follow your choice. ;-)

>
> > > For per inuse ratio zs_stats_size_show() we need to access stats
> > > individually:
> > >
> > > inuse10, inuse20, inuse30, ... inuse99
> >
> > Does it need specific index in the enum list?
>
> If we report per inuse group then yes:
>
> sprintf("... %lu %lu ..... %lu %lu ...\n",
> ...
> get_stat(ZS_INUSE_RATIO_10),
> get_stat(ZS_INUSE_RATIO_20),
> get_stat(ZS_INUSE_RATIO_30),
> ...
> get_stat(ZS_INUSE_RATIO_99),
> ...);

I thought we could handle it with loop

prologue - seq_printf
for (ratio = min, ratio < max; ratio++ )
seq_printf(s, "%lu", get_stat(ratio)
epilogue - seq_printf
seq_puts(s, "\n");

2023-03-03 01:44:09

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 3/6] zsmalloc: fine-grained inuse ratio based fullness grouping

On (23/03/02 17:38), Minchan Kim wrote:
> > Otherwise I can create two big enums for fullness and stats.
>
> Let's go with two enums at this moment since your great work is not
> tied into the problem. If that becomes really maintaince hole,
> we could tidy it up at that time.

OK.

>
> > What's your preference on inuse_0 and inuse_100 naming? Do we
> > keep unified naming or should it be INUSE_MIN/INUSE_MAX or
> > EMPTY/FULL?
>
> I don't have strong opinion about it. I will follow your choice. ;-)

OK :)

> prologue - seq_printf
> for (ratio = min, ratio < max; ratio++ )
> seq_printf(s, "%lu", get_stat(ratio)
> epilogue - seq_printf
> seq_puts(s, "\n");

Let me try a loop.

2023-03-03 01:57:50

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCHv2 5/6] zsmalloc: extend compaction statistics

On (23/03/01 15:48), Minchan Kim wrote:
> > in general does not have to do comapction, since the kernel will
> > handle it.
> >
> > Just an idea. It feels like "pages compacted" on its own tells very
> > little, but I don't insist on exporting that new stat.
>
> I don't mind adding the simple metric but I want to add metric if
> we have real usecase with handful of comments how they uses it
> in real world.

I'll drop it from the series for now.