zsmalloc pool can be compacted concurrently by many contexts,
e.g.
cc1 handle_mm_fault()
do_anonymous_page()
__alloc_pages_slowpath()
try_to_free_pages()
do_try_to_free_pages(
lru_gen_shrink_node()
shrink_slab()
do_shrink_slab()
zs_shrinker_scan()
zs_compact()
This creates unnecessary contention as all those processes
compete for access to the same classes. A single compaction
process is enough. Moreover contention that is created by
multiple compaction processes impact other zsmalloc functions,
e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
synchronize access to pool.
Introduce pool compaction mutex and permit only one compaction
context at a time. This reduces overall pool->lock contention.
/proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
&pool->lock#3:
Base Patched
------------------------------------------
con-bounces 2035730 1540066
contentions 2343871 1774348
waittime-min 0.10 0.10
waittime-max 4004216.24 2745.22
waittime-total 101334168.29 67865414.91
waittime-avg 43.23 38.25
acq-bounces 2895765 2186745
acquisitions 6247686 5136943
holdtime-min 0.07 0.07
holdtime-max 2605507.97 482439.16
holdtime-total 9998599.59 5107151.01
holdtime-avg 1.60 0.99
Test run time:
Base
2775.15user 1709.13system 2:13.82elapsed 3350%CPU
Patched
2608.25user 1439.03system 2:03.63elapsed 3273%CPU
Signed-off-by: Sergey Senozhatsky <[email protected]>
---
mm/zsmalloc.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index cc81dfba05a0..dfec2fc6a30f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -264,6 +264,7 @@ struct zs_pool {
struct work_struct free_work;
#endif
spinlock_t lock;
+ atomic_t compaction_in_progress;
};
struct zspage {
@@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
struct size_class *class;
unsigned long pages_freed = 0;
+ if (atomic_xchg(&pool->compaction_in_progress, 1))
+ return 0;
+
for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (class->index != i)
@@ -2281,6 +2285,7 @@ unsigned long zs_compact(struct zs_pool *pool)
pages_freed += __zs_compact(pool, class);
}
atomic_long_add(pages_freed, &pool->stats.pages_compacted);
+ atomic_set(&pool->compaction_in_progress, 0);
return pages_freed;
}
@@ -2388,6 +2393,7 @@ struct zs_pool *zs_create_pool(const char *name)
init_deferred_free(pool);
spin_lock_init(&pool->lock);
+ atomic_set(&pool->compaction_in_progress, 0);
pool->name = kstrdup(name, GFP_KERNEL);
if (!pool->name)
--
2.40.0.634.g4ca3ef3211-goog
On Mon, Apr 17, 2023 at 6:54 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> zsmalloc pool can be compacted concurrently by many contexts,
> e.g.
>
> cc1 handle_mm_fault()
> do_anonymous_page()
> __alloc_pages_slowpath()
> try_to_free_pages()
> do_try_to_free_pages(
> lru_gen_shrink_node()
> shrink_slab()
> do_shrink_slab()
> zs_shrinker_scan()
> zs_compact()
>
> This creates unnecessary contention as all those processes
> compete for access to the same classes. A single compaction
> process is enough. Moreover contention that is created by
> multiple compaction processes impact other zsmalloc functions,
> e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> synchronize access to pool.
>
> Introduce pool compaction mutex and permit only one compaction
> context at a time. This reduces overall pool->lock contention.
>
> /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> &pool->lock#3:
>
> Base Patched
> ------------------------------------------
> con-bounces 2035730 1540066
> contentions 2343871 1774348
> waittime-min 0.10 0.10
> waittime-max 4004216.24 2745.22
> waittime-total 101334168.29 67865414.91
> waittime-avg 43.23 38.25
> acq-bounces 2895765 2186745
> acquisitions 6247686 5136943
> holdtime-min 0.07 0.07
> holdtime-max 2605507.97 482439.16
> holdtime-total 9998599.59 5107151.01
> holdtime-avg 1.60 0.99
The numbers seem to be better when using an atomic vs. a mutex, is
this just noise or significant difference? (I am not familiar with
lock-stat).
>
> Test run time:
> Base
> 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
>
> Patched
> 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
>
> Signed-off-by: Sergey Senozhatsky <[email protected]>
FWIW,
Reviewed-by: Yosry Ahmed <[email protected]>
> ---
> mm/zsmalloc.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index cc81dfba05a0..dfec2fc6a30f 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -264,6 +264,7 @@ struct zs_pool {
> struct work_struct free_work;
> #endif
> spinlock_t lock;
> + atomic_t compaction_in_progress;
> };
>
> struct zspage {
> @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> struct size_class *class;
> unsigned long pages_freed = 0;
>
> + if (atomic_xchg(&pool->compaction_in_progress, 1))
> + return 0;
> +
> for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
> class = pool->size_class[i];
> if (class->index != i)
> @@ -2281,6 +2285,7 @@ unsigned long zs_compact(struct zs_pool *pool)
> pages_freed += __zs_compact(pool, class);
> }
> atomic_long_add(pages_freed, &pool->stats.pages_compacted);
> + atomic_set(&pool->compaction_in_progress, 0);
>
> return pages_freed;
> }
> @@ -2388,6 +2393,7 @@ struct zs_pool *zs_create_pool(const char *name)
>
> init_deferred_free(pool);
> spin_lock_init(&pool->lock);
> + atomic_set(&pool->compaction_in_progress, 0);
>
> pool->name = kstrdup(name, GFP_KERNEL);
> if (!pool->name)
> --
> 2.40.0.634.g4ca3ef3211-goog
>
On (23/04/17 11:32), Yosry Ahmed wrote:
> > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> > &pool->lock#3:
> >
> > Base Patched
> > ------------------------------------------
> > con-bounces 2035730 1540066
> > contentions 2343871 1774348
> > waittime-min 0.10 0.10
> > waittime-max 4004216.24 2745.22
> > waittime-total 101334168.29 67865414.91
> > waittime-avg 43.23 38.25
> > acq-bounces 2895765 2186745
> > acquisitions 6247686 5136943
> > holdtime-min 0.07 0.07
> > holdtime-max 2605507.97 482439.16
> > holdtime-total 9998599.59 5107151.01
> > holdtime-avg 1.60 0.99
>
> The numbers seem to be better when using an atomic vs. a mutex, is
> this just noise or significant difference? (I am not familiar with
> lock-stat).
Pretty sure that's just noise. The test is make -j72 on a system
that swaps out, so it's terribly noisy.
On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <[email protected]> wrote:
> zsmalloc pool can be compacted concurrently by many contexts,
> e.g.
>
> cc1 handle_mm_fault()
> do_anonymous_page()
> __alloc_pages_slowpath()
> try_to_free_pages()
> do_try_to_free_pages(
> lru_gen_shrink_node()
> shrink_slab()
> do_shrink_slab()
> zs_shrinker_scan()
> zs_compact()
>
> This creates unnecessary contention as all those processes
> compete for access to the same classes. A single compaction
> process is enough. Moreover contention that is created by
> multiple compaction processes impact other zsmalloc functions,
> e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> synchronize access to pool.
>
> Introduce pool compaction mutex and permit only one compaction
> context at a time. This reduces overall pool->lock contention.
That isn't what the patch does! Perhaps an earlier version used a mutex?
> /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> &pool->lock#3:
>
> Base Patched
> ------------------------------------------
> con-bounces 2035730 1540066
> contentions 2343871 1774348
> waittime-min 0.10 0.10
> waittime-max 4004216.24 2745.22
> waittime-total 101334168.29 67865414.91
> waittime-avg 43.23 38.25
> acq-bounces 2895765 2186745
> acquisitions 6247686 5136943
> holdtime-min 0.07 0.07
> holdtime-max 2605507.97 482439.16
> holdtime-total 9998599.59 5107151.01
> holdtime-avg 1.60 0.99
>
> Test run time:
> Base
> 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
>
> Patched
> 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
>
> ...
>
> @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> struct size_class *class;
> unsigned long pages_freed = 0;
>
> + if (atomic_xchg(&pool->compaction_in_progress, 1))
> + return 0;
> +
A code comment might be appropriate here.
Is the spin_is_contended() test in __zs_compact() still relevant?
And.... single-threading the operation seems a pretty sad way of
addressing a contention issue. zs_compact() is fairly computationally
expensive - surely a large machine would like to be able to
concurrently run many instances of zs_compact()?
On Mon, Apr 17, 2023 at 5:41 PM Andrew Morton <[email protected]> wrote:
>
> On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <[email protected]> wrote:
>
> > zsmalloc pool can be compacted concurrently by many contexts,
> > e.g.
> >
> > cc1 handle_mm_fault()
> > do_anonymous_page()
> > __alloc_pages_slowpath()
> > try_to_free_pages()
> > do_try_to_free_pages(
> > lru_gen_shrink_node()
> > shrink_slab()
> > do_shrink_slab()
> > zs_shrinker_scan()
> > zs_compact()
> >
> > This creates unnecessary contention as all those processes
> > compete for access to the same classes. A single compaction
> > process is enough. Moreover contention that is created by
> > multiple compaction processes impact other zsmalloc functions,
> > e.g. zs_malloc(), since zsmalloc uses "global" pool->lock to
> > synchronize access to pool.
> >
> > Introduce pool compaction mutex and permit only one compaction
> > context at a time. This reduces overall pool->lock contention.
>
> That isn't what the patch does! Perhaps an earlier version used a mutex?
Yup that's from v1. Thanks for catching that.
>
> > /proc/lock-stat after make -j$((`nproc`+1)) linux kernel for
> > &pool->lock#3:
> >
> > Base Patched
> > ------------------------------------------
> > con-bounces 2035730 1540066
> > contentions 2343871 1774348
> > waittime-min 0.10 0.10
> > waittime-max 4004216.24 2745.22
> > waittime-total 101334168.29 67865414.91
> > waittime-avg 43.23 38.25
> > acq-bounces 2895765 2186745
> > acquisitions 6247686 5136943
> > holdtime-min 0.07 0.07
> > holdtime-max 2605507.97 482439.16
> > holdtime-total 9998599.59 5107151.01
> > holdtime-avg 1.60 0.99
> >
> > Test run time:
> > Base
> > 2775.15user 1709.13system 2:13.82elapsed 3350%CPU
> >
> > Patched
> > 2608.25user 1439.03system 2:03.63elapsed 3273%CPU
> >
> > ...
> >
> > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> > struct size_class *class;
> > unsigned long pages_freed = 0;
> >
> > + if (atomic_xchg(&pool->compaction_in_progress, 1))
> > + return 0;
> > +
>
> A code comment might be appropriate here.
>
> Is the spin_is_contended() test in __zs_compact() still relevant?
It is, as pool->lock is used to synchronize a lot of operations, not
just compaction.
>
> And.... single-threading the operation seems a pretty sad way of
> addressing a contention issue. zs_compact() is fairly computationally
> expensive - surely a large machine would like to be able to
> concurrently run many instances of zs_compact()?
Yeah that was my initial thought as well, to attack this problem at a
larger scale and break down the pool->lock into class->lock. AFAICT,
zsmalloc pages/objects cannot change classes during their lifetime, so
this sounds reasonable initially.
The problem is that the pool->lock is used to synchronize ~all
zsmalloc operations: zs_map_object(), zs_malloc(), zs_free(),
async_free_zspage(), zs_compact(), zs_reclaim_page(),
zs_page_migrate().
For some of those, changing a pool->lock into a class->lock is fairly
straightforward, specifically the ones where we operate on a
page/zspage (e.g. zs_compact()).
The problem is for operations where we operate on an object (e.g.
zs_map_object()). At this point, we only have a handle that we need to
use to find the underlying object. We have no idea what class the
object belongs to, and finding an object given a handle needs
synchronization as the handle<->object mapping can change by
compaction or page migration.
We can store the class somewhere and pay an extra byte per compressed
object (which doesn't sound too bad) -- but even then we'd have
trouble finding the right place to stash it. Perhaps we can enforce
some alignment on the objects and use the lowermost 8 bits of the
handle pointer, which also means we don't need to pay any extra
overhead, but I am not sure if this is practical.
We can also store the class at the first byte of the object, and
guarantee that through migration/compaction the first byte always
remains a valid class. We can then read the class from the handle
locklessly, acquire the class lock, and then make sure the handle is
still the same.
These are all ideas I am throwing in the air, I didn't think about
them thoroughly or try implementing them.
TLDR: I believe it is possible (in theory at least) to break down the
global pool->lock into class locks, which should benefit virtually
~all zsmalloc concurrent operations, not just compaction. It is not
easy though.
As for this patch, I personally do not observe a lot of compaction in
our production environment, and allowing one thread to perform
compaction while others move on with their lives can be better than
having all of them continuously contending for the pool->lock, which
means more contention with ~all zsmalloc operations, not just
concurrent compactors. I can't say for sure that this is an
improvement, but I *believe* it is.
I will shut up now and leave it up to Sergey to decide how to move
forward, sorry for the very long response :)
On (23/04/17 17:41), Andrew Morton wrote:
> On Mon, 17 Apr 2023 22:54:20 +0900 Sergey Senozhatsky <[email protected]> wrote:
> > Introduce pool compaction mutex and permit only one compaction
> > context at a time. This reduces overall pool->lock contention.
>
> That isn't what the patch does! Perhaps an earlier version used a mutex?
Oh, yes.
[..]
> > @@ -2274,6 +2275,9 @@ unsigned long zs_compact(struct zs_pool *pool)
> > struct size_class *class;
> > unsigned long pages_freed = 0;
> >
> > + if (atomic_xchg(&pool->compaction_in_progress, 1))
> > + return 0;
> > +
>
> A code comment might be appropriate here.
OK.
> Is the spin_is_contended() test in __zs_compact() still relevant?
I'd say yes, pool->lock is "big pool lock", we use it for everything:
zs_map_object() when we read objects, zs_malloc() when we allocate objects,
zs_free() when we free objects, etc.
> And.... single-threading the operation seems a pretty sad way of
> addressing a contention issue. zs_compact() is fairly computationally
> expensive - surely a large machine would like to be able to
> concurrently run many instances of zs_compact()?
Compaction is effective only when zspool suffers from internal fragmentation.
Concurrent compaction threads iterate size classes in the same order and are
likely to compete for pool->lock to just find out that previous pool->lock
owner has compacted the class already and there isn't much left to do.
As of concurrent compaction on big machines, the thing is - concurrent
compaction doesn't really happen. We always have just one thread compacting
size classes under pool->lock, the remaining compaction threads just spin on
pool->lock. I believe it used to be slightly different in the past when we
had per size-class lock instead of "global" pool->lock. Commit c0547d0b6a4b
("zsmalloc: consolidate zs_pool's migrate_lock and size_class's locks") has
basically made compaction single-threaded.
On (23/04/17 19:53), Yosry Ahmed wrote:
>
> As for this patch, I personally do not observe a lot of compaction in
> our production environment, and allowing one thread to perform
> compaction while others move on with their lives can be better than
> having all of them continuously contending for the pool->lock, which
> means more contention with ~all zsmalloc operations, not just
> concurrent compactors. I can't say for sure that this is an
> improvement, but I *believe* it is.
Looking at one of ChromeOS memory-pressure tests, I see that sometimes
(albeit rarely) we can have up to 9 parallel zspool compaction contexts,
perhaps a little bit too many for a 12 CPUs laptop:
[ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact()
[ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact()
[ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact()
[ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact()
[ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact()
[ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact()
[ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact()
[ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact()
[ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact()
On Tue, Apr 18, 2023 at 4:24 AM Sergey Senozhatsky
<[email protected]> wrote:
>
> On (23/04/17 19:53), Yosry Ahmed wrote:
> >
> > As for this patch, I personally do not observe a lot of compaction in
> > our production environment, and allowing one thread to perform
> > compaction while others move on with their lives can be better than
> > having all of them continuously contending for the pool->lock, which
> > means more contention with ~all zsmalloc operations, not just
> > concurrent compactors. I can't say for sure that this is an
> > improvement, but I *believe* it is.
>
> Looking at one of ChromeOS memory-pressure tests, I see that sometimes
> (albeit rarely) we can have up to 9 parallel zspool compaction contexts,
> perhaps a little bit too many for a 12 CPUs laptop:
>
> [ 2159.378827] zsmalloc: ctx #1 chrome -> zs_compact()
> [ 2159.379002] zsmalloc: ctx #2 Chrome_ChildIOT -> zs_compact()
> [ 2159.379120] zsmalloc: ctx #3 chrome -> zs_compact()
> [ 2159.379135] zsmalloc: ctx #4 chrome -> zs_compact()
> [ 2159.379213] zsmalloc: ctx #5 chrome -> zs_compact()
> [ 2159.379271] zsmalloc: ctx #6 chrome -> zs_compact()
> [ 2159.379276] zsmalloc: ctx #7 chrome -> zs_compact()
> [ 2159.382786] zsmalloc: ctx #8 chrome -> zs_compact()
> [ 2159.432153] zsmalloc: ctx #9 kswapd0 -> zs_compact()
I haven't looked into such pressure tests on our end, the information
I have is based on the amount of CPU cycles we are spending on
zsmalloc compaction in general.