In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory.
The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.
This patchset pretends to solve this problem by implementing a partial
depopulation of percpu chunks: chunks with many empty pages are being
asynchronously depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
#!/bin/bash
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 159488 kB
So the total size of the percpu memory was reduced by 3 times.
v2:
- depopulation heuristics changed and optimized
- chunks are put into a separate list, depopulation scan this list
- chunk->isolated is introduced, chunk->depopulate is dropped
- rearranged patches a bit
- fixed a panic discovered by krobot
- made pcpu_nr_empty_pop_pages per chunk type
- minor fixes
rfc:
https://lwn.net/Articles/850508/
Roman Gushchin (5):
percpu: split __pcpu_balance_workfn()
percpu: make pcpu_nr_empty_pop_pages per chunk type
percpu: generalize pcpu_balance_populated()
percpu: fix a comment about the chunks ordering
percpu: implement partial chunk depopulation
mm/percpu-internal.h | 3 +-
mm/percpu-stats.c | 9 +-
mm/percpu.c | 219 ++++++++++++++++++++++++++++++++++---------
3 files changed, 182 insertions(+), 49 deletions(-)
--
2.30.2
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 25a181328353..e20119668c42 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
#include "percpu-internal.h"
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
#define PCPU_SLOT_BASE_SHIFT 5
/* chunks in slots below this are subject to being sidelined on failed alloc */
#define PCPU_SLOT_FAIL_THRESHOLD 3
--
2.30.2
To prepare for the depopulation of percpu chunks, split out the
populating part of the pcpu_balance_populated() into the new
pcpu_grow_populated() (with an intention to add
pcpu_shrink_populated() in the next commit).
The goal of pcpu_balance_populated() is to determine whether
there is a shortage or an excessive amount of empty percpu pages
and call into the corresponding function.
pcpu_grow_populated() takes a desired number of pages as an argument
(nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
to reflect that the new chunk could be created already populated.
Otherwise an infinite loop might appear.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu.c | 63 +++++++++++++++++++++++++++++++++--------------------
1 file changed, 39 insertions(+), 24 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 0eeeb4e7a2f9..25a181328353 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1976,7 +1976,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
}
/**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
* @type: chunk type
*
* Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1985,35 +1985,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
* allocation causes the failure as it is possible that requests can be
* serviced from already backed regions.
*/
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
{
/* gfp flags passed to underlying allocators */
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct pcpu_chunk *chunk;
- int slot, nr_to_pop, ret;
+ int slot, ret;
- /*
- * Ensure there are certain number of free populated pages for
- * atomic allocs. Fill up from the most packed so that atomic
- * allocs don't increase fragmentation. If atomic allocation
- * failed previously, always populate the maximum amount. This
- * should prevent atomic allocs larger than PAGE_SIZE from keeping
- * failing indefinitely; however, large atomic allocs are not
- * something we support properly and can be highly unreliable and
- * inefficient.
- */
retry_pop:
- if (pcpu_atomic_alloc_failed) {
- nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
- /* best effort anyway, don't worry about synchronization */
- pcpu_atomic_alloc_failed = false;
- } else {
- nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages[type],
- 0, PCPU_EMPTY_POP_PAGES_HIGH);
- }
-
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) {
unsigned int nr_unpop = 0, rs, re;
@@ -2057,12 +2037,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
if (chunk) {
spin_lock_irq(&pcpu_lock);
pcpu_chunk_relocate(chunk, -1);
+ nr_to_pop = max_t(int, 0, nr_to_pop - chunk->nr_populated);
spin_unlock_irq(&pcpu_lock);
- goto retry_pop;
+ if (nr_to_pop)
+ goto retry_pop;
}
}
}
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Populate or depopulate chunks to maintain a certain amount
+ * of free pages to satisfy atomic allocations, but not waste
+ * large amounts of memory.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+ int nr_to_pop;
+
+ /*
+ * Ensure there are certain number of free populated pages for
+ * atomic allocs. Fill up from the most packed so that atomic
+ * allocs don't increase fragmentation. If atomic allocation
+ * failed previously, always populate the maximum amount. This
+ * should prevent atomic allocs larger than PAGE_SIZE from keeping
+ * failing indefinitely; however, large atomic allocs are not
+ * something we support properly and can be highly unreliable and
+ * inefficient.
+ */
+ if (pcpu_atomic_alloc_failed) {
+ nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
+ /* best effort anyway, don't worry about synchronization */
+ pcpu_atomic_alloc_failed = false;
+ pcpu_grow_populated(type, nr_to_pop);
+ } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
+ nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
+ pcpu_grow_populated(type, nr_to_pop);
+ }
+}
+
/**
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
* @work: unused
--
2.30.2
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.
In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:
1) pcpu_balance_free,
2) pcpu_balance_populated.
Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.
Signed-off-by: Roman Gushchin <[email protected]>
Reviewed-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 46 +++++++++++++++++++++++++++++-----------------
1 file changed, 29 insertions(+), 17 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..5b505a459028 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1930,31 +1930,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
}
/**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
* @type: chunk type
*
- * Reclaim all fully free chunks except for the first one. This is also
- * responsible for maintaining the pool of empty populated pages. However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered. We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
*/
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
{
- /* gfp flags passed to underlying allocators */
- const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
LIST_HEAD(to_free);
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1];
struct pcpu_chunk *chunk, *next;
- int slot, nr_to_pop, ret;
/*
* There's no reason to keep around multiple unused chunks and VM
* areas can be scarce. Destroy all free chunks except for one.
*/
- mutex_lock(&pcpu_alloc_mutex);
spin_lock_irq(&pcpu_lock);
list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1982,6 +1973,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
pcpu_destroy_chunk(chunk);
cond_resched();
}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered. We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+ /* gfp flags passed to underlying allocators */
+ const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+ struct list_head *pcpu_slot = pcpu_chunk_list(type);
+ struct pcpu_chunk *chunk;
+ int slot, nr_to_pop, ret;
/*
* Ensure there are certain number of free populated pages for
@@ -2051,22 +2061,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
goto retry_pop;
}
}
-
- mutex_unlock(&pcpu_alloc_mutex);
}
/**
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
* @work: unused
*
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
*/
static void pcpu_balance_workfn(struct work_struct *work)
{
enum pcpu_chunk_type type;
- for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
- __pcpu_balance_workfn(type);
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+ mutex_lock(&pcpu_alloc_mutex);
+ pcpu_balance_free(type);
+ pcpu_balance_populated(type);
+ mutex_unlock(&pcpu_alloc_mutex);
+ }
}
/**
--
2.30.2
This patch implements partial depopulation of percpu chunks.
As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
returned to the original slot (but is appended to the tail of the list
to minimize the chances of population).
Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
the chunk can be concurrently moved to a different slot. To prevent
this, bool chunk->isolated flag is introduced. If set, the chunk can't
be moved to a different slot.
The depopulation is scheduled on the free path. Is the chunk:
1) has more than 1/8 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
5) isn't entirely free
it's a good target for depopulation.
If so, the chunk is moved to a special pcpu_depopulate_list,
chunk->isolate flag is set and the async balancing is scheduled.
The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk. Successfully
or not, at the end all chunks are returned to appropriate slots
and their isolated flags are cleared.
Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu-internal.h | 1 +
mm/percpu.c | 101 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 100 insertions(+), 2 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..ff318752915d 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,7 @@ struct pcpu_chunk {
void *data; /* chunk data */
bool immutable; /* no [de]population allowed */
+ bool isolated; /* isolated from chunk slot lists */
int start_offset; /* the overlap with the previous
region to have a page aligned
base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index e20119668c42..dae0b870e10a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
*/
int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
+/*
+ * List of chunks with a lot of free pages. Used to depopulate them
+ * asynchronously.
+ */
+static LIST_HEAD(pcpu_depopulate_list);
+
/*
* The number of populated pages in use by the allocator, protected by
* pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets
@@ -542,7 +548,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
{
int nslot = pcpu_chunk_slot(chunk);
- if (oslot != nslot)
+ if (!chunk->isolated && oslot != nslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
}
@@ -2048,6 +2054,82 @@ static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
}
}
+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over all chunks, find those marked with the depopulate flag and
+ * try to release unused pages to the system. On every attempt clear the
+ * chunk's depopulate flag to avoid wasting CPU by scanning the same
+ * chunk again and again.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+ struct pcpu_block_md *block;
+ struct pcpu_chunk *chunk, *tmp;
+ LIST_HEAD(to_depopulate);
+ int i, start;
+
+ spin_lock_irq(&pcpu_lock);
+
+ list_splice_init(&pcpu_depopulate_list, &to_depopulate);
+
+ list_for_each_entry_safe(chunk, tmp, &to_depopulate, list) {
+ WARN_ON(chunk->immutable);
+
+ for (i = 0, start = -1; i < chunk->nr_pages; i++) {
+ /*
+ * If the chunk has no empty pages or
+ * we're short on empty pages in general,
+ * just put the chunk back into the original slot.
+ */
+ if (!chunk->nr_empty_pop_pages ||
+ pcpu_nr_empty_pop_pages[type] <
+ PCPU_EMPTY_POP_PAGES_HIGH)
+ break;
+
+ /*
+ * If the page is empty and populated, start or
+ * extend the [start, i) range.
+ */
+ block = chunk->md_blocks + i;
+ if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
+ test_bit(i, chunk->populated)) {
+ if (start == -1)
+ start = i;
+ continue;
+ }
+
+ /*
+ * Otherwise check if there is an active range,
+ * and if yes, depopulate it.
+ */
+ if (start == -1)
+ continue;
+
+ spin_unlock_irq(&pcpu_lock);
+ pcpu_depopulate_chunk(chunk, start, i);
+ cond_resched();
+ spin_lock_irq(&pcpu_lock);
+
+ pcpu_chunk_depopulated(chunk, start, i);
+
+ /*
+ * Reset the range and continue.
+ */
+ start = -1;
+ }
+
+ /*
+ * Return the chunk to the corresponding slot.
+ */
+ chunk->isolated = false;
+ pcpu_chunk_relocate(chunk, -1);
+ }
+
+ spin_unlock_irq(&pcpu_lock);
+}
+
/**
* pcpu_balance_populated - manage the amount of populated pages
* @type: chunk type
@@ -2078,6 +2160,8 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
} else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
pcpu_grow_populated(type, nr_to_pop);
+ } else if (!list_empty(&pcpu_depopulate_list)) {
+ pcpu_shrink_populated(type);
}
}
@@ -2135,7 +2219,12 @@ void free_percpu(void __percpu *ptr)
pcpu_memcg_free_hook(chunk, off, size);
- /* if there are more than one fully free chunks, wake up grim reaper */
+ /*
+ * If there are more than one fully free chunks, wake up grim reaper.
+ * Otherwise if at least 1/8 of its pages are empty and there is no
+ * system-wide shortage of empty pages aside from this chunk, isolate
+ * the chunk and schedule an async depopulation.
+ */
if (chunk->free_bytes == pcpu_unit_size) {
struct pcpu_chunk *pos;
@@ -2144,6 +2233,14 @@ void free_percpu(void __percpu *ptr)
need_balance = true;
break;
}
+ } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
+ !chunk->isolated &&
+ chunk->nr_empty_pop_pages >= chunk->nr_pages / 8 &&
+ pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
+ PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages) {
+ list_move(&chunk->list, &pcpu_depopulate_list);
+ chunk->isolated = true;
+ need_balance = true;
}
trace_percpu_free_percpu(chunk->base_addr, off, ptr);
--
2.30.2
nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.
This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu-internal.h | 2 +-
mm/percpu-stats.c | 9 +++++++--
mm/percpu.c | 14 +++++++-------
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..095d7eaa0db4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
extern struct list_head *pcpu_chunk_lists;
extern int pcpu_nr_slots;
-extern int pcpu_nr_empty_pop_pages;
+extern int pcpu_nr_empty_pop_pages[];
extern struct pcpu_chunk *pcpu_first_chunk;
extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index c8400a2adbc2..f6026dbcdf6b 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
int slot, max_nr_alloc;
int *buffer;
enum pcpu_chunk_type type;
+ int nr_empty_pop_pages;
alloc_buffer:
spin_lock_irq(&pcpu_lock);
@@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
goto alloc_buffer;
}
-#define PL(X) \
+ nr_empty_pop_pages = 0;
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+ nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+
+#define PL(X) \
seq_printf(m, " %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
seq_printf(m,
@@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
PU(nr_max_chunks);
PU(min_alloc_size);
PU(max_alloc_size);
- P("empty_pop_pages", pcpu_nr_empty_pop_pages);
+ P("empty_pop_pages", nr_empty_pop_pages);
seq_putc(m, '\n');
#undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 5b505a459028..0eeeb4e7a2f9 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -173,10 +173,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
static LIST_HEAD(pcpu_map_extend_chunks);
/*
- * The number of empty populated pages, protected by pcpu_lock. The
- * reserved chunk doesn't contribute to the count.
+ * The number of empty populated pages by chunk type, protected by pcpu_lock.
+ * The reserved chunk doesn't contribute to the count.
*/
-int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
/*
* The number of populated pages in use by the allocator, protected by
@@ -556,7 +556,7 @@ static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
{
chunk->nr_empty_pop_pages += nr;
if (chunk != pcpu_reserved_chunk)
- pcpu_nr_empty_pop_pages += nr;
+ pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
}
/*
@@ -1832,7 +1832,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
mutex_unlock(&pcpu_alloc_mutex);
}
- if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
+ if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
pcpu_schedule_balance_work();
/* clear the areas and return address relative to base address */
@@ -2010,7 +2010,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
pcpu_atomic_alloc_failed = false;
} else {
nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages,
+ pcpu_nr_empty_pop_pages[type],
0, PCPU_EMPTY_POP_PAGES_HIGH);
}
@@ -2592,7 +2592,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
/* link the first chunk in */
pcpu_first_chunk = chunk;
- pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
+ pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
/* include all regions of the first chunk */
--
2.30.2
On Thu, Apr 01, 2021 at 02:43:01PM -0700, Roman Gushchin wrote:
> This patch implements partial depopulation of percpu chunks.
>
> As now, a chunk can be depopulated only as a part of the final
> destruction, if there are no more outstanding allocations. However
> to minimize a memory waste it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being fully reclaimed.
>
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> returned to the original slot (but is appended to the tail of the list
> to minimize the chances of population).
>
> Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> the chunk can be concurrently moved to a different slot. To prevent
> this, bool chunk->isolated flag is introduced. If set, the chunk can't
> be moved to a different slot.
>
> The depopulation is scheduled on the free path. Is the chunk:
> 1) has more than 1/8 of total pages free and populated
> 2) the system has enough free percpu pages aside of this chunk
> 3) isn't the reserved chunk
> 4) isn't the first chunk
> 5) isn't entirely free
> it's a good target for depopulation.
>
> If so, the chunk is moved to a special pcpu_depopulate_list,
> chunk->isolate flag is set and the async balancing is scheduled.
>
> The async balancing moves pcpu_depopulate_list to a local list
> (because pcpu_depopulate_list can be changed when pcpu_lock is
> releases), and then tries to depopulate each chunk. Successfully
> or not, at the end all chunks are returned to appropriate slots
> and their isolated flags are cleared.
>
> Many thanks to Dennis Zhou for his great ideas and a very constructive
> discussion which led to many improvements in this patchset!
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> mm/percpu-internal.h | 1 +
> mm/percpu.c | 101 ++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 100 insertions(+), 2 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 095d7eaa0db4..ff318752915d 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -67,6 +67,7 @@ struct pcpu_chunk {
>
> void *data; /* chunk data */
> bool immutable; /* no [de]population allowed */
> + bool isolated; /* isolated from chunk slot lists */
> int start_offset; /* the overlap with the previous
> region to have a page aligned
> base_addr */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index e20119668c42..dae0b870e10a 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -181,6 +181,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
> */
> int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
>
> +/*
> + * List of chunks with a lot of free pages. Used to depopulate them
> + * asynchronously.
> + */
> +static LIST_HEAD(pcpu_depopulate_list);
> +
Now that pcpu_nr_empty_pop_pages is per chunk_type I think the
depopulate_list should be per chunk_type.
> /*
> * The number of populated pages in use by the allocator, protected by
> * pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets
> @@ -542,7 +548,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
> {
> int nslot = pcpu_chunk_slot(chunk);
>
> - if (oslot != nslot)
> + if (!chunk->isolated && oslot != nslot)
> __pcpu_chunk_move(chunk, nslot, oslot < nslot);
> }
>
> @@ -2048,6 +2054,82 @@ static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
> }
> }
>
> +/**
> + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> + * @type: chunk type
> + *
> + * Scan over all chunks, find those marked with the depopulate flag and
> + * try to release unused pages to the system. On every attempt clear the
> + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> + * chunk again and again.
> + */
There no longer is a depopulate flag.
> +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> +{
> + struct pcpu_block_md *block;
> + struct pcpu_chunk *chunk, *tmp;
> + LIST_HEAD(to_depopulate);
> + int i, start;
> +
> + spin_lock_irq(&pcpu_lock);
> +
> + list_splice_init(&pcpu_depopulate_list, &to_depopulate);
> +
> + list_for_each_entry_safe(chunk, tmp, &to_depopulate, list) {
> + WARN_ON(chunk->immutable);
> +
> + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> + /*
> + * If the chunk has no empty pages or
> + * we're short on empty pages in general,
> + * just put the chunk back into the original slot.
> + */
> + if (!chunk->nr_empty_pop_pages ||
> + pcpu_nr_empty_pop_pages[type] <
> + PCPU_EMPTY_POP_PAGES_HIGH)
> + break;
This isn't ideal because if we do drop below PCPU_EMPTY_POP_PAGES_HIGH
because of the next deallocation range, then we're leaving the region
that's going to get allocated next unpopulated and the populated pages
stranded later on. See below for more discussion.
> +
> + /*
> + * If the page is empty and populated, start or
> + * extend the [start, i) range.
> + */
> + block = chunk->md_blocks + i;
> + if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
> + test_bit(i, chunk->populated)) {
> + if (start == -1)
> + start = i;
> + continue;
> + }
> +
> + /*
> + * Otherwise check if there is an active range,
> + * and if yes, depopulate it.
> + */
> + if (start == -1)
> + continue;
> +
> + spin_unlock_irq(&pcpu_lock);
> + pcpu_depopulate_chunk(chunk, start, i);
> + cond_resched();
> + spin_lock_irq(&pcpu_lock);
> +
> + pcpu_chunk_depopulated(chunk, start, i);
> +
> + /*
> + * Reset the range and continue.
> + */
> + start = -1;
> + }
> +
> + /*
> + * Return the chunk to the corresponding slot.
> + */
> + chunk->isolated = false;
> + pcpu_chunk_relocate(chunk, -1);
> + }
> +
> + spin_unlock_irq(&pcpu_lock);
> +}
> +
> /**
> * pcpu_balance_populated - manage the amount of populated pages
> * @type: chunk type
> @@ -2078,6 +2160,8 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
> } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
> nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
> pcpu_grow_populated(type, nr_to_pop);
> + } else if (!list_empty(&pcpu_depopulate_list)) {
> + pcpu_shrink_populated(type);
> }
> }
>
> @@ -2135,7 +2219,12 @@ void free_percpu(void __percpu *ptr)
>
> pcpu_memcg_free_hook(chunk, off, size);
>
> - /* if there are more than one fully free chunks, wake up grim reaper */
> + /*
> + * If there are more than one fully free chunks, wake up grim reaper.
> + * Otherwise if at least 1/8 of its pages are empty and there is no
> + * system-wide shortage of empty pages aside from this chunk, isolate
> + * the chunk and schedule an async depopulation.
> + */
> if (chunk->free_bytes == pcpu_unit_size) {
> struct pcpu_chunk *pos;
>
> @@ -2144,6 +2233,14 @@ void free_percpu(void __percpu *ptr)
> need_balance = true;
> break;
> }
> + } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> + !chunk->isolated &&
> + chunk->nr_empty_pop_pages >= chunk->nr_pages / 8 &&
> + pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
> + PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages) {
> + list_move(&chunk->list, &pcpu_depopulate_list);
I'm missing something here quite possibly. Right now, when we free,
we're not doing anything to place the floating empty pages in a
particular chunk (such as the first chunk in slot[PAGE_SIZE]). If the
ordering is bad, we can end up leaving the float pages in some random
chunk and because we scan forwards, they'll be populated pages at the
end of a chunk. Not exactly the most useful.
I wonder if we should free everything in the depopulate list and then
scan up and repopulate a # of pages scanning up from slot(PAGE_SIZE)?
This would add additional churn though. If not we need to switch the
direction of freeing.
So questions:
1. How do we keep the float pages in a way that they're most likely to
be found next by the allocator?
2. Do we need to change the direction of freeing or change the
accounting above?
> + chunk->isolated = true;
> + need_balance = true;
> }
>
> trace_percpu_free_percpu(chunk->base_addr, off, ptr);
> --
> 2.30.2
>
On 4/1/21 11:42 PM, Roman Gushchin wrote:
> In our production experience the percpu memory allocator is sometimes struggling
> with returning the memory to the system. A typical example is a creation of
> several thousands memory cgroups (each has several chunks of the percpu data
> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> of these cgroups doesn't always lead to a shrinkage of the percpu memory.
>
> The underlying problem is the fragmentation: to release an underlying chunk
> all percpu allocations should be released first. The percpu allocator tends
> to top up chunks to improve the utilization. It means new small-ish allocations
> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> effectively pinning them in memory.
>
> This patchset pretends to solve this problem by implementing a partial
Really "pretends"? :) Or did you mean "attempts"?