In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory.
The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.
This patchset pretends to solve this problem by implementing a partial
depopulation of percpu chunks: chunks with many empty pages are being
asynchronously depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
#!/bin/bash
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
$ ./percpu_test.sh
Percpu: 7296 kB
Percpu: 481024 kB
Percpu: 481024 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 153920 kB
So the total size of the percpu memory was reduced by ~3 times.
Roman Gushchin (4):
percpu: implement partial chunk depopulation
percpu: split __pcpu_balance_workfn()
percpu: on demand chunk depopulation
percpu: fix a comment about the chunks ordering
mm/percpu-internal.h | 1 +
mm/percpu.c | 242 ++++++++++++++++++++++++++++++++++++-------
2 files changed, 203 insertions(+), 40 deletions(-)
--
2.30.2
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.
Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 148137f0fc0b..08fb6e5d3232 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
#include "percpu-internal.h"
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
#define PCPU_SLOT_BASE_SHIFT 5
/* chunks in slots below this are subject to being sidelined on failed alloc */
#define PCPU_SLOT_FAIL_THRESHOLD 3
--
2.30.2