2021-04-08 04:02:27

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v3 0/6] percpu: partial chunk depopulation

In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory,
so that sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB

with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB

So the total size of the percpu memory was reduced by more than 3.5 times.

v3:
- introduced pcpu_check_chunk_hint()
- fixed a bug related to the hint check
- minor cosmetic changes
- s/pretends/fixes (cc Vlastimil)

v2:
- depopulated chunks are sidelined
- depopulation happens in the reverse order
- depopulate list made per-chunk type
- better results due to better heuristics

v1:
- depopulation heuristics changed and optimized
- chunks are put into a separate list, depopulation scan this list
- chunk->isolated is introduced, chunk->depopulate is dropped
- rearranged patches a bit
- fixed a panic discovered by krobot
- made pcpu_nr_empty_pop_pages per chunk type
- minor fixes

rfc:
https://lwn.net/Articles/850508/


Roman Gushchin (6):
percpu: fix a comment about the chunks ordering
percpu: split __pcpu_balance_workfn()
percpu: make pcpu_nr_empty_pop_pages per chunk type
percpu: generalize pcpu_balance_populated()
percpu: factor out pcpu_check_chunk_hint()
percpu: implement partial chunk depopulation

mm/percpu-internal.h | 4 +-
mm/percpu-stats.c | 9 +-
mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++--------
3 files changed, 261 insertions(+), 58 deletions(-)

--
2.30.2


2021-04-08 04:02:51

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v3 6/6] percpu: implement partial chunk depopulation

This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations can't be served
using a sidelined chunk. The chunk can be moved back to a corresponding
slot if there are not enough chunks with empty populated pages.

The depopulation is scheduled on the free path. Is the chunk:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
5) isn't entirely free
it's a good target for depopulation. If it's already depopulated
but got free populated pages, it's a good target too.
The chunk is moved to a special pcpu_depopulate_list, chunk->isolate
flag is set and the async balancing is scheduled.

The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk. The depopulation
is performed in the reverse direction to keep populated pages close to
the beginning, if the global number of empty pages is reached.
Depopulated chunks are sidelined to prevent further allocations.
Skipped and fully empty chunks are returned to the corresponding slot.

On the allocation path, if there are no suitable chunks found,
the list of sidelined chunks in scanned prior to creating a new chunk.
If there is a good sidelined chunk, it's placed back to the slot
and the scanning is restarted.

Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!

Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu-internal.h | 2 +
mm/percpu.c | 158 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 158 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..8e432663c41e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@ struct pcpu_chunk {

void *data; /* chunk data */
bool immutable; /* no [de]population allowed */
+ bool isolated; /* isolated from chunk slot lists */
+ bool depopulated; /* sidelined after depopulation */
int start_offset; /* the overlap with the previous
region to have a page aligned
base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index 357fd6994278..5bb294e394b3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,19 @@ static LIST_HEAD(pcpu_map_extend_chunks);
*/
int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];

+/*
+ * List of chunks with a lot of free pages. Used to depopulate them
+ * asynchronously.
+ */
+static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+
+/*
+ * List of previously depopulated chunks. They are not usually used for new
+ * allocations, but can be returned back to service if a need arises.
+ */
+static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+
+
/*
* The number of populated pages in use by the allocator, protected by
* pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets
@@ -562,6 +575,12 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
{
int nslot = pcpu_chunk_slot(chunk);

+ /*
+ * Keep isolated and depopulated chunks on a sideline.
+ */
+ if (chunk->isolated || chunk->depopulated)
+ return;
+
if (oslot != nslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
}
@@ -1790,6 +1809,19 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
}
}

+ /* search through sidelined depopulated chunks */
+ list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
+ /*
+ * If the allocation can fit the chunk, place the chunk back
+ * into corresponding slot and restart the scanning.
+ */
+ if (pcpu_check_chunk_hint(&chunk->chunk_md, bits, bit_align)) {
+ chunk->depopulated = false;
+ pcpu_chunk_relocate(chunk, -1);
+ goto restart;
+ }
+ }
+
spin_unlock_irqrestore(&pcpu_lock, flags);

/*
@@ -2060,6 +2092,106 @@ static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
}
}

+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over chunks in the depopulate list, try to release unused populated
+ * pages to the system. Depopulated chunks are sidelined to prevent further
+ * allocations without a need. Skipped and fully free chunks are returned
+ * to corresponding slots. Stop depopulating if the number of empty populated
+ * pages reaches the threshold. Each chunk is scanned in the reverse order to
+ * keep populated pages close to the beginning of the chunk.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+ struct pcpu_block_md *block;
+ struct pcpu_chunk *chunk, *tmp;
+ LIST_HEAD(to_depopulate);
+ bool depopulated;
+ int i, end;
+
+ spin_lock_irq(&pcpu_lock);
+
+ list_splice_init(&pcpu_depopulate_list[type], &to_depopulate);
+
+ list_for_each_entry_safe(chunk, tmp, &to_depopulate, list) {
+ WARN_ON(chunk->immutable);
+ depopulated = false;
+
+ /*
+ * Scan chunk's pages in the reverse order to keep populated
+ * pages close to the beginning of the chunk.
+ */
+ for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
+ /*
+ * If the chunk has no empty pages or
+ * we're short on empty pages in general,
+ * just put the chunk back into the original slot.
+ */
+ if (!chunk->nr_empty_pop_pages ||
+ pcpu_nr_empty_pop_pages[type] <=
+ PCPU_EMPTY_POP_PAGES_HIGH)
+ break;
+
+ /*
+ * If the page is empty and populated, start or
+ * extend the (i, end) range. If i == 0, decrease
+ * i and perform the depopulation to cover the last
+ * (first) page in the chunk.
+ */
+ block = chunk->md_blocks + i;
+ if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
+ test_bit(i, chunk->populated)) {
+ if (end == -1)
+ end = i;
+ if (i > 0)
+ continue;
+ i--;
+ }
+
+ /*
+ * Otherwise check if there is an active range,
+ * and if yes, depopulate it.
+ */
+ if (end == -1)
+ continue;
+
+ depopulated = true;
+
+ spin_unlock_irq(&pcpu_lock);
+ pcpu_depopulate_chunk(chunk, i + 1, end + 1);
+ cond_resched();
+ spin_lock_irq(&pcpu_lock);
+
+ pcpu_chunk_depopulated(chunk, i + 1, end + 1);
+
+ /*
+ * Reset the range and continue.
+ */
+ end = -1;
+ }
+
+ chunk->isolated = false;
+ if (chunk->free_bytes == pcpu_unit_size || !depopulated) {
+ /*
+ * If the chunk is empty or hasn't been depopulated,
+ * return it to the original slot.
+ */
+ pcpu_chunk_relocate(chunk, -1);
+ } else {
+ /*
+ * Otherwise put the chunk to the list of depopulated
+ * chunks.
+ */
+ chunk->depopulated = true;
+ list_move(&chunk->list, &pcpu_sideline_list[type]);
+ }
+ }
+
+ spin_unlock_irq(&pcpu_lock);
+}
+
/**
* pcpu_balance_populated - manage the amount of populated pages
* @type: chunk type
@@ -2090,6 +2222,8 @@ static void pcpu_balance_populated(enum pcpu_chunk_type type)
} else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - pcpu_nr_empty_pop_pages[type];
pcpu_grow_populated(type, nr_to_pop);
+ } else if (!list_empty(&pcpu_depopulate_list[type])) {
+ pcpu_shrink_populated(type);
}
}

@@ -2147,7 +2281,13 @@ void free_percpu(void __percpu *ptr)

pcpu_memcg_free_hook(chunk, off, size);

- /* if there are more than one fully free chunks, wake up grim reaper */
+ /*
+ * If there are more than one fully free chunks, wake up grim reaper.
+ * Otherwise if at least 1/4 of its pages are empty and there is no
+ * system-wide shortage of empty pages aside from this chunk, isolate
+ * the chunk and schedule an async depopulation. If the chunk was
+ * depopulated previously and got free pages, depopulate it too.
+ */
if (chunk->free_bytes == pcpu_unit_size) {
struct pcpu_chunk *pos;

@@ -2156,6 +2296,16 @@ void free_percpu(void __percpu *ptr)
need_balance = true;
break;
}
+ } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
+ !chunk->isolated &&
+ (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
+ PCPU_EMPTY_POP_PAGES_HIGH + chunk->nr_empty_pop_pages) &&
+ ((chunk->depopulated && chunk->nr_empty_pop_pages) ||
+ (chunk->nr_empty_pop_pages >= chunk->nr_pages / 4))) {
+ list_move(&chunk->list, &pcpu_depopulate_list[pcpu_chunk_type(chunk)]);
+ chunk->isolated = true;
+ chunk->depopulated = false;
+ need_balance = true;
}

trace_percpu_free_percpu(chunk->base_addr, off, ptr);
@@ -2583,10 +2733,14 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) *
PCPU_NR_CHUNK_TYPES);

- for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+ for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
for (i = 0; i < pcpu_nr_slots; i++)
INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]);

+ INIT_LIST_HEAD(&pcpu_depopulate_list[type]);
+ INIT_LIST_HEAD(&pcpu_sideline_list[type]);
+ }
+
/*
* The end of the static region needs to be aligned with the
* minimum allocation size as this offsets the reserved and
--
2.30.2

2021-04-16 13:56:43

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

Hello Roman,

I've tried the v3 patch series on a POWER9 and an x86 KVM setup.

My results of the percpu_test are as follows:
Intel KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu:             1952 kB
Percpu:           219648 kB
Percpu:           219648 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu:             2080 kB
Percpu:           219712 kB
Percpu:            72672 kB

I'm able to see improvement comparable to that of what you're see too.

However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration

POWER9 KVM 4CPU:4G
Vanilla 5.12-rc6
# ./percpu_test.sh
Percpu:             5888 kB
Percpu:           118272 kB
Percpu:           118272 kB

5.12-rc6 + with patchset applied
# ./percpu_test.sh
Percpu:             6144 kB
Percpu:           119040 kB
Percpu:           119040 kB

I'm wondering if there's any architectural specific code that needs plumbing
here?

I will also look through the code to find the reason why POWER isn't
depopulating pages.

Thank you,
Pratik

On 08/04/21 9:27 am, Roman Gushchin wrote:
> In our production experience the percpu memory allocator is sometimes struggling
> with returning the memory to the system. A typical example is a creation of
> several thousands memory cgroups (each has several chunks of the percpu data
> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> so that sometimes there are several GB's of memory wasted.
>
> The underlying problem is the fragmentation: to release an underlying chunk
> all percpu allocations should be released first. The percpu allocator tends
> to top up chunks to improve the utilization. It means new small-ish allocations
> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> effectively pinning them in memory.
>
> This patchset solves this problem by implementing a partial depopulation
> of percpu chunks: chunks with many empty pages are being asynchronously
> depopulated and the pages are returned to the system.
>
> To illustrate the problem the following script can be used:
>
> --
> #!/bin/bash
>
> cd /sys/fs/cgroup
>
> mkdir percpu_test
> echo "+memory" > percpu_test/cgroup.subtree_control
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
> mkdir percpu_test/cg_"${i}"
> for j in `seq 1 10`; do
> mkdir percpu_test/cg_"${i}"_"${j}"
> done
> done
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
> for j in `seq 1 10`; do
> rmdir percpu_test/cg_"${i}"_"${j}"
> done
> done
>
> sleep 10
>
> cat /proc/meminfo | grep Percpu
>
> for i in `seq 1 1000`; do
> rmdir percpu_test/cg_"${i}"
> done
>
> rmdir percpu_test
> --
>
> It creates 11000 memory cgroups and removes every 10 out of 11.
> It prints the initial size of the percpu memory, the size after
> creating all cgroups and the size after deleting most of them.
>
> Results:
> vanilla:
> ./percpu_test.sh
> Percpu: 7488 kB
> Percpu: 481152 kB
> Percpu: 481152 kB
>
> with this patchset applied:
> ./percpu_test.sh
> Percpu: 7488 kB
> Percpu: 481408 kB
> Percpu: 135552 kB
>
> So the total size of the percpu memory was reduced by more than 3.5 times.
>
> v3:
> - introduced pcpu_check_chunk_hint()
> - fixed a bug related to the hint check
> - minor cosmetic changes
> - s/pretends/fixes (cc Vlastimil)
>
> v2:
> - depopulated chunks are sidelined
> - depopulation happens in the reverse order
> - depopulate list made per-chunk type
> - better results due to better heuristics
>
> v1:
> - depopulation heuristics changed and optimized
> - chunks are put into a separate list, depopulation scan this list
> - chunk->isolated is introduced, chunk->depopulate is dropped
> - rearranged patches a bit
> - fixed a panic discovered by krobot
> - made pcpu_nr_empty_pop_pages per chunk type
> - minor fixes
>
> rfc:
> https://lwn.net/Articles/850508/
>
>
> Roman Gushchin (6):
> percpu: fix a comment about the chunks ordering
> percpu: split __pcpu_balance_workfn()
> percpu: make pcpu_nr_empty_pop_pages per chunk type
> percpu: generalize pcpu_balance_populated()
> percpu: factor out pcpu_check_chunk_hint()
> percpu: implement partial chunk depopulation
>
> mm/percpu-internal.h | 4 +-
> mm/percpu-stats.c | 9 +-
> mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++--------
> 3 files changed, 261 insertions(+), 58 deletions(-)
>

2021-04-16 15:55:20

by Dennis Zhou

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

Hello,

On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> Hello Roman,
>
> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>
> My results of the percpu_test are as follows:
> Intel KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu:???????????? 1952 kB
> Percpu:?????????? 219648 kB
> Percpu:?????????? 219648 kB
>
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu:???????????? 2080 kB
> Percpu:?????????? 219712 kB
> Percpu:??????????? 72672 kB
>
> I'm able to see improvement comparable to that of what you're see too.
>
> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>
> POWER9 KVM 4CPU:4G
> Vanilla 5.12-rc6
> # ./percpu_test.sh
> Percpu:???????????? 5888 kB
> Percpu:?????????? 118272 kB
> Percpu:?????????? 118272 kB
>
> 5.12-rc6 + with patchset applied
> # ./percpu_test.sh
> Percpu:???????????? 6144 kB
> Percpu:?????????? 119040 kB
> Percpu:?????????? 119040 kB
>
> I'm wondering if there's any architectural specific code that needs plumbing
> here?
>

There shouldn't be. Can you send me the percpu_stats debug output before
and after?

> I will also look through the code to find the reason why POWER isn't
> depopulating pages.
>
> Thank you,
> Pratik
>
> On 08/04/21 9:27 am, Roman Gushchin wrote:
> > In our production experience the percpu memory allocator is sometimes struggling
> > with returning the memory to the system. A typical example is a creation of
> > several thousands memory cgroups (each has several chunks of the percpu data
> > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > so that sometimes there are several GB's of memory wasted.
> >
> > The underlying problem is the fragmentation: to release an underlying chunk
> > all percpu allocations should be released first. The percpu allocator tends
> > to top up chunks to improve the utilization. It means new small-ish allocations
> > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > effectively pinning them in memory.
> >
> > This patchset solves this problem by implementing a partial depopulation
> > of percpu chunks: chunks with many empty pages are being asynchronously
> > depopulated and the pages are returned to the system.
> >
> > To illustrate the problem the following script can be used:
> >
> > --
> > #!/bin/bash
> >
> > cd /sys/fs/cgroup
> >
> > mkdir percpu_test
> > echo "+memory" > percpu_test/cgroup.subtree_control
> >
> > cat /proc/meminfo | grep Percpu
> >
> > for i in `seq 1 1000`; do
> > mkdir percpu_test/cg_"${i}"
> > for j in `seq 1 10`; do
> > mkdir percpu_test/cg_"${i}"_"${j}"
> > done
> > done
> >
> > cat /proc/meminfo | grep Percpu
> >
> > for i in `seq 1 1000`; do
> > for j in `seq 1 10`; do
> > rmdir percpu_test/cg_"${i}"_"${j}"
> > done
> > done
> >
> > sleep 10
> >
> > cat /proc/meminfo | grep Percpu
> >
> > for i in `seq 1 1000`; do
> > rmdir percpu_test/cg_"${i}"
> > done
> >
> > rmdir percpu_test
> > --
> >
> > It creates 11000 memory cgroups and removes every 10 out of 11.
> > It prints the initial size of the percpu memory, the size after
> > creating all cgroups and the size after deleting most of them.
> >
> > Results:
> > vanilla:
> > ./percpu_test.sh
> > Percpu: 7488 kB
> > Percpu: 481152 kB
> > Percpu: 481152 kB
> >
> > with this patchset applied:
> > ./percpu_test.sh
> > Percpu: 7488 kB
> > Percpu: 481408 kB
> > Percpu: 135552 kB
> >
> > So the total size of the percpu memory was reduced by more than 3.5 times.
> >
> > v3:
> > - introduced pcpu_check_chunk_hint()
> > - fixed a bug related to the hint check
> > - minor cosmetic changes
> > - s/pretends/fixes (cc Vlastimil)
> >
> > v2:
> > - depopulated chunks are sidelined
> > - depopulation happens in the reverse order
> > - depopulate list made per-chunk type
> > - better results due to better heuristics
> >
> > v1:
> > - depopulation heuristics changed and optimized
> > - chunks are put into a separate list, depopulation scan this list
> > - chunk->isolated is introduced, chunk->depopulate is dropped
> > - rearranged patches a bit
> > - fixed a panic discovered by krobot
> > - made pcpu_nr_empty_pop_pages per chunk type
> > - minor fixes
> >
> > rfc:
> > https://lwn.net/Articles/850508/
> >
> >
> > Roman Gushchin (6):
> > percpu: fix a comment about the chunks ordering
> > percpu: split __pcpu_balance_workfn()
> > percpu: make pcpu_nr_empty_pop_pages per chunk type
> > percpu: generalize pcpu_balance_populated()
> > percpu: factor out pcpu_check_chunk_hint()
> > percpu: implement partial chunk depopulation
> >
> > mm/percpu-internal.h | 4 +-
> > mm/percpu-stats.c | 9 +-
> > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++--------
> > 3 files changed, 261 insertions(+), 58 deletions(-)
> >
>

Roman, sorry for the delay. I'm looking to apply this today to for-5.14.

Thanks,
Dennis

2021-04-16 17:59:02

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

Hello Dennis,

I apologize for the clutter of logs before, I'm pasting the logs of before and
after the percpu test in the case of the patchset being applied on 5.12-rc6 and
the vanilla kernel 5.12-rc6.

On 16/04/21 7:48 pm, Dennis Zhou wrote:
> Hello,
>
> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>> Hello Roman,
>>
>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>
>> My results of the percpu_test are as follows:
>> Intel KVM 4CPU:4G
>> Vanilla 5.12-rc6
>> # ./percpu_test.sh
>> Percpu:             1952 kB
>> Percpu:           219648 kB
>> Percpu:           219648 kB
>>
>> 5.12-rc6 + with patchset applied
>> # ./percpu_test.sh
>> Percpu:             2080 kB
>> Percpu:           219712 kB
>> Percpu:            72672 kB
>>
>> I'm able to see improvement comparable to that of what you're see too.
>>
>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>
>> POWER9 KVM 4CPU:4G
>> Vanilla 5.12-rc6
>> # ./percpu_test.sh
>> Percpu:             5888 kB
>> Percpu:           118272 kB
>> Percpu:           118272 kB
>>
>> 5.12-rc6 + with patchset applied
>> # ./percpu_test.sh
>> Percpu:             6144 kB
>> Percpu:           119040 kB
>> Percpu:           119040 kB
>>
>> I'm wondering if there's any architectural specific code that needs plumbing
>> here?
>>
> There shouldn't be. Can you send me the percpu_stats debug output before
> and after?

I'll paste the whole debug stats before and after here.
5.12-rc6 + patchset
-----BEFORE-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 9040
nr_dealloc : 6994
nr_cur_alloc : 2046
nr_max_alloc : 2208
nr_chunks : 3
nr_max_chunks : 3
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 12

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 859
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16384
free_bytes : 0
contig_bytes : 0
sum_frag : 0
max_frag : 0
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 827
max_alloc_size : 992
empty_pop_pages : 8
first_bit : 692
free_bytes : 645012
contig_bytes : 460096
sum_frag : 466420
max_frag : 460096
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 152
memcg_aware : 0

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 4
first_bit : 29207
free_bytes : 506640
contig_bytes : 506556
sum_frag : 84
max_frag : 32
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

-----AFTER-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 97048
nr_dealloc : 95002
nr_cur_alloc : 2046
nr_max_alloc : 90054
nr_chunks : 48
nr_max_chunks : 48
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 61

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 859
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16384
free_bytes : 0
contig_bytes : 0
sum_frag : 0
max_frag : 0
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 827
max_alloc_size : 1072
empty_pop_pages : 8
first_bit : 692
free_bytes : 645012
contig_bytes : 460096
sum_frag : 466420
max_frag : 460096
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 152
memcg_aware : 0

Chunk:
nr_alloc : 0
max_alloc_size : 0
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 0

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 7
first_bit : 29207
free_bytes : 506640
contig_bytes : 506556
sum_frag : 84
max_frag : 32
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1


I'm also pasting the logs before and after in a vanilla kernel too
There are considerably higher number of chunks in the vanilla kernel, than with
the patches though.

5.12-rc6 vanilla
-----BEFORE-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 9038
nr_dealloc : 6992
nr_cur_alloc : 2046
nr_max_alloc : 2178
nr_chunks : 3
nr_max_chunks : 3
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 5

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 1088
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16384
free_bytes : 0
contig_bytes : 0
sum_frag : 0
max_frag : 0
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 598
max_alloc_size : 992
empty_pop_pages : 5
first_bit : 642
free_bytes : 645012
contig_bytes : 504292
sum_frag : 140720
max_frag : 116456
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 424
memcg_aware : 0

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 27909
free_bytes : 506640
contig_bytes : 506556
sum_frag : 84
max_frag : 36
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

-----AFTER-----
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 97046
nr_dealloc : 94237
nr_cur_alloc : 2809
nr_max_alloc : 90054
nr_chunks : 11
nr_max_chunks : 47
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 29

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 1088
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16384
free_bytes : 0
contig_bytes : 0
sum_frag : 0
max_frag : 0
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 865
max_alloc_size : 1072
empty_pop_pages : 6
first_bit : 789
free_bytes : 640296
contig_bytes : 290672
sum_frag : 349624
max_frag : 169956
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 536
free_bytes : 595752
contig_bytes : 26164
sum_frag : 575132
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 1072
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 597428
contig_bytes : 26164
sum_frag : 596848
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 590360
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 583768
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595752
contig_bytes : 26164
sum_frag : 577748
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 1072
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 30
max_alloc_size : 1072
empty_pop_pages : 6
first_bit : 0
free_bytes : 636608
contig_bytes : 397944
sum_frag : 636500
max_frag : 426720
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 7
first_bit : 27909
free_bytes : 506640
contig_bytes : 506556
sum_frag : 84
max_frag : 36
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 12
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 647524
contig_bytes : 563492
sum_frag : 57872
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 10
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

>> I will also look through the code to find the reason why POWER isn't
>> depopulating pages.
>>
>> Thank you,
>> Pratik
>>
>> On 08/04/21 9:27 am, Roman Gushchin wrote:
>>> In our production experience the percpu memory allocator is sometimes struggling
>>> with returning the memory to the system. A typical example is a creation of
>>> several thousands memory cgroups (each has several chunks of the percpu data
>>> used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
>>> of these cgroups doesn't always lead to a shrinkage of the percpu memory,
>>> so that sometimes there are several GB's of memory wasted.
>>>
>>> The underlying problem is the fragmentation: to release an underlying chunk
>>> all percpu allocations should be released first. The percpu allocator tends
>>> to top up chunks to improve the utilization. It means new small-ish allocations
>>> (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
>>> effectively pinning them in memory.
>>>
>>> This patchset solves this problem by implementing a partial depopulation
>>> of percpu chunks: chunks with many empty pages are being asynchronously
>>> depopulated and the pages are returned to the system.
>>>
>>> To illustrate the problem the following script can be used:
>>>
>>> --
>>> #!/bin/bash
>>>
>>> cd /sys/fs/cgroup
>>>
>>> mkdir percpu_test
>>> echo "+memory" > percpu_test/cgroup.subtree_control
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>> mkdir percpu_test/cg_"${i}"
>>> for j in `seq 1 10`; do
>>> mkdir percpu_test/cg_"${i}"_"${j}"
>>> done
>>> done
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>> for j in `seq 1 10`; do
>>> rmdir percpu_test/cg_"${i}"_"${j}"
>>> done
>>> done
>>>
>>> sleep 10
>>>
>>> cat /proc/meminfo | grep Percpu
>>>
>>> for i in `seq 1 1000`; do
>>> rmdir percpu_test/cg_"${i}"
>>> done
>>>
>>> rmdir percpu_test
>>> --
>>>
>>> It creates 11000 memory cgroups and removes every 10 out of 11.
>>> It prints the initial size of the percpu memory, the size after
>>> creating all cgroups and the size after deleting most of them.
>>>
>>> Results:
>>> vanilla:
>>> ./percpu_test.sh
>>> Percpu: 7488 kB
>>> Percpu: 481152 kB
>>> Percpu: 481152 kB
>>>
>>> with this patchset applied:
>>> ./percpu_test.sh
>>> Percpu: 7488 kB
>>> Percpu: 481408 kB
>>> Percpu: 135552 kB
>>>
>>> So the total size of the percpu memory was reduced by more than 3.5 times.
>>>
>>> v3:
>>> - introduced pcpu_check_chunk_hint()
>>> - fixed a bug related to the hint check
>>> - minor cosmetic changes
>>> - s/pretends/fixes (cc Vlastimil)
>>>
>>> v2:
>>> - depopulated chunks are sidelined
>>> - depopulation happens in the reverse order
>>> - depopulate list made per-chunk type
>>> - better results due to better heuristics
>>>
>>> v1:
>>> - depopulation heuristics changed and optimized
>>> - chunks are put into a separate list, depopulation scan this list
>>> - chunk->isolated is introduced, chunk->depopulate is dropped
>>> - rearranged patches a bit
>>> - fixed a panic discovered by krobot
>>> - made pcpu_nr_empty_pop_pages per chunk type
>>> - minor fixes
>>>
>>> rfc:
>>> https://lwn.net/Articles/850508/
>>>
>>>
>>> Roman Gushchin (6):
>>> percpu: fix a comment about the chunks ordering
>>> percpu: split __pcpu_balance_workfn()
>>> percpu: make pcpu_nr_empty_pop_pages per chunk type
>>> percpu: generalize pcpu_balance_populated()
>>> percpu: factor out pcpu_check_chunk_hint()
>>> percpu: implement partial chunk depopulation
>>>
>>> mm/percpu-internal.h | 4 +-
>>> mm/percpu-stats.c | 9 +-
>>> mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++--------
>>> 3 files changed, 261 insertions(+), 58 deletions(-)
>>>
> Roman, sorry for the delay. I'm looking to apply this today to for-5.14.
>
> Thanks,
> Dennis
Thanks
Pratik

2021-04-16 18:48:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

On Fri, Apr 16, 2021 at 02:18:10PM +0000, Dennis Zhou wrote:
> Hello,
>
> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > Hello Roman,
> >
> > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> >
> > My results of the percpu_test are as follows:
> > Intel KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu:???????????? 1952 kB
> > Percpu:?????????? 219648 kB
> > Percpu:?????????? 219648 kB
> >
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu:???????????? 2080 kB
> > Percpu:?????????? 219712 kB
> > Percpu:??????????? 72672 kB
> >
> > I'm able to see improvement comparable to that of what you're see too.
> >
> > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> >
> > POWER9 KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu:???????????? 5888 kB
> > Percpu:?????????? 118272 kB
> > Percpu:?????????? 118272 kB
> >
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu:???????????? 6144 kB
> > Percpu:?????????? 119040 kB
> > Percpu:?????????? 119040 kB
> >
> > I'm wondering if there's any architectural specific code that needs plumbing
> > here?
> >
>
> There shouldn't be. Can you send me the percpu_stats debug output before
> and after?

Btw, sidelined chunks are not listed in the debug output. It was actually on my
to-do list, looks like I need to prioritize it a bit.

>
> > I will also look through the code to find the reason why POWER isn't
> > depopulating pages.
> >
> > Thank you,
> > Pratik
> >
> > On 08/04/21 9:27 am, Roman Gushchin wrote:
> > > In our production experience the percpu memory allocator is sometimes struggling
> > > with returning the memory to the system. A typical example is a creation of
> > > several thousands memory cgroups (each has several chunks of the percpu data
> > > used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
> > > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > > so that sometimes there are several GB's of memory wasted.
> > >
> > > The underlying problem is the fragmentation: to release an underlying chunk
> > > all percpu allocations should be released first. The percpu allocator tends
> > > to top up chunks to improve the utilization. It means new small-ish allocations
> > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > > effectively pinning them in memory.
> > >
> > > This patchset solves this problem by implementing a partial depopulation
> > > of percpu chunks: chunks with many empty pages are being asynchronously
> > > depopulated and the pages are returned to the system.
> > >
> > > To illustrate the problem the following script can be used:
> > >
> > > --
> > > #!/bin/bash
> > >
> > > cd /sys/fs/cgroup
> > >
> > > mkdir percpu_test
> > > echo "+memory" > percpu_test/cgroup.subtree_control
> > >
> > > cat /proc/meminfo | grep Percpu
> > >
> > > for i in `seq 1 1000`; do
> > > mkdir percpu_test/cg_"${i}"
> > > for j in `seq 1 10`; do
> > > mkdir percpu_test/cg_"${i}"_"${j}"
> > > done
> > > done
> > >
> > > cat /proc/meminfo | grep Percpu
> > >
> > > for i in `seq 1 1000`; do
> > > for j in `seq 1 10`; do
> > > rmdir percpu_test/cg_"${i}"_"${j}"
> > > done
> > > done
> > >
> > > sleep 10
> > >
> > > cat /proc/meminfo | grep Percpu
> > >
> > > for i in `seq 1 1000`; do
> > > rmdir percpu_test/cg_"${i}"
> > > done
> > >
> > > rmdir percpu_test
> > > --
> > >
> > > It creates 11000 memory cgroups and removes every 10 out of 11.
> > > It prints the initial size of the percpu memory, the size after
> > > creating all cgroups and the size after deleting most of them.
> > >
> > > Results:
> > > vanilla:
> > > ./percpu_test.sh
> > > Percpu: 7488 kB
> > > Percpu: 481152 kB
> > > Percpu: 481152 kB
> > >
> > > with this patchset applied:
> > > ./percpu_test.sh
> > > Percpu: 7488 kB
> > > Percpu: 481408 kB
> > > Percpu: 135552 kB
> > >
> > > So the total size of the percpu memory was reduced by more than 3.5 times.
> > >
> > > v3:
> > > - introduced pcpu_check_chunk_hint()
> > > - fixed a bug related to the hint check
> > > - minor cosmetic changes
> > > - s/pretends/fixes (cc Vlastimil)
> > >
> > > v2:
> > > - depopulated chunks are sidelined
> > > - depopulation happens in the reverse order
> > > - depopulate list made per-chunk type
> > > - better results due to better heuristics
> > >
> > > v1:
> > > - depopulation heuristics changed and optimized
> > > - chunks are put into a separate list, depopulation scan this list
> > > - chunk->isolated is introduced, chunk->depopulate is dropped
> > > - rearranged patches a bit
> > > - fixed a panic discovered by krobot
> > > - made pcpu_nr_empty_pop_pages per chunk type
> > > - minor fixes
> > >
> > > rfc:
> > > https://lwn.net/Articles/850508/
> > >
> > >
> > > Roman Gushchin (6):
> > > percpu: fix a comment about the chunks ordering
> > > percpu: split __pcpu_balance_workfn()
> > > percpu: make pcpu_nr_empty_pop_pages per chunk type
> > > percpu: generalize pcpu_balance_populated()
> > > percpu: factor out pcpu_check_chunk_hint()
> > > percpu: implement partial chunk depopulation
> > >
> > > mm/percpu-internal.h | 4 +-
> > > mm/percpu-stats.c | 9 +-
> > > mm/percpu.c | 306 +++++++++++++++++++++++++++++++++++--------
> > > 3 files changed, 261 insertions(+), 58 deletions(-)
> > >
> >
>
> Roman, sorry for the delay. I'm looking to apply this today to for-5.14.

Great, thanks!

2021-04-16 18:53:25

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> Hello Dennis,
>
> I apologize for the clutter of logs before, I'm pasting the logs of before and
> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> the vanilla kernel 5.12-rc6.
>
> On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > Hello,
> >
> > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > Hello Roman,
> > >
> > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > >
> > > My results of the percpu_test are as follows:
> > > Intel KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu:???????????? 1952 kB
> > > Percpu:?????????? 219648 kB
> > > Percpu:?????????? 219648 kB
> > >
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu:???????????? 2080 kB
> > > Percpu:?????????? 219712 kB
> > > Percpu:??????????? 72672 kB
> > >
> > > I'm able to see improvement comparable to that of what you're see too.
> > >
> > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > >
> > > POWER9 KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu:???????????? 5888 kB
> > > Percpu:?????????? 118272 kB
> > > Percpu:?????????? 118272 kB
> > >
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu:???????????? 6144 kB
> > > Percpu:?????????? 119040 kB
> > > Percpu:?????????? 119040 kB
> > >
> > > I'm wondering if there's any architectural specific code that needs plumbing
> > > here?
> > >
> > There shouldn't be. Can you send me the percpu_stats debug output before
> > and after?
>
> I'll paste the whole debug stats before and after here.
> 5.12-rc6 + patchset
> -----BEFORE-----
> Percpu Memory Statistics
> Allocation Info:


Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc : 9038 nr_alloc : 97046
nr_dealloc : 6992 nr_dealloc : 94237
nr_cur_alloc : 2046 nr_cur_alloc : 2809
nr_max_alloc : 2178 nr_max_alloc : 90054
nr_chunks : 3 nr_chunks : 11
nr_max_chunks : 3 nr_max_chunks : 47
min_alloc_size : 4 min_alloc_size : 4
max_alloc_size : 1072 max_alloc_size : 1072
empty_pop_pages : 5 empty_pop_pages : 29


Patched

nr_alloc : 9040 nr_alloc : 97048
nr_dealloc : 6994 nr_dealloc : 95002
nr_cur_alloc : 2046 nr_cur_alloc : 2046
nr_max_alloc : 2208 nr_max_alloc : 90054
nr_chunks : 3 nr_chunks : 48
nr_max_chunks : 3 nr_max_chunks : 48
min_alloc_size : 4 min_alloc_size : 4
max_alloc_size : 1072 max_alloc_size : 1072
empty_pop_pages : 12 empty_pop_pages : 61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated statistics?

--

From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001
From: Roman Gushchin <[email protected]>
Date: Fri, 16 Apr 2021 09:54:38 -0700
Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug
output

Information about sidelined chunks and chunks in the depopulate queue
could be extremely valuable for debugging different problems.

Dump information about these chunks on pair with regular chunks
in percpu slots via percpu stats interface.

Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu-internal.h | 2 ++
mm/percpu-stats.c | 10 ++++++++++
mm/percpu.c | 4 ++--
3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 8e432663c41e..c11f115ced5c 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock;
extern struct list_head *pcpu_chunk_lists;
extern int pcpu_nr_slots;
extern int pcpu_nr_empty_pop_pages[];
+extern struct list_head pcpu_depopulate_list[];
+extern struct list_head pcpu_sideline_list[];

extern struct pcpu_chunk *pcpu_first_chunk;
extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index f6026dbcdf6b..af09ed1ea5f8 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v)
}
}
}
+
+ list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
+ seq_puts(m, "Chunk (sidelined):\n");
+ chunk_map_stats(m, chunk, buffer);
+ }
+
+ list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) {
+ seq_puts(m, "Chunk (to depopulate):\n");
+ chunk_map_stats(m, chunk, buffer);
+ }
}

spin_unlock_irq(&pcpu_lock);
diff --git a/mm/percpu.c b/mm/percpu.c
index 5bb294e394b3..ded3a7541cb2 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
* List of chunks with a lot of free pages. Used to depopulate them
* asynchronously.
*/
-static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];

/*
* List of previously depopulated chunks. They are not usually used for new
* allocations, but can be returned back to service if a need arises.
*/
-static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];


/*
--
2.30.2

2021-04-16 19:15:09

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>
>
> On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > Hello Dennis,
> > >
> > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > the vanilla kernel 5.12-rc6.
> > >
> > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > Hello,
> > > >
> > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > Hello Roman,
> > > > >
> > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > >
> > > > > My results of the percpu_test are as follows:
> > > > > Intel KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu:???????????? 1952 kB
> > > > > Percpu:?????????? 219648 kB
> > > > > Percpu:?????????? 219648 kB
> > > > >
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu:???????????? 2080 kB
> > > > > Percpu:?????????? 219712 kB
> > > > > Percpu:??????????? 72672 kB
> > > > >
> > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > >
> > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > >
> > > > > POWER9 KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu:???????????? 5888 kB
> > > > > Percpu:?????????? 118272 kB
> > > > > Percpu:?????????? 118272 kB
> > > > >
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu:???????????? 6144 kB
> > > > > Percpu:?????????? 119040 kB
> > > > > Percpu:?????????? 119040 kB
> > > > >
> > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > here?
> > > > >
> > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > and after?
> > > I'll paste the whole debug stats before and after here.
> > > 5.12-rc6 + patchset
> > > -----BEFORE-----
> > > Percpu Memory Statistics
> > > Allocation Info:
> >
> > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> >
> > Vanilla
> >
> > nr_alloc : 9038 nr_alloc : 97046
> > nr_dealloc : 6992 nr_dealloc : 94237
> > nr_cur_alloc : 2046 nr_cur_alloc : 2809
> > nr_max_alloc : 2178 nr_max_alloc : 90054
> > nr_chunks : 3 nr_chunks : 11
> > nr_max_chunks : 3 nr_max_chunks : 47
> > min_alloc_size : 4 min_alloc_size : 4
> > max_alloc_size : 1072 max_alloc_size : 1072
> > empty_pop_pages : 5 empty_pop_pages : 29
> >
> >
> > Patched
> >
> > nr_alloc : 9040 nr_alloc : 97048
> > nr_dealloc : 6994 nr_dealloc : 95002
> > nr_cur_alloc : 2046 nr_cur_alloc : 2046
> > nr_max_alloc : 2208 nr_max_alloc : 90054
> > nr_chunks : 3 nr_chunks : 48
> > nr_max_chunks : 3 nr_max_chunks : 48
> > min_alloc_size : 4 min_alloc_size : 4
> > max_alloc_size : 1072 max_alloc_size : 1072
> > empty_pop_pages : 12 empty_pop_pages : 61
> >
> >
> > So it looks like the number of chunks got bigger, as well as the number of
> > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > sure that the data is correct and we're not messing two cases?
> >
> > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > bigger with the same amount of chunks.
> >
> > So, can you, please, apply the following patch and provide an updated statistics?
>
> Unfortunately, I'm not completely well versed in this area, but yes the empty
> pop pages number doesn't make sense to me either.
>
> I re-ran the numbers trying to make sure my experiment setup is sane but
> results remain the same.
>
> Vanilla
> nr_alloc : 9040 nr_alloc : 97048
> nr_dealloc : 6994 nr_dealloc : 94404
> nr_cur_alloc : 2046 nr_cur_alloc : 2644
> nr_max_alloc : 2169 nr_max_alloc : 90054
> nr_chunks : 3 nr_chunks : 10
> nr_max_chunks : 3 nr_max_chunks : 47
> min_alloc_size : 4 min_alloc_size : 4
> max_alloc_size : 1072 max_alloc_size : 1072
> empty_pop_pages : 4 empty_pop_pages : 32
>
> With the patchset + debug patch the results are as follows:
> Patched
>
> nr_alloc : 9040 nr_alloc : 97048
> nr_dealloc : 6994 nr_dealloc : 94349
> nr_cur_alloc : 2046 nr_cur_alloc : 2699
> nr_max_alloc : 2194 nr_max_alloc : 90054
> nr_chunks : 3 nr_chunks : 48
> nr_max_chunks : 3 nr_max_chunks : 48
> min_alloc_size : 4 min_alloc_size : 4
> max_alloc_size : 1072 max_alloc_size : 1072
> empty_pop_pages : 12 empty_pop_pages : 54
>
> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>
> I've snipped the results of slidelined chunks because they went on for ~600
> lines, if you need the full logs let me know.

Yes, please! That's the most interesting part!

2021-04-16 19:17:27

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation



On 17/04/21 12:04 am, Roman Gushchin wrote:
> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>
>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>> Hello Dennis,
>>>>
>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>> the vanilla kernel 5.12-rc6.
>>>>
>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>> Hello Roman,
>>>>>>
>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>
>>>>>> My results of the percpu_test are as follows:
>>>>>> Intel KVM 4CPU:4G
>>>>>> Vanilla 5.12-rc6
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             1952 kB
>>>>>> Percpu:           219648 kB
>>>>>> Percpu:           219648 kB
>>>>>>
>>>>>> 5.12-rc6 + with patchset applied
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             2080 kB
>>>>>> Percpu:           219712 kB
>>>>>> Percpu:            72672 kB
>>>>>>
>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>
>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>
>>>>>> POWER9 KVM 4CPU:4G
>>>>>> Vanilla 5.12-rc6
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             5888 kB
>>>>>> Percpu:           118272 kB
>>>>>> Percpu:           118272 kB
>>>>>>
>>>>>> 5.12-rc6 + with patchset applied
>>>>>> # ./percpu_test.sh
>>>>>> Percpu:             6144 kB
>>>>>> Percpu:           119040 kB
>>>>>> Percpu:           119040 kB
>>>>>>
>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>> here?
>>>>>>
>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>> and after?
>>>> I'll paste the whole debug stats before and after here.
>>>> 5.12-rc6 + patchset
>>>> -----BEFORE-----
>>>> Percpu Memory Statistics
>>>> Allocation Info:
>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>
>>> Vanilla
>>>
>>> nr_alloc : 9038 nr_alloc : 97046
>>> nr_dealloc : 6992 nr_dealloc : 94237
>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809
>>> nr_max_alloc : 2178 nr_max_alloc : 90054
>>> nr_chunks : 3 nr_chunks : 11
>>> nr_max_chunks : 3 nr_max_chunks : 47
>>> min_alloc_size : 4 min_alloc_size : 4
>>> max_alloc_size : 1072 max_alloc_size : 1072
>>> empty_pop_pages : 5 empty_pop_pages : 29
>>>
>>>
>>> Patched
>>>
>>> nr_alloc : 9040 nr_alloc : 97048
>>> nr_dealloc : 6994 nr_dealloc : 95002
>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046
>>> nr_max_alloc : 2208 nr_max_alloc : 90054
>>> nr_chunks : 3 nr_chunks : 48
>>> nr_max_chunks : 3 nr_max_chunks : 48
>>> min_alloc_size : 4 min_alloc_size : 4
>>> max_alloc_size : 1072 max_alloc_size : 1072
>>> empty_pop_pages : 12 empty_pop_pages : 61
>>>
>>>
>>> So it looks like the number of chunks got bigger, as well as the number of
>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>> sure that the data is correct and we're not messing two cases?
>>>
>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>> bigger with the same amount of chunks.
>>>
>>> So, can you, please, apply the following patch and provide an updated statistics?
>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>> pop pages number doesn't make sense to me either.
>>
>> I re-ran the numbers trying to make sure my experiment setup is sane but
>> results remain the same.
>>
>> Vanilla
>> nr_alloc : 9040 nr_alloc : 97048
>> nr_dealloc : 6994 nr_dealloc : 94404
>> nr_cur_alloc : 2046 nr_cur_alloc : 2644
>> nr_max_alloc : 2169 nr_max_alloc : 90054
>> nr_chunks : 3 nr_chunks : 10
>> nr_max_chunks : 3 nr_max_chunks : 47
>> min_alloc_size : 4 min_alloc_size : 4
>> max_alloc_size : 1072 max_alloc_size : 1072
>> empty_pop_pages : 4 empty_pop_pages : 32
>>
>> With the patchset + debug patch the results are as follows:
>> Patched
>>
>> nr_alloc : 9040 nr_alloc : 97048
>> nr_dealloc : 6994 nr_dealloc : 94349
>> nr_cur_alloc : 2046 nr_cur_alloc : 2699
>> nr_max_alloc : 2194 nr_max_alloc : 90054
>> nr_chunks : 3 nr_chunks : 48
>> nr_max_chunks : 3 nr_max_chunks : 48
>> min_alloc_size : 4 min_alloc_size : 4
>> max_alloc_size : 1072 max_alloc_size : 1072
>> empty_pop_pages : 12 empty_pop_pages : 54
>>
>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>
>> I've snipped the results of slidelined chunks because they went on for ~600
>> lines, if you need the full logs let me know.
> Yes, please! That's the most interesting part!

Got it. Pasting the full logs of after the percpu experiment was completed

Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 97048
nr_dealloc : 94349
nr_cur_alloc : 2699
nr_max_alloc : 90054
nr_chunks : 48
nr_max_chunks : 48
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 54

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 1081
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16117
free_bytes : 4
contig_bytes : 4
sum_frag : 4
max_frag : 4
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 826
max_alloc_size : 1072
empty_pop_pages : 6
first_bit : 819
free_bytes : 640660
contig_bytes : 249896
sum_frag : 464700
max_frag : 306216
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 0
max_alloc_size : 0
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 0

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 536
free_bytes : 595752
contig_bytes : 26164
sum_frag : 575132
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 1072
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 597428
contig_bytes : 26164
sum_frag : 596848
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 590360
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 583768
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 7
first_bit : 26595
free_bytes : 506640
contig_bytes : 506540
sum_frag : 100
max_frag : 36
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 12
max_alloc_size : 1072
empty_pop_pages : 3
first_bit : 0
free_bytes : 647524
contig_bytes : 563492
sum_frag : 57872
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 52
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 621404
contig_bytes : 203104
sum_frag : 603400
max_frag : 260656
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 4
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 652748
contig_bytes : 570600
sum_frag : 570600
max_frag : 570600
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1


2021-04-16 19:41:53

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>
>
> On 17/04/21 12:04 am, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > >
> > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > Hello Dennis,
> > > > >
> > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > the vanilla kernel 5.12-rc6.
> > > > >
> > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Roman,
> > > > > > >
> > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > >
> > > > > > > My results of the percpu_test are as follows:
> > > > > > > Intel KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:???????????? 1952 kB
> > > > > > > Percpu:?????????? 219648 kB
> > > > > > > Percpu:?????????? 219648 kB
> > > > > > >
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:???????????? 2080 kB
> > > > > > > Percpu:?????????? 219712 kB
> > > > > > > Percpu:??????????? 72672 kB
> > > > > > >
> > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > >
> > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > >
> > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:???????????? 5888 kB
> > > > > > > Percpu:?????????? 118272 kB
> > > > > > > Percpu:?????????? 118272 kB
> > > > > > >
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu:???????????? 6144 kB
> > > > > > > Percpu:?????????? 119040 kB
> > > > > > > Percpu:?????????? 119040 kB
> > > > > > >
> > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > here?
> > > > > > >
> > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > and after?
> > > > > I'll paste the whole debug stats before and after here.
> > > > > 5.12-rc6 + patchset
> > > > > -----BEFORE-----
> > > > > Percpu Memory Statistics
> > > > > Allocation Info:
> > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > >
> > > > Vanilla
> > > >
> > > > nr_alloc : 9038 nr_alloc : 97046
> > > > nr_dealloc : 6992 nr_dealloc : 94237
> > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809
> > > > nr_max_alloc : 2178 nr_max_alloc : 90054
> > > > nr_chunks : 3 nr_chunks : 11
> > > > nr_max_chunks : 3 nr_max_chunks : 47
> > > > min_alloc_size : 4 min_alloc_size : 4
> > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > empty_pop_pages : 5 empty_pop_pages : 29
> > > >
> > > >
> > > > Patched
> > > >
> > > > nr_alloc : 9040 nr_alloc : 97048
> > > > nr_dealloc : 6994 nr_dealloc : 95002
> > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046
> > > > nr_max_alloc : 2208 nr_max_alloc : 90054
> > > > nr_chunks : 3 nr_chunks : 48
> > > > nr_max_chunks : 3 nr_max_chunks : 48
> > > > min_alloc_size : 4 min_alloc_size : 4
> > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > empty_pop_pages : 12 empty_pop_pages : 61
> > > >
> > > >
> > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > sure that the data is correct and we're not messing two cases?
> > > >
> > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > bigger with the same amount of chunks.
> > > >
> > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > pop pages number doesn't make sense to me either.
> > >
> > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > results remain the same.
> > >
> > > Vanilla
> > > nr_alloc : 9040 nr_alloc : 97048
> > > nr_dealloc : 6994 nr_dealloc : 94404
> > > nr_cur_alloc : 2046 nr_cur_alloc : 2644
> > > nr_max_alloc : 2169 nr_max_alloc : 90054
> > > nr_chunks : 3 nr_chunks : 10
> > > nr_max_chunks : 3 nr_max_chunks : 47
> > > min_alloc_size : 4 min_alloc_size : 4
> > > max_alloc_size : 1072 max_alloc_size : 1072
> > > empty_pop_pages : 4 empty_pop_pages : 32
> > >
> > > With the patchset + debug patch the results are as follows:
> > > Patched
> > >
> > > nr_alloc : 9040 nr_alloc : 97048
> > > nr_dealloc : 6994 nr_dealloc : 94349
> > > nr_cur_alloc : 2046 nr_cur_alloc : 2699
> > > nr_max_alloc : 2194 nr_max_alloc : 90054
> > > nr_chunks : 3 nr_chunks : 48
> > > nr_max_chunks : 3 nr_max_chunks : 48
> > > min_alloc_size : 4 min_alloc_size : 4
> > > max_alloc_size : 1072 max_alloc_size : 1072
> > > empty_pop_pages : 12 empty_pop_pages : 54
> > >
> > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > >
> > > I've snipped the results of slidelined chunks because they went on for ~600
> > > lines, if you need the full logs let me know.
> > Yes, please! That's the most interesting part!
>
> Got it. Pasting the full logs of after the percpu experiment was completed

Thanks!

Would you mind to apply the following patch and test again?

--

diff --git a/mm/percpu.c b/mm/percpu.c
index ded3a7541cb2..532c6a7ebdfd 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
need_balance = true;
break;
}
+
+ chunk->depopulated = false;
+ pcpu_chunk_relocate(chunk, -1);
} else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
!chunk->isolated &&
(pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >

2021-04-16 20:11:18

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>
>
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > >
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > >
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > >
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > >
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > > >
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 1952 kB
> > > > > > > > > Percpu:?????????? 219648 kB
> > > > > > > > > Percpu:?????????? 219648 kB
> > > > > > > > >
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 2080 kB
> > > > > > > > > Percpu:?????????? 219712 kB
> > > > > > > > > Percpu:??????????? 72672 kB
> > > > > > > > >
> > > > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > > > >
> > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > > > >
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 5888 kB
> > > > > > > > > Percpu:?????????? 118272 kB
> > > > > > > > > Percpu:?????????? 118272 kB
> > > > > > > > >
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 6144 kB
> > > > > > > > > Percpu:?????????? 119040 kB
> > > > > > > > > Percpu:?????????? 119040 kB
> > > > > > > > >
> > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > > > here?
> > > > > > > > >
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -----BEFORE-----
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > > > >
> > > > > > Vanilla
> > > > > >
> > > > > > nr_alloc : 9038 nr_alloc : 97046
> > > > > > nr_dealloc : 6992 nr_dealloc : 94237
> > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809
> > > > > > nr_max_alloc : 2178 nr_max_alloc : 90054
> > > > > > nr_chunks : 3 nr_chunks : 11
> > > > > > nr_max_chunks : 3 nr_max_chunks : 47
> > > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > > empty_pop_pages : 5 empty_pop_pages : 29
> > > > > >
> > > > > >
> > > > > > Patched
> > > > > >
> > > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > > nr_dealloc : 6994 nr_dealloc : 95002
> > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046
> > > > > > nr_max_alloc : 2208 nr_max_alloc : 90054
> > > > > > nr_chunks : 3 nr_chunks : 48
> > > > > > nr_max_chunks : 3 nr_max_chunks : 48
> > > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > > empty_pop_pages : 12 empty_pop_pages : 61
> > > > > >
> > > > > >
> > > > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > > > sure that the data is correct and we're not messing two cases?
> > > > > >
> > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > > > bigger with the same amount of chunks.
> > > > > >
> > > > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > > > pop pages number doesn't make sense to me either.
> > > > >
> > > > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > > > results remain the same.
> > > > >
> > > > > Vanilla
> > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > nr_dealloc : 6994 nr_dealloc : 94404
> > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2644
> > > > > nr_max_alloc : 2169 nr_max_alloc : 90054
> > > > > nr_chunks : 3 nr_chunks : 10
> > > > > nr_max_chunks : 3 nr_max_chunks : 47
> > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > empty_pop_pages : 4 empty_pop_pages : 32
> > > > >
> > > > > With the patchset + debug patch the results are as follows:
> > > > > Patched
> > > > >
> > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > nr_dealloc : 6994 nr_dealloc : 94349
> > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2699
> > > > > nr_max_alloc : 2194 nr_max_alloc : 90054
> > > > > nr_chunks : 3 nr_chunks : 48
> > > > > nr_max_chunks : 3 nr_max_chunks : 48
> > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > empty_pop_pages : 12 empty_pop_pages : 54
> > > > >
> > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > > > >
> > > > > I've snipped the results of slidelined chunks because they went on for ~600
> > > > > lines, if you need the full logs let me know.
> > > > Yes, please! That's the most interesting part!
> > > Got it. Pasting the full logs of after the percpu experiment was completed
> > Thanks!
> >
> > Would you mind to apply the following patch and test again?
> >
> > --
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index ded3a7541cb2..532c6a7ebdfd 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
> > need_balance = true;
> > break;
> > }
> > +
> > + chunk->depopulated = false;
> > + pcpu_chunk_relocate(chunk, -1);
> > } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> > !chunk->isolated &&
> > (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
> >
> Sure thing.
>
> I see much lower sideline chunks. In one such test run I saw zero occurrences
> of slidelined chunks
>
So looking at the stats it now works properly. Do you see any savings in
comparison to vanilla? The size of savings can significanlty depend on the exact
size of cgroup-related objects, how many of them fit into a single chunk, etc.
So you might want to play with numbers in the test...

Anyway, thank you very much for the report and your work on testing follow-up
patches! It helped to reveal a serious bug in the implementation (completely
empty sidelined chunks were not released in some cases), which by pure
coincidence wasn't triggered on x86.

Thanks!

2021-04-16 21:24:12

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation



On 16/04/21 10:43 pm, Roman Gushchin wrote:
> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>> Hello Dennis,
>>
>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>> the vanilla kernel 5.12-rc6.
>>
>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>> Hello,
>>>
>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>> Hello Roman,
>>>>
>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>
>>>> My results of the percpu_test are as follows:
>>>> Intel KVM 4CPU:4G
>>>> Vanilla 5.12-rc6
>>>> # ./percpu_test.sh
>>>> Percpu:             1952 kB
>>>> Percpu:           219648 kB
>>>> Percpu:           219648 kB
>>>>
>>>> 5.12-rc6 + with patchset applied
>>>> # ./percpu_test.sh
>>>> Percpu:             2080 kB
>>>> Percpu:           219712 kB
>>>> Percpu:            72672 kB
>>>>
>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>
>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>
>>>> POWER9 KVM 4CPU:4G
>>>> Vanilla 5.12-rc6
>>>> # ./percpu_test.sh
>>>> Percpu:             5888 kB
>>>> Percpu:           118272 kB
>>>> Percpu:           118272 kB
>>>>
>>>> 5.12-rc6 + with patchset applied
>>>> # ./percpu_test.sh
>>>> Percpu:             6144 kB
>>>> Percpu:           119040 kB
>>>> Percpu:           119040 kB
>>>>
>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>> here?
>>>>
>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>> and after?
>> I'll paste the whole debug stats before and after here.
>> 5.12-rc6 + patchset
>> -----BEFORE-----
>> Percpu Memory Statistics
>> Allocation Info:
>
> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>
> Vanilla
>
> nr_alloc : 9038 nr_alloc : 97046
> nr_dealloc : 6992 nr_dealloc : 94237
> nr_cur_alloc : 2046 nr_cur_alloc : 2809
> nr_max_alloc : 2178 nr_max_alloc : 90054
> nr_chunks : 3 nr_chunks : 11
> nr_max_chunks : 3 nr_max_chunks : 47
> min_alloc_size : 4 min_alloc_size : 4
> max_alloc_size : 1072 max_alloc_size : 1072
> empty_pop_pages : 5 empty_pop_pages : 29
>
>
> Patched
>
> nr_alloc : 9040 nr_alloc : 97048
> nr_dealloc : 6994 nr_dealloc : 95002
> nr_cur_alloc : 2046 nr_cur_alloc : 2046
> nr_max_alloc : 2208 nr_max_alloc : 90054
> nr_chunks : 3 nr_chunks : 48
> nr_max_chunks : 3 nr_max_chunks : 48
> min_alloc_size : 4 min_alloc_size : 4
> max_alloc_size : 1072 max_alloc_size : 1072
> empty_pop_pages : 12 empty_pop_pages : 61
>
>
> So it looks like the number of chunks got bigger, as well as the number of
> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> sure that the data is correct and we're not messing two cases?
>
> So it looks like for some reason sidelined (depopulated) chunks are not getting
> freed completely. But I struggle to explain why the initial empty_pop_pages is
> bigger with the same amount of chunks.
>
> So, can you, please, apply the following patch and provide an updated statistics?

Unfortunately, I'm not completely well versed in this area, but yes the empty
pop pages number doesn't make sense to me either.

I re-ran the numbers trying to make sure my experiment setup is sane but
results remain the same.

Vanilla
nr_alloc : 9040 nr_alloc : 97048
nr_dealloc : 6994 nr_dealloc : 94404
nr_cur_alloc : 2046 nr_cur_alloc : 2644
nr_max_alloc : 2169 nr_max_alloc : 90054
nr_chunks : 3 nr_chunks : 10
nr_max_chunks : 3 nr_max_chunks : 47
min_alloc_size : 4 min_alloc_size : 4
max_alloc_size : 1072 max_alloc_size : 1072
empty_pop_pages : 4 empty_pop_pages : 32

With the patchset + debug patch the results are as follows:
Patched

nr_alloc : 9040 nr_alloc : 97048
nr_dealloc : 6994 nr_dealloc : 94349
nr_cur_alloc : 2046 nr_cur_alloc : 2699
nr_max_alloc : 2194 nr_max_alloc : 90054
nr_chunks : 3 nr_chunks : 48
nr_max_chunks : 3 nr_max_chunks : 48
min_alloc_size : 4 min_alloc_size : 4
max_alloc_size : 1072 max_alloc_size : 1072
empty_pop_pages : 12 empty_pop_pages : 54

With the extra tracing I can see 39 entries of "Chunk (sidelined)"
after the test was run. I don't see any entries for "Chunk (to depopulate)"

I've snipped the results of slidelined chunks because they went on for ~600
lines, if you need the full logs let me know.

Thank you,
Pratik

> --
>
> From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001
> From: Roman Gushchin <[email protected]>
> Date: Fri, 16 Apr 2021 09:54:38 -0700
> Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug
> output
>
> Information about sidelined chunks and chunks in the depopulate queue
> could be extremely valuable for debugging different problems.
>
> Dump information about these chunks on pair with regular chunks
> in percpu slots via percpu stats interface.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> mm/percpu-internal.h | 2 ++
> mm/percpu-stats.c | 10 ++++++++++
> mm/percpu.c | 4 ++--
> 3 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index 8e432663c41e..c11f115ced5c 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -90,6 +90,8 @@ extern spinlock_t pcpu_lock;
> extern struct list_head *pcpu_chunk_lists;
> extern int pcpu_nr_slots;
> extern int pcpu_nr_empty_pop_pages[];
> +extern struct list_head pcpu_depopulate_list[];
> +extern struct list_head pcpu_sideline_list[];
>
> extern struct pcpu_chunk *pcpu_first_chunk;
> extern struct pcpu_chunk *pcpu_reserved_chunk;
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index f6026dbcdf6b..af09ed1ea5f8 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -228,6 +228,16 @@ static int percpu_stats_show(struct seq_file *m, void *v)
> }
> }
> }
> +
> + list_for_each_entry(chunk, &pcpu_sideline_list[type], list) {
> + seq_puts(m, "Chunk (sidelined):\n");
> + chunk_map_stats(m, chunk, buffer);
> + }
> +
> + list_for_each_entry(chunk, &pcpu_depopulate_list[type], list) {
> + seq_puts(m, "Chunk (to depopulate):\n");
> + chunk_map_stats(m, chunk, buffer);
> + }
> }
>
> spin_unlock_irq(&pcpu_lock);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 5bb294e394b3..ded3a7541cb2 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -185,13 +185,13 @@ int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
> * List of chunks with a lot of free pages. Used to depopulate them
> * asynchronously.
> */
> -static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
> +struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
>
> /*
> * List of previously depopulated chunks. They are not usually used for new
> * allocations, but can be returned back to service if a need arises.
> */
> -static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
> +struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
>
>
> /*

2021-04-16 21:57:52

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation



On 17/04/21 12:39 am, Roman Gushchin wrote:
> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>> Hello Dennis,
>>>>>>
>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>
>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Roman,
>>>>>>>>
>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>
>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>> Vanilla 5.12-rc6
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             1952 kB
>>>>>>>> Percpu:           219648 kB
>>>>>>>> Percpu:           219648 kB
>>>>>>>>
>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             2080 kB
>>>>>>>> Percpu:           219712 kB
>>>>>>>> Percpu:            72672 kB
>>>>>>>>
>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>
>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>
>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>> Vanilla 5.12-rc6
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             5888 kB
>>>>>>>> Percpu:           118272 kB
>>>>>>>> Percpu:           118272 kB
>>>>>>>>
>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>> # ./percpu_test.sh
>>>>>>>> Percpu:             6144 kB
>>>>>>>> Percpu:           119040 kB
>>>>>>>> Percpu:           119040 kB
>>>>>>>>
>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>> here?
>>>>>>>>
>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>> and after?
>>>>>> I'll paste the whole debug stats before and after here.
>>>>>> 5.12-rc6 + patchset
>>>>>> -----BEFORE-----
>>>>>> Percpu Memory Statistics
>>>>>> Allocation Info:
>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>
>>>>> Vanilla
>>>>>
>>>>> nr_alloc : 9038 nr_alloc : 97046
>>>>> nr_dealloc : 6992 nr_dealloc : 94237
>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809
>>>>> nr_max_alloc : 2178 nr_max_alloc : 90054
>>>>> nr_chunks : 3 nr_chunks : 11
>>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>> empty_pop_pages : 5 empty_pop_pages : 29
>>>>>
>>>>>
>>>>> Patched
>>>>>
>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>> nr_dealloc : 6994 nr_dealloc : 95002
>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046
>>>>> nr_max_alloc : 2208 nr_max_alloc : 90054
>>>>> nr_chunks : 3 nr_chunks : 48
>>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>> empty_pop_pages : 12 empty_pop_pages : 61
>>>>>
>>>>>
>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>> sure that the data is correct and we're not messing two cases?
>>>>>
>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>> bigger with the same amount of chunks.
>>>>>
>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>> pop pages number doesn't make sense to me either.
>>>>
>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>> results remain the same.
>>>>
>>>> Vanilla
>>>> nr_alloc : 9040 nr_alloc : 97048
>>>> nr_dealloc : 6994 nr_dealloc : 94404
>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644
>>>> nr_max_alloc : 2169 nr_max_alloc : 90054
>>>> nr_chunks : 3 nr_chunks : 10
>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>> min_alloc_size : 4 min_alloc_size : 4
>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>> empty_pop_pages : 4 empty_pop_pages : 32
>>>>
>>>> With the patchset + debug patch the results are as follows:
>>>> Patched
>>>>
>>>> nr_alloc : 9040 nr_alloc : 97048
>>>> nr_dealloc : 6994 nr_dealloc : 94349
>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699
>>>> nr_max_alloc : 2194 nr_max_alloc : 90054
>>>> nr_chunks : 3 nr_chunks : 48
>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>> min_alloc_size : 4 min_alloc_size : 4
>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>> empty_pop_pages : 12 empty_pop_pages : 54
>>>>
>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>
>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>> lines, if you need the full logs let me know.
>>> Yes, please! That's the most interesting part!
>> Got it. Pasting the full logs of after the percpu experiment was completed
> Thanks!
>
> Would you mind to apply the following patch and test again?
>
> --
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index ded3a7541cb2..532c6a7ebdfd 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
> need_balance = true;
> break;
> }
> +
> + chunk->depopulated = false;
> + pcpu_chunk_relocate(chunk, -1);
> } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> !chunk->isolated &&
> (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>
Sure thing.

I see much lower sideline chunks. In one such test run I saw zero occurrences
of slidelined chunks

Pasting the full logs as an example:

BEFORE
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 9038
nr_dealloc : 6992
nr_cur_alloc : 2046
nr_max_alloc : 2200
nr_chunks : 3
nr_max_chunks : 3
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 12

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 1092
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16247
free_bytes : 4
contig_bytes : 4
sum_frag : 4
max_frag : 4
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 594
max_alloc_size : 992
empty_pop_pages : 8
first_bit : 456
free_bytes : 645008
contig_bytes : 319984
sum_frag : 325024
max_frag : 318680
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 424
memcg_aware : 0

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 4
first_bit : 26595
free_bytes : 506640
contig_bytes : 506540
sum_frag : 100
max_frag : 32
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1


AFTER
Percpu Memory Statistics
Allocation Info:
----------------------------------------
unit_size : 655360
static_size : 608920
reserved_size : 0
dyn_size : 46440
atom_size : 65536
alloc_size : 655360

Global Stats:
----------------------------------------
nr_alloc : 97046
nr_dealloc : 94304
nr_cur_alloc : 2742
nr_max_alloc : 90054
nr_chunks : 11
nr_max_chunks : 47
min_alloc_size : 4
max_alloc_size : 1072
empty_pop_pages : 18

Per Chunk Stats:
----------------------------------------
Chunk: <- First Chunk
nr_alloc : 1092
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 16247
free_bytes : 4
contig_bytes : 4
sum_frag : 4
max_frag : 4
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 838
max_alloc_size : 1072
empty_pop_pages : 7
first_bit : 464
free_bytes : 640476
contig_bytes : 290672
sum_frag : 349804
max_frag : 304344
cur_min_alloc : 4
cur_med_alloc : 8
cur_max_alloc : 1072
memcg_aware : 0

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 536
free_bytes : 595752
contig_bytes : 26164
sum_frag : 575132
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 1072
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 90
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 597428
contig_bytes : 26164
sum_frag : 596848
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 590360
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 92
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 595284
contig_bytes : 26164
sum_frag : 583768
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 360
max_alloc_size : 1072
empty_pop_pages : 7
first_bit : 26595
free_bytes : 506640
contig_bytes : 506540
sum_frag : 100
max_frag : 32
cur_min_alloc : 4
cur_med_alloc : 156
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 12
max_alloc_size : 1072
empty_pop_pages : 3
first_bit : 0
free_bytes : 647524
contig_bytes : 563492
sum_frag : 57872
max_frag : 26164
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk:
nr_alloc : 0
max_alloc_size : 1072
empty_pop_pages : 1
first_bit : 0
free_bytes : 655360
contig_bytes : 655360
sum_frag : 0
max_frag : 0
cur_min_alloc : 0
cur_med_alloc : 0
cur_max_alloc : 0
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 72
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 608344
contig_bytes : 145552
sum_frag : 590340
max_frag : 145552
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1

Chunk (sidelined):
nr_alloc : 4
max_alloc_size : 1072
empty_pop_pages : 0
first_bit : 0
free_bytes : 652748
contig_bytes : 426720
sum_frag : 426720
max_frag : 426720
cur_min_alloc : 156
cur_med_alloc : 312
cur_max_alloc : 1072
memcg_aware : 1





2021-04-16 22:11:25

by Dennis Zhou

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation

Hello,

On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>
>
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > >
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > >
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs of before and
> > > > > > > after the percpu test in the case of the patchset being applied on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > >
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > >
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > > >
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 1952 kB
> > > > > > > > > Percpu:?????????? 219648 kB
> > > > > > > > > Percpu:?????????? 219648 kB
> > > > > > > > >
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 2080 kB
> > > > > > > > > Percpu:?????????? 219712 kB
> > > > > > > > > Percpu:??????????? 72672 kB
> > > > > > > > >
> > > > > > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > > > > >
> > > > > > > > > However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
> > > > > > > > >
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 5888 kB
> > > > > > > > > Percpu:?????????? 118272 kB
> > > > > > > > > Percpu:?????????? 118272 kB
> > > > > > > > >
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu:???????????? 6144 kB
> > > > > > > > > Percpu:?????????? 119040 kB
> > > > > > > > > Percpu:?????????? 119040 kB
> > > > > > > > >
> > > > > > > > > I'm wondering if there's any architectural specific code that needs plumbing
> > > > > > > > > here?
> > > > > > > > >
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -----BEFORE-----
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > > > > >
> > > > > > Vanilla
> > > > > >
> > > > > > nr_alloc : 9038 nr_alloc : 97046
> > > > > > nr_dealloc : 6992 nr_dealloc : 94237
> > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2809
> > > > > > nr_max_alloc : 2178 nr_max_alloc : 90054
> > > > > > nr_chunks : 3 nr_chunks : 11
> > > > > > nr_max_chunks : 3 nr_max_chunks : 47
> > > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > > empty_pop_pages : 5 empty_pop_pages : 29
> > > > > >
> > > > > >
> > > > > > Patched
> > > > > >
> > > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > > nr_dealloc : 6994 nr_dealloc : 95002
> > > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2046
> > > > > > nr_max_alloc : 2208 nr_max_alloc : 90054
> > > > > > nr_chunks : 3 nr_chunks : 48
> > > > > > nr_max_chunks : 3 nr_max_chunks : 48
> > > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > > empty_pop_pages : 12 empty_pop_pages : 61
> > > > > >
> > > > > >
> > > > > > So it looks like the number of chunks got bigger, as well as the number of
> > > > > > empty_pop_pages? This contradicts to what you wrote, so can you, please, make
> > > > > > sure that the data is correct and we're not messing two cases?
> > > > > >
> > > > > > So it looks like for some reason sidelined (depopulated) chunks are not getting
> > > > > > freed completely. But I struggle to explain why the initial empty_pop_pages is
> > > > > > bigger with the same amount of chunks.
> > > > > >
> > > > > > So, can you, please, apply the following patch and provide an updated statistics?
> > > > > Unfortunately, I'm not completely well versed in this area, but yes the empty
> > > > > pop pages number doesn't make sense to me either.
> > > > >
> > > > > I re-ran the numbers trying to make sure my experiment setup is sane but
> > > > > results remain the same.
> > > > >
> > > > > Vanilla
> > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > nr_dealloc : 6994 nr_dealloc : 94404
> > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2644
> > > > > nr_max_alloc : 2169 nr_max_alloc : 90054
> > > > > nr_chunks : 3 nr_chunks : 10
> > > > > nr_max_chunks : 3 nr_max_chunks : 47
> > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > empty_pop_pages : 4 empty_pop_pages : 32
> > > > >
> > > > > With the patchset + debug patch the results are as follows:
> > > > > Patched
> > > > >
> > > > > nr_alloc : 9040 nr_alloc : 97048
> > > > > nr_dealloc : 6994 nr_dealloc : 94349
> > > > > nr_cur_alloc : 2046 nr_cur_alloc : 2699
> > > > > nr_max_alloc : 2194 nr_max_alloc : 90054
> > > > > nr_chunks : 3 nr_chunks : 48
> > > > > nr_max_chunks : 3 nr_max_chunks : 48
> > > > > min_alloc_size : 4 min_alloc_size : 4
> > > > > max_alloc_size : 1072 max_alloc_size : 1072
> > > > > empty_pop_pages : 12 empty_pop_pages : 54
> > > > >
> > > > > With the extra tracing I can see 39 entries of "Chunk (sidelined)"
> > > > > after the test was run. I don't see any entries for "Chunk (to depopulate)"
> > > > >
> > > > > I've snipped the results of slidelined chunks because they went on for ~600
> > > > > lines, if you need the full logs let me know.
> > > > Yes, please! That's the most interesting part!
> > > Got it. Pasting the full logs of after the percpu experiment was completed
> > Thanks!
> >
> > Would you mind to apply the following patch and test again?
> >
> > --
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index ded3a7541cb2..532c6a7ebdfd 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
> > need_balance = true;
> > break;
> > }
> > +
> > + chunk->depopulated = false;
> > + pcpu_chunk_relocate(chunk, -1);
> > } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
> > !chunk->isolated &&
> > (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
> >
> Sure thing.
>
> I see much lower sideline chunks. In one such test run I saw zero occurrences
> of slidelined chunks
>
> Pasting the full logs as an example:
>
> BEFORE
> Percpu Memory Statistics
> Allocation Info:
> ----------------------------------------
> unit_size : 655360
> static_size : 608920
> reserved_size : 0
> dyn_size : 46440
> atom_size : 65536
> alloc_size : 655360
>
> Global Stats:
> ----------------------------------------
> nr_alloc : 9038
> nr_dealloc : 6992
> nr_cur_alloc : 2046
> nr_max_alloc : 2200
> nr_chunks : 3
> nr_max_chunks : 3
> min_alloc_size : 4
> max_alloc_size : 1072
> empty_pop_pages : 12
>
> Per Chunk Stats:
> ----------------------------------------
> Chunk: <- First Chunk
> nr_alloc : 1092
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 16247
> free_bytes : 4
> contig_bytes : 4
> sum_frag : 4
> max_frag : 4
> cur_min_alloc : 4
> cur_med_alloc : 8
> cur_max_alloc : 1072
> memcg_aware : 0
>
> Chunk:
> nr_alloc : 594
> max_alloc_size : 992
> empty_pop_pages : 8
> first_bit : 456
> free_bytes : 645008
> contig_bytes : 319984
> sum_frag : 325024
> max_frag : 318680
> cur_min_alloc : 4
> cur_med_alloc : 8
> cur_max_alloc : 424
> memcg_aware : 0
>
> Chunk:
> nr_alloc : 360
> max_alloc_size : 1072
> empty_pop_pages : 4
> first_bit : 26595
> free_bytes : 506640
> contig_bytes : 506540
> sum_frag : 100
> max_frag : 32
> cur_min_alloc : 4
> cur_med_alloc : 156
> cur_max_alloc : 1072
> memcg_aware : 1
>
>
> AFTER
> Percpu Memory Statistics
> Allocation Info:
> ----------------------------------------
> unit_size : 655360
> static_size : 608920
> reserved_size : 0
> dyn_size : 46440
> atom_size : 65536
> alloc_size : 655360
>
> Global Stats:
> ----------------------------------------
> nr_alloc : 97046
> nr_dealloc : 94304
> nr_cur_alloc : 2742
> nr_max_alloc : 90054
> nr_chunks : 11
> nr_max_chunks : 47
> min_alloc_size : 4
> max_alloc_size : 1072
> empty_pop_pages : 18
>
> Per Chunk Stats:
> ----------------------------------------
> Chunk: <- First Chunk
> nr_alloc : 1092
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 16247
> free_bytes : 4
> contig_bytes : 4
> sum_frag : 4
> max_frag : 4
> cur_min_alloc : 4
> cur_med_alloc : 8
> cur_max_alloc : 1072
> memcg_aware : 0
>
> Chunk:
> nr_alloc : 838
> max_alloc_size : 1072
> empty_pop_pages : 7
> first_bit : 464
> free_bytes : 640476
> contig_bytes : 290672
> sum_frag : 349804
> max_frag : 304344
> cur_min_alloc : 4
> cur_med_alloc : 8
> cur_max_alloc : 1072
> memcg_aware : 0
>
> Chunk:
> nr_alloc : 90
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 536
> free_bytes : 595752
> contig_bytes : 26164
> sum_frag : 575132
> max_frag : 26164
> cur_min_alloc : 156
> cur_med_alloc : 1072
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 90
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 0
> free_bytes : 597428
> contig_bytes : 26164
> sum_frag : 596848
> max_frag : 26164
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 92
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 0
> free_bytes : 595284
> contig_bytes : 26164
> sum_frag : 590360
> max_frag : 26164
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 92
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 0
> free_bytes : 595284
> contig_bytes : 26164
> sum_frag : 583768
> max_frag : 26164
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 360
> max_alloc_size : 1072
> empty_pop_pages : 7
> first_bit : 26595
> free_bytes : 506640
> contig_bytes : 506540
> sum_frag : 100
> max_frag : 32
> cur_min_alloc : 4
> cur_med_alloc : 156
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 12
> max_alloc_size : 1072
> empty_pop_pages : 3
> first_bit : 0
> free_bytes : 647524
> contig_bytes : 563492
> sum_frag : 57872
> max_frag : 26164
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk:
> nr_alloc : 0
> max_alloc_size : 1072
> empty_pop_pages : 1
> first_bit : 0
> free_bytes : 655360
> contig_bytes : 655360
> sum_frag : 0
> max_frag : 0
> cur_min_alloc : 0
> cur_med_alloc : 0
> cur_max_alloc : 0
> memcg_aware : 1
>
> Chunk (sidelined):
> nr_alloc : 72
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 0
> free_bytes : 608344
> contig_bytes : 145552
> sum_frag : 590340
> max_frag : 145552
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
> Chunk (sidelined):
> nr_alloc : 4
> max_alloc_size : 1072
> empty_pop_pages : 0
> first_bit : 0
> free_bytes : 652748
> contig_bytes : 426720
> sum_frag : 426720
> max_frag : 426720
> cur_min_alloc : 156
> cur_med_alloc : 312
> cur_max_alloc : 1072
> memcg_aware : 1
>
>

Thank you Pratik for testing this and working with us to resolve this. I
greatly appreciate it!

Thanks,
Dennis

2021-04-17 07:12:27

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation



On 17/04/21 1:33 am, Roman Gushchin wrote:
> On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:39 am, Roman Gushchin wrote:
>>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Dennis,
>>>>>>>>
>>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>>>
>>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>>>> Hello Roman,
>>>>>>>>>>
>>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>>>
>>>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             1952 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             2080 kB
>>>>>>>>>> Percpu:           219712 kB
>>>>>>>>>> Percpu:            72672 kB
>>>>>>>>>>
>>>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>>>
>>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>>>
>>>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             5888 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             6144 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>>
>>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>>>> here?
>>>>>>>>>>
>>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>>>> and after?
>>>>>>>> I'll paste the whole debug stats before and after here.
>>>>>>>> 5.12-rc6 + patchset
>>>>>>>> -----BEFORE-----
>>>>>>>> Percpu Memory Statistics
>>>>>>>> Allocation Info:
>>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>>>
>>>>>>> Vanilla
>>>>>>>
>>>>>>> nr_alloc : 9038 nr_alloc : 97046
>>>>>>> nr_dealloc : 6992 nr_dealloc : 94237
>>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809
>>>>>>> nr_max_alloc : 2178 nr_max_alloc : 90054
>>>>>>> nr_chunks : 3 nr_chunks : 11
>>>>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>>> empty_pop_pages : 5 empty_pop_pages : 29
>>>>>>>
>>>>>>>
>>>>>>> Patched
>>>>>>>
>>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>>> nr_dealloc : 6994 nr_dealloc : 95002
>>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046
>>>>>>> nr_max_alloc : 2208 nr_max_alloc : 90054
>>>>>>> nr_chunks : 3 nr_chunks : 48
>>>>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>>> empty_pop_pages : 12 empty_pop_pages : 61
>>>>>>>
>>>>>>>
>>>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>>>> sure that the data is correct and we're not messing two cases?
>>>>>>>
>>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>>>> bigger with the same amount of chunks.
>>>>>>>
>>>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>>>> pop pages number doesn't make sense to me either.
>>>>>>
>>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>>>> results remain the same.
>>>>>>
>>>>>> Vanilla
>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>> nr_dealloc : 6994 nr_dealloc : 94404
>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644
>>>>>> nr_max_alloc : 2169 nr_max_alloc : 90054
>>>>>> nr_chunks : 3 nr_chunks : 10
>>>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>> empty_pop_pages : 4 empty_pop_pages : 32
>>>>>>
>>>>>> With the patchset + debug patch the results are as follows:
>>>>>> Patched
>>>>>>
>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>> nr_dealloc : 6994 nr_dealloc : 94349
>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699
>>>>>> nr_max_alloc : 2194 nr_max_alloc : 90054
>>>>>> nr_chunks : 3 nr_chunks : 48
>>>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>> empty_pop_pages : 12 empty_pop_pages : 54
>>>>>>
>>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>>>
>>>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>>>> lines, if you need the full logs let me know.
>>>>> Yes, please! That's the most interesting part!
>>>> Got it. Pasting the full logs of after the percpu experiment was completed
>>> Thanks!
>>>
>>> Would you mind to apply the following patch and test again?
>>>
>>> --
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index ded3a7541cb2..532c6a7ebdfd 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
>>> need_balance = true;
>>> break;
>>> }
>>> +
>>> + chunk->depopulated = false;
>>> + pcpu_chunk_relocate(chunk, -1);
>>> } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
>>> !chunk->isolated &&
>>> (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>>>
>> Sure thing.
>>
>> I see much lower sideline chunks. In one such test run I saw zero occurrences
>> of slidelined chunks
>>
> So looking at the stats it now works properly. Do you see any savings in
> comparison to vanilla? The size of savings can significanlty depend on the exact
> size of cgroup-related objects, how many of them fit into a single chunk, etc.
> So you might want to play with numbers in the test...
>
> Anyway, thank you very much for the report and your work on testing follow-up
> patches! It helped to reveal a serious bug in the implementation (completely
> empty sidelined chunks were not released in some cases), which by pure
> coincidence wasn't triggered on x86.
>
> Thanks!
>
Unfortunately not, I don't see any savings from the test.

# ./percpu_test_roman.sh
Percpu: 6144 kB
Percpu: 122880 kB
Percpu: 122880 kB

I had assumed that because POWER has a larger page size, we would indeed also
have higher fragmentation which could possibly lead to a lot more savings.

I'll dive deeper into the patches and tweak around the setup to see if I can
understand this behavior.

Thanks for helping me understand this patchset a little better and I'm glad we
found a bug with sidelined chunks!

I'll get back to you if I do find something interesting and need help
understanding it.

Thank you again,
Pratik

2021-04-17 07:15:37

by Pratik R. Sampat

[permalink] [raw]
Subject: Re: [PATCH v3 0/6] percpu: partial chunk depopulation



On 17/04/21 3:17 am, Dennis Zhou wrote:
> Hello,
>
> On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
>>
>> On 17/04/21 12:39 am, Roman Gushchin wrote:
>>> On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
>>>> On 17/04/21 12:04 am, Roman Gushchin wrote:
>>>>> On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
>>>>>> On 16/04/21 10:43 pm, Roman Gushchin wrote:
>>>>>>> On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
>>>>>>>> Hello Dennis,
>>>>>>>>
>>>>>>>> I apologize for the clutter of logs before, I'm pasting the logs of before and
>>>>>>>> after the percpu test in the case of the patchset being applied on 5.12-rc6 and
>>>>>>>> the vanilla kernel 5.12-rc6.
>>>>>>>>
>>>>>>>> On 16/04/21 7:48 pm, Dennis Zhou wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
>>>>>>>>>> Hello Roman,
>>>>>>>>>>
>>>>>>>>>> I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
>>>>>>>>>>
>>>>>>>>>> My results of the percpu_test are as follows:
>>>>>>>>>> Intel KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             1952 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>> Percpu:           219648 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             2080 kB
>>>>>>>>>> Percpu:           219712 kB
>>>>>>>>>> Percpu:            72672 kB
>>>>>>>>>>
>>>>>>>>>> I'm able to see improvement comparable to that of what you're see too.
>>>>>>>>>>
>>>>>>>>>> However, on POWERPC I'm unable to reproduce these improvements with the patchset in the same configuration
>>>>>>>>>>
>>>>>>>>>> POWER9 KVM 4CPU:4G
>>>>>>>>>> Vanilla 5.12-rc6
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             5888 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>> Percpu:           118272 kB
>>>>>>>>>>
>>>>>>>>>> 5.12-rc6 + with patchset applied
>>>>>>>>>> # ./percpu_test.sh
>>>>>>>>>> Percpu:             6144 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>> Percpu:           119040 kB
>>>>>>>>>>
>>>>>>>>>> I'm wondering if there's any architectural specific code that needs plumbing
>>>>>>>>>> here?
>>>>>>>>>>
>>>>>>>>> There shouldn't be. Can you send me the percpu_stats debug output before
>>>>>>>>> and after?
>>>>>>>> I'll paste the whole debug stats before and after here.
>>>>>>>> 5.12-rc6 + patchset
>>>>>>>> -----BEFORE-----
>>>>>>>> Percpu Memory Statistics
>>>>>>>> Allocation Info:
>>>>>>> Hm, this looks highly suspicious. Here is your stats in a more compact form:
>>>>>>>
>>>>>>> Vanilla
>>>>>>>
>>>>>>> nr_alloc : 9038 nr_alloc : 97046
>>>>>>> nr_dealloc : 6992 nr_dealloc : 94237
>>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2809
>>>>>>> nr_max_alloc : 2178 nr_max_alloc : 90054
>>>>>>> nr_chunks : 3 nr_chunks : 11
>>>>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>>> empty_pop_pages : 5 empty_pop_pages : 29
>>>>>>>
>>>>>>>
>>>>>>> Patched
>>>>>>>
>>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>>> nr_dealloc : 6994 nr_dealloc : 95002
>>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2046
>>>>>>> nr_max_alloc : 2208 nr_max_alloc : 90054
>>>>>>> nr_chunks : 3 nr_chunks : 48
>>>>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>>> empty_pop_pages : 12 empty_pop_pages : 61
>>>>>>>
>>>>>>>
>>>>>>> So it looks like the number of chunks got bigger, as well as the number of
>>>>>>> empty_pop_pages? This contradicts to what you wrote, so can you, please, make
>>>>>>> sure that the data is correct and we're not messing two cases?
>>>>>>>
>>>>>>> So it looks like for some reason sidelined (depopulated) chunks are not getting
>>>>>>> freed completely. But I struggle to explain why the initial empty_pop_pages is
>>>>>>> bigger with the same amount of chunks.
>>>>>>>
>>>>>>> So, can you, please, apply the following patch and provide an updated statistics?
>>>>>> Unfortunately, I'm not completely well versed in this area, but yes the empty
>>>>>> pop pages number doesn't make sense to me either.
>>>>>>
>>>>>> I re-ran the numbers trying to make sure my experiment setup is sane but
>>>>>> results remain the same.
>>>>>>
>>>>>> Vanilla
>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>> nr_dealloc : 6994 nr_dealloc : 94404
>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2644
>>>>>> nr_max_alloc : 2169 nr_max_alloc : 90054
>>>>>> nr_chunks : 3 nr_chunks : 10
>>>>>> nr_max_chunks : 3 nr_max_chunks : 47
>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>> empty_pop_pages : 4 empty_pop_pages : 32
>>>>>>
>>>>>> With the patchset + debug patch the results are as follows:
>>>>>> Patched
>>>>>>
>>>>>> nr_alloc : 9040 nr_alloc : 97048
>>>>>> nr_dealloc : 6994 nr_dealloc : 94349
>>>>>> nr_cur_alloc : 2046 nr_cur_alloc : 2699
>>>>>> nr_max_alloc : 2194 nr_max_alloc : 90054
>>>>>> nr_chunks : 3 nr_chunks : 48
>>>>>> nr_max_chunks : 3 nr_max_chunks : 48
>>>>>> min_alloc_size : 4 min_alloc_size : 4
>>>>>> max_alloc_size : 1072 max_alloc_size : 1072
>>>>>> empty_pop_pages : 12 empty_pop_pages : 54
>>>>>>
>>>>>> With the extra tracing I can see 39 entries of "Chunk (sidelined)"
>>>>>> after the test was run. I don't see any entries for "Chunk (to depopulate)"
>>>>>>
>>>>>> I've snipped the results of slidelined chunks because they went on for ~600
>>>>>> lines, if you need the full logs let me know.
>>>>> Yes, please! That's the most interesting part!
>>>> Got it. Pasting the full logs of after the percpu experiment was completed
>>> Thanks!
>>>
>>> Would you mind to apply the following patch and test again?
>>>
>>> --
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index ded3a7541cb2..532c6a7ebdfd 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -2296,6 +2296,9 @@ void free_percpu(void __percpu *ptr)
>>> need_balance = true;
>>> break;
>>> }
>>> +
>>> + chunk->depopulated = false;
>>> + pcpu_chunk_relocate(chunk, -1);
>>> } else if (chunk != pcpu_first_chunk && chunk != pcpu_reserved_chunk &&
>>> !chunk->isolated &&
>>> (pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] >
>>>
>> Sure thing.
>>
>> I see much lower sideline chunks. In one such test run I saw zero occurrences
>> of slidelined chunks
>>
>> Pasting the full logs as an example:
>>
>> BEFORE
>> Percpu Memory Statistics
>> Allocation Info:
>> ----------------------------------------
>> unit_size : 655360
>> static_size : 608920
>> reserved_size : 0
>> dyn_size : 46440
>> atom_size : 65536
>> alloc_size : 655360
>>
>> Global Stats:
>> ----------------------------------------
>> nr_alloc : 9038
>> nr_dealloc : 6992
>> nr_cur_alloc : 2046
>> nr_max_alloc : 2200
>> nr_chunks : 3
>> nr_max_chunks : 3
>> min_alloc_size : 4
>> max_alloc_size : 1072
>> empty_pop_pages : 12
>>
>> Per Chunk Stats:
>> ----------------------------------------
>> Chunk: <- First Chunk
>> nr_alloc : 1092
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 16247
>> free_bytes : 4
>> contig_bytes : 4
>> sum_frag : 4
>> max_frag : 4
>> cur_min_alloc : 4
>> cur_med_alloc : 8
>> cur_max_alloc : 1072
>> memcg_aware : 0
>>
>> Chunk:
>> nr_alloc : 594
>> max_alloc_size : 992
>> empty_pop_pages : 8
>> first_bit : 456
>> free_bytes : 645008
>> contig_bytes : 319984
>> sum_frag : 325024
>> max_frag : 318680
>> cur_min_alloc : 4
>> cur_med_alloc : 8
>> cur_max_alloc : 424
>> memcg_aware : 0
>>
>> Chunk:
>> nr_alloc : 360
>> max_alloc_size : 1072
>> empty_pop_pages : 4
>> first_bit : 26595
>> free_bytes : 506640
>> contig_bytes : 506540
>> sum_frag : 100
>> max_frag : 32
>> cur_min_alloc : 4
>> cur_med_alloc : 156
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>>
>> AFTER
>> Percpu Memory Statistics
>> Allocation Info:
>> ----------------------------------------
>> unit_size : 655360
>> static_size : 608920
>> reserved_size : 0
>> dyn_size : 46440
>> atom_size : 65536
>> alloc_size : 655360
>>
>> Global Stats:
>> ----------------------------------------
>> nr_alloc : 97046
>> nr_dealloc : 94304
>> nr_cur_alloc : 2742
>> nr_max_alloc : 90054
>> nr_chunks : 11
>> nr_max_chunks : 47
>> min_alloc_size : 4
>> max_alloc_size : 1072
>> empty_pop_pages : 18
>>
>> Per Chunk Stats:
>> ----------------------------------------
>> Chunk: <- First Chunk
>> nr_alloc : 1092
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 16247
>> free_bytes : 4
>> contig_bytes : 4
>> sum_frag : 4
>> max_frag : 4
>> cur_min_alloc : 4
>> cur_med_alloc : 8
>> cur_max_alloc : 1072
>> memcg_aware : 0
>>
>> Chunk:
>> nr_alloc : 838
>> max_alloc_size : 1072
>> empty_pop_pages : 7
>> first_bit : 464
>> free_bytes : 640476
>> contig_bytes : 290672
>> sum_frag : 349804
>> max_frag : 304344
>> cur_min_alloc : 4
>> cur_med_alloc : 8
>> cur_max_alloc : 1072
>> memcg_aware : 0
>>
>> Chunk:
>> nr_alloc : 90
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 536
>> free_bytes : 595752
>> contig_bytes : 26164
>> sum_frag : 575132
>> max_frag : 26164
>> cur_min_alloc : 156
>> cur_med_alloc : 1072
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 90
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 0
>> free_bytes : 597428
>> contig_bytes : 26164
>> sum_frag : 596848
>> max_frag : 26164
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 92
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 0
>> free_bytes : 595284
>> contig_bytes : 26164
>> sum_frag : 590360
>> max_frag : 26164
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 92
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 0
>> free_bytes : 595284
>> contig_bytes : 26164
>> sum_frag : 583768
>> max_frag : 26164
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 360
>> max_alloc_size : 1072
>> empty_pop_pages : 7
>> first_bit : 26595
>> free_bytes : 506640
>> contig_bytes : 506540
>> sum_frag : 100
>> max_frag : 32
>> cur_min_alloc : 4
>> cur_med_alloc : 156
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 12
>> max_alloc_size : 1072
>> empty_pop_pages : 3
>> first_bit : 0
>> free_bytes : 647524
>> contig_bytes : 563492
>> sum_frag : 57872
>> max_frag : 26164
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk:
>> nr_alloc : 0
>> max_alloc_size : 1072
>> empty_pop_pages : 1
>> first_bit : 0
>> free_bytes : 655360
>> contig_bytes : 655360
>> sum_frag : 0
>> max_frag : 0
>> cur_min_alloc : 0
>> cur_med_alloc : 0
>> cur_max_alloc : 0
>> memcg_aware : 1
>>
>> Chunk (sidelined):
>> nr_alloc : 72
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 0
>> free_bytes : 608344
>> contig_bytes : 145552
>> sum_frag : 590340
>> max_frag : 145552
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>> Chunk (sidelined):
>> nr_alloc : 4
>> max_alloc_size : 1072
>> empty_pop_pages : 0
>> first_bit : 0
>> free_bytes : 652748
>> contig_bytes : 426720
>> sum_frag : 426720
>> max_frag : 426720
>> cur_min_alloc : 156
>> cur_med_alloc : 312
>> cur_max_alloc : 1072
>> memcg_aware : 1
>>
>>
>
> Thank you Pratik for testing this and working with us to resolve this. I
> greatly appreciate it!
>
> Thanks,
> Dennis

No worries at all, glad I could be of some help!

Thank you,
Pratik