2021-03-24 19:08:22

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, when there are no more outstanding allocations. However
to minimize a memory waste, it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
returned to the original slot (but is appended to the tail of the list
to minimize the chances of population).

Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
the chunk can be concurrently moved to a different slot. So we need
to isolate it again on each step. pcpu_alloc_mutex is held, so the
chunk can't be populated/depopulated asynchronously.

Signed-off-by: Roman Gushchin <[email protected]>
---
mm/percpu.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..78c55c73fa28 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
mutex_unlock(&pcpu_alloc_mutex);
}

+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over all chunks, find those marked with the depopulate flag and
+ * try to release unused pages to the system. On every attempt clear the
+ * chunk's depopulate flag to avoid wasting CPU by scanning the same
+ * chunk again and again.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+ struct list_head *pcpu_slot = pcpu_chunk_list(type);
+ struct pcpu_chunk *chunk;
+ int slot, i, off, start;
+
+ spin_lock_irq(&pcpu_lock);
+ for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
+restart:
+ list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+ bool isolated = false;
+
+ if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
+ break;
+
+ for (i = 0, start = -1; i < chunk->nr_pages; i++) {
+ if (!chunk->nr_empty_pop_pages)
+ break;
+
+ /*
+ * If the page is empty and populated, start or
+ * extend the [start, i) range.
+ */
+ if (test_bit(i, chunk->populated)) {
+ off = find_first_bit(
+ pcpu_index_alloc_map(chunk, i),
+ PCPU_BITMAP_BLOCK_BITS);
+ if (off >= PCPU_BITMAP_BLOCK_BITS) {
+ if (start == -1)
+ start = i;
+ continue;
+ }
+ }
+
+ /*
+ * Otherwise check if there is an active range,
+ * and if yes, depopulate it.
+ */
+ if (start == -1)
+ continue;
+
+ /*
+ * Isolate the chunk, so new allocations
+ * wouldn't be served using this chunk.
+ * Async releases can still happen.
+ */
+ if (!list_empty(&chunk->list)) {
+ list_del_init(&chunk->list);
+ isolated = true;
+ }
+
+ spin_unlock_irq(&pcpu_lock);
+ pcpu_depopulate_chunk(chunk, start, i);
+ cond_resched();
+ spin_lock_irq(&pcpu_lock);
+
+ pcpu_chunk_depopulated(chunk, start, i);
+
+ /*
+ * Reset the range and continue.
+ */
+ start = -1;
+ }
+
+ if (isolated) {
+ /*
+ * The chunk could have been moved while
+ * pcpu_lock wasn't held. Make sure we put
+ * the chunk back into the slot and restart
+ * the scanning.
+ */
+ if (list_empty(&chunk->list))
+ list_add_tail(&chunk->list,
+ &pcpu_slot[slot]);
+ goto restart;
+ }
+ }
+ }
+ spin_unlock_irq(&pcpu_lock);
+}
+
/**
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
* @work: unused
--
2.30.2


2021-03-29 17:25:26

by Dennis Zhou

[permalink] [raw]
Subject: Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> This patch implements partial depopulation of percpu chunks.
>
> As now, a chunk can be depopulated only as a part of the final
> destruction, when there are no more outstanding allocations. However
> to minimize a memory waste, it might be useful to depopulate a
> partially filed chunk, if a small number of outstanding allocations
> prevents the chunk from being reclaimed.
>
> This patch implements the following depopulation process: it scans
> over the chunk pages, looks for a range of empty and populated pages
> and performs the depopulation. To avoid races with new allocations,
> the chunk is previously isolated. After the depopulation the chunk is
> returned to the original slot (but is appended to the tail of the list
> to minimize the chances of population).
>
> Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> the chunk can be concurrently moved to a different slot. So we need
> to isolate it again on each step. pcpu_alloc_mutex is held, so the
> chunk can't be populated/depopulated asynchronously.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> mm/percpu.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 90 insertions(+)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 6596a0a4286e..78c55c73fa28 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> mutex_unlock(&pcpu_alloc_mutex);
> }
>
> +/**
> + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> + * @type: chunk type
> + *
> + * Scan over all chunks, find those marked with the depopulate flag and
> + * try to release unused pages to the system. On every attempt clear the
> + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> + * chunk again and again.
> + */
> +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> +{
> + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> + struct pcpu_chunk *chunk;
> + int slot, i, off, start;
> +
> + spin_lock_irq(&pcpu_lock);
> + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> +restart:
> + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> + bool isolated = false;
> +
> + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> + break;
> +

Deallocation makes me a little worried for the atomic case as now we
could in theory pathologically scan deallocated chunks before finding a
populated one.

I wonder if we should do something like once a chunk gets depopulated,
it gets deprioritized and then only once we exhaust looking through
allocated chunks we then find a depopulated chunk and add it back into
the rotation. Possibly just add another set of slots? I guess it adds a
few dimensions to pcpu_slots after the memcg change.

> + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> + if (!chunk->nr_empty_pop_pages)
> + break;
> +
> + /*
> + * If the page is empty and populated, start or
> + * extend the [start, i) range.
> + */
> + if (test_bit(i, chunk->populated)) {
> + off = find_first_bit(
> + pcpu_index_alloc_map(chunk, i),
> + PCPU_BITMAP_BLOCK_BITS);
> + if (off >= PCPU_BITMAP_BLOCK_BITS) {
> + if (start == -1)
> + start = i;
> + continue;
> + }

Here instead of looking at the alloc_map, you can look at the
pcpu_block_md and look for a fully free contig_hint.

> + }
> +
> + /*
> + * Otherwise check if there is an active range,
> + * and if yes, depopulate it.
> + */
> + if (start == -1)
> + continue;
> +
> + /*
> + * Isolate the chunk, so new allocations
> + * wouldn't be served using this chunk.
> + * Async releases can still happen.
> + */
> + if (!list_empty(&chunk->list)) {
> + list_del_init(&chunk->list);
> + isolated = true;

Maybe when freeing a chunk, we should consider just isolating it period
and preventing pcpu_free_area() from being able to add the chunk back
to a pcpu_slot.

> + }
> +
> + spin_unlock_irq(&pcpu_lock);
> + pcpu_depopulate_chunk(chunk, start, i);
> + cond_resched();
> + spin_lock_irq(&pcpu_lock);
> +
> + pcpu_chunk_depopulated(chunk, start, i);
> +
> + /*
> + * Reset the range and continue.
> + */
> + start = -1;
> + }
> +
> + if (isolated) {
> + /*
> + * The chunk could have been moved while
> + * pcpu_lock wasn't held. Make sure we put
> + * the chunk back into the slot and restart
> + * the scanning.
> + */
> + if (list_empty(&chunk->list))
> + list_add_tail(&chunk->list,
> + &pcpu_slot[slot]);
> + goto restart;
> + }
> + }
> + }
> + spin_unlock_irq(&pcpu_lock);
> +}
> +
> /**
> * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> * @work: unused
> --
> 2.30.2
>

2021-03-29 18:31:34

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

On Mon, Mar 29, 2021 at 05:20:55PM +0000, Dennis Zhou wrote:
> On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > This patch implements partial depopulation of percpu chunks.
> >
> > As now, a chunk can be depopulated only as a part of the final
> > destruction, when there are no more outstanding allocations. However
> > to minimize a memory waste, it might be useful to depopulate a
> > partially filed chunk, if a small number of outstanding allocations
> > prevents the chunk from being reclaimed.
> >
> > This patch implements the following depopulation process: it scans
> > over the chunk pages, looks for a range of empty and populated pages
> > and performs the depopulation. To avoid races with new allocations,
> > the chunk is previously isolated. After the depopulation the chunk is
> > returned to the original slot (but is appended to the tail of the list
> > to minimize the chances of population).
> >
> > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > the chunk can be concurrently moved to a different slot. So we need
> > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > chunk can't be populated/depopulated asynchronously.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > ---
> > mm/percpu.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 90 insertions(+)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 6596a0a4286e..78c55c73fa28 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> > mutex_unlock(&pcpu_alloc_mutex);
> > }
> >
> > +/**
> > + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> > + * @type: chunk type
> > + *
> > + * Scan over all chunks, find those marked with the depopulate flag and
> > + * try to release unused pages to the system. On every attempt clear the
> > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > + * chunk again and again.
> > + */
> > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > +{
> > + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > + struct pcpu_chunk *chunk;
> > + int slot, i, off, start;
> > +
> > + spin_lock_irq(&pcpu_lock);
> > + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > +restart:
> > + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > + bool isolated = false;
> > +
> > + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > + break;
> > +
>
> Deallocation makes me a little worried for the atomic case as now we
> could in theory pathologically scan deallocated chunks before finding a
> populated one.
>
> I wonder if we should do something like once a chunk gets depopulated,
> it gets deprioritized and then only once we exhaust looking through
> allocated chunks we then find a depopulated chunk and add it back into
> the rotation. Possibly just add another set of slots? I guess it adds a
> few dimensions to pcpu_slots after the memcg change.

Please, take a look at patch 3 in the series ("percpu: on demand chunk depopulation").
Chunks considered to be a good target for the depopulation are in advance
marked with a special flag, so we'll actually try to depopulate only
few chunks at once. While the total number of chunks is fairly low,
I think it should work.

Another option is to link all such chunks into a list and scan over it,
instead of iterating over all slots.

Adding new dimensions to pcpu_slots is an option too, but I hope we can avoid
this, as it would complicate the code.

>
> > + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> > + if (!chunk->nr_empty_pop_pages)
> > + break;
> > +
> > + /*
> > + * If the page is empty and populated, start or
> > + * extend the [start, i) range.
> > + */
> > + if (test_bit(i, chunk->populated)) {
> > + off = find_first_bit(
> > + pcpu_index_alloc_map(chunk, i),
> > + PCPU_BITMAP_BLOCK_BITS);
> > + if (off >= PCPU_BITMAP_BLOCK_BITS) {
> > + if (start == -1)
> > + start = i;
> > + continue;
> > + }
>
> Here instead of looking at the alloc_map, you can look at the
> pcpu_block_md and look for a fully free contig_hint.

Good idea, will try in v2.

>
> > + }
> > +
> > + /*
> > + * Otherwise check if there is an active range,
> > + * and if yes, depopulate it.
> > + */
> > + if (start == -1)
> > + continue;
> > +
> > + /*
> > + * Isolate the chunk, so new allocations
> > + * wouldn't be served using this chunk.
> > + * Async releases can still happen.
> > + */
> > + if (!list_empty(&chunk->list)) {
> > + list_del_init(&chunk->list);
> > + isolated = true;
>
> Maybe when freeing a chunk, we should consider just isolating it period
> and preventing pcpu_free_area() from being able to add the chunk back
> to a pcpu_slot.

You mean to add a check in pcpu_free_area() if the chunks is isolated?
Yeah, sounds good to me, will do in v2.

Thank you!

>
> > + }
> > +
> > + spin_unlock_irq(&pcpu_lock);
> > + pcpu_depopulate_chunk(chunk, start, i);
> > + cond_resched();
> > + spin_lock_irq(&pcpu_lock);
> > +
> > + pcpu_chunk_depopulated(chunk, start, i);
> > +
> > + /*
> > + * Reset the range and continue.
> > + */
> > + start = -1;
> > + }
> > +
> > + if (isolated) {
> > + /*
> > + * The chunk could have been moved while
> > + * pcpu_lock wasn't held. Make sure we put
> > + * the chunk back into the slot and restart
> > + * the scanning.
> > + */
> > + if (list_empty(&chunk->list))
> > + list_add_tail(&chunk->list,
> > + &pcpu_slot[slot]);
> > + goto restart;
> > + }
> > + }
> > + }
> > + spin_unlock_irq(&pcpu_lock);
> > +}
> > +
> > /**
> > * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > * @work: unused
> > --
> > 2.30.2
> >

2021-03-29 19:30:59

by Dennis Zhou

[permalink] [raw]
Subject: Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

On Mon, Mar 29, 2021 at 11:29:22AM -0700, Roman Gushchin wrote:
> On Mon, Mar 29, 2021 at 05:20:55PM +0000, Dennis Zhou wrote:
> > On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > > This patch implements partial depopulation of percpu chunks.
> > >
> > > As now, a chunk can be depopulated only as a part of the final
> > > destruction, when there are no more outstanding allocations. However
> > > to minimize a memory waste, it might be useful to depopulate a
> > > partially filed chunk, if a small number of outstanding allocations
> > > prevents the chunk from being reclaimed.
> > >
> > > This patch implements the following depopulation process: it scans
> > > over the chunk pages, looks for a range of empty and populated pages
> > > and performs the depopulation. To avoid races with new allocations,
> > > the chunk is previously isolated. After the depopulation the chunk is
> > > returned to the original slot (but is appended to the tail of the list
> > > to minimize the chances of population).
> > >
> > > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > > the chunk can be concurrently moved to a different slot. So we need
> > > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > > chunk can't be populated/depopulated asynchronously.
> > >
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > ---
> > > mm/percpu.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 90 insertions(+)
> > >
> > > diff --git a/mm/percpu.c b/mm/percpu.c
> > > index 6596a0a4286e..78c55c73fa28 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> > > mutex_unlock(&pcpu_alloc_mutex);
> > > }
> > >
> > > +/**
> > > + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> > > + * @type: chunk type
> > > + *
> > > + * Scan over all chunks, find those marked with the depopulate flag and
> > > + * try to release unused pages to the system. On every attempt clear the
> > > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > > + * chunk again and again.
> > > + */
> > > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > > +{
> > > + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > > + struct pcpu_chunk *chunk;
> > > + int slot, i, off, start;
> > > +
> > > + spin_lock_irq(&pcpu_lock);
> > > + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > > +restart:
> > > + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > > + bool isolated = false;
> > > +
> > > + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > > + break;
> > > +
> >
> > Deallocation makes me a little worried for the atomic case as now we
> > could in theory pathologically scan deallocated chunks before finding a
> > populated one.
> >
> > I wonder if we should do something like once a chunk gets depopulated,
> > it gets deprioritized and then only once we exhaust looking through
> > allocated chunks we then find a depopulated chunk and add it back into
> > the rotation. Possibly just add another set of slots? I guess it adds a
> > few dimensions to pcpu_slots after the memcg change.
>
> Please, take a look at patch 3 in the series ("percpu: on demand chunk depopulation").
> Chunks considered to be a good target for the depopulation are in advance
> marked with a special flag, so we'll actually try to depopulate only
> few chunks at once. While the total number of chunks is fairly low,
> I think it should work.
>
> Another option is to link all such chunks into a list and scan over it,
> instead of iterating over all slots.
>
> Adding new dimensions to pcpu_slots is an option too, but I hope we can avoid
> this, as it would complicate the code.
>

Yeah, depopulation has been on the todo list for a while. It adds the
dimension/opportunity of bin packing by sidelining chunks and I'm
wondering if that is the right thing to do.

Do you have a rough idea of the distribution of # of chunks you're
seeing?

> >
> > > + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> > > + if (!chunk->nr_empty_pop_pages)
> > > + break;
> > > +
> > > + /*
> > > + * If the page is empty and populated, start or
> > > + * extend the [start, i) range.
> > > + */
> > > + if (test_bit(i, chunk->populated)) {
> > > + off = find_first_bit(
> > > + pcpu_index_alloc_map(chunk, i),
> > > + PCPU_BITMAP_BLOCK_BITS);
> > > + if (off >= PCPU_BITMAP_BLOCK_BITS) {
> > > + if (start == -1)
> > > + start = i;
> > > + continue;
> > > + }
> >
> > Here instead of looking at the alloc_map, you can look at the
> > pcpu_block_md and look for a fully free contig_hint.
>
> Good idea, will try in v2.
>
> >
> > > + }
> > > +
> > > + /*
> > > + * Otherwise check if there is an active range,
> > > + * and if yes, depopulate it.
> > > + */
> > > + if (start == -1)
> > > + continue;
> > > +
> > > + /*
> > > + * Isolate the chunk, so new allocations
> > > + * wouldn't be served using this chunk.
> > > + * Async releases can still happen.
> > > + */
> > > + if (!list_empty(&chunk->list)) {
> > > + list_del_init(&chunk->list);
> > > + isolated = true;
> >
> > Maybe when freeing a chunk, we should consider just isolating it period
> > and preventing pcpu_free_area() from being able to add the chunk back
> > to a pcpu_slot.
>
> You mean to add a check in pcpu_free_area() if the chunks is isolated?
> Yeah, sounds good to me, will do in v2.
>

Could also be done in pcpu_chunk_relocate() so it's clear an isolated
chunk shouldn't be moved. But I think pcpu_free_area() should be the
only way the chunk can be moved on the list.

> Thank you!
>
> >
> > > + }
> > > +
> > > + spin_unlock_irq(&pcpu_lock);
> > > + pcpu_depopulate_chunk(chunk, start, i);
> > > + cond_resched();
> > > + spin_lock_irq(&pcpu_lock);
> > > +
> > > + pcpu_chunk_depopulated(chunk, start, i);
> > > +
> > > + /*
> > > + * Reset the range and continue.
> > > + */
> > > + start = -1;
> > > + }
> > > +
> > > + if (isolated) {
> > > + /*
> > > + * The chunk could have been moved while
> > > + * pcpu_lock wasn't held. Make sure we put
> > > + * the chunk back into the slot and restart
> > > + * the scanning.
> > > + */
> > > + if (list_empty(&chunk->list))
> > > + list_add_tail(&chunk->list,
> > > + &pcpu_slot[slot]);
> > > + goto restart;
> > > + }
> > > + }
> > > + }
> > > + spin_unlock_irq(&pcpu_lock);
> > > +}
> > > +
> > > /**
> > > * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > > * @work: unused
> > > --
> > > 2.30.2
> > >

2021-03-29 19:43:13

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

On Mon, Mar 29, 2021 at 07:28:49PM +0000, Dennis Zhou wrote:
> On Mon, Mar 29, 2021 at 11:29:22AM -0700, Roman Gushchin wrote:
> > On Mon, Mar 29, 2021 at 05:20:55PM +0000, Dennis Zhou wrote:
> > > On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > > > This patch implements partial depopulation of percpu chunks.
> > > >
> > > > As now, a chunk can be depopulated only as a part of the final
> > > > destruction, when there are no more outstanding allocations. However
> > > > to minimize a memory waste, it might be useful to depopulate a
> > > > partially filed chunk, if a small number of outstanding allocations
> > > > prevents the chunk from being reclaimed.
> > > >
> > > > This patch implements the following depopulation process: it scans
> > > > over the chunk pages, looks for a range of empty and populated pages
> > > > and performs the depopulation. To avoid races with new allocations,
> > > > the chunk is previously isolated. After the depopulation the chunk is
> > > > returned to the original slot (but is appended to the tail of the list
> > > > to minimize the chances of population).
> > > >
> > > > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > > > the chunk can be concurrently moved to a different slot. So we need
> > > > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > > > chunk can't be populated/depopulated asynchronously.
> > > >
> > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > ---
> > > > mm/percpu.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 90 insertions(+)
> > > >
> > > > diff --git a/mm/percpu.c b/mm/percpu.c
> > > > index 6596a0a4286e..78c55c73fa28 100644
> > > > --- a/mm/percpu.c
> > > > +++ b/mm/percpu.c
> > > > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> > > > mutex_unlock(&pcpu_alloc_mutex);
> > > > }
> > > >
> > > > +/**
> > > > + * pcpu_shrink_populated - scan chunks and release unused pages to the system
> > > > + * @type: chunk type
> > > > + *
> > > > + * Scan over all chunks, find those marked with the depopulate flag and
> > > > + * try to release unused pages to the system. On every attempt clear the
> > > > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > > > + * chunk again and again.
> > > > + */
> > > > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > > > +{
> > > > + struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > > > + struct pcpu_chunk *chunk;
> > > > + int slot, i, off, start;
> > > > +
> > > > + spin_lock_irq(&pcpu_lock);
> > > > + for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > > > +restart:
> > > > + list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > > > + bool isolated = false;
> > > > +
> > > > + if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > > > + break;
> > > > +
> > >
> > > Deallocation makes me a little worried for the atomic case as now we
> > > could in theory pathologically scan deallocated chunks before finding a
> > > populated one.
> > >
> > > I wonder if we should do something like once a chunk gets depopulated,
> > > it gets deprioritized and then only once we exhaust looking through
> > > allocated chunks we then find a depopulated chunk and add it back into
> > > the rotation. Possibly just add another set of slots? I guess it adds a
> > > few dimensions to pcpu_slots after the memcg change.
> >
> > Please, take a look at patch 3 in the series ("percpu: on demand chunk depopulation").
> > Chunks considered to be a good target for the depopulation are in advance
> > marked with a special flag, so we'll actually try to depopulate only
> > few chunks at once. While the total number of chunks is fairly low,
> > I think it should work.
> >
> > Another option is to link all such chunks into a list and scan over it,
> > instead of iterating over all slots.
> >
> > Adding new dimensions to pcpu_slots is an option too, but I hope we can avoid
> > this, as it would complicate the code.
> >
>
> Yeah, depopulation has been on the todo list for a while. It adds the
> dimension/opportunity of bin packing by sidelining chunks and I'm
> wondering if that is the right thing to do.
>
> Do you have a rough idea of the distribution of # of chunks you're
> seeing?

The majority of chunks are almost completely empty and reside on a one
of two last slots (before the slot for empty chunks). These are chunks
I'm mostly targeting.

>
> > >
> > > > + for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> > > > + if (!chunk->nr_empty_pop_pages)
> > > > + break;
> > > > +
> > > > + /*
> > > > + * If the page is empty and populated, start or
> > > > + * extend the [start, i) range.
> > > > + */
> > > > + if (test_bit(i, chunk->populated)) {
> > > > + off = find_first_bit(
> > > > + pcpu_index_alloc_map(chunk, i),
> > > > + PCPU_BITMAP_BLOCK_BITS);
> > > > + if (off >= PCPU_BITMAP_BLOCK_BITS) {
> > > > + if (start == -1)
> > > > + start = i;
> > > > + continue;
> > > > + }
> > >
> > > Here instead of looking at the alloc_map, you can look at the
> > > pcpu_block_md and look for a fully free contig_hint.
> >
> > Good idea, will try in v2.
> >
> > >
> > > > + }
> > > > +
> > > > + /*
> > > > + * Otherwise check if there is an active range,
> > > > + * and if yes, depopulate it.
> > > > + */
> > > > + if (start == -1)
> > > > + continue;
> > > > +
> > > > + /*
> > > > + * Isolate the chunk, so new allocations
> > > > + * wouldn't be served using this chunk.
> > > > + * Async releases can still happen.
> > > > + */
> > > > + if (!list_empty(&chunk->list)) {
> > > > + list_del_init(&chunk->list);
> > > > + isolated = true;
> > >
> > > Maybe when freeing a chunk, we should consider just isolating it period
> > > and preventing pcpu_free_area() from being able to add the chunk back
> > > to a pcpu_slot.
> >
> > You mean to add a check in pcpu_free_area() if the chunks is isolated?
> > Yeah, sounds good to me, will do in v2.
> >
>
> Could also be done in pcpu_chunk_relocate() so it's clear an isolated
> chunk shouldn't be moved. But I think pcpu_free_area() should be the
> only way the chunk can be moved on the list.

Yeah, I thought about putting it into pcpu_chunk_relocate(), but it's
used on new chunks, so it would require more changes, so I'm not sure.

>
> > Thank you!
> >
> > >
> > > > + }
> > > > +
> > > > + spin_unlock_irq(&pcpu_lock);
> > > > + pcpu_depopulate_chunk(chunk, start, i);
> > > > + cond_resched();
> > > > + spin_lock_irq(&pcpu_lock);
> > > > +
> > > > + pcpu_chunk_depopulated(chunk, start, i);
> > > > +
> > > > + /*
> > > > + * Reset the range and continue.
> > > > + */
> > > > + start = -1;
> > > > + }
> > > > +
> > > > + if (isolated) {
> > > > + /*
> > > > + * The chunk could have been moved while
> > > > + * pcpu_lock wasn't held. Make sure we put
> > > > + * the chunk back into the slot and restart
> > > > + * the scanning.
> > > > + */
> > > > + if (list_empty(&chunk->list))
> > > > + list_add_tail(&chunk->list,
> > > > + &pcpu_slot[slot]);
> > > > + goto restart;
> > > > + }
> > > > + }
> > > > + }
> > > > + spin_unlock_irq(&pcpu_lock);
> > > > +}
> > > > +
> > > > /**
> > > > * pcpu_balance_workfn - manage the amount of free chunks and populated pages
> > > > * @work: unused
> > > > --
> > > > 2.30.2
> > > >