This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
Proactive reclaim on the root cgroup is 10x slower after this patch when
MGLRU is enabled, and completion times for proactive reclaim on much
smaller non-root cgroups take ~30% longer (with or without MGLRU). With
root reclaim before the patch, I observe average reclaim rates of
~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and
the nr_retries counter starts to decrement, eventually ending the
proactive reclaim attempt. After the patch the reclaim rate is
consistently ~6.6k pages/sec due to the reduced nr_pages value causing
scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
proactive reclaim doesn't complete after several minutes because
try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
The docs for memory.reclaim say, "the kernel can over or under reclaim
from the target cgroup" which this patch was trying to fix. Revert it
until a less costly solution is found.
Signed-off-by: T.J. Mercier <[email protected]>
---
mm/memcontrol.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e4c8735e7c85..cee536c97151 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6956,8 +6956,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
lru_add_drain_all();
reclaimed = try_to_free_mem_cgroup_pages(memcg,
- min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
- GFP_KERNEL, reclaim_options);
+ nr_to_reclaim - nr_reclaimed,
+ GFP_KERNEL, reclaim_options);
if (!reclaimed && !nr_retries--)
return -EAGAIN;
--
2.43.0.429.g432eaa2c6b-goog
On Sun, Jan 21, 2024 at 2:44 PM T.J. Mercier <[email protected]> wrote:
>
> This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
>
> Proactive reclaim on the root cgroup is 10x slower after this patch when
> MGLRU is enabled, and completion times for proactive reclaim on much
> smaller non-root cgroups take ~30% longer (with or without MGLRU). With
> root reclaim before the patch, I observe average reclaim rates of
> ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and
> the nr_retries counter starts to decrement, eventually ending the
> proactive reclaim attempt. After the patch the reclaim rate is
> consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> proactive reclaim doesn't complete after several minutes because
> try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
>
> The docs for memory.reclaim say, "the kernel can over or under reclaim
> from the target cgroup" which this patch was trying to fix. Revert it
> until a less costly solution is found.
>
> Signed-off-by: T.J. Mercier <[email protected]>
Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during
proactive reclaim")
Cc: <[email protected]>
On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
>
> Proactive reclaim on the root cgroup is 10x slower after this patch when
> MGLRU is enabled, and completion times for proactive reclaim on much
> smaller non-root cgroups take ~30% longer (with or without MGLRU).
What is the reclaim target in these pro-active reclaim requests?
> With
> root reclaim before the patch, I observe average reclaim rates of
> ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and
> the nr_retries counter starts to decrement, eventually ending the
> proactive reclaim attempt.
Do I understand correctly that the reclaim target is over estimated and
you expect that the reclaim process breaks out early>
> After the patch the reclaim rate is
> consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> proactive reclaim doesn't complete after several minutes because
> try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
I do not understand this part. How does a smaller reclaim target manages
to have reclaimed > 0 while larger one doesn't?
> The docs for memory.reclaim say, "the kernel can over or under reclaim
> from the target cgroup" which this patch was trying to fix. Revert it
> until a less costly solution is found.
>
> Signed-off-by: T.J. Mercier <[email protected]>
> ---
> mm/memcontrol.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e4c8735e7c85..cee536c97151 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6956,8 +6956,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> lru_add_drain_all();
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> - GFP_KERNEL, reclaim_options);
> + nr_to_reclaim - nr_reclaimed,
> + GFP_KERNEL, reclaim_options);
>
> if (!reclaimed && !nr_retries--)
> return -EAGAIN;
> --
> 2.43.0.429.g432eaa2c6b-goog
>
--
Michal Hocko
SUSE Labs
On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <[email protected]> wrote:
>
> On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> > This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
> >
> > Proactive reclaim on the root cgroup is 10x slower after this patch when
> > MGLRU is enabled, and completion times for proactive reclaim on much
> > smaller non-root cgroups take ~30% longer (with or without MGLRU).
>
> What is the reclaim target in these pro-active reclaim requests?
Two targets:
1) /sys/fs/cgroup/memory.reclaim
2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services)
Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is
used for 2.
The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k
pages/sec) for 1. After this revert the root reclaim took only about
10 seconds. Before the revert it's still running after about 3 minutes
using a core at 100% the whole time, and I'm too impatient to wait
longer to record times for comparison.
The 30% comes from the average of a few runs for 2:
Before revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
restarting adbd as root
0m09.69s real 0m00.00s user 0m09.19s system
After revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
0m07.31s real 0m00.00s user 0m06.44s system
It's actually a bigger difference for smaller reclaim amounts:
Before revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim'
0m12.04s real 0m00.00s user 0m11.48s system
After revert:
$ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
echo "3G" > /sys/fs/cgroup/uid_0/memory.reclaim'
0m06.65s real 0m00.00s user 0m05.91s system
> > With
> > root reclaim before the patch, I observe average reclaim rates of
> > ~70k pages/sec before try_to_free_mem_cgroup_pages starts to fail and
> > the nr_retries counter starts to decrement, eventually ending the
> > proactive reclaim attempt.
>
> Do I understand correctly that the reclaim target is over estimated and
> you expect that the reclaim process breaks out early
Yes. I expect memory_reclaim to fail at some point when it becomes
difficult/impossible to reclaim pages where I specify a large amount
to reclaim. The ask here is, "please reclaim as much as possible from
this cgroup, but don't take all day". But it takes minutes to get
there on the root cgroup, working SWAP_CLUSTER_MAX pages at a time.
> > After the patch the reclaim rate is
> > consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> > proactive reclaim doesn't complete after several minutes because
> > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
>
> I do not understand this part. How does a smaller reclaim target manages
> to have reclaimed > 0 while larger one doesn't?
They both are able to make progress. The main difference is that a
single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
after it reclaims nr_to_reclaim, and before it touches all memcgs. So
a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
immediately after nr_to_reclaim is reached, so a single call to
try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
https://lore.kernel.org/lkml/[email protected]/
On Tue 23-01-24 05:58:05, T.J. Mercier wrote:
> On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <[email protected]> wrote:
> >
> > On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
> > >
> > > Proactive reclaim on the root cgroup is 10x slower after this patch when
> > > MGLRU is enabled, and completion times for proactive reclaim on much
> > > smaller non-root cgroups take ~30% longer (with or without MGLRU).
> >
> > What is the reclaim target in these pro-active reclaim requests?
>
> Two targets:
> 1) /sys/fs/cgroup/memory.reclaim
> 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services)
OK, I was not really clear. I was curious about nr_to_reclaim.
> Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is
> used for 2.
>
> The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k
> pages/sec) for 1. After this revert the root reclaim took only about
> 10 seconds. Before the revert it's still running after about 3 minutes
> using a core at 100% the whole time, and I'm too impatient to wait
> longer to record times for comparison.
>
> The 30% comes from the average of a few runs for 2:
> Before revert:
> $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
> echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
Ohh, so you want to reclaim all of it (resp. as much as possible).
[...]
> > > After the patch the reclaim rate is
> > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> > > proactive reclaim doesn't complete after several minutes because
> > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
> >
> > I do not understand this part. How does a smaller reclaim target manages
> > to have reclaimed > 0 while larger one doesn't?
>
> They both are able to make progress. The main difference is that a
> single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> immediately after nr_to_reclaim is reached, so a single call to
> try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> https://lore.kernel.org/lkml/[email protected]/
OK, I do see how try_to_free_mem_cgroup_pages might over reclaim but I
do not really follow how increasing the batch actually fixes the issue
that there is always progress being made and therefore memory_reclaim
takes ages to terminates?
--
Michal Hocko
SUSE Labs
The revert isn't a straight-forward solution.
The patch you're reverting fixed conventional reclaim and broke
MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.
On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> They both are able to make progress. The main difference is that a
> single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> immediately after nr_to_reclaim is reached, so a single call to
> try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> https://lore.kernel.org/lkml/[email protected]/
Is that a feature or a bug?
* 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
* of their max_seq counters ensures the eventual fairness to all eligible
* memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
If it bails out exactly after nr_to_reclaim, it'll overreclaim
less. But with steady reclaim in a complex subtree, it will always hit
the first cgroup returned by mem_cgroup_iter() and then bail. This
seems like a fairness issue.
We should figure out what the right method for balancing fairness with
overreclaim is, regardless of reclaim implementation. Because having
two different approaches and reverting dependent things back and forth
doesn't make sense.
Using an LRU to rotate through memcgs over multiple reclaim cycles
seems like a good idea. Why is this specific to MGLRU? Shouldn't this
be a generic piece of memcg infrastructure?
Then there is the question of why there is an LRU for global reclaim,
but not for subtree reclaim. Reclaiming a container with multiple
subtrees would benefit from the fairness provided by a container-level
LRU order just as much; having fairness for root but not for subtrees
would produce different reclaim and pressure behavior, and can cause
regressions when moving a service from bare-metal into a container.
Figuring out these differences and converging on a method for cgroup
fairness would be the better way of fixing this. Because of the
regression risk to the default reclaim implementation, I'm inclined to
NAK this revert.
On Tue 23-01-24 11:48:19, Johannes Weiner wrote:
> The revert isn't a straight-forward solution.
>
> The patch you're reverting fixed conventional reclaim and broke
> MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.
>
> On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/[email protected]/
>
> Is that a feature or a bug?
>
> * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
> * of their max_seq counters ensures the eventual fairness to all eligible
> * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
>
> If it bails out exactly after nr_to_reclaim, it'll overreclaim
> less. But with steady reclaim in a complex subtree, it will always hit
> the first cgroup returned by mem_cgroup_iter() and then bail. This
> seems like a fairness issue.
Agreed. We would need to re-introduce something like we used to have
before 1ba6fc9af35bf.
> We should figure out what the right method for balancing fairness with
> overreclaim is, regardless of reclaim implementation. Because having
> two different approaches and reverting dependent things back and forth
> doesn't make sense.
Absolutely agreed!
> Using an LRU to rotate through memcgs over multiple reclaim cycles
> seems like a good idea. Why is this specific to MGLRU? Shouldn't this
> be a generic piece of memcg infrastructure?
>
> Then there is the question of why there is an LRU for global reclaim,
> but not for subtree reclaim. Reclaiming a container with multiple
> subtrees would benefit from the fairness provided by a container-level
> LRU order just as much; having fairness for root but not for subtrees
> would produce different reclaim and pressure behavior, and can cause
> regressions when moving a service from bare-metal into a container.
>
> Figuring out these differences and converging on a method for cgroup
> fairness would be the better way of fixing this. Because of the
> regression risk to the default reclaim implementation, I'm inclined to
> NAK this revert.
I do agree that a simple revert doesn't seem to be the way to go.
--
Michal Hocko
SUSE Labs
On Tue, Jan 23, 2024 at 8:19 AM Michal Hocko <[email protected]> wrote:
>
> On Tue 23-01-24 05:58:05, T.J. Mercier wrote:
> > On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> > > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
> > > >
> > > > Proactive reclaim on the root cgroup is 10x slower after this patch when
> > > > MGLRU is enabled, and completion times for proactive reclaim on much
> > > > smaller non-root cgroups take ~30% longer (with or without MGLRU).
> > >
> > > What is the reclaim target in these pro-active reclaim requests?
> >
> > Two targets:
> > 1) /sys/fs/cgroup/memory.reclaim
> > 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services)
>
> OK, I was not really clear. I was curious about nr_to_reclaim.
>
> > Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is
> > used for 2.
> >
> > The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k
> > pages/sec) for 1. After this revert the root reclaim took only about
> > 10 seconds. Before the revert it's still running after about 3 minutes
> > using a core at 100% the whole time, and I'm too impatient to wait
> > longer to record times for comparison.
> >
> > The 30% comes from the average of a few runs for 2:
> > Before revert:
> > $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
> > echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
>
> Ohh, so you want to reclaim all of it (resp. as much as possible).
>
Right, the main use-case here is we decide an application should be
backgrounded and its cgroup frozen. Before freezing, reclaim as much
as possible so that the frozen processes' RAM use is as low as
possible while they're dormant.
> [...]
>
> > > > After the patch the reclaim rate is
> > > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> > > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> > > > proactive reclaim doesn't complete after several minutes because
> > > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> > > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
> > >
> > > I do not understand this part. How does a smaller reclaim target manages
> > > to have reclaimed > 0 while larger one doesn't?
> >
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/[email protected]/
>
> OK, I do see how try_to_free_mem_cgroup_pages might over reclaim but I
> do not really follow how increasing the batch actually fixes the issue
> that there is always progress being made and therefore memory_reclaim
> takes ages to terminates?
Oh, because the page reclaim rate with a small batch is just much
lower than with a very large batch. We have to restart reclaim from
fresh each time a batch is completed before we get to a place where
we're actually freeing/swapping pages again. That setup cost is
amortized over many more pages with a large batch size, but appears to
be pretty significant for small batch sizes.
On Tue, Jan 23, 2024 at 8:48 AM Johannes Weiner <[email protected]> wrote:
>
> The revert isn't a straight-forward solution.
>
> The patch you're reverting fixed conventional reclaim and broke
> MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.
>
> On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/[email protected]/
>
> Is that a feature or a bug?
Feature!
> * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
> * of their max_seq counters ensures the eventual fairness to all eligible
> * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
>
> If it bails out exactly after nr_to_reclaim, it'll overreclaim
> less. But with steady reclaim in a complex subtree, it will always hit
> the first cgroup returned by mem_cgroup_iter() and then bail. This
> seems like a fairness issue.
Right. Because the memcg LRU is maintained in pg_data_t and not in
each cgroup, I think we are currently forced to have the iteration
across all child memcgs for non-root memcg reclaim for fairness.
> We should figure out what the right method for balancing fairness with
> overreclaim is, regardless of reclaim implementation. Because having
> two different approaches and reverting dependent things back and forth
> doesn't make sense.
>
> Using an LRU to rotate through memcgs over multiple reclaim cycles
> seems like a good idea. Why is this specific to MGLRU? Shouldn't this
> be a generic piece of memcg infrastructure?
It would be pretty sweet if it were. I haven't tried to measure this
part in isolation, but I know we had to abandon attempts to use
per-app memcgs in the past (2018?) because the perf overhead was too
much. In recent tests where this feature is used, I see some perf
gains which I think are probably attributable to this.
> Then there is the question of why there is an LRU for global reclaim,
> but not for subtree reclaim. Reclaiming a container with multiple
> subtrees would benefit from the fairness provided by a container-level
> LRU order just as much; having fairness for root but not for subtrees
> would produce different reclaim and pressure behavior, and can cause
> regressions when moving a service from bare-metal into a container.
>
> Figuring out these differences and converging on a method for cgroup
> fairness would be the better way of fixing this. Because of the
> regression risk to the default reclaim implementation, I'm inclined to
> NAK this revert.
In the meantime, instead of a revert how about changing the batch size
geometrically instead of the SWAP_CLUSTER_MAX constant:
reclaimed = try_to_free_mem_cgroup_pages(memcg,
- min(nr_to_reclaim -
nr_reclaimed, SWAP_CLUSTER_MAX),
+ (nr_to_reclaim - nr_reclaimed)/2,
GFP_KERNEL, reclaim_options);
I think that should address the overreclaim concern (it was mentioned
that the upper bound of overreclaim was 2 * request), and this should
also increase the reclaim rate for root reclaim with MGLRU closer to
what it was before.
On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote:
> In the meantime, instead of a revert how about changing the batch size
> geometrically instead of the SWAP_CLUSTER_MAX constant:
>
> reclaimed = try_to_free_mem_cgroup_pages(memcg,
> - min(nr_to_reclaim -
> nr_reclaimed, SWAP_CLUSTER_MAX),
> + (nr_to_reclaim - nr_reclaimed)/2,
> GFP_KERNEL, reclaim_options);
>
> I think that should address the overreclaim concern (it was mentioned
> that the upper bound of overreclaim was 2 * request), and this should
> also increase the reclaim rate for root reclaim with MGLRU closer to
> what it was before.
Hahaha. Would /4 work for you?
I genuinely think the idea is worth a shot. /4 would give us a bit
more margin for error, since the bailout/fairness cutoffs have changed
back and forth over time. And it should still give you a reasonable
convergence on MGLRU.
try_to_free_reclaim_pages() already does max(nr_to_reclaim,
SWAP_CLUSTER_MAX) which will avoid the painful final approach loops
the integer division would produce on its own.
Please add a comment mentioning the compromise between the two reclaim
implementations though.
On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote:
> > In the meantime, instead of a revert how about changing the batch size
> > geometrically instead of the SWAP_CLUSTER_MAX constant:
> >
> > reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > - min(nr_to_reclaim -
> > nr_reclaimed, SWAP_CLUSTER_MAX),
> > + (nr_to_reclaim - nr_reclaimed)/2,
> > GFP_KERNEL, reclaim_options);
> >
> > I think that should address the overreclaim concern (it was mentioned
> > that the upper bound of overreclaim was 2 * request), and this should
> > also increase the reclaim rate for root reclaim with MGLRU closer to
> > what it was before.
>
> Hahaha. Would /4 work for you?
>
> I genuinely think the idea is worth a shot. /4 would give us a bit
> more margin for error, since the bailout/fairness cutoffs have changed
> back and forth over time. And it should still give you a reasonable
> convergence on MGLRU.
>
> try_to_free_reclaim_pages() already does max(nr_to_reclaim,
> SWAP_CLUSTER_MAX) which will avoid the painful final approach loops
> the integer division would produce on its own.
>
> Please add a comment mentioning the compromise between the two reclaim
> implementations though.
I'll try it out and get back to you. :)
On Tue, Jan 23, 2024 at 9:48 AM Johannes Weiner <[email protected]> wrote:
>
> The revert isn't a straight-forward solution.
>
> The patch you're reverting fixed conventional reclaim and broke
> MGLRU. Your revert fixes MGLRU and breaks conventional reclaim.
This is not true -- the patch reverted regressed the active/inactive
LRU too, on execution time.
Quoting the commit message: "completion times for proactive reclaim on
much smaller non-root cgroups take ~30% longer (with or without
MGLRU)."
And I wouldn't call the original patch a fix -- it shifted the problem
from space to time, which at best is a tradeoff.
> On Tue, Jan 23, 2024 at 05:58:05AM -0800, T.J. Mercier wrote:
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/[email protected]/
>
> Is that a feature or a bug?
>
> * 1. Memcg LRU only applies to global reclaim, and the round-robin incrementing
> * of their max_seq counters ensures the eventual fairness to all eligible
> * memcgs. For memcg reclaim, it still relies on mem_cgroup_iter().
>
> If it bails out exactly after nr_to_reclaim, it'll overreclaim
> less. But with steady reclaim in a complex subtree, it will always hit
> the first cgroup returned by mem_cgroup_iter() and then bail. This
> seems like a fairness issue.
>
> We should figure out what the right method for balancing fairness with
> overreclaim is, regardless of reclaim implementation. Because having
> two different approaches and reverting dependent things back and forth
> doesn't make sense.
>
> Using an LRU to rotate through memcgs over multiple reclaim cycles
> seems like a good idea. Why is this specific to MGLRU? Shouldn't this
> be a generic piece of memcg infrastructure?
>
> Then there is the question of why there is an LRU for global reclaim,
> but not for subtree reclaim. Reclaiming a container with multiple
> subtrees would benefit from the fairness provided by a container-level
> LRU order just as much; having fairness for root but not for subtrees
> would produce different reclaim and pressure behavior, and can cause
> regressions when moving a service from bare-metal into a container.
>
> Figuring out these differences and converging on a method for cgroup
> fairness would be the better way of fixing this. Because of the
> regression risk to the default reclaim implementation, I'm inclined to
> NAK this revert.
On Fri, Jan 26, 2024 at 8:41 AM T.J. Mercier <[email protected]> wrote:
>
> On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote:
> > > In the meantime, instead of a revert how about changing the batch size
> > > geometrically instead of the SWAP_CLUSTER_MAX constant:
> > >
> > > reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > - min(nr_to_reclaim -
> > > nr_reclaimed, SWAP_CLUSTER_MAX),
> > > + (nr_to_reclaim - nr_reclaimed)/2,
> > > GFP_KERNEL, reclaim_options);
> > >
> > > I think that should address the overreclaim concern (it was mentioned
> > > that the upper bound of overreclaim was 2 * request), and this should
> > > also increase the reclaim rate for root reclaim with MGLRU closer to
> > > what it was before.
> >
> > Hahaha. Would /4 work for you?
> >
> > I genuinely think the idea is worth a shot. /4 would give us a bit
> > more margin for error, since the bailout/fairness cutoffs have changed
> > back and forth over time. And it should still give you a reasonable
> > convergence on MGLRU.
> >
> > try_to_free_reclaim_pages() already does max(nr_to_reclaim,
> > SWAP_CLUSTER_MAX) which will avoid the painful final approach loops
> > the integer division would produce on its own.
> >
> > Please add a comment mentioning the compromise between the two reclaim
> > implementations though.
>
> I'll try it out and get back to you. :)
Right, so (nr_to_reclaim - nr_reclaimed)/4 looks pretty good to me:
root - full reclaim pages/sec time (sec)
pre-0388536ac291 : 68047 10.46
post-0388536ac291 : 13742 inf
(reclaim-reclaimed)/4 : 67352 10.51
/uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB)
pre-0388536ac291 : 258822 1.12 107.8
post-0388536ac291 : 105174 2.49 3.5
(reclaim-reclaimed)/4 : 233396 1.12 -7.4
/uid_0 - full reclaim pages/sec time (sec)
pre-0388536ac291 : 72334 7.09
post-0388536ac291 : 38105 14.45
(reclaim-reclaimed)/4 : 72914 6.96
So I'll put up a new patch.
On Tue, Jan 30, 2024 at 12:58:12PM -0800, T.J. Mercier wrote:
> On Fri, Jan 26, 2024 at 8:41 AM T.J. Mercier <[email protected]> wrote:
> >
> > On Fri, Jan 26, 2024 at 8:34 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Wed, Jan 24, 2024 at 09:46:23AM -0800, T.J. Mercier wrote:
> > > > In the meantime, instead of a revert how about changing the batch size
> > > > geometrically instead of the SWAP_CLUSTER_MAX constant:
> > > >
> > > > reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > > - min(nr_to_reclaim -
> > > > nr_reclaimed, SWAP_CLUSTER_MAX),
> > > > + (nr_to_reclaim - nr_reclaimed)/2,
> > > > GFP_KERNEL, reclaim_options);
> > > >
> > > > I think that should address the overreclaim concern (it was mentioned
> > > > that the upper bound of overreclaim was 2 * request), and this should
> > > > also increase the reclaim rate for root reclaim with MGLRU closer to
> > > > what it was before.
> > >
> > > Hahaha. Would /4 work for you?
> > >
> > > I genuinely think the idea is worth a shot. /4 would give us a bit
> > > more margin for error, since the bailout/fairness cutoffs have changed
> > > back and forth over time. And it should still give you a reasonable
> > > convergence on MGLRU.
> > >
> > > try_to_free_reclaim_pages() already does max(nr_to_reclaim,
> > > SWAP_CLUSTER_MAX) which will avoid the painful final approach loops
> > > the integer division would produce on its own.
> > >
> > > Please add a comment mentioning the compromise between the two reclaim
> > > implementations though.
> >
> > I'll try it out and get back to you. :)
>
> Right, so (nr_to_reclaim - nr_reclaimed)/4 looks pretty good to me:
>
> root - full reclaim pages/sec time (sec)
> pre-0388536ac291 : 68047 10.46
> post-0388536ac291 : 13742 inf
> (reclaim-reclaimed)/4 : 67352 10.51
>
> /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB)
> pre-0388536ac291 : 258822 1.12 107.8
> post-0388536ac291 : 105174 2.49 3.5
> (reclaim-reclaimed)/4 : 233396 1.12 -7.4
>
> /uid_0 - full reclaim pages/sec time (sec)
> pre-0388536ac291 : 72334 7.09
> post-0388536ac291 : 38105 14.45
> (reclaim-reclaimed)/4 : 72914 6.96
>
> So I'll put up a new patch.
That looks great, thanks for giving it a shot.
Looking forward to your patch.