LinuxLists.cc - Low TCP throughput due to vmpressure with swap enabled

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 10:01 AM Eric Dumazet <[email protected]> wrote:
>
> On Mon, Nov 21, 2022 at 4:53 PM Ivan Babrou <[email protected]> wrote:
> >
> > Hello,
> >
> > We have observed a negative TCP throughput behavior from the following commit:
> >
> > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> >
> > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> >
> > The crux of the issue is that in some cases with swap present the
> > workload can be unfairly throttled in terms of TCP throughput.
>
> I guess defining 'fairness' in such a scenario is nearly impossible.
>
> Have you tried changing /proc/sys/net/ipv4/tcp_rmem (and/or tcp_wmem) ?
> Defaults are quite conservative.

Yes, our max sizes are much higher than the defaults. I don't see how
it matters though. The issue is that the kernel clamps rcv_sshtrehsh
at 4 x advmss. No matter how much TCP memory you end up using, the
kernel will clamp based on responsiveness to memory reclaim, which in
turn depends on swap presence. We're seeing it in production with tens
of thousands of sockets and high max tcp_rmem and I'm able to
replicate the same issue in my vm with the default sysctl values.

> If for your workload you want to ensure a minimum amount of memory per
> TCP socket,
> that might be good enough.

That's not my goal at all. We don't have a problem with TCP memory
consumption. Our issue is low throughput because vmpressure() thinks
that the cgroup is memory constrained when it most definitely is not.

2022-11-22 18:44:19

by Eric Dumazet

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 10:11 AM Ivan Babrou <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 10:01 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Mon, Nov 21, 2022 at 4:53 PM Ivan Babrou <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > We have observed a negative TCP throughput behavior from the following commit:
> > >
> > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > >
> > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > >
> > > The crux of the issue is that in some cases with swap present the
> > > workload can be unfairly throttled in terms of TCP throughput.
> >
> > I guess defining 'fairness' in such a scenario is nearly impossible.
> >
> > Have you tried changing /proc/sys/net/ipv4/tcp_rmem (and/or tcp_wmem) ?
> > Defaults are quite conservative.
>
> Yes, our max sizes are much higher than the defaults. I don't see how
> it matters though. The issue is that the kernel clamps rcv_sshtrehsh
> at 4 x advmss.

There are some places (eg tcp_clamp_window) where we have this
additional condition :

sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)

So I was suggesting maybe to add a similar condition to tcp_try_rmem_schedule()

Then adjust tcp_rmem for your needs.

No matter how much TCP memory you end up using, the
> kernel will clamp based on responsiveness to memory reclaim, which in
> turn depends on swap presence. We're seeing it in production with tens
> of thousands of sockets and high max tcp_rmem and I'm able to
> replicate the same issue in my vm with the default sysctl values.
>
> > If for your workload you want to ensure a minimum amount of memory per
> > TCP socket,
> > that might be good enough.
>
> That's not my goal at all. We don't have a problem with TCP memory
> consumption. Our issue is low throughput because vmpressure() thinks
> that the cgroup is memory constrained when it most definitely is not.

OK, then I will stop commenting I guess :)

2022-11-22 19:26:08

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

2022-11-22 19:55:54

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Nov 21, 2022 at 5:53 PM Ivan Babrou <[email protected]> wrote:
>
> Hello,
>
> We have observed a negative TCP throughput behavior from the following commit:
>
> * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
>
> It landed back in 2016 in v4.5, so it's not exactly a new issue.
>
> The crux of the issue is that in some cases with swap present the
> workload can be unfairly throttled in terms of TCP throughput.
>
> I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8
> GiB of RAM with zram enabled.
>
> The setup is fairly simple:
>
> 1. Run the following go proxy in one cgroup (it has some memory
> ballast to simulate useful memory usage):
>
> * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82
>
> sudo systemd-run --scope -p MemoryLimit=6G go run main.go
>
> 2. Run the following fio config in another cgroup to simulate mmapped
> page cache usage:
>
> [global]
> size=8g
> bs=256k
> iodepth=256
> direct=0
> ioengine=mmap
> group_reporting
> time_based
> runtime=86400
> numjobs=8
> name=randread
> rw=randread

Is it practical for your workload to apply some madvise/fadvise hint?
For the above repro, it would be fadvise_hint=1 which is mapped into
MADV_RANDOM automatically. The kernel also supports MADV_SEQUENTIAL,
but not POSIX_FADV_NOREUSE at the moment.

We actually have similar issues but unfortunately I haven't been able
to come up with any solution beyond recommending the above flags.
The problem is that harvesting the accessed bit from mmapped memory is
costly, and when random accesses happen fast enough, the cost of doing
that prevents LRU from collecting more information to make better
decisions. In a nutshell, LRU can't tell whether there is genuine
memory locality with your test case.

It's a very difficult problem to solve from LRU's POV. I'd like to
hear more about your workloads and see whether there are workarounds
other than tackling the problem head-on, if applying hints is not
practical or preferrable.

2022-11-22 20:06:38

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 12:05 PM Ivan Babrou <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 10:59 AM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Nov 21, 2022 at 5:53 PM Ivan Babrou <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > We have observed a negative TCP throughput behavior from the following commit:
> > >
> > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > >
> > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > >
> > > The crux of the issue is that in some cases with swap present the
> > > workload can be unfairly throttled in terms of TCP throughput.
> > >
> > > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8
> > > GiB of RAM with zram enabled.
> >
> > Hi Ivan,
> >
> > If it's not too much trouble, could you try again with the following?
> > CONFIG_LRU_GEN=y
> > CONFIG_LRU_GEN_ENABLED=y
> >
> > I haven't tried it myself. But I'll fix whatever doesn't work for you,
> > since your team is on the top of my customer list :)
>
> We don't have it in production, since there we have 5.15 LTS (our
> kernel team is testing v6.1).
>
> I do have it enabled in my VM running v6.1-rc6 where I was able to
> replicate this, so it doesn't seem to help.

Thanks. I'll look into this and get back to you.

2022-11-22 20:18:01

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 10:59 AM Yu Zhao <[email protected]> wrote:
>
> On Mon, Nov 21, 2022 at 5:53 PM Ivan Babrou <[email protected]> wrote:
> >
> > Hello,
> >
> > We have observed a negative TCP throughput behavior from the following commit:
> >
> > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> >
> > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> >
> > The crux of the issue is that in some cases with swap present the
> > workload can be unfairly throttled in terms of TCP throughput.
> >
> > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8
> > GiB of RAM with zram enabled.
>
> Hi Ivan,
>
> If it's not too much trouble, could you try again with the following?
> CONFIG_LRU_GEN=y
> CONFIG_LRU_GEN_ENABLED=y
>
> I haven't tried it myself. But I'll fix whatever doesn't work for you,
> since your team is on the top of my customer list :)

We don't have it in production, since there we have 5.15 LTS (our
kernel team is testing v6.1).

I do have it enabled in my VM running v6.1-rc6 where I was able to
replicate this, so it doesn't seem to help.

2022-11-22 20:25:53

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 12:46 PM Yu Zhao <[email protected]> wrote:
>
> On Mon, Nov 21, 2022 at 5:53 PM Ivan Babrou <[email protected]> wrote:
> >
> > Hello,
> >
> > We have observed a negative TCP throughput behavior from the following commit:
> >
> > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> >
> > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> >
> > The crux of the issue is that in some cases with swap present the
> > workload can be unfairly throttled in terms of TCP throughput.
> >
> > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8
> > GiB of RAM with zram enabled.
> >
> > The setup is fairly simple:
> >
> > 1. Run the following go proxy in one cgroup (it has some memory
> > ballast to simulate useful memory usage):
> >
> > * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82
> >
> > sudo systemd-run --scope -p MemoryLimit=6G go run main.go
> >
> > 2. Run the following fio config in another cgroup to simulate mmapped
> > page cache usage:
> >
> > [global]
> > size=8g
> > bs=256k
> > iodepth=256
> > direct=0
> > ioengine=mmap
> > group_reporting
> > time_based
> > runtime=86400
> > numjobs=8
> > name=randread
> > rw=randread
>
> Is it practical for your workload to apply some madvise/fadvise hint?
> For the above repro, it would be fadvise_hint=1 which is mapped into
> MADV_RANDOM automatically. The kernel also supports MADV_SEQUENTIAL,
> but not POSIX_FADV_NOREUSE at the moment.

Actually fadvise_hint already defaults to 1. At least with MGLRU, the
page cache should be thrown away without causing you any problem. It
might be mapped to POSIX_FADV_RANDOM rather than MADV_RANDOM.
POSIX_FADV_RANDOM is ignored at the moment.

Sorry for all the noise. Let me dig into this and get back to you later today.

> We actually have similar issues but unfortunately I haven't been able
> to come up with any solution beyond recommending the above flags.
> The problem is that harvesting the accessed bit from mmapped memory is
> costly, and when random accesses happen fast enough, the cost of doing
> that prevents LRU from collecting more information to make better
> decisions. In a nutshell, LRU can't tell whether there is genuine
> memory locality with your test case.
>
> It's a very difficult problem to solve from LRU's POV. I'd like to
> hear more about your workloads and see whether there are workarounds
> other than tackling the problem head-on, if applying hints is not
> practical or preferrable.

2022-11-22 20:38:25

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> Hello,
>
> We have observed a negative TCP throughput behavior from the following commit:
>
> * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
>
> It landed back in 2016 in v4.5, so it's not exactly a new issue.
>
> The crux of the issue is that in some cases with swap present the
> workload can be unfairly throttled in terms of TCP throughput.

Thanks for the detailed analysis, Ivan.

Originally, we pushed back on sockets only when regular page reclaim
had completely failed and we were about to OOM. This patch was an
attempt to be smarter about it and equalize pressure more smoothly
between socket memory, file cache, anonymous pages.

After a recent discussion with Shakeel, I'm no longer quite sure the
kernel is the right place to attempt this sort of balancing. It kind
of depends on the workload which type of memory is more imporant. And
your report shows that vmpressure is a flawed mechanism to implement
this, anyway.

So I'm thinking we should delete the vmpressure thing, and go back to
socket throttling only if an OOM is imminent. This is in line with
what we do at the system level: sockets get throttled only after
reclaim fails and we hit hard limits. It's then up to the users and
sysadmin to allocate a reasonable amount of buffers given the overall
memory budget.

Cgroup accounting, limiting and OOM enforcement is still there for the
socket buffers, so misbehaving groups will be contained either way.

What do you think? Something like the below patch?

---

From 67757f78d8b4412b72fe1583ebaf13cfd0fc03b0 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Tue, 22 Nov 2022 14:40:50 -0500
Subject: [PATCH] Revert "mm: memcontrol: hook up vmpressure to socket
pressure"

This reverts commit 8e8ae645249b85c8ed6c178557f8db8613a6bcc7.

NOT-Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 6 ++--
include/linux/vmpressure.h | 7 ++---
mm/memcontrol.c | 19 +++++++++----
mm/vmpressure.c | 58 ++++++--------------------------------
mm/vmscan.c | 15 +---------
5 files changed, 29 insertions(+), 76 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1644a24009c..e7369363d4c2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,11 +283,11 @@ struct mem_cgroup {
atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS];

+ /* Socket memory allocations have failed */
unsigned long socket_pressure;

/* Legacy tcp memory accounting */
bool tcpmem_active;
- int tcpmem_pressure;

#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
@@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
return true;
do {
- if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
+ if (memcg->socket_pressure)
return true;
} while ((memcg = parent_mem_cgroup(memcg)));
return false;
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 6a2f51ebbfd3..20d93de37a17 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -11,9 +11,6 @@
#include <linux/eventfd.h>

struct vmpressure {
- unsigned long scanned;
- unsigned long reclaimed;
-
unsigned long tree_scanned;
unsigned long tree_reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
@@ -30,7 +27,7 @@ struct vmpressure {
struct mem_cgroup;

#ifdef CONFIG_MEMCG
-extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed);
extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);

@@ -44,7 +41,7 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd);
#else
-static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed) {}
static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
int prio) {}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..066166aebbef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7195,10 +7195,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
struct page_counter *fail;

if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
- memcg->tcpmem_pressure = 0;
+ memcg->socket_pressure = 0;
return true;
}
- memcg->tcpmem_pressure = 1;
+ memcg->socket_pressure = 1;
if (gfp_mask & __GFP_NOFAIL) {
page_counter_charge(&memcg->tcpmem, nr_pages);
return true;
@@ -7206,12 +7206,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
return false;
}

- if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
- mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
- return true;
+ if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) {
+ memcg->socket_pressure = 0;
+ goto success;
+ }
+ memcg->socket_pressure = 1;
+ if (gfp_mask & __GFP_NOFAIL) {
+ try_charge(memcg, gfp_mask, nr_pages);
+ goto success;
}

return false;
+
+success:
+ mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ return true;
}

/**
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index b52644771cc4..4cec90711cf4 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -219,7 +219,6 @@ static void vmpressure_work_fn(struct work_struct *work)
* vmpressure() - Account memory pressure through scanned/reclaimed ratio
* @gfp: reclaimer's gfp mask
* @memcg: cgroup memory controller handle
- * @tree: legacy subtree mode
* @scanned: number of pages scanned
* @reclaimed: number of pages reclaimed
*
@@ -227,16 +226,9 @@ static void vmpressure_work_fn(struct work_struct *work)
* "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
* pressure index is then further refined and averaged over time.
*
- * If @tree is set, vmpressure is in traditional userspace reporting
- * mode: @memcg is considered the pressure root and userspace is
- * notified of the entire subtree's reclaim efficiency.
- *
- * If @tree is not set, reclaim efficiency is recorded for @memcg, and
- * only in-kernel users are notified.
- *
* This function does not return any value.
*/
-void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed)
{
struct vmpressure *vmpr;
@@ -271,46 +263,14 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
if (!scanned)
return;

- if (tree) {
- spin_lock(&vmpr->sr_lock);
- scanned = vmpr->tree_scanned += scanned;
- vmpr->tree_reclaimed += reclaimed;
- spin_unlock(&vmpr->sr_lock);
-
- if (scanned < vmpressure_win)
- return;
- schedule_work(&vmpr->work);
- } else {
- enum vmpressure_levels level;
-
- /* For now, no users for root-level efficiency */
- if (!memcg || mem_cgroup_is_root(memcg))
- return;
-
- spin_lock(&vmpr->sr_lock);
- scanned = vmpr->scanned += scanned;
- reclaimed = vmpr->reclaimed += reclaimed;
- if (scanned < vmpressure_win) {
- spin_unlock(&vmpr->sr_lock);
- return;
- }
- vmpr->scanned = vmpr->reclaimed = 0;
- spin_unlock(&vmpr->sr_lock);
+ spin_lock(&vmpr->sr_lock);
+ scanned = vmpr->tree_scanned += scanned;
+ vmpr->tree_reclaimed += reclaimed;
+ spin_unlock(&vmpr->sr_lock);

- level = vmpressure_calc_level(scanned, reclaimed);
-
- if (level > VMPRESSURE_LOW) {
- /*
- * Let the socket buffer allocator know that
- * we are having trouble reclaiming LRU pages.
- *
- * For hysteresis keep the pressure state
- * asserted for a second in which subsequent
- * pressure events can occur.
- */
- WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
- }
- }
+ if (scanned < vmpressure_win)
+ return;
+ schedule_work(&vmpr->work);
}

/**
@@ -340,7 +300,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
* to the vmpressure() basically means that we signal 'critical'
* level.
*/
- vmpressure(gfp, memcg, true, vmpressure_win, 0);
+ vmpressure(gfp, memcg, vmpressure_win, 0);
}

#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04d8b88e5216..d348366d58d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6035,8 +6035,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- unsigned long reclaimed;
- unsigned long scanned;

/*
* This loop can become CPU-bound when target memcgs
@@ -6068,20 +6066,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg_memory_event(memcg, MEMCG_LOW);
}

- reclaimed = sc->nr_reclaimed;
- scanned = sc->nr_scanned;
-
shrink_lruvec(lruvec, sc);
-
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
sc->priority);
-
- /* Record the group's reclaim efficiency */
- if (!sc->proactive)
- vmpressure(sc->gfp_mask, memcg, false,
- sc->nr_scanned - scanned,
- sc->nr_reclaimed - reclaimed);
-
} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
}

@@ -6111,7 +6098,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

/* Record the subtree's reclaim efficiency */
if (!sc->proactive)
- vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);

--
2.38.1

2022-11-22 22:43:15

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 12:05 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> > Hello,
> >
> > We have observed a negative TCP throughput behavior from the following commit:
> >
> > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> >
> > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> >
> > The crux of the issue is that in some cases with swap present the
> > workload can be unfairly throttled in terms of TCP throughput.
>
> Thanks for the detailed analysis, Ivan.
>
> Originally, we pushed back on sockets only when regular page reclaim
> had completely failed and we were about to OOM. This patch was an
> attempt to be smarter about it and equalize pressure more smoothly
> between socket memory, file cache, anonymous pages.
>
> After a recent discussion with Shakeel, I'm no longer quite sure the
> kernel is the right place to attempt this sort of balancing. It kind
> of depends on the workload which type of memory is more imporant. And
> your report shows that vmpressure is a flawed mechanism to implement
> this, anyway.
>
> So I'm thinking we should delete the vmpressure thing, and go back to
> socket throttling only if an OOM is imminent. This is in line with
> what we do at the system level: sockets get throttled only after
> reclaim fails and we hit hard limits. It's then up to the users and
> sysadmin to allocate a reasonable amount of buffers given the overall
> memory budget.
>
> Cgroup accounting, limiting and OOM enforcement is still there for the
> socket buffers, so misbehaving groups will be contained either way.
>
> What do you think? Something like the below patch?

The idea sounds very reasonable to me. I can't really speak for the
patch contents with any sort of authority, but it looks ok to my
non-expert eyes.

There were some conflicts when cherry-picking this into v5.15. I think
the only real one was for the "!sc->proactive" condition not being
present there. For the rest I just accepted the incoming change.

I'm going to be away from my work computer until December 5th, but
I'll try to expedite my backported patch to a production machine today
to confirm that it makes the difference. If I can get some approvals
on my internal PRs, I should be able to provide the results by EOD
tomorrow.

>
> ---
>
> From 67757f78d8b4412b72fe1583ebaf13cfd0fc03b0 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Tue, 22 Nov 2022 14:40:50 -0500
> Subject: [PATCH] Revert "mm: memcontrol: hook up vmpressure to socket
> pressure"
>
> This reverts commit 8e8ae645249b85c8ed6c178557f8db8613a6bcc7.
>
> NOT-Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 6 ++--
> include/linux/vmpressure.h | 7 ++---
> mm/memcontrol.c | 19 +++++++++----
> mm/vmpressure.c | 58 ++++++--------------------------------
> mm/vmscan.c | 15 +---------
> 5 files changed, 29 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1644a24009c..e7369363d4c2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -283,11 +283,11 @@ struct mem_cgroup {
> atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS];
>
> + /* Socket memory allocations have failed */
> unsigned long socket_pressure;
>
> /* Legacy tcp memory accounting */
> bool tcpmem_active;
> - int tcpmem_pressure;
>
> #ifdef CONFIG_MEMCG_KMEM
> int kmemcg_id;
> @@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
> void mem_cgroup_sk_free(struct sock *sk);
> static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
> {
> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
> return true;
> do {
> - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
> + if (memcg->socket_pressure)
> return true;
> } while ((memcg = parent_mem_cgroup(memcg)));
> return false;
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 6a2f51ebbfd3..20d93de37a17 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -11,9 +11,6 @@
> #include <linux/eventfd.h>
>
> struct vmpressure {
> - unsigned long scanned;
> - unsigned long reclaimed;
> -
> unsigned long tree_scanned;
> unsigned long tree_reclaimed;
> /* The lock is used to keep the scanned/reclaimed above in sync. */
> @@ -30,7 +27,7 @@ struct vmpressure {
> struct mem_cgroup;
>
> #ifdef CONFIG_MEMCG
> -extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> unsigned long scanned, unsigned long reclaimed);
> extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
>
> @@ -44,7 +41,7 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
> extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd);
> #else
> -static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> unsigned long scanned, unsigned long reclaimed) {}
> static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> int prio) {}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2d8549ae1b30..066166aebbef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7195,10 +7195,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> struct page_counter *fail;
>
> if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> - memcg->tcpmem_pressure = 0;
> + memcg->socket_pressure = 0;
> return true;
> }
> - memcg->tcpmem_pressure = 1;
> + memcg->socket_pressure = 1;
> if (gfp_mask & __GFP_NOFAIL) {
> page_counter_charge(&memcg->tcpmem, nr_pages);
> return true;
> @@ -7206,12 +7206,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> return false;
> }
>
> - if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
> - mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
> - return true;
> + if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) {
> + memcg->socket_pressure = 0;
> + goto success;
> + }
> + memcg->socket_pressure = 1;
> + if (gfp_mask & __GFP_NOFAIL) {
> + try_charge(memcg, gfp_mask, nr_pages);
> + goto success;
> }
>
> return false;
> +
> +success:
> + mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
> + return true;
> }
>
> /**
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index b52644771cc4..4cec90711cf4 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -219,7 +219,6 @@ static void vmpressure_work_fn(struct work_struct *work)
> * vmpressure() - Account memory pressure through scanned/reclaimed ratio
> * @gfp: reclaimer's gfp mask
> * @memcg: cgroup memory controller handle
> - * @tree: legacy subtree mode
> * @scanned: number of pages scanned
> * @reclaimed: number of pages reclaimed
> *
> @@ -227,16 +226,9 @@ static void vmpressure_work_fn(struct work_struct *work)
> * "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
> * pressure index is then further refined and averaged over time.
> *
> - * If @tree is set, vmpressure is in traditional userspace reporting
> - * mode: @memcg is considered the pressure root and userspace is
> - * notified of the entire subtree's reclaim efficiency.
> - *
> - * If @tree is not set, reclaim efficiency is recorded for @memcg, and
> - * only in-kernel users are notified.
> - *
> * This function does not return any value.
> */
> -void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> +void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> unsigned long scanned, unsigned long reclaimed)
> {
> struct vmpressure *vmpr;
> @@ -271,46 +263,14 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> if (!scanned)
> return;
>
> - if (tree) {
> - spin_lock(&vmpr->sr_lock);
> - scanned = vmpr->tree_scanned += scanned;
> - vmpr->tree_reclaimed += reclaimed;
> - spin_unlock(&vmpr->sr_lock);
> -
> - if (scanned < vmpressure_win)
> - return;
> - schedule_work(&vmpr->work);
> - } else {
> - enum vmpressure_levels level;
> -
> - /* For now, no users for root-level efficiency */
> - if (!memcg || mem_cgroup_is_root(memcg))
> - return;
> -
> - spin_lock(&vmpr->sr_lock);
> - scanned = vmpr->scanned += scanned;
> - reclaimed = vmpr->reclaimed += reclaimed;
> - if (scanned < vmpressure_win) {
> - spin_unlock(&vmpr->sr_lock);
> - return;
> - }
> - vmpr->scanned = vmpr->reclaimed = 0;
> - spin_unlock(&vmpr->sr_lock);
> + spin_lock(&vmpr->sr_lock);
> + scanned = vmpr->tree_scanned += scanned;
> + vmpr->tree_reclaimed += reclaimed;
> + spin_unlock(&vmpr->sr_lock);
>
> - level = vmpressure_calc_level(scanned, reclaimed);
> -
> - if (level > VMPRESSURE_LOW) {
> - /*
> - * Let the socket buffer allocator know that
> - * we are having trouble reclaiming LRU pages.
> - *
> - * For hysteresis keep the pressure state
> - * asserted for a second in which subsequent
> - * pressure events can occur.
> - */
> - WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
> - }
> - }
> + if (scanned < vmpressure_win)
> + return;
> + schedule_work(&vmpr->work);
> }
>
> /**
> @@ -340,7 +300,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
> * to the vmpressure() basically means that we signal 'critical'
> * level.
> */
> - vmpressure(gfp, memcg, true, vmpressure_win, 0);
> + vmpressure(gfp, memcg, vmpressure_win, 0);
> }
>
> #define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 04d8b88e5216..d348366d58d4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6035,8 +6035,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
> do {
> struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - unsigned long reclaimed;
> - unsigned long scanned;
>
> /*
> * This loop can become CPU-bound when target memcgs
> @@ -6068,20 +6066,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> memcg_memory_event(memcg, MEMCG_LOW);
> }
>
> - reclaimed = sc->nr_reclaimed;
> - scanned = sc->nr_scanned;
> -
> shrink_lruvec(lruvec, sc);
> -
> shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> sc->priority);
> -
> - /* Record the group's reclaim efficiency */
> - if (!sc->proactive)
> - vmpressure(sc->gfp_mask, memcg, false,
> - sc->nr_scanned - scanned,
> - sc->nr_reclaimed - reclaimed);
> -
> } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
> }
>
> @@ -6111,7 +6098,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
> /* Record the subtree's reclaim efficiency */
> if (!sc->proactive)
> - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
> + vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
> sc->nr_scanned - nr_scanned,
> sc->nr_reclaimed - nr_reclaimed);
>
> --
> 2.38.1
>

2022-11-23 01:13:48

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 01:05:32PM -0700, Yu Zhao wrote:
> On Tue, Nov 22, 2022 at 12:46 PM Yu Zhao <[email protected]> wrote:
> >
> > On Mon, Nov 21, 2022 at 5:53 PM Ivan Babrou <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > We have observed a negative TCP throughput behavior from the following commit:
> > >
> > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > >
> > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > >
> > > The crux of the issue is that in some cases with swap present the
> > > workload can be unfairly throttled in terms of TCP throughput.
> > >
> > > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8
> > > GiB of RAM with zram enabled.
> > >
> > > The setup is fairly simple:
> > >
> > > 1. Run the following go proxy in one cgroup (it has some memory
> > > ballast to simulate useful memory usage):
> > >
> > > * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82
> > >
> > > sudo systemd-run --scope -p MemoryLimit=6G go run main.go
> > >
> > > 2. Run the following fio config in another cgroup to simulate mmapped
> > > page cache usage:
> > >
> > > [global]
> > > size=8g
> > > bs=256k
> > > iodepth=256
> > > direct=0
> > > ioengine=mmap
> > > group_reporting
> > > time_based
> > > runtime=86400
> > > numjobs=8
> > > name=randread
> > > rw=randread
> >
> > Is it practical for your workload to apply some madvise/fadvise hint?
> > For the above repro, it would be fadvise_hint=1 which is mapped into
> > MADV_RANDOM automatically. The kernel also supports MADV_SEQUENTIAL,
> > but not POSIX_FADV_NOREUSE at the moment.
>
> Actually fadvise_hint already defaults to 1. At least with MGLRU, the
> page cache should be thrown away without causing you any problem. It
> might be mapped to POSIX_FADV_RANDOM rather than MADV_RANDOM.
> POSIX_FADV_RANDOM is ignored at the moment.
>
> Sorry for all the noise. Let me dig into this and get back to you later today.
>
>
> > We actually have similar issues but unfortunately I haven't been able
> > to come up with any solution beyond recommending the above flags.
> > The problem is that harvesting the accessed bit from mmapped memory is
> > costly, and when random accesses happen fast enough, the cost of doing
> > that prevents LRU from collecting more information to make better
> > decisions. In a nutshell, LRU can't tell whether there is genuine
> > memory locality with your test case.
> >
> > It's a very difficult problem to solve from LRU's POV. I'd like to
> > hear more about your workloads and see whether there are workarounds
> > other than tackling the problem head-on, if applying hints is not
> > practical or preferrable.

Apparently MADV_RANDOM isn't properly handled in MGLRU. (I think I broke
it at some point but have yet to find out when.)

I tried the patch below with a test case similar to yours and it shows
improvements for both the baseline and MGLRU. It should fix the problem
for your repro but not your production unless it also does madvise. So
no worries if you don't care to try.

Hi Johannes,

Do you think it makes sense to have the below for both the baseline and
MGLRU or it's some behavior change that the baseline doesn't want to
risk?

Thanks.

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..632e0f462df9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -166,6 +166,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File supports DIRECT IO */
#define FMODE_CAN_ODIRECT ((__force fmode_t)0x400000)

+#define FMODE_NOREUSE ((__force fmode_t)0x800000)
+
/* File was opened by fanotify and shouldn't generate fanotify events */
#define FMODE_NONOTIFY ((__force fmode_t)0x4000000)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 2f53c31216de..4122a4db5b49 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -595,4 +595,15 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
#endif
}

+static inline bool vma_has_locality(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))
+ return false;
+
+ if (vma->vm_file && (vma->vm_file->f_mode & FMODE_NOREUSE))
+ return false;
+
+ return true;
+}
+
#endif
diff --git a/mm/fadvise.c b/mm/fadvise.c
index c76ee665355a..2ba24d865bf5 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -80,7 +80,7 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
spin_lock(&file->f_lock);
- file->f_mode &= ~FMODE_RANDOM;
+ file->f_mode &= ~(FMODE_RANDOM | FMODE_NOREUSE);
spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
@@ -107,6 +107,9 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
force_page_cache_readahead(mapping, file, start_index, nrpages);
break;
case POSIX_FADV_NOREUSE:
+ spin_lock(&file->f_lock);
+ file->f_mode |= FMODE_NOREUSE;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_DONTNEED:
__filemap_fdatawrite_range(mapping, offset, endbyte,
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..69fba1d58eb2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,8 +1448,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
force_flush = 1;
set_page_dirty(page);
}
- if (pte_young(ptent) &&
- likely(!(vma->vm_flags & VM_SEQ_READ)))
+ if (pte_young(ptent) && likely(vma_has_locality(vma)))
mark_page_accessed(page);
}
rss[mm_counter(page)]--;
@@ -5161,8 +5160,7 @@ static inline void mm_account_fault(struct pt_regs *regs,
#ifdef CONFIG_LRU_GEN
static void lru_gen_enter_fault(struct vm_area_struct *vma)
{
- /* the LRU algorithm doesn't apply to sequential or random reads */
- current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
+ current->in_lru_fault = vma_has_locality(vma);
}

static void lru_gen_exit_fault(void)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..b1bb492115ae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -823,25 +823,14 @@ static bool folio_referenced_one(struct folio *folio,
}

if (pvmw.pte) {
- if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
- !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+ if (lru_gen_enabled() && pte_young(*pvmw.pte)) {
lru_gen_look_around(&pvmw);
referenced++;
}

if (ptep_clear_flush_young_notify(vma, address,
- pvmw.pte)) {
- /*
- * Don't treat a reference through
- * a sequentially read mapping as such.
- * If the folio has been used in another mapping,
- * we will catch it; if this other mapping is
- * already gone, the unmap path will have set
- * the referenced flag or activated the folio.
- */
- if (likely(!(vma->vm_flags & VM_SEQ_READ)))
- referenced++;
- }
+ pvmw.pte))
+ referenced++;
} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
if (pmdp_clear_flush_young_notify(vma, address,
pvmw.pmd))
@@ -875,7 +864,15 @@ static bool invalid_folio_referenced_vma(struct vm_area_struct *vma, void *arg)
struct folio_referenced_arg *pra = arg;
struct mem_cgroup *memcg = pra->memcg;

- if (!mm_match_cgroup(vma->vm_mm, memcg))
+ if (!vma_has_locality(vma))
+ return true;
+
+ /*
+ * If we are reclaiming on behalf of a cgroup, skip
+ * counting on behalf of references from different
+ * cgroups
+ */
+ if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
return true;

return false;
@@ -921,14 +918,7 @@ int folio_referenced(struct folio *folio, int is_locked,
return 1;
}

- /*
- * If we are reclaiming on behalf of a cgroup, skip
- * counting on behalf of references from different
- * cgroups
- */
- if (memcg) {
- rwc.invalid_vma = invalid_folio_referenced_vma;
- }
+ rwc.invalid_vma = invalid_folio_referenced_vma;

rmap_walk(folio, &rwc);
*vm_flags = pra.vm_flags;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7d3cc787b840..97ce5d37d67c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3742,7 +3742,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
if (is_vm_hugetlb_page(vma))
return true;

- if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL | VM_SEQ_READ | VM_RAND_READ))
+ if (!vma_has_locality(vma))
+ return true;
+
+ if (vma->vm_flags & (VM_LOCKED | VM_SPECIAL))
return true;

if (vma == get_gate_vma(vma->vm_mm))

2022-11-23 02:13:43

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 2:11 PM Ivan Babrou <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 12:05 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> > > Hello,
> > >
> > > We have observed a negative TCP throughput behavior from the following commit:
> > >
> > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > >
> > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > >
> > > The crux of the issue is that in some cases with swap present the
> > > workload can be unfairly throttled in terms of TCP throughput.
> >
> > Thanks for the detailed analysis, Ivan.
> >
> > Originally, we pushed back on sockets only when regular page reclaim
> > had completely failed and we were about to OOM. This patch was an
> > attempt to be smarter about it and equalize pressure more smoothly
> > between socket memory, file cache, anonymous pages.
> >
> > After a recent discussion with Shakeel, I'm no longer quite sure the
> > kernel is the right place to attempt this sort of balancing. It kind
> > of depends on the workload which type of memory is more imporant. And
> > your report shows that vmpressure is a flawed mechanism to implement
> > this, anyway.
> >
> > So I'm thinking we should delete the vmpressure thing, and go back to
> > socket throttling only if an OOM is imminent. This is in line with
> > what we do at the system level: sockets get throttled only after
> > reclaim fails and we hit hard limits. It's then up to the users and
> > sysadmin to allocate a reasonable amount of buffers given the overall
> > memory budget.
> >
> > Cgroup accounting, limiting and OOM enforcement is still there for the
> > socket buffers, so misbehaving groups will be contained either way.
> >
> > What do you think? Something like the below patch?
>
> The idea sounds very reasonable to me. I can't really speak for the
> patch contents with any sort of authority, but it looks ok to my
> non-expert eyes.
>
> There were some conflicts when cherry-picking this into v5.15. I think
> the only real one was for the "!sc->proactive" condition not being
> present there. For the rest I just accepted the incoming change.
>
> I'm going to be away from my work computer until December 5th, but
> I'll try to expedite my backported patch to a production machine today
> to confirm that it makes the difference. If I can get some approvals
> on my internal PRs, I should be able to provide the results by EOD
> tomorrow.

I tried the patch and something isn't right here.

With the patch applied I'm capped at ~120MB/s, which is a symptom of a
clamped window.

I can't find any sockets with memcg->socket_pressure = 1, but at the
same time I only see the following rcv_ssthresh assigned to sockets:

$ sudo ss -tim dport 6443 | fgrep rcv_ssthresh | sed
's/.*rcv_ssthresh://' | awk '{ print $1 }' | sort -n | uniq -c | sort
-n | tail
1 64076
181 65495
1456 5792
16531 64088

* 64088 is the default value
* 5792 is 4 * advmss (clamped)

Compare this to a machine without the patch but with
cgroup.memory=nosocket in cmdline:
$ sudo ss -tim dport 6443 | fgrep rcv_ssthresh | sed
's/.*rcv_ssthresh://' | awk '{ print $1 }' | sort -n | uniq -c | sort
-n | tail
8 2806862
8 3777338
8 72776
8 86068
10 2024018
12 3777354
23 91172
29 66984
101 65495
5439 64088

There aren't any clamped sockets here and there are many different
rcv_ssthresh values.

2022-11-23 22:12:30

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 05:44:44PM -0700, Yu Zhao wrote:
> Hi Johannes,
>
> Do you think it makes sense to have the below for both the baseline and
> MGLRU or it's some behavior change that the baseline doesn't want to
> risk?

It looks good to me. Besides the new FMODE_NOREUSE, it's also a nice
cleanup on the rmap side!

It would just be good to keep the comment from folio_referenced_one() and
move it to the vma_has_locality() check in invalid_folio_referenced_vma().

Otherwise,

Acked-by: Johannes Weiner <[email protected]>

2022-11-24 01:35:59

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Wed, Nov 23, 2022 at 2:22 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 05:44:44PM -0700, Yu Zhao wrote:
> > Hi Johannes,
> >
> > Do you think it makes sense to have the below for both the baseline and
> > MGLRU or it's some behavior change that the baseline doesn't want to
> > risk?
>
> It looks good to me. Besides the new FMODE_NOREUSE, it's also a nice
> cleanup on the rmap side!
>
> It would just be good to keep the comment from folio_referenced_one() and
> move it to the vma_has_locality() check in invalid_folio_referenced_vma().
>
> Otherwise,
>
> Acked-by: Johannes Weiner <[email protected]>

Thanks.

I've added Ivan's test case to my collection. Interestingly, after
this patch, the download speed increased while fio was running (my
guess is that fio pushed more cold anon into swap):

$ uname
Linux test127 6.1.0-rc6-dirty #2 SMP PREEMPT_DYNAMIC Wed Nov 23
16:51:20 MST 2022 x86_64 x86_64 x86_64 GNU/Linux

$ go version
go version go1.18.1 linux/amd64

$ fio -v
fio-3.28

$ curl --version
curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2
zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0
(+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3
OpenLDAP/2.5.13
Release-Date: 2022-01-05
Protocols: dict file ftp ftps gopher gophers http https imap imaps
ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps
telnet tftp
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN
IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP
UnixSockets zstd

fio NOT running:

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 83.6M 0 --:--:-- 0:00:57 --:--:-- 87.0M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 82.8M 0 --:--:-- 0:00:57 --:--:-- 79.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 82.7M 0 --:--:-- 0:00:57 --:--:-- 89.7M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 87.4M 0 --:--:-- 0:00:54 --:--:-- 94.3M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 88.1M 0 --:--:-- 0:00:54 --:--:-- 94.7M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 82.6M 0 --:--:-- 0:00:57 --:--:-- 83.9M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 86.4M 0 --:--:-- 0:00:55 --:--:-- 90.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 82.8M 0 --:--:-- 0:00:57 --:--:-- 67.5M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 83.4M 0 --:--:-- 0:00:57 --:--:-- 78.7M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 84.0M 0 --:--:-- 0:00:56 --:--:-- 87.4M

fio running:

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 86.7M 0 --:--:-- 0:01:11 --:--:-- 88.7M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 87.7M 0 --:--:-- 0:00:54 --:--:-- 93.5M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 88.5M 0 --:--:-- 0:00:53 --:--:-- 95.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 91.6M 0 --:--:-- 0:00:52 --:--:-- 94.4M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 89.4M 0 --:--:-- 0:00:53 --:--:-- 86.6M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 88.6M 0 --:--:-- 0:00:53 --:--:-- 84.8M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 84.6M 0 --:--:-- 0:00:56 --:--:-- 87.5M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 86.9M 0 --:--:-- 0:00:54 --:--:-- 81.4M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 89.0M 0 --:--:-- 0:00:53 --:--:-- 86.4M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4768M 0 4768M 0 0 91.1M 0 --:--:-- 0:00:52 --:--:-- 90.6M

2022-11-24 01:52:59

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Wed, Nov 23, 2022 at 6:18 PM Yu Zhao <[email protected]> wrote:
>
> On Wed, Nov 23, 2022 at 2:22 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 05:44:44PM -0700, Yu Zhao wrote:
> > > Hi Johannes,
> > >
> > > Do you think it makes sense to have the below for both the baseline and
> > > MGLRU or it's some behavior change that the baseline doesn't want to
> > > risk?
> >
> > It looks good to me. Besides the new FMODE_NOREUSE, it's also a nice
> > cleanup on the rmap side!
> >
> > It would just be good to keep the comment from folio_referenced_one() and
> > move it to the vma_has_locality() check in invalid_folio_referenced_vma().
> >
> > Otherwise,
> >
> > Acked-by: Johannes Weiner <[email protected]>
>
> Thanks.
>
> I've added Ivan's test case to my collection. Interestingly, after
> this patch, the download speed increased while fio was running (my
> guess is that fio pushed more cold anon into swap):
>
> $ uname
> Linux test127 6.1.0-rc6-dirty #2 SMP PREEMPT_DYNAMIC Wed Nov 23
> 16:51:20 MST 2022 x86_64 x86_64 x86_64 GNU/Linux

This is the baseline kernel, not MGLRU.

> $ go version
> go version go1.18.1 linux/amd64
>
> $ fio -v
> fio-3.28
>
> $ curl --version
> curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2
> zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0
> (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3
> OpenLDAP/2.5.13
> Release-Date: 2022-01-05
> Protocols: dict file ftp ftps gopher gophers http https imap imaps
> ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps
> telnet tftp
> Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN
> IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP
> UnixSockets zstd
>
> fio NOT running:
>
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 83.6M 0 --:--:-- 0:00:57 --:--:-- 87.0M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 82.8M 0 --:--:-- 0:00:57 --:--:-- 79.1M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 82.7M 0 --:--:-- 0:00:57 --:--:-- 89.7M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 87.4M 0 --:--:-- 0:00:54 --:--:-- 94.3M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 88.1M 0 --:--:-- 0:00:54 --:--:-- 94.7M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 82.6M 0 --:--:-- 0:00:57 --:--:-- 83.9M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 86.4M 0 --:--:-- 0:00:55 --:--:-- 90.1M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 82.8M 0 --:--:-- 0:00:57 --:--:-- 67.5M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 83.4M 0 --:--:-- 0:00:57 --:--:-- 78.7M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 84.0M 0 --:--:-- 0:00:56 --:--:-- 87.4M
>
>
> fio running:
>
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 86.7M 0 --:--:-- 0:01:11 --:--:-- 88.7M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 87.7M 0 --:--:-- 0:00:54 --:--:-- 93.5M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 88.5M 0 --:--:-- 0:00:53 --:--:-- 95.1M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 91.6M 0 --:--:-- 0:00:52 --:--:-- 94.4M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 89.4M 0 --:--:-- 0:00:53 --:--:-- 86.6M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 88.6M 0 --:--:-- 0:00:53 --:--:-- 84.8M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 84.6M 0 --:--:-- 0:00:56 --:--:-- 87.5M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 86.9M 0 --:--:-- 0:00:54 --:--:-- 81.4M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 89.0M 0 --:--:-- 0:00:53 --:--:-- 86.4M
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 100 4768M 0 4768M 0 0 91.1M 0 --:--:-- 0:00:52 --:--:-- 90.6M

2022-11-28 18:34:13

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Nov 22, 2022 at 05:28:24PM -0800, Ivan Babrou wrote:
> On Tue, Nov 22, 2022 at 2:11 PM Ivan Babrou <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 12:05 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> > > > Hello,
> > > >
> > > > We have observed a negative TCP throughput behavior from the following commit:
> > > >
> > > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > > >
> > > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > > >
> > > > The crux of the issue is that in some cases with swap present the
> > > > workload can be unfairly throttled in terms of TCP throughput.
> > >
> > > Thanks for the detailed analysis, Ivan.
> > >
> > > Originally, we pushed back on sockets only when regular page reclaim
> > > had completely failed and we were about to OOM. This patch was an
> > > attempt to be smarter about it and equalize pressure more smoothly
> > > between socket memory, file cache, anonymous pages.
> > >
> > > After a recent discussion with Shakeel, I'm no longer quite sure the
> > > kernel is the right place to attempt this sort of balancing. It kind
> > > of depends on the workload which type of memory is more imporant. And
> > > your report shows that vmpressure is a flawed mechanism to implement
> > > this, anyway.
> > >
> > > So I'm thinking we should delete the vmpressure thing, and go back to
> > > socket throttling only if an OOM is imminent. This is in line with
> > > what we do at the system level: sockets get throttled only after
> > > reclaim fails and we hit hard limits. It's then up to the users and
> > > sysadmin to allocate a reasonable amount of buffers given the overall
> > > memory budget.
> > >
> > > Cgroup accounting, limiting and OOM enforcement is still there for the
> > > socket buffers, so misbehaving groups will be contained either way.
> > >
> > > What do you think? Something like the below patch?
> >
> > The idea sounds very reasonable to me. I can't really speak for the
> > patch contents with any sort of authority, but it looks ok to my
> > non-expert eyes.
> >
> > There were some conflicts when cherry-picking this into v5.15. I think
> > the only real one was for the "!sc->proactive" condition not being
> > present there. For the rest I just accepted the incoming change.
> >
> > I'm going to be away from my work computer until December 5th, but
> > I'll try to expedite my backported patch to a production machine today
> > to confirm that it makes the difference. If I can get some approvals
> > on my internal PRs, I should be able to provide the results by EOD
> > tomorrow.
>
> I tried the patch and something isn't right here.

Thanks for giving it a sping.

> With the patch applied I'm capped at ~120MB/s, which is a symptom of a
> clamped window.
>
> I can't find any sockets with memcg->socket_pressure = 1, but at the
> same time I only see the following rcv_ssthresh assigned to sockets:

Hm, I don't see how socket accounting would alter the network behavior
other than through socket_pressure=1.

How do you look for that flag? If you haven't yet done something
comparable, can you try with tracing to rule out sampling errors?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 066166aebbef..134b623bee6a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7211,6 +7211,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
goto success;
}
memcg->socket_pressure = 1;
+ trace_printk("skmem charge failed nr_pages=%u gfp=%pGg\n", nr_pages, &gfp_mask);
if (gfp_mask & __GFP_NOFAIL) {
try_charge(memcg, gfp_mask, nr_pages);
goto success;

2022-12-05 20:45:49

by Shakeel Butt

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Nov 28, 2022 at 01:07:25PM -0500, Johannes Weiner wrote:
>
[...]
> > With the patch applied I'm capped at ~120MB/s, which is a symptom of a
> > clamped window.
> >
> > I can't find any sockets with memcg->socket_pressure = 1, but at the
> > same time I only see the following rcv_ssthresh assigned to sockets:
>
> Hm, I don't see how socket accounting would alter the network behavior
> other than through socket_pressure=1.
>

I think what is happening is that the tcp stack is calling
tcp_under_memory_pressure() and making decisions without going through
the memcg charge codepath which set or reset memcg->socket_pressure.
Most probably the socket is clamped due to memcg->socket_pressure and
then the kernel never tried to grow its buffers because
memcg->socket_pressure is still set and thus never tried the memcg
charge codepath which would have reset memcg->socket_pressure. (Maybe)
That is my guess but network experts CCed can correct me.

Anyways, I don't think the pressure mechanism which relies on successful
charging will work. I am brainstorming towards memory.high based network
throttling. Basically use penalty_jiffies (or something similar) to set
memcg->socket_pressure. However I want this to be opt-in as we do have
applications which prefer to be killed than be throttled. So, still
working on the fine details how this can be done without introducing a
rigid API.

2022-12-06 00:20:08

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Nov 28, 2022 at 10:07 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Nov 22, 2022 at 05:28:24PM -0800, Ivan Babrou wrote:
> > On Tue, Nov 22, 2022 at 2:11 PM Ivan Babrou <[email protected]> wrote:
> > >
> > > On Tue, Nov 22, 2022 at 12:05 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> > > > > Hello,
> > > > >
> > > > > We have observed a negative TCP throughput behavior from the following commit:
> > > > >
> > > > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > > > >
> > > > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > > > >
> > > > > The crux of the issue is that in some cases with swap present the
> > > > > workload can be unfairly throttled in terms of TCP throughput.
> > > >
> > > > Thanks for the detailed analysis, Ivan.
> > > >
> > > > Originally, we pushed back on sockets only when regular page reclaim
> > > > had completely failed and we were about to OOM. This patch was an
> > > > attempt to be smarter about it and equalize pressure more smoothly
> > > > between socket memory, file cache, anonymous pages.
> > > >
> > > > After a recent discussion with Shakeel, I'm no longer quite sure the
> > > > kernel is the right place to attempt this sort of balancing. It kind
> > > > of depends on the workload which type of memory is more imporant. And
> > > > your report shows that vmpressure is a flawed mechanism to implement
> > > > this, anyway.
> > > >
> > > > So I'm thinking we should delete the vmpressure thing, and go back to
> > > > socket throttling only if an OOM is imminent. This is in line with
> > > > what we do at the system level: sockets get throttled only after
> > > > reclaim fails and we hit hard limits. It's then up to the users and
> > > > sysadmin to allocate a reasonable amount of buffers given the overall
> > > > memory budget.
> > > >
> > > > Cgroup accounting, limiting and OOM enforcement is still there for the
> > > > socket buffers, so misbehaving groups will be contained either way.
> > > >
> > > > What do you think? Something like the below patch?
> > >
> > > The idea sounds very reasonable to me. I can't really speak for the
> > > patch contents with any sort of authority, but it looks ok to my
> > > non-expert eyes.
> > >
> > > There were some conflicts when cherry-picking this into v5.15. I think
> > > the only real one was for the "!sc->proactive" condition not being
> > > present there. For the rest I just accepted the incoming change.
> > >
> > > I'm going to be away from my work computer until December 5th, but
> > > I'll try to expedite my backported patch to a production machine today
> > > to confirm that it makes the difference. If I can get some approvals
> > > on my internal PRs, I should be able to provide the results by EOD
> > > tomorrow.
> >
> > I tried the patch and something isn't right here.
>
> Thanks for giving it a sping.
>
> > With the patch applied I'm capped at ~120MB/s, which is a symptom of a
> > clamped window.
> >
> > I can't find any sockets with memcg->socket_pressure = 1, but at the
> > same time I only see the following rcv_ssthresh assigned to sockets:
>
> Hm, I don't see how socket accounting would alter the network behavior
> other than through socket_pressure=1.
>
> How do you look for that flag? If you haven't yet done something
> comparable, can you try with tracing to rule out sampling errors?

Apologies for a delayed reply, I took a week off away from computers.

I looked with bpftrace (from my bash_history):

$ sudo bpftrace -e 'kprobe:tcp_try_rmem_schedule { @sk[cpu] = arg0; }
kretprobe:tcp_try_rmem_schedule { $arg = @sk[cpu]; if ($arg) { $sk =
(struct sock *) $arg; $id = $sk->sk_memcg->css.cgroup->kn->id;
$socket_pressure = $sk->sk_memcg->socket_pressure; if ($id == 21379) {
printf("id = %d, socket_pressure = %d\n", $id, $socket_pressure); } }
}'

I tried your patch on top of v6.1-rc8 (where it produced no conflicts)
in my vm and it still gave me low numbers and nothing in
/sys/kernel/debug/tracing/trace. To be extra sure, I changed it from
trace_printk to just printk and it still didn't show up in dmesg, even
with constant low throughput:

ivan@vm:~$ curl -o /dev/null https://sim.cfperf.net/cached-assets/zero-5g.bin
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
14 4768M 14 685M 0 0 12.9M 0 0:06:08 0:00:52 0:05:16 13.0M

I still saw clamped rcv_ssthresh:

$ sudo ss -tinm dport 443
State Recv-Q Send-Q
Local Address:Port
Peer Address:Port Process
ESTAB 0 0
10.2.0.15:35800
162.159.136.82:443
skmem:(r0,rb2577228,t0,tb46080,f0,w0,o0,bl0,d0) cubic rto:201
rtt:0.42/0.09 ato:40 mss:1460 pmtu:1500 rcvmss:1440 advmss:1460
cwnd:10 bytes_sent:12948 bytes_acked:12949 bytes_received:2915062731
segs_out:506592 segs_in:2025111 data_segs_out:351 data_segs_in:2024911
send 278095238bps lastsnd:824 lastrcv:154 lastack:154 pacing_rate
556190472bps delivery_rate 47868848bps delivered:352 app_limited
busy:147ms rcv_rtt:0.011 rcv_space:82199 rcv_ssthresh:5840
minrtt:0.059 snd_wnd:65535 tcp-ulp-tls rxconf: none txconf: none

I also tried with my detection program for ebpf_exporter (fexit based version):

* https://github.com/cloudflare/ebpf_exporter/pull/172/files

Which also showed signs of a clamped window:

# HELP ebpf_exporter_tcp_window_clamps_total Number of times that TCP
window was clamped to a low value
# TYPE ebpf_exporter_tcp_window_clamps_total counter
ebpf_exporter_tcp_window_clamps_total 53887

In fact, I can replicate this with just curl to a public URL and fio running,

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 066166aebbef..134b623bee6a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7211,6 +7211,7 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> goto success;
> }
> memcg->socket_pressure = 1;
> + trace_printk("skmem charge failed nr_pages=%u gfp=%pGg\n", nr_pages, &gfp_mask);
> if (gfp_mask & __GFP_NOFAIL) {
> try_charge(memcg, gfp_mask, nr_pages);
> goto success;

2022-12-06 01:12:12

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Dec 5, 2022 at 3:57 PM Ivan Babrou <[email protected]> wrote:
>
> On Mon, Nov 28, 2022 at 10:07 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Nov 22, 2022 at 05:28:24PM -0800, Ivan Babrou wrote:
> > > On Tue, Nov 22, 2022 at 2:11 PM Ivan Babrou <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 22, 2022 at 12:05 PM Johannes Weiner <[email protected]> wrote:
> > > > >
> > > > > On Mon, Nov 21, 2022 at 04:53:43PM -0800, Ivan Babrou wrote:
> > > > > > Hello,
> > > > > >
> > > > > > We have observed a negative TCP throughput behavior from the following commit:
> > > > > >
> > > > > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure
> > > > > >
> > > > > > It landed back in 2016 in v4.5, so it's not exactly a new issue.
> > > > > >
> > > > > > The crux of the issue is that in some cases with swap present the
> > > > > > workload can be unfairly throttled in terms of TCP throughput.
> > > > >
> > > > > Thanks for the detailed analysis, Ivan.
> > > > >
> > > > > Originally, we pushed back on sockets only when regular page reclaim
> > > > > had completely failed and we were about to OOM. This patch was an
> > > > > attempt to be smarter about it and equalize pressure more smoothly
> > > > > between socket memory, file cache, anonymous pages.
> > > > >
> > > > > After a recent discussion with Shakeel, I'm no longer quite sure the
> > > > > kernel is the right place to attempt this sort of balancing. It kind
> > > > > of depends on the workload which type of memory is more imporant. And
> > > > > your report shows that vmpressure is a flawed mechanism to implement
> > > > > this, anyway.
> > > > >
> > > > > So I'm thinking we should delete the vmpressure thing, and go back to
> > > > > socket throttling only if an OOM is imminent. This is in line with
> > > > > what we do at the system level: sockets get throttled only after
> > > > > reclaim fails and we hit hard limits. It's then up to the users and
> > > > > sysadmin to allocate a reasonable amount of buffers given the overall
> > > > > memory budget.
> > > > >
> > > > > Cgroup accounting, limiting and OOM enforcement is still there for the
> > > > > socket buffers, so misbehaving groups will be contained either way.
> > > > >
> > > > > What do you think? Something like the below patch?
> > > >
> > > > The idea sounds very reasonable to me. I can't really speak for the
> > > > patch contents with any sort of authority, but it looks ok to my
> > > > non-expert eyes.
> > > >
> > > > There were some conflicts when cherry-picking this into v5.15. I think
> > > > the only real one was for the "!sc->proactive" condition not being
> > > > present there. For the rest I just accepted the incoming change.
> > > >
> > > > I'm going to be away from my work computer until December 5th, but
> > > > I'll try to expedite my backported patch to a production machine today
> > > > to confirm that it makes the difference. If I can get some approvals
> > > > on my internal PRs, I should be able to provide the results by EOD
> > > > tomorrow.
> > >
> > > I tried the patch and something isn't right here.
> >
> > Thanks for giving it a sping.
> >
> > > With the patch applied I'm capped at ~120MB/s, which is a symptom of a
> > > clamped window.
> > >
> > > I can't find any sockets with memcg->socket_pressure = 1, but at the
> > > same time I only see the following rcv_ssthresh assigned to sockets:
> >
> > Hm, I don't see how socket accounting would alter the network behavior
> > other than through socket_pressure=1.
> >
> > How do you look for that flag? If you haven't yet done something
> > comparable, can you try with tracing to rule out sampling errors?
>
> Apologies for a delayed reply, I took a week off away from computers.
>
> I looked with bpftrace (from my bash_history):
>
> $ sudo bpftrace -e 'kprobe:tcp_try_rmem_schedule { @sk[cpu] = arg0; }
> kretprobe:tcp_try_rmem_schedule { $arg = @sk[cpu]; if ($arg) { $sk =
> (struct sock *) $arg; $id = $sk->sk_memcg->css.cgroup->kn->id;
> $socket_pressure = $sk->sk_memcg->socket_pressure; if ($id == 21379) {
> printf("id = %d, socket_pressure = %d\n", $id, $socket_pressure); } }
> }'
>
> I tried your patch on top of v6.1-rc8 (where it produced no conflicts)
> in my vm and it still gave me low numbers and nothing in
> /sys/kernel/debug/tracing/trace. To be extra sure, I changed it from
> trace_printk to just printk and it still didn't show up in dmesg, even
> with constant low throughput:
>
> ivan@vm:~$ curl -o /dev/null https://sim.cfperf.net/cached-assets/zero-5g.bin
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 14 4768M 14 685M 0 0 12.9M 0 0:06:08 0:00:52 0:05:16 13.0M
>
> I still saw clamped rcv_ssthresh:
>
> $ sudo ss -tinm dport 443
> State Recv-Q Send-Q
> Local Address:Port
> Peer Address:Port Process
> ESTAB 0 0
> 10.2.0.15:35800
> 162.159.136.82:443
> skmem:(r0,rb2577228,t0,tb46080,f0,w0,o0,bl0,d0) cubic rto:201
> rtt:0.42/0.09 ato:40 mss:1460 pmtu:1500 rcvmss:1440 advmss:1460
> cwnd:10 bytes_sent:12948 bytes_acked:12949 bytes_received:2915062731
> segs_out:506592 segs_in:2025111 data_segs_out:351 data_segs_in:2024911
> send 278095238bps lastsnd:824 lastrcv:154 lastack:154 pacing_rate
> 556190472bps delivery_rate 47868848bps delivered:352 app_limited
> busy:147ms rcv_rtt:0.011 rcv_space:82199 rcv_ssthresh:5840
> minrtt:0.059 snd_wnd:65535 tcp-ulp-tls rxconf: none txconf: none
>
> I also tried with my detection program for ebpf_exporter (fexit based version):
>
> * https://github.com/cloudflare/ebpf_exporter/pull/172/files
>
> Which also showed signs of a clamped window:
>
> # HELP ebpf_exporter_tcp_window_clamps_total Number of times that TCP
> window was clamped to a low value
> # TYPE ebpf_exporter_tcp_window_clamps_total counter
> ebpf_exporter_tcp_window_clamps_total 53887
>
> In fact, I can replicate this with just curl to a public URL and fio running,

I sprinkled some more printk around to get to the bottom of this:

static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
memcg->socket_pressure) {
printk("socket pressure[1]: %lu", memcg->socket_pressure);
return true;
}
do {
if (memcg->socket_pressure) {
printk("socket pressure[2]: %lu",
memcg->socket_pressure);
return true;
}
} while ((memcg = parent_mem_cgroup(memcg)));
return false;
}

And now I can see plenty of this:

[ 108.156707][ T5175] socket pressure[2]: 4294673429
[ 108.157050][ T5175] socket pressure[2]: 4294673429
[ 108.157301][ T5175] socket pressure[2]: 4294673429
[ 108.157581][ T5175] socket pressure[2]: 4294673429
[ 108.157874][ T5175] socket pressure[2]: 4294673429
[ 108.158254][ T5175] socket pressure[2]: 4294673429

I think the first result below is to blame:

$ rg '.->socket_pressure' mm
mm/memcontrol.c
5280: memcg->socket_pressure = jiffies;
7198: memcg->socket_pressure = 0;
7201: memcg->socket_pressure = 1;
7211: memcg->socket_pressure = 0;
7215: memcg->socket_pressure = 1;

While we set socket_pressure to either zero or one in
mem_cgroup_charge_skmem, it is still initialized to jiffies on memcg
creation. Zero seems like a more appropriate starting point. With that
change I see it working as expected with no TCP speed bumps. My
ebpf_exporter program also looks happy and reports zero clamps in my
brief testing.

Since it's not "socket pressure[1]" in dmesg output, then it's
probably one of the parent cgroups that is not getting charged for
socket memory that is reporting memory pressure.

I also think we should downgrade socket_pressure from "unsigned long"
to "bool", as it only holds zero and one now.

2022-12-06 19:04:04

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Mon, Dec 05, 2022 at 04:50:46PM -0800, Ivan Babrou wrote:
> And now I can see plenty of this:
>
> [ 108.156707][ T5175] socket pressure[2]: 4294673429
> [ 108.157050][ T5175] socket pressure[2]: 4294673429
> [ 108.157301][ T5175] socket pressure[2]: 4294673429
> [ 108.157581][ T5175] socket pressure[2]: 4294673429
> [ 108.157874][ T5175] socket pressure[2]: 4294673429
> [ 108.158254][ T5175] socket pressure[2]: 4294673429
>
> I think the first result below is to blame:
>
> $ rg '.->socket_pressure' mm
> mm/memcontrol.c
> 5280: memcg->socket_pressure = jiffies;
> 7198: memcg->socket_pressure = 0;
> 7201: memcg->socket_pressure = 1;
> 7211: memcg->socket_pressure = 0;
> 7215: memcg->socket_pressure = 1;

Hoo boy, that's a silly mistake indeed. Thanks for tracking it down.

> While we set socket_pressure to either zero or one in
> mem_cgroup_charge_skmem, it is still initialized to jiffies on memcg
> creation. Zero seems like a more appropriate starting point. With that
> change I see it working as expected with no TCP speed bumps. My
> ebpf_exporter program also looks happy and reports zero clamps in my
> brief testing.

Excellent, now this behavior makes sense.

> I also think we should downgrade socket_pressure from "unsigned long"
> to "bool", as it only holds zero and one now.

Sounds good to me!

Attaching the updated patch below. If nobody has any objections, I'll
add a proper changelog, reported-bys, sign-off etc and send it out.

---
include/linux/memcontrol.h | 8 +++---
include/linux/vmpressure.h | 7 ++---
mm/memcontrol.c | 20 +++++++++----
mm/vmpressure.c | 58 ++++++--------------------------------
mm/vmscan.c | 15 +---------
5 files changed, 30 insertions(+), 78 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1644a24009c..ef1c388be5b3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -283,11 +283,11 @@ struct mem_cgroup {
atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS];

- unsigned long socket_pressure;
+ /* Socket memory allocations have failed */
+ bool socket_pressure;

/* Legacy tcp memory accounting */
bool tcpmem_active;
- int tcpmem_pressure;

#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
@@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
return true;
do {
- if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
+ if (memcg->socket_pressure)
return true;
} while ((memcg = parent_mem_cgroup(memcg)));
return false;
diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 6a2f51ebbfd3..20d93de37a17 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -11,9 +11,6 @@
#include <linux/eventfd.h>

struct vmpressure {
- unsigned long scanned;
- unsigned long reclaimed;
-
unsigned long tree_scanned;
unsigned long tree_reclaimed;
/* The lock is used to keep the scanned/reclaimed above in sync. */
@@ -30,7 +27,7 @@ struct vmpressure {
struct mem_cgroup;

#ifdef CONFIG_MEMCG
-extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed);
extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);

@@ -44,7 +41,7 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd);
#else
-static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed) {}
static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
int prio) {}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..0d4b9dbe775a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5277,7 +5277,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
vmpressure_init(&memcg->vmpressure);
INIT_LIST_HEAD(&memcg->event_list);
spin_lock_init(&memcg->event_list_lock);
- memcg->socket_pressure = jiffies;
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
INIT_LIST_HEAD(&memcg->objcg_list);
@@ -7195,10 +7194,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
struct page_counter *fail;

if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
- memcg->tcpmem_pressure = 0;
+ memcg->socket_pressure = false;
return true;
}
- memcg->tcpmem_pressure = 1;
+ memcg->socket_pressure = true;
if (gfp_mask & __GFP_NOFAIL) {
page_counter_charge(&memcg->tcpmem, nr_pages);
return true;
@@ -7206,12 +7205,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
return false;
}

- if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
- mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
- return true;
+ if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) {
+ memcg->socket_pressure = false;
+ goto success;
+ }
+ memcg->socket_pressure = true;
+ if (gfp_mask & __GFP_NOFAIL) {
+ try_charge(memcg, gfp_mask, nr_pages);
+ goto success;
}

return false;
+
+success:
+ mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ return true;
}

/**
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index b52644771cc4..4cec90711cf4 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -219,7 +219,6 @@ static void vmpressure_work_fn(struct work_struct *work)
* vmpressure() - Account memory pressure through scanned/reclaimed ratio
* @gfp: reclaimer's gfp mask
* @memcg: cgroup memory controller handle
- * @tree: legacy subtree mode
* @scanned: number of pages scanned
* @reclaimed: number of pages reclaimed
*
@@ -227,16 +226,9 @@ static void vmpressure_work_fn(struct work_struct *work)
* "instantaneous" memory pressure (scanned/reclaimed ratio). The raw
* pressure index is then further refined and averaged over time.
*
- * If @tree is set, vmpressure is in traditional userspace reporting
- * mode: @memcg is considered the pressure root and userspace is
- * notified of the entire subtree's reclaim efficiency.
- *
- * If @tree is not set, reclaim efficiency is recorded for @memcg, and
- * only in-kernel users are notified.
- *
* This function does not return any value.
*/
-void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
+void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
unsigned long scanned, unsigned long reclaimed)
{
struct vmpressure *vmpr;
@@ -271,46 +263,14 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
if (!scanned)
return;

- if (tree) {
- spin_lock(&vmpr->sr_lock);
- scanned = vmpr->tree_scanned += scanned;
- vmpr->tree_reclaimed += reclaimed;
- spin_unlock(&vmpr->sr_lock);
-
- if (scanned < vmpressure_win)
- return;
- schedule_work(&vmpr->work);
- } else {
- enum vmpressure_levels level;
-
- /* For now, no users for root-level efficiency */
- if (!memcg || mem_cgroup_is_root(memcg))
- return;
-
- spin_lock(&vmpr->sr_lock);
- scanned = vmpr->scanned += scanned;
- reclaimed = vmpr->reclaimed += reclaimed;
- if (scanned < vmpressure_win) {
- spin_unlock(&vmpr->sr_lock);
- return;
- }
- vmpr->scanned = vmpr->reclaimed = 0;
- spin_unlock(&vmpr->sr_lock);
+ spin_lock(&vmpr->sr_lock);
+ scanned = vmpr->tree_scanned += scanned;
+ vmpr->tree_reclaimed += reclaimed;
+ spin_unlock(&vmpr->sr_lock);

- level = vmpressure_calc_level(scanned, reclaimed);
-
- if (level > VMPRESSURE_LOW) {
- /*
- * Let the socket buffer allocator know that
- * we are having trouble reclaiming LRU pages.
- *
- * For hysteresis keep the pressure state
- * asserted for a second in which subsequent
- * pressure events can occur.
- */
- WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
- }
- }
+ if (scanned < vmpressure_win)
+ return;
+ schedule_work(&vmpr->work);
}

/**
@@ -340,7 +300,7 @@ void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio)
* to the vmpressure() basically means that we signal 'critical'
* level.
*/
- vmpressure(gfp, memcg, true, vmpressure_win, 0);
+ vmpressure(gfp, memcg, vmpressure_win, 0);
}

#define MAX_VMPRESSURE_ARGS_LEN (strlen("critical") + strlen("hierarchy") + 2)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04d8b88e5216..d348366d58d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6035,8 +6035,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- unsigned long reclaimed;
- unsigned long scanned;

/*
* This loop can become CPU-bound when target memcgs
@@ -6068,20 +6066,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg_memory_event(memcg, MEMCG_LOW);
}

- reclaimed = sc->nr_reclaimed;
- scanned = sc->nr_scanned;
-
shrink_lruvec(lruvec, sc);
-
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
sc->priority);
-
- /* Record the group's reclaim efficiency */
- if (!sc->proactive)
- vmpressure(sc->gfp_mask, memcg, false,
- sc->nr_scanned - scanned,
- sc->nr_reclaimed - reclaimed);
-
} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
}

@@ -6111,7 +6098,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)

/* Record the subtree's reclaim efficiency */
if (!sc->proactive)
- vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+ vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
sc->nr_reclaimed - nr_reclaimed);

--
2.38.1

2022-12-06 19:54:35

by Eric Dumazet

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Dec 6, 2022 at 8:00 PM Johannes Weiner <[email protected]> wrote:
>
> On Mon, Dec 05, 2022 at 04:50:46PM -0800, Ivan Babrou wrote:
> > And now I can see plenty of this:
> >
> > [ 108.156707][ T5175] socket pressure[2]: 4294673429
> > [ 108.157050][ T5175] socket pressure[2]: 4294673429
> > [ 108.157301][ T5175] socket pressure[2]: 4294673429
> > [ 108.157581][ T5175] socket pressure[2]: 4294673429
> > [ 108.157874][ T5175] socket pressure[2]: 4294673429
> > [ 108.158254][ T5175] socket pressure[2]: 4294673429
> >
> > I think the first result below is to blame:
> >
> > $ rg '.->socket_pressure' mm
> > mm/memcontrol.c
> > 5280: memcg->socket_pressure = jiffies;
> > 7198: memcg->socket_pressure = 0;
> > 7201: memcg->socket_pressure = 1;
> > 7211: memcg->socket_pressure = 0;
> > 7215: memcg->socket_pressure = 1;
>
> Hoo boy, that's a silly mistake indeed. Thanks for tracking it down.
>
> > While we set socket_pressure to either zero or one in
> > mem_cgroup_charge_skmem, it is still initialized to jiffies on memcg
> > creation. Zero seems like a more appropriate starting point. With that
> > change I see it working as expected with no TCP speed bumps. My
> > ebpf_exporter program also looks happy and reports zero clamps in my
> > brief testing.
>
> Excellent, now this behavior makes sense.
>
> > I also think we should downgrade socket_pressure from "unsigned long"
> > to "bool", as it only holds zero and one now.
>
> Sounds good to me!
>
> Attaching the updated patch below. If nobody has any objections, I'll
> add a proper changelog, reported-bys, sign-off etc and send it out.
>
> ---
> include/linux/memcontrol.h | 8 +++---
> include/linux/vmpressure.h | 7 ++---
> mm/memcontrol.c | 20 +++++++++----
> mm/vmpressure.c | 58 ++++++--------------------------------
> mm/vmscan.c | 15 +---------
> 5 files changed, 30 insertions(+), 78 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e1644a24009c..ef1c388be5b3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -283,11 +283,11 @@ struct mem_cgroup {
> atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
> atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS];
>
> - unsigned long socket_pressure;
> + /* Socket memory allocations have failed */
> + bool socket_pressure;
>
> /* Legacy tcp memory accounting */
> bool tcpmem_active;
> - int tcpmem_pressure;
>
> #ifdef CONFIG_MEMCG_KMEM
> int kmemcg_id;
> @@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
> void mem_cgroup_sk_free(struct sock *sk);
> static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
> {
> - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)

&& READ_ONCE(memcg->socket_pressure))

> return true;
> do {
> - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
> + if (memcg->socket_pressure)

if (READ_ONCE(...))

> return true;
> } while ((memcg = parent_mem_cgroup(memcg)));
> return false;
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 6a2f51ebbfd3..20d93de37a17 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -11,9 +11,6 @@
> #include <linux/eventfd.h>
>
> struct vmpressure {
> - unsigned long scanned;
> - unsigned long reclaimed;
> -
> unsigned long tree_scanned;
> unsigned long tree_reclaimed;
> /* The lock is used to keep the scanned/reclaimed above in sync. */
> @@ -30,7 +27,7 @@ struct vmpressure {
> struct mem_cgroup;
>
> #ifdef CONFIG_MEMCG
> -extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> +extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> unsigned long scanned, unsigned long reclaimed);
> extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio);
>
> @@ -44,7 +41,7 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg,
> extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
> struct eventfd_ctx *eventfd);
> #else
> -static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
> +static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> unsigned long scanned, unsigned long reclaimed) {}
> static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg,
> int prio) {}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2d8549ae1b30..0d4b9dbe775a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5277,7 +5277,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> vmpressure_init(&memcg->vmpressure);
> INIT_LIST_HEAD(&memcg->event_list);
> spin_lock_init(&memcg->event_list_lock);
> - memcg->socket_pressure = jiffies;
> #ifdef CONFIG_MEMCG_KMEM
> memcg->kmemcg_id = -1;
> INIT_LIST_HEAD(&memcg->objcg_list);
> @@ -7195,10 +7194,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> struct page_counter *fail;
>
> if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> - memcg->tcpmem_pressure = 0;

Orthogonal to your patch, but:

Maybe avoid touching this cache line too often and use READ/WRITE_ONCE() ?

if (READ_ONCE(memcg->socket_pressure))
WRITE_ONCE(memcg->socket_pressure, false);

> + memcg->socket_pressure = false;
> return true;
> }
> - memcg->tcpmem_pressure = 1;
> + memcg->socket_pressure = true;

Same remark.

> if (gfp_mask & __GFP_NOFAIL) {
> page_counter_charge(&memcg->tcpmem, nr_pages);
> return true;
> @@ -7206,12 +7205,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> return false;
> }
>
> - if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
> - mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
> - return true;
> + if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) {
> + memcg->socket_pressure = false;

same remark.

> + goto success;
> + }
> + memcg->socket_pressure = true;

same remark.

> + if (gfp_mask & __GFP_NOFAIL) {
> + try_charge(memcg, gfp_mask, nr_pages);
> + goto success;
> }
>
> return false;
> +
> +success:
> + mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
> + return true;
> }
>
> /**

2022-12-06 21:19:01

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Dec 06, 2022 at 08:13:50PM +0100, Eric Dumazet wrote:
> On Tue, Dec 6, 2022 at 8:00 PM Johannes Weiner <[email protected]> wrote:
> > @@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
> > void mem_cgroup_sk_free(struct sock *sk);
> > static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
> > {
> > - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
> > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
>
> && READ_ONCE(memcg->socket_pressure))
>
> > return true;
> > do {
> > - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
> > + if (memcg->socket_pressure)
>
> if (READ_ONCE(...))

Good point, I'll add those.

> > @@ -7195,10 +7194,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> > struct page_counter *fail;
> >
> > if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> > - memcg->tcpmem_pressure = 0;
>
> Orthogonal to your patch, but:
>
> Maybe avoid touching this cache line too often and use READ/WRITE_ONCE() ?
>
> if (READ_ONCE(memcg->socket_pressure))
> WRITE_ONCE(memcg->socket_pressure, false);

Ah, that's a good idea.

I think it'll be fine in the failure case, since that's associated
with OOM and total performance breakdown anyway.

But certainly, in the common case of the charge succeeding, we should
not keep hammering false into that variable over and over.

How about the delta below? I also flipped the branches around to keep
the common path at the first indentation level, hopefully making that
a bit clearer too.

Thanks for taking a look, Eric!

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef1c388be5b3..13ae10116895 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1701,10 +1701,11 @@ void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
+ READ_ONCE(memcg->socket_pressure))
return true;
do {
- if (memcg->socket_pressure)
+ if (READ_ONCE(memcg->socket_pressure))
return true;
} while ((memcg = parent_mem_cgroup(memcg)));
return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0d4b9dbe775a..96c4ec0f11ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7193,31 +7193,29 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
struct page_counter *fail;

- if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
- memcg->socket_pressure = false;
- return true;
+ if (!page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
+ WRITE_ONCE(memcg->socket_pressure, true);
+ if (gfp_mask & __GFP_FAIL) {
+ page_counter_charge(&memcg->tcpmem, nr_pages);
+ return true;
+ }
+ return false;
}
- memcg->socket_pressure = true;
+ if (unlikely(READ_ONCE(memcg->socket_pressure)))
+ WRITE_ONCE(memcg->socket_pressure, false);
+ }
+
+ if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) < 0) {
+ WRITE_ONCE(memcg->socket_pressure, true);
if (gfp_mask & __GFP_NOFAIL) {
- page_counter_charge(&memcg->tcpmem, nr_pages);
+ try_charge(memcg, gfp_mask, nr_pages);
+ mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
return true;
}
return false;
}
-
- if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) == 0) {
- memcg->socket_pressure = false;
- goto success;
- }
- memcg->socket_pressure = true;
- if (gfp_mask & __GFP_NOFAIL) {
- try_charge(memcg, gfp_mask, nr_pages);
- goto success;
- }
-
- return false;
-
-success:
+ if (unlikely(READ_ONCE(memcg->socket_pressure)))
+ WRITE_ONCE(memcg->socket_pressure, false);
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
return true;
}

2022-12-06 23:25:49

by Shakeel Butt

[permalink] [raw]

Subject: Re: Low TCP throughput due to vmpressure with swap enabled

On Tue, Dec 06, 2022 at 09:51:01PM +0100, Johannes Weiner wrote:
> On Tue, Dec 06, 2022 at 08:13:50PM +0100, Eric Dumazet wrote:
> > On Tue, Dec 6, 2022 at 8:00 PM Johannes Weiner <[email protected]> wrote:
> > > @@ -1701,10 +1701,10 @@ void mem_cgroup_sk_alloc(struct sock *sk);
> > > void mem_cgroup_sk_free(struct sock *sk);
> > > static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
> > > {
> > > - if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
> > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->socket_pressure)
> >
> > && READ_ONCE(memcg->socket_pressure))
> >
> > > return true;
> > > do {
> > > - if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
> > > + if (memcg->socket_pressure)
> >
> > if (READ_ONCE(...))
>
> Good point, I'll add those.
>
> > > @@ -7195,10 +7194,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
> > > struct page_counter *fail;
> > >
> > > if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
> > > - memcg->tcpmem_pressure = 0;
> >
> > Orthogonal to your patch, but:
> >
> > Maybe avoid touching this cache line too often and use READ/WRITE_ONCE() ?
> >
> > if (READ_ONCE(memcg->socket_pressure))
> > WRITE_ONCE(memcg->socket_pressure, false);
>
> Ah, that's a good idea.
>
> I think it'll be fine in the failure case, since that's associated
> with OOM and total performance breakdown anyway.
>
> But certainly, in the common case of the charge succeeding, we should
> not keep hammering false into that variable over and over.
>
> How about the delta below? I also flipped the branches around to keep
> the common path at the first indentation level, hopefully making that
> a bit clearer too.
>
> Thanks for taking a look, Eric!
>

I still think we should not put a persistent state of socket pressure on
unsuccessful charge which will only get reset on successful charge. I
think the better approach would be to limit the pressure state by time
window same as today but set it on charge path. Something like below:

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d3c8203cab6c..7bd88d443c42 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -287,7 +287,6 @@ struct mem_cgroup {

/* Legacy tcp memory accounting */
bool tcpmem_active;
- int tcpmem_pressure;

#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
@@ -1712,8 +1711,6 @@ void mem_cgroup_sk_alloc(struct sock *sk);
void mem_cgroup_sk_free(struct sock *sk);
static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
{
- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
- return true;
do {
if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
return true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48c44229cf47..290444bcab84 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5286,7 +5286,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
vmpressure_init(&memcg->vmpressure);
INIT_LIST_HEAD(&memcg->event_list);
spin_lock_init(&memcg->event_list_lock);
- memcg->socket_pressure = jiffies;
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
INIT_LIST_HEAD(&memcg->objcg_list);
@@ -7252,10 +7251,12 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
struct page_counter *fail;

if (page_counter_try_charge(&memcg->tcpmem, nr_pages, &fail)) {
- memcg->tcpmem_pressure = 0;
+ if (READ_ONCE(memcg->socket_pressure))
+ WRITE_ONCE(memcg->socket_pressure, 0);
return true;
}
- memcg->tcpmem_pressure = 1;
+ if (READ_ONCE(memcg->socket_pressure) < jiffies + HZ)
+ WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
if (gfp_mask & __GFP_NOFAIL) {
page_counter_charge(&memcg->tcpmem, nr_pages);
return true;
@@ -7263,12 +7264,21 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages,
return false;
}

- if (try_charge(memcg, gfp_mask, nr_pages) == 0) {
- mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
- return true;
+ if (try_charge(memcg, gfp_mask & ~__GFP_NOFAIL, nr_pages) < 0) {
+ if (READ_ONCE(memcg->socket_pressure) < jiffies + HZ)
+ WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
+ if (gfp_mask & __GFP_NOFAIL) {
+ try_charge(memcg, gfp_mask, nr_pages);
+ goto out;
+ }
+ return false;
}

- return false;
+ if (READ_ONCE(memcg->socket_pressure))
+ WRITE_ONCE(memcg->socket_pressure, 0);
+out:
+ mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
+ return true;
}

/**

2022-12-07 13:12:33