2023-09-20 17:35:45

by Abel Wu

[permalink] [raw]
Subject: [PATCH net-next 1/2] sock: Code cleanup on __sk_mem_raise_allocated()

Code cleanup for both better simplicity and readability.
No functional change intended.

Signed-off-by: Abel Wu <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
---
net/core/sock.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index a5995750c5c5..379eb8b65562 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3046,17 +3046,19 @@ EXPORT_SYMBOL(sk_wait_data);
*/
int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
{
- bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
+ struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
struct proto *prot = sk->sk_prot;
- bool charged = true;
+ bool charged = false;
long allocated;

sk_memory_allocated_add(sk, amt);
allocated = sk_memory_allocated(sk);
- if (memcg_charge &&
- !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
- gfp_memcg_charge())))
- goto suppress_allocation;
+
+ if (memcg) {
+ if (!mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge()))
+ goto suppress_allocation;
+ charged = true;
+ }

/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
@@ -3111,8 +3113,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
*/
if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
/* Force charge with __GFP_NOFAIL */
- if (memcg_charge && !charged) {
- mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+ if (memcg && !charged) {
+ mem_cgroup_charge_skmem(memcg, amt,
gfp_memcg_charge() | __GFP_NOFAIL);
}
return 1;
@@ -3124,8 +3126,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)

sk_memory_allocated_sub(sk, amt);

- if (memcg_charge && charged)
- mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
+ if (charged)
+ mem_cgroup_uncharge_skmem(memcg, amt);

return 0;
}
--
2.37.3


2023-09-21 18:58:32

by Abel Wu

[permalink] [raw]
Subject: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

Before sockets became aware of net-memcg's memory pressure since
commit e1aab161e013 ("socket: initial cgroup code."), the memory
usage would be granted to raise if below average even when under
protocol's pressure. This provides fairness among the sockets of
same protocol.

That commit changes this because the heuristic will also be
effective when only memcg is under pressure which makes no sense.
Fix this by skipping this heuristic when under memcg pressure.

Fixes: e1aab161e013 ("socket: initial cgroup code.")
Signed-off-by: Abel Wu <[email protected]>
---
net/core/sock.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 379eb8b65562..ef5cf6250f17 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
if (sk_has_memory_pressure(sk)) {
u64 alloc;

- if (!sk_under_memory_pressure(sk))
+ if (memcg && mem_cgroup_under_socket_pressure(memcg))
+ goto suppress_allocation;
+
+ if (!sk_under_global_memory_pressure(sk))
return 1;
+
+ /* Trying to be fair among all the sockets under the
+ * protocol's memory pressure, by allowing the ones
+ * that below average usage to raise.
+ */
alloc = sk_sockets_allocated_read_positive(sk);
if (sk_prot_mem_limits(sk, 2) > alloc *
sk_mem_pages(sk->sk_wmem_queued +
--
2.37.3

2023-09-22 02:22:14

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote:
> Before sockets became aware of net-memcg's memory pressure since
> commit e1aab161e013 ("socket: initial cgroup code."), the memory
> usage would be granted to raise if below average even when under
> protocol's pressure. This provides fairness among the sockets of
> same protocol.
>
> That commit changes this because the heuristic will also be
> effective when only memcg is under pressure which makes no sense.
> Fix this by skipping this heuristic when under memcg pressure.
>
> Fixes: e1aab161e013 ("socket: initial cgroup code.")
> Signed-off-by: Abel Wu <[email protected]>
> ---
> net/core/sock.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 379eb8b65562..ef5cf6250f17 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
> if (sk_has_memory_pressure(sk)) {
> u64 alloc;
>
> - if (!sk_under_memory_pressure(sk))
> + if (memcg && mem_cgroup_under_socket_pressure(memcg))
> + goto suppress_allocation;
> +
> + if (!sk_under_global_memory_pressure(sk))
> return 1;

I am onboard with replacing sk_under_memory_pressure() with
sk_under_global_memory_pressure(). However suppressing on memcg pressure
is a behavior change from status quo and need more thought and testing.

I think there are three options for this hunk:

1. proposed patch
2. Consider memcg pressure only for !in_softirq().
3. Don't consider memcg pressure at all.

All three options are behavior change from the status quo but with
different risk levels. (1) may reintroduce the regression fixed by
720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
(2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg
limits ineffective.

IMHO we should go with (2) as there is already a precedence in
720ca52bcef22.

thanks,
Shakeel

2023-09-22 16:00:21

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On 9/22/23 4:36 PM, Abel Wu wrote:
> On 9/22/23 3:01 AM, Shakeel Butt wrote:
>> On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote:
>>> Before sockets became aware of net-memcg's memory pressure since
>>> commit e1aab161e013 ("socket: initial cgroup code."), the memory
>>> usage would be granted to raise if below average even when under
>>> protocol's pressure. This provides fairness among the sockets of
>>> same protocol.
>>>
>>> That commit changes this because the heuristic will also be
>>> effective when only memcg is under pressure which makes no sense.
>>> Fix this by skipping this heuristic when under memcg pressure.
>>>
>>> Fixes: e1aab161e013 ("socket: initial cgroup code.")
>>> Signed-off-by: Abel Wu <[email protected]>
>>> ---
>>>   net/core/sock.c | 10 +++++++++-
>>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/sock.c b/net/core/sock.c
>>> index 379eb8b65562..ef5cf6250f17 100644
>>> --- a/net/core/sock.c
>>> +++ b/net/core/sock.c
>>> @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk,
>>> int size, int amt, int kind)
>>>       if (sk_has_memory_pressure(sk)) {
>>>           u64 alloc;
>>> -        if (!sk_under_memory_pressure(sk))
>>> +        if (memcg && mem_cgroup_under_socket_pressure(memcg))
>>> +            goto suppress_allocation;
>>> +
>>> +        if (!sk_under_global_memory_pressure(sk))
>>>               return 1;
>>
>> I am onboard with replacing sk_under_memory_pressure() with
>> sk_under_global_memory_pressure(). However suppressing on memcg pressure
>> is a behavior change from status quo and need more thought and testing.
>>
>> I think there are three options for this hunk:
>>
>> 1. proposed patch
>> 2. Consider memcg pressure only for !in_softirq().
>> 3. Don't consider memcg pressure at all.
>>
>> All three options are behavior change from the status quo but with
>> different risk levels. (1) may reintroduce the regression fixed by
>> 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
>
> Just for the record, it is same for the current upstream implementation
> if the socket reaches average usage. Taking option 2 will fix this too.
>
>> (2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg
>> limits ineffective.
>>
>> IMHO we should go with (2) as there is already a precedence in
>> 720ca52bcef22.
>
> Yes, I agree. Actually applying option(2) would make this patch quite
> similar to the previous version[a], except the below part:
>
>      /* Under limit. */
>      if (allocated <= sk_prot_mem_limits(sk, 0)) {
>          sk_leave_memory_pressure(sk);
> -        return 1;
> +        if (!under_memcg_pressure)
> +            return 1;
>      }

After a second thought, it is still vague to me about the position
the memcg pressure should be in socket memory allocation. It lacks
convincing design. I think the above hunk helps, but not much.

I wonder if we should take option (3) first. Thoughts?

Thanks,
Abel

>
> My original thought is to inherit the behavior of tcpmem pressure.
> There are also 3 levels of memcg pressure named low/medium/critical,
> but considering that the 'low' level is too much conservative for
> socket allocation, I made the following match:
>
>     PROTOCOL    MEMCG        ACTION
>     -----------------------------------------------------
>     low        <medium        allow allocation
>     pressure    medium        be more conservative
>     high        critical    throttle
>
> which also seems align with the design[b] of memcg pressure. Anyway
> I will take option (2) and post v2.
>
> Thanks & Best,
>     Abel
>
> [a]
> https://lore.kernel.org/lkml/[email protected]/
> [b]
> https://docs.kernel.org/admin-guide/cgroup-v1/memory.html#memory-pressure

2023-09-22 16:48:24

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On 9/22/23 3:01 AM, Shakeel Butt wrote:
> On Wed, Sep 20, 2023 at 09:25:41PM +0800, Abel Wu wrote:
>> Before sockets became aware of net-memcg's memory pressure since
>> commit e1aab161e013 ("socket: initial cgroup code."), the memory
>> usage would be granted to raise if below average even when under
>> protocol's pressure. This provides fairness among the sockets of
>> same protocol.
>>
>> That commit changes this because the heuristic will also be
>> effective when only memcg is under pressure which makes no sense.
>> Fix this by skipping this heuristic when under memcg pressure.
>>
>> Fixes: e1aab161e013 ("socket: initial cgroup code.")
>> Signed-off-by: Abel Wu <[email protected]>
>> ---
>> net/core/sock.c | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/sock.c b/net/core/sock.c
>> index 379eb8b65562..ef5cf6250f17 100644
>> --- a/net/core/sock.c
>> +++ b/net/core/sock.c
>> @@ -3093,8 +3093,16 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>> if (sk_has_memory_pressure(sk)) {
>> u64 alloc;
>>
>> - if (!sk_under_memory_pressure(sk))
>> + if (memcg && mem_cgroup_under_socket_pressure(memcg))
>> + goto suppress_allocation;
>> +
>> + if (!sk_under_global_memory_pressure(sk))
>> return 1;
>
> I am onboard with replacing sk_under_memory_pressure() with
> sk_under_global_memory_pressure(). However suppressing on memcg pressure
> is a behavior change from status quo and need more thought and testing.
>
> I think there are three options for this hunk:
>
> 1. proposed patch
> 2. Consider memcg pressure only for !in_softirq().
> 3. Don't consider memcg pressure at all.
>
> All three options are behavior change from the status quo but with
> different risk levels. (1) may reintroduce the regression fixed by
> 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").

Just for the record, it is same for the current upstream implementation
if the socket reaches average usage. Taking option 2 will fix this too.

> (2) is more inlined with 720ca52bcef22. (3) has the risk to making memcg
> limits ineffective.
>
> IMHO we should go with (2) as there is already a precedence in
> 720ca52bcef22.

Yes, I agree. Actually applying option(2) would make this patch quite
similar to the previous version[a], except the below part:

/* Under limit. */
if (allocated <= sk_prot_mem_limits(sk, 0)) {
sk_leave_memory_pressure(sk);
- return 1;
+ if (!under_memcg_pressure)
+ return 1;
}

My original thought is to inherit the behavior of tcpmem pressure.
There are also 3 levels of memcg pressure named low/medium/critical,
but considering that the 'low' level is too much conservative for
socket allocation, I made the following match:

PROTOCOL MEMCG ACTION
-----------------------------------------------------
low <medium allow allocation
pressure medium be more conservative
high critical throttle

which also seems align with the design[b] of memcg pressure. Anyway
I will take option (2) and post v2.

Thanks & Best,
Abel

[a]
https://lore.kernel.org/lkml/[email protected]/
[b]
https://docs.kernel.org/admin-guide/cgroup-v1/memory.html#memory-pressure

2023-09-25 04:06:40

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote:
[...]
>
> After a second thought, it is still vague to me about the position
> the memcg pressure should be in socket memory allocation. It lacks
> convincing design. I think the above hunk helps, but not much.
>
> I wonder if we should take option (3) first. Thoughts?
>

Let's take a step further. Let's decouple the memcg accounting and
global skmem accounting. __sk_mem_raise_allocated is already very hard
to reason. There are couple of heuristics in it which may or may not
apply to both accounting infrastructures.

Let's explicitly document what heurisitics allows to forcefully succeed
the allocations i.e. irrespective of pressure or over limit for both
accounting infras. I think decoupling them would make the flow of the
code very clear.

There are three heuristics:

1. minimum buffer size even under pressure.

2. allow allocation for a socket whose usage is below average of the
system.

3. socket is over its sndbuf.

Let's discuss which heuristic applies to which accounting infra and
under which state (under pressure or over limit).

thanks,
Shakeel

2023-10-03 12:49:35

by Abel Wu

[permalink] [raw]
Subject: Re: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On 9/24/23 3:28 PM, Shakeel Butt wrote:
> On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote:
> [...]
>>
>> After a second thought, it is still vague to me about the position
>> the memcg pressure should be in socket memory allocation. It lacks
>> convincing design. I think the above hunk helps, but not much.
>>
>> I wonder if we should take option (3) first. Thoughts?
>>
>
> Let's take a step further. Let's decouple the memcg accounting and
> global skmem accounting. __sk_mem_raise_allocated is already very hard
> to reason. There are couple of heuristics in it which may or may not
> apply to both accounting infrastructures.
>
> Let's explicitly document what heurisitics allows to forcefully succeed
> the allocations i.e. irrespective of pressure or over limit for both
> accounting infras. I think decoupling them would make the flow of the
> code very clear.

I can't agree more.

>
> There are three heuristics:

I found all of them were first introduced in linux-2.4.0-test7pre1 for
TCP only, and then migrated to socket core in linux-2.6.8-rc1 without
functional change.

>
> 1. minimum buffer size even under pressure.

This is required by RFC 7323 (TCP Extensions for High Performance) to
make features like Window Scale option work as expected, and should be
succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO
for same reason, it should also be succeeded under memcg pressure, or
else workloads might suffer performance drop due to bottleneck on
network.

The allocation must not be succeeded either exceed global or memcg's
hard limit, or else a DoS attack can be taken place by spawning lots
of sockets that are under minimum buffer size.

>
> 2. allow allocation for a socket whose usage is below average of the
> system.

Since 'average' is within the scope of global accounting, this one
only makes sense under global memory pressure. Actually this exists
before cgroup was born, hence doesn't take memcg into consideration.

While OTOH the intention of throttling under memcg pressure is to
relief the memcg from heavy reclaim pressure, this heuristic does no
help. And there also seems to be no reason to succeed the allocation
when global or memcg's hard limit is exceeded.

>
> 3. socket is over its sndbuf.

TBH I don't get its point..

>
> Let's discuss which heuristic applies to which accounting infra and
> under which state (under pressure or over limit).

I will follow your suggestion to post a patch to explicitly document
the behaviors once things are cleared.

Thanks,
Abel

2023-10-11 03:04:43

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

Gentle ping :)

在 10/3/23 8:49 PM, Abel Wu Wrote:
> On 9/24/23 3:28 PM, Shakeel Butt wrote:
>> On Fri, Sep 22, 2023 at 06:10:06PM +0800, Abel Wu wrote:
>> [...]
>>>
>>> After a second thought, it is still vague to me about the position
>>> the memcg pressure should be in socket memory allocation. It lacks
>>> convincing design. I think the above hunk helps, but not much.
>>>
>>> I wonder if we should take option (3) first. Thoughts?
>>>
>>
>> Let's take a step further. Let's decouple the memcg accounting and
>> global skmem accounting. __sk_mem_raise_allocated is already very hard
>> to reason. There are couple of heuristics in it which may or may not
>> apply to both accounting infrastructures.
>>
>> Let's explicitly document what heurisitics allows to forcefully succeed
>> the allocations i.e. irrespective of pressure or over limit for both
>> accounting infras. I think decoupling them would make the flow of the
>> code very clear.
>
> I can't agree more.
>
>>
>> There are three heuristics:
>
> I found all of them were first introduced in linux-2.4.0-test7pre1 for
> TCP only, and then migrated to socket core in linux-2.6.8-rc1 without
> functional change.
>
>>
>> 1. minimum buffer size even under pressure.
>
> This is required by RFC 7323 (TCP Extensions for High Performance) to
> make features like Window Scale option work as expected, and should be
> succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO
> for same reason, it should also be succeeded under memcg pressure, or
> else workloads might suffer performance drop due to bottleneck on
> network.
>
> The allocation must not be succeeded either exceed global or memcg's
> hard limit, or else a DoS attack can be taken place by spawning lots
> of sockets that are under minimum buffer size.
>
>>
>> 2. allow allocation for a socket whose usage is below average of the
>> system.
>
> Since 'average' is within the scope of global accounting, this one
> only makes sense under global memory pressure. Actually this exists
> before cgroup was born, hence doesn't take memcg into consideration.
>
> While OTOH the intention of throttling under memcg pressure is to
> relief the memcg from heavy reclaim pressure, this heuristic does no
> help. And there also seems to be no reason to succeed the allocation
> when global or memcg's hard limit is exceeded.
>
>>
>> 3. socket is over its sndbuf.
>
> TBH I don't get its point..
>
>>
>> Let's discuss which heuristic applies to which accounting infra and
>> under which state (under pressure or over limit).
>
> I will follow your suggestion to post a patch to explicitly document
> the behaviors once things are cleared.
>
> Thanks,
>     Abel

2023-10-13 01:09:36

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH net-next 2/2] sock: Fix improper heuristic on raising memory

On Tue, Oct 03, 2023 at 08:49:08PM +0800, Abel Wu wrote:
[...]
> > 1. minimum buffer size even under pressure.
>
> This is required by RFC 7323 (TCP Extensions for High Performance) to
> make features like Window Scale option work as expected, and should be
> succeeded under global pressure by tcp_{r,w}mem's definition. And IMHO
> for same reason, it should also be succeeded under memcg pressure, or
> else workloads might suffer performance drop due to bottleneck on
> network.
>
> The allocation must not be succeeded either exceed global or memcg's
> hard limit, or else a DoS attack can be taken place by spawning lots
> of sockets that are under minimum buffer size.
>

Sounds good.

> >
> > 2. allow allocation for a socket whose usage is below average of the
> > system.
>
> Since 'average' is within the scope of global accounting, this one
> only makes sense under global memory pressure. Actually this exists
> before cgroup was born, hence doesn't take memcg into consideration.
>
> While OTOH the intention of throttling under memcg pressure is to
> relief the memcg from heavy reclaim pressure, this heuristic does no
> help. And there also seems to be no reason to succeed the allocation
> when global or memcg's hard limit is exceeded.
>

Sounds good too.

> >
> > 3. socket is over its sndbuf.
>
> TBH I don't get its point..
>

So, this corresponds to following code in __sk_mem_raise_allocated()

if (kind == SK_MEM_SEND && sk->sk_type == SOCK_STREAM) {
sk_stream_moderate_sndbuf(sk);

/* Fail only if socket is _under_ its sndbuf.
* In this case we cannot block, so that we have to fail.
*/
if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
/* Force charge with __GFP_NOFAIL */
if (memcg_charge && !charged) {
mem_cgroup_charge_skmem(sk->sk_memcg, amt,
gfp_memcg_charge() | __GFP_NOFAIL);
}
return 1;
}
}

Here we moderate the sk_sndbuf possibly half of sk_wmem_queued and thus
we always succeed unless user has done SO_SNDBUF on the socket in which
case it interacts with sk_stream_wait_memory() called in sendmsg.

I am not really able to make sense of the interaction between this code
and sk_stream_wait_memory() and will punt to networking experts to shed
some light.

Other than that I think we need to answer if we want to moderate the
sndbuf on memcg charge failure.

> >
> > Let's discuss which heuristic applies to which accounting infra and
> > under which state (under pressure or over limit).
>
> I will follow your suggestion to post a patch to explicitly document
> the behaviors once things are cleared.
>

Let's just post the patch and see what other folks comment as well.