We've catched several cgroup css refcounting issues on 4.15-rc7,
triggered from different release paths. We've used cgroups v2.
I've added a temporarily per-memcg sockmem atomic counter,
and found, that we're sometimes falling below 0. It was easy
to reproduce, so I was able to bisect the problem.
It was introduced by the commit 9f1c2674b328 ("net: memcontrol:
defer call to mem_cgroup_sk_alloc()"), which moved
the mem_cgroup_sk_alloc() call from the BH context
into inet_csk_accept().
The problem is that all the memory allocated before
mem_cgroup_sk_alloc() is charged to the socket,
but not charged to the memcg. So, when we're releasing
the socket, we're uncharging more, than we've charged.
Fix this by charging the cgroup by the amount of already
allocated memory right after mem_cgroup_sk_alloc() in
inet_csk_accept().
Fixes: 9f1c2674b328 ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: David S. Miller <[email protected]>
---
net/ipv4/inet_connection_sock.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4ca46dc08e63..f439162c2ea2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -434,6 +434,7 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
struct request_sock *req;
struct sock *newsk;
int error;
+ long amt;
lock_sock(sk);
@@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
spin_unlock_bh(&queue->fastopenq.lock);
}
mem_cgroup_sk_alloc(newsk);
+ amt = sk_memory_allocated(newsk);
+ if (amt && newsk->sk_memcg)
+ mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+
out:
release_sock(sk);
if (req)
--
2.14.3
From: Roman Gushchin <[email protected]>
Date: Thu, 25 Jan 2018 00:19:11 +0000
> @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> spin_unlock_bh(&queue->fastopenq.lock);
> }
> mem_cgroup_sk_alloc(newsk);
> + amt = sk_memory_allocated(newsk);
> + if (amt && newsk->sk_memcg)
> + mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> +
This looks confusing to me.
sk_memory_allocated() is the total amount of memory used by all
sockets for a particular "struct proto", not just for that specific
socket.
Maybe I don't understand how this socket memcg stuff works, but it
seems like you should be looking instead at how much memory is
allocated to this specific socket.
Thanks.
On Thu, Jan 25, 2018 at 12:03:02PM -0500, David Miller wrote:
> From: Roman Gushchin <[email protected]>
> Date: Thu, 25 Jan 2018 00:19:11 +0000
>
> > @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> > spin_unlock_bh(&queue->fastopenq.lock);
> > }
> > mem_cgroup_sk_alloc(newsk);
> > + amt = sk_memory_allocated(newsk);
> > + if (amt && newsk->sk_memcg)
> > + mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> > +
>
> This looks confusing to me.
>
> sk_memory_allocated() is the total amount of memory used by all
> sockets for a particular "struct proto", not just for that specific
> socket.
Oh, I see...
>
> Maybe I don't understand how this socket memcg stuff works, but it
> seems like you should be looking instead at how much memory is
> allocated to this specific socket.
Yes, this is what I wanted to do originally.
Let me find a proper way to do this.
Thank you!
Roman
On Thu, Jan 25, 2018 at 12:03:02PM -0500, David Miller wrote:
> From: Roman Gushchin <[email protected]>
> Date: Thu, 25 Jan 2018 00:19:11 +0000
>
> > @@ -476,6 +477,10 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
> > spin_unlock_bh(&queue->fastopenq.lock);
> > }
> > mem_cgroup_sk_alloc(newsk);
> > + amt = sk_memory_allocated(newsk);
> > + if (amt && newsk->sk_memcg)
> > + mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
> > +
>
> This looks confusing to me.
>
> sk_memory_allocated() is the total amount of memory used by all
> sockets for a particular "struct proto", not just for that specific
> socket.
>
> Maybe I don't understand how this socket memcg stuff works, but it
> seems like you should be looking instead at how much memory is
> allocated to this specific socket.
So, the patch below takes the per-socket charge into account,
and it _almost_ works: css leak is weaker by a couple orders
of magnitude, but still exists. I believe, the problem is
that we need additional synchronization for sk_memcg and
sk_forward_alloc fields; and I'm really out of ideas how
to do it without heavy artillery like introducing a new
field for unaccounted memcg charge. As I can see, we
check the sk_memcg field without socket lock; and we
do set it from a concurrent context.
Most likely, I do miss something...
So I really start thinking that reverting 9f1c2674b328
("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
and fixing the original issue differently might be easier
and a proper way to go. Does it makes sense?
Thanks!
--
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 4ca46dc08e63..287de1501a30 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -476,6 +476,12 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern)
spin_unlock_bh(&queue->fastopenq.lock);
}
mem_cgroup_sk_alloc(newsk);
+ if (mem_cgroup_sockets_enabled && newsk->sk_memcg) {
+ int amt = sk_mem_pages(newsk->sk_forward_alloc);
+ if (amt > 0)
+ mem_cgroup_charge_skmem(newsk->sk_memcg, amt);
+ }
+
out:
release_sock(sk);
if (req)
From: Roman Gushchin <[email protected]>
Date: Wed, 31 Jan 2018 21:54:08 +0000
> So I really start thinking that reverting 9f1c2674b328
> ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
> and fixing the original issue differently might be easier
> and a proper way to go. Does it makes sense?
You'll need to work that out with Eric Dumazet who added the
change in question which you think we should revert.
On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
> From: Roman Gushchin <[email protected]>
> Date: Wed, 31 Jan 2018 21:54:08 +0000
>
> > So I really start thinking that reverting 9f1c2674b328
> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
> > and fixing the original issue differently might be easier
> > and a proper way to go. Does it makes sense?
>
> You'll need to work that out with Eric Dumazet who added the
> change in question which you think we should revert.
Eric,
can you, please, provide some details about the use-after-free problem
that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
and makes it much more fragile in general. So, I wonder, if there are
solutions for the use-after-free problem.
Thank you!
Roman
On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <[email protected]> wrote:
> On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
>> From: Roman Gushchin <[email protected]>
>> Date: Wed, 31 Jan 2018 21:54:08 +0000
>>
>> > So I really start thinking that reverting 9f1c2674b328
>> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
>> > and fixing the original issue differently might be easier
>> > and a proper way to go. Does it makes sense?
>>
>> You'll need to work that out with Eric Dumazet who added the
>> change in question which you think we should revert.
>
> Eric,
>
> can you, please, provide some details about the use-after-free problem
> that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
> to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
>
> Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
> and makes it much more fragile in general. So, I wonder, if there are
> solutions for the use-after-free problem.
>
> Thank you!
>
> Roman
Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
following this thread )
Our kernel has a debug feature on percpu_ref_get_many() which detects
the typical use-after-free problem of
doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
memory already freed.
Bug was serious because css_put() will release the css a second time.
Stack trace looked like :
Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ>
[<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ?
cgroup_get+0x43/0x50
Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>]
warn_slowpath_common+0xac/0xc8
Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>]
warn_slowpath_null+0x1a/0x1c
Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>]
cgroup_get+0x43/0x50
Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>]
cgroup_sk_alloc+0x64/0x90
Oct 8 00:23:14 lphh23 kernel: [27239.568128] [<ffffffff90bd6e91>]
sk_clone_lock+0x2d1/0x400
Oct 8 00:23:14 lphh23 kernel: [27239.568134] [<ffffffff90bf2d56>]
inet_csk_clone_lock+0x16/0x100
Oct 8 00:23:14 lphh23 kernel: [27239.568138] [<ffffffff90bff163>]
tcp_create_openreq_child+0x23/0x600
Oct 8 00:23:14 lphh23 kernel: [27239.568143] [<ffffffff90c1ba8a>]
tcp_v6_syn_recv_sock+0x26a/0x8f0
Oct 8 00:23:14 lphh23 kernel: [27239.568146] [<ffffffff90bffbfe>]
tcp_check_req+0x1ce/0x440
Oct 8 00:23:14 lphh23 kernel: [27239.568152] [<ffffffff90c6556c>]
tcp_v6_rcv+0x9cc/0x22a0
Oct 8 00:23:14 lphh23 kernel: [27239.568155] [<ffffffff90c67cc2>] ?
ip6table_mangle_hook+0x42/0x190
Oct 8 00:23:14 lphh23 kernel: [27239.568158] [<ffffffff90c61e5b>]
ip6_input+0x1ab/0x400
Oct 8 00:23:14 lphh23 kernel: [27239.568162] [<ffffffff90cd8c0d>] ?
ip6_rcv_finish+0x93/0x93
Oct 8 00:23:14 lphh23 kernel: [27239.568165] [<ffffffff90c61a2d>]
ipv6_rcv+0x32d/0x5b0
Oct 8 00:23:14 lphh23 kernel: [27239.568167] [<ffffffff90cd8b7a>] ?
ip6_fragment+0x965/0x965
Oct 8 00:23:14 lphh23 kernel: [27239.568171] [<ffffffff90c2fd4c>]
process_backlog+0x39c/0xc50
Oct 8 00:23:14 lphh23 kernel: [27239.568177] [<ffffffff907be695>] ?
ktime_get+0x35/0xa0
Oct 8 00:23:14 lphh23 kernel: [27239.568180] [<ffffffff907bf681>] ?
clockevents_program_event+0x81/0x1c0
Oct 8 00:23:14 lphh23 kernel: [27239.568183] [<ffffffff90c2e22e>]
net_rx_action+0x10e/0x360
Oct 8 00:23:14 lphh23 kernel: [27239.568190] [<ffffffff906064f1>]
__do_softirq+0x151/0x2f5
Oct 8 00:23:14 lphh23 kernel: [27239.568196] [<ffffffff90d101dc>]
do_softirq_own_stack+0x1c/0x30
Oct 8 00:23:14 lphh23 kernel: [27239.568197] <EOI>
[<ffffffff9079a12b>] __local_bh_enable_ip+0x6b/0xa0
Oct 8 00:23:14 lphh23 kernel: [27239.568203] [<ffffffff90c609c6>]
ip6_output+0x326/0x1060
Oct 8 00:23:14 lphh23 kernel: [27239.568206] [<ffffffff90c67d3d>] ?
ip6table_mangle_hook+0xbd/0x190
Oct 8 00:23:14 lphh23 kernel: [27239.568209] [<ffffffff90c5f780>] ?
inet6_getname+0x130/0x130
Oct 8 00:23:14 lphh23 kernel: [27239.568212] [<ffffffff90c606a0>] ?
ip6_finish_output+0xf20/0xf20
Oct 8 00:23:14 lphh23 kernel: [27239.568215] [<ffffffff90cd77a7>]
ip6_xmit+0x52d/0x5b6
Oct 8 00:23:14 lphh23 kernel: [27239.568217] [<ffffffff90cd6ffe>] ?
ip6_call_ra_chain+0xc9/0xc9
Oct 8 00:23:14 lphh23 kernel: [27239.568220] [<ffffffff90c4483d>] ?
tcp_ack+0x60d/0x3290
Oct 8 00:23:14 lphh23 kernel: [27239.568223] [<ffffffff90c67521>]
inet6_csk_xmit+0x181/0x2b0
Oct 8 00:23:14 lphh23 kernel: [27239.568225] [<ffffffff90c4bb55>]
tcp_send_ack+0x6f5/0xdf0
Oct 8 00:23:14 lphh23 kernel: [27239.568229] [<ffffffff90bf8311>]
tcp_rcv_state_process+0x8a1/0x2630
Oct 8 00:23:14 lphh23 kernel: [27239.568231] [<ffffffff90c1c24b>]
tcp_v6_do_rcv+0x13b/0x340
Oct 8 00:23:14 lphh23 kernel: [27239.568234] [<ffffffff90c2286c>]
release_sock+0xec/0x180
Oct 8 00:23:14 lphh23 kernel: [27239.568237] [<ffffffff90c08b6f>]
__inet_stream_connect+0x1ef/0x2f0
Oct 8 00:23:14 lphh23 kernel: [27239.568240] [<ffffffff906d8710>] ?
__wake_up_locked_key+0x70/0x70
Oct 8 00:23:14 lphh23 kernel: [27239.568243] [<ffffffff90c08cab>]
inet_stream_connect+0x3b/0x60
Oct 8 00:23:14 lphh23 kernel: [27239.568249] [<ffffffff90bd5564>]
SYSC_connect+0x84/0xc0
On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote:
> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <[email protected]> wrote:
> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
> >> From: Roman Gushchin <[email protected]>
> >> Date: Wed, 31 Jan 2018 21:54:08 +0000
> >>
> >> > So I really start thinking that reverting 9f1c2674b328
> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
> >> > and fixing the original issue differently might be easier
> >> > and a proper way to go. Does it makes sense?
> >>
> >> You'll need to work that out with Eric Dumazet who added the
> >> change in question which you think we should revert.
> >
> > Eric,
> >
> > can you, please, provide some details about the use-after-free problem
> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
> >
> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
> > and makes it much more fragile in general. So, I wonder, if there are
> > solutions for the use-after-free problem.
> >
> > Thank you!
> >
> > Roman
>
> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
> following this thread )
>
> Our kernel has a debug feature on percpu_ref_get_many() which detects
> the typical use-after-free problem of
> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
> memory already freed.
>
> Bug was serious because css_put() will release the css a second time.
>
> Stack trace looked like :
>
> Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ>
> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
> Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ?
> cgroup_get+0x43/0x50
> Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>]
> warn_slowpath_common+0xac/0xc8
> Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>]
> warn_slowpath_null+0x1a/0x1c
> Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>]
> cgroup_get+0x43/0x50
> Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>]
> cgroup_sk_alloc+0x64/0x90
Hm, that looks strange... It's cgroup_sk_alloc(),
not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328.
I thought, that it's css_get() in mem_cgroup_sk_alloc(), which
you removed, but the stacktrace you've posted is different.
void mem_cgroup_sk_alloc(struct sock *sk) {
/*
* Socket cloning can throw us here with sk_memcg already
* filled. It won't however, necessarily happen from
* process context. So the test for root memcg given
* the current task's memcg won't help us in this case.
*
* Respecting the original socket's memcg is a better
* decision in this case.
*/
if (sk->sk_memcg) {
BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
>>> css_get(&sk->sk_memcg->css);
return;
}
Is it possible to reproduce the issue on an upstream kernel?
Any ideas of what can trigger it?
Btw, with the following patch applied (below) and cgroup v2 enabled,
the issue, which I'm talking about, can be reproduced in seconds after reboot
by doing almost any network activity. Just sshing to a machine is enough.
The corresponding warning will be printed to dmesg.
What is a proper way to fix the socket memory accounting in this case,
what do you think?
Thank you!
Roman
--
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c51c589..55fb890 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -276,6 +276,8 @@ struct mem_cgroup {
struct list_head event_list;
spinlock_t event_list_lock;
+ atomic_t tcpcnt;
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19eea69..c69ff04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
seq_printf(m, "workingset_nodereclaim %lu\n",
stat[WORKINGSET_NODERECLAIM]);
+ seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt));
+
return 0;
}
@@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
{
gfp_t gfp_mask = GFP_KERNEL;
+ atomic_add(nr_pages, &memcg->tcpcnt);
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
struct page_counter *fail;
@@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
*/
void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
{
+ int v = atomic_sub_return(nr_pages, &memcg->tcpcnt);
+ if (v < 0) {
+ pr_info("@@@ %p %d \n", memcg, v);
+ }
+
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
page_counter_uncharge(&memcg->tcpmem, nr_pages);
return;
Well, this memcg stuff is so confusing.
My recollection is that we had :
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6
And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well
Honestly bug was closed months ago for us, based on stack traces on the wild.
No C repro or whatever, but reproducing it would be a matter of
having a TCP listener constantly doing a
socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop,
while connections are attempted to the listening port.
On Thu, Feb 1, 2018 at 2:55 PM, Roman Gushchin <[email protected]> wrote:
> On Thu, Feb 01, 2018 at 01:17:56PM -0800, Eric Dumazet wrote:
>> On Thu, Feb 1, 2018 at 12:22 PM, Roman Gushchin <[email protected]> wrote:
>> > On Thu, Feb 01, 2018 at 10:16:55AM -0500, David Miller wrote:
>> >> From: Roman Gushchin <[email protected]>
>> >> Date: Wed, 31 Jan 2018 21:54:08 +0000
>> >>
>> >> > So I really start thinking that reverting 9f1c2674b328
>> >> > ("net: memcontrol: defer call to mem_cgroup_sk_alloc()")
>> >> > and fixing the original issue differently might be easier
>> >> > and a proper way to go. Does it makes sense?
>> >>
>> >> You'll need to work that out with Eric Dumazet who added the
>> >> change in question which you think we should revert.
>> >
>> > Eric,
>> >
>> > can you, please, provide some details about the use-after-free problem
>> > that you've fixed with commit 9f1c2674b328 ("net: memcontrol: defer call
>> > to mem_cgroup_sk_alloc()" ? Do you know how to reproduce it?
>> >
>> > Deferring mem_cgroup_sk_alloc() breaks socket memory accounting
>> > and makes it much more fragile in general. So, I wonder, if there are
>> > solutions for the use-after-free problem.
>> >
>> > Thank you!
>> >
>> > Roman
>>
>> Unfortunately bug is not public (Google-Bug-Id 67556600 for Googlers
>> following this thread )
>>
>> Our kernel has a debug feature on percpu_ref_get_many() which detects
>> the typical use-after-free problem of
>> doing atomic_long_add(nr, &ref->count); while ref->count is 0, or
>> memory already freed.
>>
>> Bug was serious because css_put() will release the css a second time.
>>
>> Stack trace looked like :
>>
>> Oct 8 00:23:14 lphh23 kernel: [27239.568098] <IRQ>
>> [<ffffffff909d2fb1>] dump_stack+0x4d/0x6c
>> Oct 8 00:23:14 lphh23 kernel: [27239.568108] [<ffffffff906df6e3>] ?
>> cgroup_get+0x43/0x50
>> Oct 8 00:23:14 lphh23 kernel: [27239.568114] [<ffffffff906f2f35>]
>> warn_slowpath_common+0xac/0xc8
>> Oct 8 00:23:14 lphh23 kernel: [27239.568117] [<ffffffff906f2f6b>]
>> warn_slowpath_null+0x1a/0x1c
>> Oct 8 00:23:14 lphh23 kernel: [27239.568120] [<ffffffff906df6e3>]
>> cgroup_get+0x43/0x50
>> Oct 8 00:23:14 lphh23 kernel: [27239.568123] [<ffffffff906e07a4>]
>> cgroup_sk_alloc+0x64/0x90
>
> Hm, that looks strange... It's cgroup_sk_alloc(),
> not mem_cgroup_sk_alloc(), which was removed by 9f1c2674b328.
>
> I thought, that it's css_get() in mem_cgroup_sk_alloc(), which
> you removed, but the stacktrace you've posted is different.
>
> void mem_cgroup_sk_alloc(struct sock *sk) {
> /*
> * Socket cloning can throw us here with sk_memcg already
> * filled. It won't however, necessarily happen from
> * process context. So the test for root memcg given
> * the current task's memcg won't help us in this case.
> *
> * Respecting the original socket's memcg is a better
> * decision in this case.
> */
> if (sk->sk_memcg) {
> BUG_ON(mem_cgroup_is_root(sk->sk_memcg));
>>>> css_get(&sk->sk_memcg->css);
> return;
> }
>
> Is it possible to reproduce the issue on an upstream kernel?
> Any ideas of what can trigger it?
>
> Btw, with the following patch applied (below) and cgroup v2 enabled,
> the issue, which I'm talking about, can be reproduced in seconds after reboot
> by doing almost any network activity. Just sshing to a machine is enough.
> The corresponding warning will be printed to dmesg.
>
> What is a proper way to fix the socket memory accounting in this case,
> what do you think?
>
> Thank you!
>
> Roman
>
> --
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c51c589..55fb890 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -276,6 +276,8 @@ struct mem_cgroup {
> struct list_head event_list;
> spinlock_t event_list_lock;
>
> + atomic_t tcpcnt;
> +
> struct mem_cgroup_per_node *nodeinfo[0];
> /* WARNING: nodeinfo must be the last member here */
> };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19eea69..c69ff04 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5623,6 +5623,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
> seq_printf(m, "workingset_nodereclaim %lu\n",
> stat[WORKINGSET_NODERECLAIM]);
>
> + seq_printf(m, "tcpcnt %d\n", atomic_read(&memcg->tcpcnt));
> +
> return 0;
> }
>
> @@ -6139,6 +6141,8 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> gfp_t gfp_mask = GFP_KERNEL;
>
> + atomic_add(nr_pages, &memcg->tcpcnt);
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> struct page_counter *fail;
>
> @@ -6171,6 +6175,11 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> */
> void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
> {
> + int v = atomic_sub_return(nr_pages, &memcg->tcpcnt);
> + if (v < 0) {
> + pr_info("@@@ %p %d \n", memcg, v);
> + }
> +
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
> page_counter_uncharge(&memcg->tcpmem, nr_pages);
> return;
On Thu, Feb 01, 2018 at 03:27:14PM -0800, Eric Dumazet wrote:
> Well, this memcg stuff is so confusing.
>
> My recollection is that we had :
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d979a39d7242e0601bf9b60e89628fb8ac577179
>
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=75cb070960ade40fba5de32138390f3c85c90941
>
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=c0576e3975084d4699b7bfef578613fb8e1144f6
>
> And commit a590b90d472f2c176c140576ee3ab44df7f67839 as well
>
> Honestly bug was closed months ago for us, based on stack traces on the wild.
>
> No C repro or whatever, but reproducing it would be a matter of
> having a TCP listener constantly doing a
> socket()/setsockopt(REUSEADDR)/bind()/listen()/close() in a loop,
> while connections are attempted to the listening port.
Oh, I see...
Then I think that we should return memcg_sk_alloc() back to the bh context,
where cgroup_sk_alloc() is, and repeat all the tricks to avoid copying
dead cgroups/memcg pointers. Do you agree?
I'll try to master a patch and reproduce the issue.
Thanks!