From: Jason Xing <[email protected]>
TCP socks cannot be released because of the sock_hold() increasing the
sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
Therefore, this situation could increase the slab memory and then trigger
the OOM if the machine has beening running for a long time. This issue,
however, can happen on some machine only running a few days.
We add one exception case to avoid unneeded use of sock_hold if the
pacing_timer is enqueued.
Reproduce procedure:
0) cat /proc/slabinfo | grep TCP
1) switch net.ipv4.tcp_congestion_control to bbr
2) using wrk tool something like that to send packages
3) using tc to increase the delay in the dev to simulate the busy case.
4) cat /proc/slabinfo | grep TCP
5) kill the wrk command and observe the number of objects and slabs in TCP.
6) at last, you could notice that the number would not decrease.
Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: liweishi <[email protected]>
Signed-off-by: Shujin Li <[email protected]>
---
net/ipv4/tcp_output.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cc4ba42..5cf63d9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
u64 len_ns;
u32 rate;
- if (!tcp_needs_internal_pacing(sk))
+ if (!tcp_needs_internal_pacing(sk) ||
+ hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
return;
rate = sk->sk_pacing_rate;
if (!rate || rate == ~0U)
--
1.8.3.1
On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
>
> From: Jason Xing <[email protected]>
>
> TCP socks cannot be released because of the sock_hold() increasing the
> sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> Therefore, this situation could increase the slab memory and then trigger
> the OOM if the machine has beening running for a long time. This issue,
> however, can happen on some machine only running a few days.
>
> We add one exception case to avoid unneeded use of sock_hold if the
> pacing_timer is enqueued.
>
> Reproduce procedure:
> 0) cat /proc/slabinfo | grep TCP
> 1) switch net.ipv4.tcp_congestion_control to bbr
> 2) using wrk tool something like that to send packages
> 3) using tc to increase the delay in the dev to simulate the busy case.
> 4) cat /proc/slabinfo | grep TCP
> 5) kill the wrk command and observe the number of objects and slabs in TCP.
> 6) at last, you could notice that the number would not decrease.
>
> Signed-off-by: Jason Xing <[email protected]>
> Signed-off-by: liweishi <[email protected]>
> Signed-off-by: Shujin Li <[email protected]>
> ---
> net/ipv4/tcp_output.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cc4ba42..5cf63d9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> u64 len_ns;
> u32 rate;
>
> - if (!tcp_needs_internal_pacing(sk))
> + if (!tcp_needs_internal_pacing(sk) ||
> + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> return;
> rate = sk->sk_pacing_rate;
> if (!rate || rate == ~0U)
> --
> 1.8.3.1
>
Hi Jason.
Please do not send patches that do not apply to current upstream trees.
Instead, backport to your kernels the needed fixes.
I suspect that you are not using a pristine linux kernel, but some
heavily modified one and something went wrong in your backports.
Do not ask us to spend time finding what went wrong.
Thank you.
Hi Eric,
I'm sorry that I didn't write enough clearly. We're running the
pristine 4.19.125 linux kernel (the latest LTS version) and have been
haunted by such an issue. This patch is high-important, I think. So
I'm going to resend this email with the [patch 4.19] on the headline
and cc Greg.
Thanks,
Jason
On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> >
> > From: Jason Xing <[email protected]>
> >
> > TCP socks cannot be released because of the sock_hold() increasing the
> > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > Therefore, this situation could increase the slab memory and then trigger
> > the OOM if the machine has beening running for a long time. This issue,
> > however, can happen on some machine only running a few days.
> >
> > We add one exception case to avoid unneeded use of sock_hold if the
> > pacing_timer is enqueued.
> >
> > Reproduce procedure:
> > 0) cat /proc/slabinfo | grep TCP
> > 1) switch net.ipv4.tcp_congestion_control to bbr
> > 2) using wrk tool something like that to send packages
> > 3) using tc to increase the delay in the dev to simulate the busy case.
> > 4) cat /proc/slabinfo | grep TCP
> > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > 6) at last, you could notice that the number would not decrease.
> >
> > Signed-off-by: Jason Xing <[email protected]>
> > Signed-off-by: liweishi <[email protected]>
> > Signed-off-by: Shujin Li <[email protected]>
> > ---
> > net/ipv4/tcp_output.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index cc4ba42..5cf63d9 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > u64 len_ns;
> > u32 rate;
> >
> > - if (!tcp_needs_internal_pacing(sk))
> > + if (!tcp_needs_internal_pacing(sk) ||
> > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > return;
> > rate = sk->sk_pacing_rate;
> > if (!rate || rate == ~0U)
> > --
> > 1.8.3.1
> >
>
> Hi Jason.
>
> Please do not send patches that do not apply to current upstream trees.
>
> Instead, backport to your kernels the needed fixes.
>
> I suspect that you are not using a pristine linux kernel, but some
> heavily modified one and something went wrong in your backports.
> Do not ask us to spend time finding what went wrong.
>
> Thank you.
From: Jason Xing <[email protected]>
Date: Wed, 3 Jun 2020 09:53:10 +0800
> I'm sorry that I didn't write enough clearly. We're running the
> pristine 4.19.125 linux kernel (the latest LTS version) and have been
> haunted by such an issue. This patch is high-important, I think. So
> I'm going to resend this email with the [patch 4.19] on the headline
> and cc Greg.
That's not the appropriate thing to do.
On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
>
> Hi Eric,
>
> I'm sorry that I didn't write enough clearly. We're running the
> pristine 4.19.125 linux kernel (the latest LTS version) and have been
> haunted by such an issue. This patch is high-important, I think. So
> I'm going to resend this email with the [patch 4.19] on the headline
> and cc Greg.
Yes, please always give for which tree a patch is meant for.
Problem is that your patch is not correct.
In these old kernels, tcp_internal_pacing() is called _after_ the
packet has been sent.
It is too late to 'give up pacing'
The packet should not have been sent if the pacing timer is queued
(otherwise this means we do not respect pacing)
So the bug should be caught earlier. check where tcp_pacing_check()
calls are missing.
>
>
> Thanks,
> Jason
>
> On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > >
> > > From: Jason Xing <[email protected]>
> > >
> > > TCP socks cannot be released because of the sock_hold() increasing the
> > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > Therefore, this situation could increase the slab memory and then trigger
> > > the OOM if the machine has beening running for a long time. This issue,
> > > however, can happen on some machine only running a few days.
> > >
> > > We add one exception case to avoid unneeded use of sock_hold if the
> > > pacing_timer is enqueued.
> > >
> > > Reproduce procedure:
> > > 0) cat /proc/slabinfo | grep TCP
> > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > 2) using wrk tool something like that to send packages
> > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > 4) cat /proc/slabinfo | grep TCP
> > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > 6) at last, you could notice that the number would not decrease.
> > >
> > > Signed-off-by: Jason Xing <[email protected]>
> > > Signed-off-by: liweishi <[email protected]>
> > > Signed-off-by: Shujin Li <[email protected]>
> > > ---
> > > net/ipv4/tcp_output.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index cc4ba42..5cf63d9 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > u64 len_ns;
> > > u32 rate;
> > >
> > > - if (!tcp_needs_internal_pacing(sk))
> > > + if (!tcp_needs_internal_pacing(sk) ||
> > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > return;
> > > rate = sk->sk_pacing_rate;
> > > if (!rate || rate == ~0U)
> > > --
> > > 1.8.3.1
> > >
> >
> > Hi Jason.
> >
> > Please do not send patches that do not apply to current upstream trees.
> >
> > Instead, backport to your kernels the needed fixes.
> >
> > I suspect that you are not using a pristine linux kernel, but some
> > heavily modified one and something went wrong in your backports.
> > Do not ask us to spend time finding what went wrong.
> >
> > Thank you.
I agree with you. The upstream has already dropped and optimized this
part (commit 864e5c090749), so it would not happen like that. However
the old kernels like LTS still have the problem which causes
large-scale crashes on our thousands of machines after running for a
long while. I will send the fix to the correct tree soon :)
Thanks again,
Jason
On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> >
> > Hi Eric,
> >
> > I'm sorry that I didn't write enough clearly. We're running the
> > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > haunted by such an issue. This patch is high-important, I think. So
> > I'm going to resend this email with the [patch 4.19] on the headline
> > and cc Greg.
>
> Yes, please always give for which tree a patch is meant for.
>
> Problem is that your patch is not correct.
> In these old kernels, tcp_internal_pacing() is called _after_ the
> packet has been sent.
> It is too late to 'give up pacing'
>
> The packet should not have been sent if the pacing timer is queued
> (otherwise this means we do not respect pacing)
>
> So the bug should be caught earlier. check where tcp_pacing_check()
> calls are missing.
>
>
>
> >
> >
> > Thanks,
> > Jason
> >
> > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > >
> > > > From: Jason Xing <[email protected]>
> > > >
> > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > Therefore, this situation could increase the slab memory and then trigger
> > > > the OOM if the machine has beening running for a long time. This issue,
> > > > however, can happen on some machine only running a few days.
> > > >
> > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > pacing_timer is enqueued.
> > > >
> > > > Reproduce procedure:
> > > > 0) cat /proc/slabinfo | grep TCP
> > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > 2) using wrk tool something like that to send packages
> > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > 4) cat /proc/slabinfo | grep TCP
> > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > 6) at last, you could notice that the number would not decrease.
> > > >
> > > > Signed-off-by: Jason Xing <[email protected]>
> > > > Signed-off-by: liweishi <[email protected]>
> > > > Signed-off-by: Shujin Li <[email protected]>
> > > > ---
> > > > net/ipv4/tcp_output.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > index cc4ba42..5cf63d9 100644
> > > > --- a/net/ipv4/tcp_output.c
> > > > +++ b/net/ipv4/tcp_output.c
> > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > u64 len_ns;
> > > > u32 rate;
> > > >
> > > > - if (!tcp_needs_internal_pacing(sk))
> > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > return;
> > > > rate = sk->sk_pacing_rate;
> > > > if (!rate || rate == ~0U)
> > > > --
> > > > 1.8.3.1
> > > >
> > >
> > > Hi Jason.
> > >
> > > Please do not send patches that do not apply to current upstream trees.
> > >
> > > Instead, backport to your kernels the needed fixes.
> > >
> > > I suspect that you are not using a pristine linux kernel, but some
> > > heavily modified one and something went wrong in your backports.
> > > Do not ask us to spend time finding what went wrong.
> > >
> > > Thank you.
On Tue, Jun 2, 2020 at 7:42 PM Jason Xing <[email protected]> wrote:
>
> I agree with you. The upstream has already dropped and optimized this
> part (commit 864e5c090749), so it would not happen like that. However
> the old kernels like LTS still have the problem which causes
> large-scale crashes on our thousands of machines after running for a
> long while. I will send the fix to the correct tree soon :)
If you run BBR at scale (thousands of machines), you probably should
use sch_fq instead of internal pacing,
just saying ;)
>
> Thanks again,
> Jason
>
> On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> > >
> > > Hi Eric,
> > >
> > > I'm sorry that I didn't write enough clearly. We're running the
> > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > haunted by such an issue. This patch is high-important, I think. So
> > > I'm going to resend this email with the [patch 4.19] on the headline
> > > and cc Greg.
> >
> > Yes, please always give for which tree a patch is meant for.
> >
> > Problem is that your patch is not correct.
> > In these old kernels, tcp_internal_pacing() is called _after_ the
> > packet has been sent.
> > It is too late to 'give up pacing'
> >
> > The packet should not have been sent if the pacing timer is queued
> > (otherwise this means we do not respect pacing)
> >
> > So the bug should be caught earlier. check where tcp_pacing_check()
> > calls are missing.
> >
> >
> >
> > >
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > > >
> > > > > From: Jason Xing <[email protected]>
> > > > >
> > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > Therefore, this situation could increase the slab memory and then trigger
> > > > > the OOM if the machine has beening running for a long time. This issue,
> > > > > however, can happen on some machine only running a few days.
> > > > >
> > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > pacing_timer is enqueued.
> > > > >
> > > > > Reproduce procedure:
> > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > 2) using wrk tool something like that to send packages
> > > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > > 6) at last, you could notice that the number would not decrease.
> > > > >
> > > > > Signed-off-by: Jason Xing <[email protected]>
> > > > > Signed-off-by: liweishi <[email protected]>
> > > > > Signed-off-by: Shujin Li <[email protected]>
> > > > > ---
> > > > > net/ipv4/tcp_output.c | 3 ++-
> > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > index cc4ba42..5cf63d9 100644
> > > > > --- a/net/ipv4/tcp_output.c
> > > > > +++ b/net/ipv4/tcp_output.c
> > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > > u64 len_ns;
> > > > > u32 rate;
> > > > >
> > > > > - if (!tcp_needs_internal_pacing(sk))
> > > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > > return;
> > > > > rate = sk->sk_pacing_rate;
> > > > > if (!rate || rate == ~0U)
> > > > > --
> > > > > 1.8.3.1
> > > > >
> > > >
> > > > Hi Jason.
> > > >
> > > > Please do not send patches that do not apply to current upstream trees.
> > > >
> > > > Instead, backport to your kernels the needed fixes.
> > > >
> > > > I suspect that you are not using a pristine linux kernel, but some
> > > > heavily modified one and something went wrong in your backports.
> > > > Do not ask us to spend time finding what went wrong.
> > > >
> > > > Thank you.
Thanks for reminding me.
I will test the cases through using sch_fq.
Jason
On Wed, Jun 3, 2020 at 10:44 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 7:42 PM Jason Xing <[email protected]> wrote:
> >
> > I agree with you. The upstream has already dropped and optimized this
> > part (commit 864e5c090749), so it would not happen like that. However
> > the old kernels like LTS still have the problem which causes
> > large-scale crashes on our thousands of machines after running for a
> > long while. I will send the fix to the correct tree soon :)
>
> If you run BBR at scale (thousands of machines), you probably should
> use sch_fq instead of internal pacing,
> just saying ;)
>
>
> >
> > Thanks again,
> > Jason
> >
> > On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm sorry that I didn't write enough clearly. We're running the
> > > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > > haunted by such an issue. This patch is high-important, I think. So
> > > > I'm going to resend this email with the [patch 4.19] on the headline
> > > > and cc Greg.
> > >
> > > Yes, please always give for which tree a patch is meant for.
> > >
> > > Problem is that your patch is not correct.
> > > In these old kernels, tcp_internal_pacing() is called _after_ the
> > > packet has been sent.
> > > It is too late to 'give up pacing'
> > >
> > > The packet should not have been sent if the pacing timer is queued
> > > (otherwise this means we do not respect pacing)
> > >
> > > So the bug should be caught earlier. check where tcp_pacing_check()
> > > calls are missing.
> > >
> > >
> > >
> > > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > > > >
> > > > > > From: Jason Xing <[email protected]>
> > > > > >
> > > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > > Therefore, this situation could increase the slab memory and then trigger
> > > > > > the OOM if the machine has beening running for a long time. This issue,
> > > > > > however, can happen on some machine only running a few days.
> > > > > >
> > > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > > pacing_timer is enqueued.
> > > > > >
> > > > > > Reproduce procedure:
> > > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > > 2) using wrk tool something like that to send packages
> > > > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > > > 6) at last, you could notice that the number would not decrease.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <[email protected]>
> > > > > > Signed-off-by: liweishi <[email protected]>
> > > > > > Signed-off-by: Shujin Li <[email protected]>
> > > > > > ---
> > > > > > net/ipv4/tcp_output.c | 3 ++-
> > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > > index cc4ba42..5cf63d9 100644
> > > > > > --- a/net/ipv4/tcp_output.c
> > > > > > +++ b/net/ipv4/tcp_output.c
> > > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > > > u64 len_ns;
> > > > > > u32 rate;
> > > > > >
> > > > > > - if (!tcp_needs_internal_pacing(sk))
> > > > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > > > return;
> > > > > > rate = sk->sk_pacing_rate;
> > > > > > if (!rate || rate == ~0U)
> > > > > > --
> > > > > > 1.8.3.1
> > > > > >
> > > > >
> > > > > Hi Jason.
> > > > >
> > > > > Please do not send patches that do not apply to current upstream trees.
> > > > >
> > > > > Instead, backport to your kernels the needed fixes.
> > > > >
> > > > > I suspect that you are not using a pristine linux kernel, but some
> > > > > heavily modified one and something went wrong in your backports.
> > > > > Do not ask us to spend time finding what went wrong.
> > > > >
> > > > > Thank you.
Hi Eric,
I'm still trying to understand what you're saying before. Would this
be better as following:
1) discard the tcp_internal_pacing() function.
2) remove where the tcp_internal_pacing() is called in the
__tcp_transmit_skb() function.
If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
should we introduce the tcp_wstamp_ns socket field as commit
(864e5c090749) does?
Thanks,
Jason
On Wed, Jun 3, 2020 at 10:44 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 7:42 PM Jason Xing <[email protected]> wrote:
> >
> > I agree with you. The upstream has already dropped and optimized this
> > part (commit 864e5c090749), so it would not happen like that. However
> > the old kernels like LTS still have the problem which causes
> > large-scale crashes on our thousands of machines after running for a
> > long while. I will send the fix to the correct tree soon :)
>
> If you run BBR at scale (thousands of machines), you probably should
> use sch_fq instead of internal pacing,
> just saying ;)
>
>
> >
> > Thanks again,
> > Jason
> >
> > On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm sorry that I didn't write enough clearly. We're running the
> > > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > > haunted by such an issue. This patch is high-important, I think. So
> > > > I'm going to resend this email with the [patch 4.19] on the headline
> > > > and cc Greg.
> > >
> > > Yes, please always give for which tree a patch is meant for.
> > >
> > > Problem is that your patch is not correct.
> > > In these old kernels, tcp_internal_pacing() is called _after_ the
> > > packet has been sent.
> > > It is too late to 'give up pacing'
> > >
> > > The packet should not have been sent if the pacing timer is queued
> > > (otherwise this means we do not respect pacing)
> > >
> > > So the bug should be caught earlier. check where tcp_pacing_check()
> > > calls are missing.
> > >
> > >
> > >
> > > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > > > >
> > > > > > From: Jason Xing <[email protected]>
> > > > > >
> > > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > > Therefore, this situation could increase the slab memory and then trigger
> > > > > > the OOM if the machine has beening running for a long time. This issue,
> > > > > > however, can happen on some machine only running a few days.
> > > > > >
> > > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > > pacing_timer is enqueued.
> > > > > >
> > > > > > Reproduce procedure:
> > > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > > 2) using wrk tool something like that to send packages
> > > > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > > > 6) at last, you could notice that the number would not decrease.
> > > > > >
> > > > > > Signed-off-by: Jason Xing <[email protected]>
> > > > > > Signed-off-by: liweishi <[email protected]>
> > > > > > Signed-off-by: Shujin Li <[email protected]>
> > > > > > ---
> > > > > > net/ipv4/tcp_output.c | 3 ++-
> > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > > index cc4ba42..5cf63d9 100644
> > > > > > --- a/net/ipv4/tcp_output.c
> > > > > > +++ b/net/ipv4/tcp_output.c
> > > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > > > u64 len_ns;
> > > > > > u32 rate;
> > > > > >
> > > > > > - if (!tcp_needs_internal_pacing(sk))
> > > > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > > > return;
> > > > > > rate = sk->sk_pacing_rate;
> > > > > > if (!rate || rate == ~0U)
> > > > > > --
> > > > > > 1.8.3.1
> > > > > >
> > > > >
> > > > > Hi Jason.
> > > > >
> > > > > Please do not send patches that do not apply to current upstream trees.
> > > > >
> > > > > Instead, backport to your kernels the needed fixes.
> > > > >
> > > > > I suspect that you are not using a pristine linux kernel, but some
> > > > > heavily modified one and something went wrong in your backports.
> > > > > Do not ask us to spend time finding what went wrong.
> > > > >
> > > > > Thank you.
On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
>
> Hi Eric,
>
> I'm still trying to understand what you're saying before. Would this
> be better as following:
> 1) discard the tcp_internal_pacing() function.
> 2) remove where the tcp_internal_pacing() is called in the
> __tcp_transmit_skb() function.
>
> If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> should we introduce the tcp_wstamp_ns socket field as commit
> (864e5c090749) does?
>
Please do not top-post on netdev mailing list.
I basically suggested double-checking which point in TCP could end up
calling tcp_internal_pacing()
while the timer was already armed.
I guess this is mtu probing.
Please try the following patch : If we still have another bug, a
WARNING should give us a stack trace.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cc4ba42052c21b206850594db6751810d8fc72b4..8f4081b228486305222767d4d118b9b6ed0ffda3
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -977,12 +977,26 @@ static void tcp_internal_pacing(struct sock *sk,
const struct sk_buff *skb)
len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
+
+ /* If hrtimer is already armed, then our caller has not properly
+ * used tcp_pacing_check().
+ */
+ if (unlikely(hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))) {
+ WARN_ON_ONCE(1);
+ return;
+ }
hrtimer_start(&tcp_sk(sk)->pacing_timer,
ktime_add_ns(ktime_get(), len_ns),
HRTIMER_MODE_ABS_PINNED_SOFT);
sock_hold(sk);
}
+static bool tcp_pacing_check(const struct sock *sk)
+{
+ return tcp_needs_internal_pacing(sk) &&
+ hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
+}
+
static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
{
skb->skb_mstamp = tp->tcp_mstamp;
@@ -2117,6 +2131,9 @@ static int tcp_mtu_probe(struct sock *sk)
if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
return -1;
+ if (tcp_pacing_check(sk))
+ return -1;
+
/* We're allowed to probe. Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2190,11 +2207,6 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
}
-static bool tcp_pacing_check(const struct sock *sk)
-{
- return tcp_needs_internal_pacing(sk) &&
- hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
-}
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
> Thanks,
> Jason
>
> On Wed, Jun 3, 2020 at 10:44 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 2, 2020 at 7:42 PM Jason Xing <[email protected]> wrote:
> > >
> > > I agree with you. The upstream has already dropped and optimized this
> > > part (commit 864e5c090749), so it would not happen like that. However
> > > the old kernels like LTS still have the problem which causes
> > > large-scale crashes on our thousands of machines after running for a
> > > long while. I will send the fix to the correct tree soon :)
> >
> > If you run BBR at scale (thousands of machines), you probably should
> > use sch_fq instead of internal pacing,
> > just saying ;)
> >
> >
> > >
> > > Thanks again,
> > > Jason
> > >
> > > On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> > > > >
> > > > > Hi Eric,
> > > > >
> > > > > I'm sorry that I didn't write enough clearly. We're running the
> > > > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > > > haunted by such an issue. This patch is high-important, I think. So
> > > > > I'm going to resend this email with the [patch 4.19] on the headline
> > > > > and cc Greg.
> > > >
> > > > Yes, please always give for which tree a patch is meant for.
> > > >
> > > > Problem is that your patch is not correct.
> > > > In these old kernels, tcp_internal_pacing() is called _after_ the
> > > > packet has been sent.
> > > > It is too late to 'give up pacing'
> > > >
> > > > The packet should not have been sent if the pacing timer is queued
> > > > (otherwise this means we do not respect pacing)
> > > >
> > > > So the bug should be caught earlier. check where tcp_pacing_check()
> > > > calls are missing.
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > > > > >
> > > > > > > From: Jason Xing <[email protected]>
> > > > > > >
> > > > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > > > Therefore, this situation could increase the slab memory and then trigger
> > > > > > > the OOM if the machine has beening running for a long time. This issue,
> > > > > > > however, can happen on some machine only running a few days.
> > > > > > >
> > > > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > > > pacing_timer is enqueued.
> > > > > > >
> > > > > > > Reproduce procedure:
> > > > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > > > 2) using wrk tool something like that to send packages
> > > > > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > > > > 6) at last, you could notice that the number would not decrease.
> > > > > > >
> > > > > > > Signed-off-by: Jason Xing <[email protected]>
> > > > > > > Signed-off-by: liweishi <[email protected]>
> > > > > > > Signed-off-by: Shujin Li <[email protected]>
> > > > > > > ---
> > > > > > > net/ipv4/tcp_output.c | 3 ++-
> > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > > > index cc4ba42..5cf63d9 100644
> > > > > > > --- a/net/ipv4/tcp_output.c
> > > > > > > +++ b/net/ipv4/tcp_output.c
> > > > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > > > > u64 len_ns;
> > > > > > > u32 rate;
> > > > > > >
> > > > > > > - if (!tcp_needs_internal_pacing(sk))
> > > > > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > > > > return;
> > > > > > > rate = sk->sk_pacing_rate;
> > > > > > > if (!rate || rate == ~0U)
> > > > > > > --
> > > > > > > 1.8.3.1
> > > > > > >
> > > > > >
> > > > > > Hi Jason.
> > > > > >
> > > > > > Please do not send patches that do not apply to current upstream trees.
> > > > > >
> > > > > > Instead, backport to your kernels the needed fixes.
> > > > > >
> > > > > > I suspect that you are not using a pristine linux kernel, but some
> > > > > > heavily modified one and something went wrong in your backports.
> > > > > > Do not ask us to spend time finding what went wrong.
> > > > > >
> > > > > > Thank you.
On Wed, Jun 3, 2020 at 1:44 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> >
> > Hi Eric,
> >
> > I'm still trying to understand what you're saying before. Would this
> > be better as following:
> > 1) discard the tcp_internal_pacing() function.
> > 2) remove where the tcp_internal_pacing() is called in the
> > __tcp_transmit_skb() function.
> >
> > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > should we introduce the tcp_wstamp_ns socket field as commit
> > (864e5c090749) does?
> >
>
> Please do not top-post on netdev mailing list.
>
>
> I basically suggested double-checking which point in TCP could end up
> calling tcp_internal_pacing()
> while the timer was already armed.
>
> I guess this is mtu probing.
Thanks for suggestions. I will recheck the point.
>
> Please try the following patch : If we still have another bug, a
> WARNING should give us a stack trace.
>
Agreed. I will apply this part of code and test it, then get back some
information here.
If it runs well as we expect, I decide to send this patch as v2 for
4.19 linux kernel.
Jason
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cc4ba42052c21b206850594db6751810d8fc72b4..8f4081b228486305222767d4d118b9b6ed0ffda3
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -977,12 +977,26 @@ static void tcp_internal_pacing(struct sock *sk,
> const struct sk_buff *skb)
>
> len_ns = (u64)skb->len * NSEC_PER_SEC;
> do_div(len_ns, rate);
> +
> + /* If hrtimer is already armed, then our caller has not properly
> + * used tcp_pacing_check().
> + */
> + if (unlikely(hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))) {
> + WARN_ON_ONCE(1);
> + return;
> + }
> hrtimer_start(&tcp_sk(sk)->pacing_timer,
> ktime_add_ns(ktime_get(), len_ns),
> HRTIMER_MODE_ABS_PINNED_SOFT);
> sock_hold(sk);
> }
>
> +static bool tcp_pacing_check(const struct sock *sk)
> +{
> + return tcp_needs_internal_pacing(sk) &&
> + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> +}
> +
> static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
> {
> skb->skb_mstamp = tp->tcp_mstamp;
> @@ -2117,6 +2131,9 @@ static int tcp_mtu_probe(struct sock *sk)
> if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
> return -1;
>
> + if (tcp_pacing_check(sk))
> + return -1;
> +
> /* We're allowed to probe. Build it now. */
> nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
> if (!nskb)
> @@ -2190,11 +2207,6 @@ static int tcp_mtu_probe(struct sock *sk)
> return -1;
> }
>
> -static bool tcp_pacing_check(const struct sock *sk)
> -{
> - return tcp_needs_internal_pacing(sk) &&
> - hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> -}
>
> /* TCP Small Queues :
> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>
>
>
> > Thanks,
> > Jason
> >
> > On Wed, Jun 3, 2020 at 10:44 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 7:42 PM Jason Xing <[email protected]> wrote:
> > > >
> > > > I agree with you. The upstream has already dropped and optimized this
> > > > part (commit 864e5c090749), so it would not happen like that. However
> > > > the old kernels like LTS still have the problem which causes
> > > > large-scale crashes on our thousands of machines after running for a
> > > > long while. I will send the fix to the correct tree soon :)
> > >
> > > If you run BBR at scale (thousands of machines), you probably should
> > > use sch_fq instead of internal pacing,
> > > just saying ;)
> > >
> > >
> > > >
> > > > Thanks again,
> > > > Jason
> > > >
> > > > On Wed, Jun 3, 2020 at 10:29 AM Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jun 2, 2020 at 6:53 PM Jason Xing <[email protected]> wrote:
> > > > > >
> > > > > > Hi Eric,
> > > > > >
> > > > > > I'm sorry that I didn't write enough clearly. We're running the
> > > > > > pristine 4.19.125 linux kernel (the latest LTS version) and have been
> > > > > > haunted by such an issue. This patch is high-important, I think. So
> > > > > > I'm going to resend this email with the [patch 4.19] on the headline
> > > > > > and cc Greg.
> > > > >
> > > > > Yes, please always give for which tree a patch is meant for.
> > > > >
> > > > > Problem is that your patch is not correct.
> > > > > In these old kernels, tcp_internal_pacing() is called _after_ the
> > > > > packet has been sent.
> > > > > It is too late to 'give up pacing'
> > > > >
> > > > > The packet should not have been sent if the pacing timer is queued
> > > > > (otherwise this means we do not respect pacing)
> > > > >
> > > > > So the bug should be caught earlier. check where tcp_pacing_check()
> > > > > calls are missing.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Tue, Jun 2, 2020 at 9:05 PM Eric Dumazet <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue, Jun 2, 2020 at 1:05 AM <[email protected]> wrote:
> > > > > > > >
> > > > > > > > From: Jason Xing <[email protected]>
> > > > > > > >
> > > > > > > > TCP socks cannot be released because of the sock_hold() increasing the
> > > > > > > > sk_refcnt in the manner of tcp_internal_pacing() when RTO happens.
> > > > > > > > Therefore, this situation could increase the slab memory and then trigger
> > > > > > > > the OOM if the machine has beening running for a long time. This issue,
> > > > > > > > however, can happen on some machine only running a few days.
> > > > > > > >
> > > > > > > > We add one exception case to avoid unneeded use of sock_hold if the
> > > > > > > > pacing_timer is enqueued.
> > > > > > > >
> > > > > > > > Reproduce procedure:
> > > > > > > > 0) cat /proc/slabinfo | grep TCP
> > > > > > > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > > > > > > 2) using wrk tool something like that to send packages
> > > > > > > > 3) using tc to increase the delay in the dev to simulate the busy case.
> > > > > > > > 4) cat /proc/slabinfo | grep TCP
> > > > > > > > 5) kill the wrk command and observe the number of objects and slabs in TCP.
> > > > > > > > 6) at last, you could notice that the number would not decrease.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Xing <[email protected]>
> > > > > > > > Signed-off-by: liweishi <[email protected]>
> > > > > > > > Signed-off-by: Shujin Li <[email protected]>
> > > > > > > > ---
> > > > > > > > net/ipv4/tcp_output.c | 3 ++-
> > > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > > > > > > index cc4ba42..5cf63d9 100644
> > > > > > > > --- a/net/ipv4/tcp_output.c
> > > > > > > > +++ b/net/ipv4/tcp_output.c
> > > > > > > > @@ -969,7 +969,8 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > > > > > > > u64 len_ns;
> > > > > > > > u32 rate;
> > > > > > > >
> > > > > > > > - if (!tcp_needs_internal_pacing(sk))
> > > > > > > > + if (!tcp_needs_internal_pacing(sk) ||
> > > > > > > > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer))
> > > > > > > > return;
> > > > > > > > rate = sk->sk_pacing_rate;
> > > > > > > > if (!rate || rate == ~0U)
> > > > > > > > --
> > > > > > > > 1.8.3.1
> > > > > > > >
> > > > > > >
> > > > > > > Hi Jason.
> > > > > > >
> > > > > > > Please do not send patches that do not apply to current upstream trees.
> > > > > > >
> > > > > > > Instead, backport to your kernels the needed fixes.
> > > > > > >
> > > > > > > I suspect that you are not using a pristine linux kernel, but some
> > > > > > > heavily modified one and something went wrong in your backports.
> > > > > > > Do not ask us to spend time finding what went wrong.
> > > > > > >
> > > > > > > Thank you.
On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> >
> > Hi Eric,
> >
> > I'm still trying to understand what you're saying before. Would this
> > be better as following:
> > 1) discard the tcp_internal_pacing() function.
> > 2) remove where the tcp_internal_pacing() is called in the
> > __tcp_transmit_skb() function.
> >
> > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > should we introduce the tcp_wstamp_ns socket field as commit
> > (864e5c090749) does?
> >
>
> Please do not top-post on netdev mailing list.
>
>
> I basically suggested double-checking which point in TCP could end up
> calling tcp_internal_pacing()
> while the timer was already armed.
>
> I guess this is mtu probing.
Perhaps this could also happen from some of the retransmission code
paths that don't use tcp_xmit_retransmit_queue()? Perhaps
tcp_retransmit_timer() (RTO) and tcp_send_loss_probe() TLP? It seems
they could indirectly cause a call to __tcp_transmit_skb() and thus
tcp_internal_pacing() without first checking if the pacing timer was
already armed?
neal
On Wed, Jun 3, 2020 at 8:02 PM Neal Cardwell <[email protected]> wrote:
>
> On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> > >
> > > Hi Eric,
> > >
> > > I'm still trying to understand what you're saying before. Would this
> > > be better as following:
> > > 1) discard the tcp_internal_pacing() function.
> > > 2) remove where the tcp_internal_pacing() is called in the
> > > __tcp_transmit_skb() function.
> > >
> > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > should we introduce the tcp_wstamp_ns socket field as commit
> > > (864e5c090749) does?
> > >
> >
> > Please do not top-post on netdev mailing list.
> >
> >
> > I basically suggested double-checking which point in TCP could end up
> > calling tcp_internal_pacing()
> > while the timer was already armed.
> >
> > I guess this is mtu probing.
I tested the patch Eric suggested and the system display the stack
trace which means there's one more exception we have to take into
consideration. The call trace is listed as following:
Call Trace:
<IRQ>
__tcp_retransmit_skb+0x188/0x7f0
? bbr_set_state+0x7f/0x90 [tcp_bbr]
tcp_retransmit_skb+0x14/0xc0
tcp_retransmit_timer+0x313/0xa10
? native_sched_clock+0x37/0x90
? tcp_write_timer_handler+0x210/0x210
tcp_write_timer_handler+0xb1/0x210
tcp_write_timer+0x6d/0x80
call_timer_fn+0x29/0x110
run_timer_softirq+0x3cb/0x400
? native_sched_clock+0x37/0x90
__do_softirq+0xdf/0x2ed
irq_exit+0xf7/0x100
smp_apic_timer_interrupt+0x68/0x120
apic_timer_interrupt+0xf/0x20
</IRQ>
I admitted that this case is not that easily triggered, but it is the
one that avoids the check during tcp_mtu_probe() period. The first skb
is sent out without being checked by tcp_pacing_check when RTO comes.
>
> Perhaps this could also happen from some of the retransmission code
> paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> tcp_retransmit_timer() (RTO) and tcp_send_loss_probe() TLP? It seems
> they could indirectly cause a call to __tcp_transmit_skb() and thus
> tcp_internal_pacing() without first checking if the pacing timer was
> already armed?
>
Point taken. There are indeed several places using __tcp_transmit_skb
where could cause such an issue, that is to say, slab increasing. All
these particular cases, I think, should all be taken into account.
Thanks,
Jason
> neal
On Wed, Jun 3, 2020 at 5:02 AM Neal Cardwell <[email protected]> wrote:
>
> On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> > >
> > > Hi Eric,
> > >
> > > I'm still trying to understand what you're saying before. Would this
> > > be better as following:
> > > 1) discard the tcp_internal_pacing() function.
> > > 2) remove where the tcp_internal_pacing() is called in the
> > > __tcp_transmit_skb() function.
> > >
> > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > should we introduce the tcp_wstamp_ns socket field as commit
> > > (864e5c090749) does?
> > >
> >
> > Please do not top-post on netdev mailing list.
> >
> >
> > I basically suggested double-checking which point in TCP could end up
> > calling tcp_internal_pacing()
> > while the timer was already armed.
> >
> > I guess this is mtu probing.
>
> Perhaps this could also happen from some of the retransmission code
> paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> tcp_retransmit_timer() (RTO) and tcp_send_loss_probe() TLP? It seems
> they could indirectly cause a call to __tcp_transmit_skb() and thus
> tcp_internal_pacing() without first checking if the pacing timer was
> already armed?
I feared this, (see recent commits about very low pacing rates) :/
I am not sure we need to properly fix all these points for old
kernels, since EDT model got rid of these problems.
Maybe we can try to extend the timer.
Something like :
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cc4ba42052c21b206850594db6751810d8fc72b4..626b9f4f500f7e5270d8d59e6eb16dbfa3efbc7c
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
{
+ struct tcp_sock *tp = tcp_sk(sk);
+ ktime_t expire, now;
u64 len_ns;
u32 rate;
@@ -977,12 +979,29 @@ static void tcp_internal_pacing(struct sock *sk,
const struct sk_buff *skb)
len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
- hrtimer_start(&tcp_sk(sk)->pacing_timer,
- ktime_add_ns(ktime_get(), len_ns),
+
+ now = ktime_get();
+ /* If hrtimer is already armed, then our caller has not
+ * used tcp_pacing_check().
+ */
+ if (unlikely(hrtimer_is_queued(&tp->pacing_timer))) {
+ expire = hrtimer_get_softexpires(&tp->pacing_timer);
+ if (ktime_after(expire, now))
+ now = expire;
+ if (hrtimer_try_to_cancel(&tp->pacing_timer) == 1)
+ __sock_put(sk);
+ }
+ hrtimer_start(&tp->pacing_timer, ktime_add_ns(now, len_ns),
HRTIMER_MODE_ABS_PINNED_SOFT);
sock_hold(sk);
}
+static bool tcp_pacing_check(const struct sock *sk)
+{
+ return tcp_needs_internal_pacing(sk) &&
+ hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
+}
+
static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
{
skb->skb_mstamp = tp->tcp_mstamp;
@@ -2117,6 +2136,9 @@ static int tcp_mtu_probe(struct sock *sk)
if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
return -1;
+ if (tcp_pacing_check(sk))
+ return -1;
+
/* We're allowed to probe. Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2190,11 +2212,6 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
}
-static bool tcp_pacing_check(const struct sock *sk)
-{
- return tcp_needs_internal_pacing(sk) &&
- hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
-}
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
>
> neal
On Wed, Jun 3, 2020 at 9:55 AM Eric Dumazet <[email protected]> wrote:
>
> On Wed, Jun 3, 2020 at 5:02 AM Neal Cardwell <[email protected]> wrote:
> >
> > On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> > > >
> > > > Hi Eric,
> > > >
> > > > I'm still trying to understand what you're saying before. Would this
> > > > be better as following:
> > > > 1) discard the tcp_internal_pacing() function.
> > > > 2) remove where the tcp_internal_pacing() is called in the
> > > > __tcp_transmit_skb() function.
> > > >
> > > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > > should we introduce the tcp_wstamp_ns socket field as commit
> > > > (864e5c090749) does?
> > > >
> > >
> > > Please do not top-post on netdev mailing list.
> > >
> > >
> > > I basically suggested double-checking which point in TCP could end up
> > > calling tcp_internal_pacing()
> > > while the timer was already armed.
> > >
> > > I guess this is mtu probing.
> >
> > Perhaps this could also happen from some of the retransmission code
> > paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> > tcp_retransmit_timer() (RTO) and tcp_send_loss_probe() TLP? It seems
> > they could indirectly cause a call to __tcp_transmit_skb() and thus
> > tcp_internal_pacing() without first checking if the pacing timer was
> > already armed?
>
> I feared this, (see recent commits about very low pacing rates) :/
>
> I am not sure we need to properly fix all these points for old
> kernels, since EDT model got rid of these problems.
Agreed.
> Maybe we can try to extend the timer.
Sounds good.
> Something like :
>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index cc4ba42052c21b206850594db6751810d8fc72b4..626b9f4f500f7e5270d8d59e6eb16dbfa3efbc7c
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
>
> static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> {
> + struct tcp_sock *tp = tcp_sk(sk);
> + ktime_t expire, now;
> u64 len_ns;
> u32 rate;
>
> @@ -977,12 +979,29 @@ static void tcp_internal_pacing(struct sock *sk,
> const struct sk_buff *skb)
>
> len_ns = (u64)skb->len * NSEC_PER_SEC;
> do_div(len_ns, rate);
> - hrtimer_start(&tcp_sk(sk)->pacing_timer,
> - ktime_add_ns(ktime_get(), len_ns),
> +
> + now = ktime_get();
> + /* If hrtimer is already armed, then our caller has not
> + * used tcp_pacing_check().
> + */
> + if (unlikely(hrtimer_is_queued(&tp->pacing_timer))) {
> + expire = hrtimer_get_softexpires(&tp->pacing_timer);
> + if (ktime_after(expire, now))
> + now = expire;
> + if (hrtimer_try_to_cancel(&tp->pacing_timer) == 1)
> + __sock_put(sk);
> + }
> + hrtimer_start(&tp->pacing_timer, ktime_add_ns(now, len_ns),
> HRTIMER_MODE_ABS_PINNED_SOFT);
> sock_hold(sk);
> }
>
> +static bool tcp_pacing_check(const struct sock *sk)
> +{
> + return tcp_needs_internal_pacing(sk) &&
> + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> +}
> +
> static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
> {
> skb->skb_mstamp = tp->tcp_mstamp;
> @@ -2117,6 +2136,9 @@ static int tcp_mtu_probe(struct sock *sk)
> if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
> return -1;
>
> + if (tcp_pacing_check(sk))
> + return -1;
> +
> /* We're allowed to probe. Build it now. */
> nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
> if (!nskb)
> @@ -2190,11 +2212,6 @@ static int tcp_mtu_probe(struct sock *sk)
> return -1;
> }
>
> -static bool tcp_pacing_check(const struct sock *sk)
> -{
> - return tcp_needs_internal_pacing(sk) &&
> - hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> -}
>
> /* TCP Small Queues :
> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
Thanks for your fix, Eric. This fix looks good to me! I agree that
this fix is good enough for older kernels.
thanks,
neal
On Wed, Jun 3, 2020 at 10:08 PM Neal Cardwell <[email protected]> wrote:
>
> On Wed, Jun 3, 2020 at 9:55 AM Eric Dumazet <[email protected]> wrote:
> >
> > On Wed, Jun 3, 2020 at 5:02 AM Neal Cardwell <[email protected]> wrote:
> > >
> > > On Wed, Jun 3, 2020 at 1:44 AM Eric Dumazet <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 2, 2020 at 10:05 PM Jason Xing <[email protected]> wrote:
> > > > >
> > > > > Hi Eric,
> > > > >
> > > > > I'm still trying to understand what you're saying before. Would this
> > > > > be better as following:
> > > > > 1) discard the tcp_internal_pacing() function.
> > > > > 2) remove where the tcp_internal_pacing() is called in the
> > > > > __tcp_transmit_skb() function.
> > > > >
> > > > > If we do so, we could avoid 'too late to give up pacing'. Meanwhile,
> > > > > should we introduce the tcp_wstamp_ns socket field as commit
> > > > > (864e5c090749) does?
> > > > >
> > > >
> > > > Please do not top-post on netdev mailing list.
> > > >
> > > >
> > > > I basically suggested double-checking which point in TCP could end up
> > > > calling tcp_internal_pacing()
> > > > while the timer was already armed.
> > > >
> > > > I guess this is mtu probing.
> > >
> > > Perhaps this could also happen from some of the retransmission code
> > > paths that don't use tcp_xmit_retransmit_queue()? Perhaps
> > > tcp_retransmit_timer() (RTO) and tcp_send_loss_probe() TLP? It seems
> > > they could indirectly cause a call to __tcp_transmit_skb() and thus
> > > tcp_internal_pacing() without first checking if the pacing timer was
> > > already armed?
> >
> > I feared this, (see recent commits about very low pacing rates) :/
> >
> > I am not sure we need to properly fix all these points for old
> > kernels, since EDT model got rid of these problems.
>
> Agreed.
>
> > Maybe we can try to extend the timer.
>
> Sounds good.
>
> > Something like :
> >
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index cc4ba42052c21b206850594db6751810d8fc72b4..626b9f4f500f7e5270d8d59e6eb16dbfa3efbc7c
> > 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
> >
> > static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> > {
> > + struct tcp_sock *tp = tcp_sk(sk);
> > + ktime_t expire, now;
> > u64 len_ns;
> > u32 rate;
> >
> > @@ -977,12 +979,29 @@ static void tcp_internal_pacing(struct sock *sk,
> > const struct sk_buff *skb)
> >
> > len_ns = (u64)skb->len * NSEC_PER_SEC;
> > do_div(len_ns, rate);
> > - hrtimer_start(&tcp_sk(sk)->pacing_timer,
> > - ktime_add_ns(ktime_get(), len_ns),
> > +
> > + now = ktime_get();
> > + /* If hrtimer is already armed, then our caller has not
> > + * used tcp_pacing_check().
> > + */
> > + if (unlikely(hrtimer_is_queued(&tp->pacing_timer))) {
> > + expire = hrtimer_get_softexpires(&tp->pacing_timer);
> > + if (ktime_after(expire, now))
> > + now = expire;
> > + if (hrtimer_try_to_cancel(&tp->pacing_timer) == 1)
> > + __sock_put(sk);
> > + }
> > + hrtimer_start(&tp->pacing_timer, ktime_add_ns(now, len_ns),
> > HRTIMER_MODE_ABS_PINNED_SOFT);
> > sock_hold(sk);
> > }
> >
> > +static bool tcp_pacing_check(const struct sock *sk)
> > +{
> > + return tcp_needs_internal_pacing(sk) &&
> > + hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> > +}
> > +
> > static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
> > {
> > skb->skb_mstamp = tp->tcp_mstamp;
> > @@ -2117,6 +2136,9 @@ static int tcp_mtu_probe(struct sock *sk)
> > if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
> > return -1;
> >
> > + if (tcp_pacing_check(sk))
> > + return -1;
> > +
> > /* We're allowed to probe. Build it now. */
> > nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
> > if (!nskb)
> > @@ -2190,11 +2212,6 @@ static int tcp_mtu_probe(struct sock *sk)
> > return -1;
> > }
> >
> > -static bool tcp_pacing_check(const struct sock *sk)
> > -{
> > - return tcp_needs_internal_pacing(sk) &&
> > - hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
> > -}
> >
> > /* TCP Small Queues :
> > * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>
> Thanks for your fix, Eric. This fix looks good to me! I agree that
> this fix is good enough for older kernels.
>
I just tested this patch and it worked well. So it also looks good to me :)
Nice work!
thanks,
Jason
> thanks,
> neal
From: Jason Xing <[email protected]>
When using BBR mode, too many tcp socks cannot be released because of
duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
when RTO happens. Therefore, this situation maddly increases the slab
memory and then constantly triggers the OOM until crash.
Besides, in addition to BBR mode, if some mode applies pacing function,
it could trigger what we've discussed above,
Reproduce procedure:
0) cat /proc/slabinfo | grep TCP
1) switch net.ipv4.tcp_congestion_control to bbr
2) using wrk tool something like that to send packages
3) using tc to increase the delay and loss to simulate the RTO case.
4) cat /proc/slabinfo | grep TCP
5) kill the wrk command and observe the number of objects and slabs in
TCP.
6) at last, you could notice that the number would not decrease.
v2: extend the timer which could cover all those related potential risks
(suggested by Eric Dumazet and Neal Cardwell)
Signed-off-by: Jason Xing <[email protected]>
Signed-off-by: liweishi <[email protected]>
Signed-off-by: Shujin Li <[email protected]>
---
net/ipv4/tcp_output.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cc4ba42..4626f4e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -966,6 +966,8 @@ enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
{
+ struct tcp_sock *tp = tcp_sk(sk);
+ ktime_t expire, now;
u64 len_ns;
u32 rate;
@@ -977,12 +979,28 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
len_ns = (u64)skb->len * NSEC_PER_SEC;
do_div(len_ns, rate);
- hrtimer_start(&tcp_sk(sk)->pacing_timer,
- ktime_add_ns(ktime_get(), len_ns),
+ now = ktime_get();
+ /* If hrtimer is already armed, then our caller has not
+ * used tcp_pacing_check().
+ */
+ if (unlikely(hrtimer_is_queued(&tp->pacing_timer))) {
+ expire = hrtimer_get_softexpires(&tp->pacing_timer);
+ if (ktime_after(expire, now))
+ now = expire;
+ if (hrtimer_try_to_cancel(&tp->pacing_timer) == 1)
+ __sock_put(sk);
+ }
+ hrtimer_start(&tp->pacing_timer, ktime_add_ns(now, len_ns),
HRTIMER_MODE_ABS_PINNED_SOFT);
sock_hold(sk);
}
+static bool tcp_pacing_check(const struct sock *sk)
+{
+ return tcp_needs_internal_pacing(sk) &&
+ hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
+}
+
static void tcp_update_skb_after_send(struct tcp_sock *tp, struct sk_buff *skb)
{
skb->skb_mstamp = tp->tcp_mstamp;
@@ -2117,6 +2135,9 @@ static int tcp_mtu_probe(struct sock *sk)
if (!tcp_can_coalesce_send_queue_head(sk, probe_size))
return -1;
+ if (tcp_pacing_check(sk))
+ return -1;
+
/* We're allowed to probe. Build it now. */
nskb = sk_stream_alloc_skb(sk, probe_size, GFP_ATOMIC, false);
if (!nskb)
@@ -2190,12 +2211,6 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
}
-static bool tcp_pacing_check(const struct sock *sk)
-{
- return tcp_needs_internal_pacing(sk) &&
- hrtimer_is_queued(&tcp_sk(sk)->pacing_timer);
-}
-
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
* (These limits are doubled for retransmits)
--
1.8.3.1
On Thu, Jun 4, 2020 at 2:01 AM <[email protected]> wrote:
>
> From: Jason Xing <[email protected]>
>
> When using BBR mode, too many tcp socks cannot be released because of
> duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
> when RTO happens. Therefore, this situation maddly increases the slab
> memory and then constantly triggers the OOM until crash.
>
> Besides, in addition to BBR mode, if some mode applies pacing function,
> it could trigger what we've discussed above,
>
> Reproduce procedure:
> 0) cat /proc/slabinfo | grep TCP
> 1) switch net.ipv4.tcp_congestion_control to bbr
> 2) using wrk tool something like that to send packages
> 3) using tc to increase the delay and loss to simulate the RTO case.
> 4) cat /proc/slabinfo | grep TCP
> 5) kill the wrk command and observe the number of objects and slabs in
> TCP.
> 6) at last, you could notice that the number would not decrease.
>
> v2: extend the timer which could cover all those related potential risks
> (suggested by Eric Dumazet and Neal Cardwell)
>
> Signed-off-by: Jason Xing <[email protected]>
> Signed-off-by: liweishi <[email protected]>
> Signed-off-by: Shujin Li <[email protected]>
That is not how things work really.
I will submit this properly so that stable teams do not have to guess
how to backport this to various kernels.
Changelog is misleading, this has nothing to do with BBR, we need to be precise.
Thank you.
On Thu, Jun 4, 2020 at 9:10 PM Eric Dumazet <[email protected]> wrote:
>
> On Thu, Jun 4, 2020 at 2:01 AM <[email protected]> wrote:
> >
> > From: Jason Xing <[email protected]>
> >
> > When using BBR mode, too many tcp socks cannot be released because of
> > duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
> > when RTO happens. Therefore, this situation maddly increases the slab
> > memory and then constantly triggers the OOM until crash.
> >
> > Besides, in addition to BBR mode, if some mode applies pacing function,
> > it could trigger what we've discussed above,
> >
> > Reproduce procedure:
> > 0) cat /proc/slabinfo | grep TCP
> > 1) switch net.ipv4.tcp_congestion_control to bbr
> > 2) using wrk tool something like that to send packages
> > 3) using tc to increase the delay and loss to simulate the RTO case.
> > 4) cat /proc/slabinfo | grep TCP
> > 5) kill the wrk command and observe the number of objects and slabs in
> > TCP.
> > 6) at last, you could notice that the number would not decrease.
> >
> > v2: extend the timer which could cover all those related potential risks
> > (suggested by Eric Dumazet and Neal Cardwell)
> >
> > Signed-off-by: Jason Xing <[email protected]>
> > Signed-off-by: liweishi <[email protected]>
> > Signed-off-by: Shujin Li <[email protected]>
>
> That is not how things work really.
>
> I will submit this properly so that stable teams do not have to guess
> how to backport this to various kernels.
>
> Changelog is misleading, this has nothing to do with BBR, we need to be precise.
>
Thanks for your help. I can finally apply this patch into my kernel.
Looking forward to your patchset :)
Jason
> Thank you.
Hi everyone,
Could anyone take a look at this issue? I believe it is of high-importance.
Though Eric gave the proper patch a few months ago, the stable branch
still hasn't applied or merged this fix. It seems this patch was
forgotten :(
Thanks,
Jason
On Thu, Jun 4, 2020 at 9:47 PM Jason Xing <[email protected]> wrote:
>
> On Thu, Jun 4, 2020 at 9:10 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Thu, Jun 4, 2020 at 2:01 AM <[email protected]> wrote:
> > >
> > > From: Jason Xing <[email protected]>
> > >
> > > When using BBR mode, too many tcp socks cannot be released because of
> > > duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
> > > when RTO happens. Therefore, this situation maddly increases the slab
> > > memory and then constantly triggers the OOM until crash.
> > >
> > > Besides, in addition to BBR mode, if some mode applies pacing function,
> > > it could trigger what we've discussed above,
> > >
> > > Reproduce procedure:
> > > 0) cat /proc/slabinfo | grep TCP
> > > 1) switch net.ipv4.tcp_congestion_control to bbr
> > > 2) using wrk tool something like that to send packages
> > > 3) using tc to increase the delay and loss to simulate the RTO case.
> > > 4) cat /proc/slabinfo | grep TCP
> > > 5) kill the wrk command and observe the number of objects and slabs in
> > > TCP.
> > > 6) at last, you could notice that the number would not decrease.
> > >
> > > v2: extend the timer which could cover all those related potential risks
> > > (suggested by Eric Dumazet and Neal Cardwell)
> > >
> > > Signed-off-by: Jason Xing <[email protected]>
> > > Signed-off-by: liweishi <[email protected]>
> > > Signed-off-by: Shujin Li <[email protected]>
> >
> > That is not how things work really.
> >
> > I will submit this properly so that stable teams do not have to guess
> > how to backport this to various kernels.
> >
> > Changelog is misleading, this has nothing to do with BBR, we need to be precise.
> >
>
> Thanks for your help. I can finally apply this patch into my kernel.
>
> Looking forward to your patchset :)
>
> Jason
>
> > Thank you.
On 8/11/20 3:37 AM, Jason Xing wrote:
> Hi everyone,
>
> Could anyone take a look at this issue? I believe it is of high-importance.
> Though Eric gave the proper patch a few months ago, the stable branch
> still hasn't applied or merged this fix. It seems this patch was
> forgotten :(
Sure, I'll take care of this shortly.
Thanks.
>
> Thanks,
> Jason
>
> On Thu, Jun 4, 2020 at 9:47 PM Jason Xing <[email protected]> wrote:
>>
>> On Thu, Jun 4, 2020 at 9:10 PM Eric Dumazet <[email protected]> wrote:
>>>
>>> On Thu, Jun 4, 2020 at 2:01 AM <[email protected]> wrote:
>>>>
>>>> From: Jason Xing <[email protected]>
>>>>
>>>> When using BBR mode, too many tcp socks cannot be released because of
>>>> duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
>>>> when RTO happens. Therefore, this situation maddly increases the slab
>>>> memory and then constantly triggers the OOM until crash.
>>>>
>>>> Besides, in addition to BBR mode, if some mode applies pacing function,
>>>> it could trigger what we've discussed above,
>>>>
>>>> Reproduce procedure:
>>>> 0) cat /proc/slabinfo | grep TCP
>>>> 1) switch net.ipv4.tcp_congestion_control to bbr
>>>> 2) using wrk tool something like that to send packages
>>>> 3) using tc to increase the delay and loss to simulate the RTO case.
>>>> 4) cat /proc/slabinfo | grep TCP
>>>> 5) kill the wrk command and observe the number of objects and slabs in
>>>> TCP.
>>>> 6) at last, you could notice that the number would not decrease.
>>>>
>>>> v2: extend the timer which could cover all those related potential risks
>>>> (suggested by Eric Dumazet and Neal Cardwell)
>>>>
>>>> Signed-off-by: Jason Xing <[email protected]>
>>>> Signed-off-by: liweishi <[email protected]>
>>>> Signed-off-by: Shujin Li <[email protected]>
>>>
>>> That is not how things work really.
>>>
>>> I will submit this properly so that stable teams do not have to guess
>>> how to backport this to various kernels.
>>>
>>> Changelog is misleading, this has nothing to do with BBR, we need to be precise.
>>>
>>
>> Thanks for your help. I can finally apply this patch into my kernel.
>>
>> Looking forward to your patchset :)
>>
>> Jason
>>
>>> Thank you.
On Tue, Aug 11, 2020 at 11:33 PM Eric Dumazet <[email protected]> wrote:
>
>
>
> On 8/11/20 3:37 AM, Jason Xing wrote:
> > Hi everyone,
> >
> > Could anyone take a look at this issue? I believe it is of high-importance.
> > Though Eric gave the proper patch a few months ago, the stable branch
> > still hasn't applied or merged this fix. It seems this patch was
> > forgotten :(
>
>
> Sure, I'll take care of this shortly.
Hi Eric,
It has been a very long time. It seems this issue was left behind and
almost forgotten, I think.
Could you mind taking some time to fix this up if you still consider
it as important?
Our team has been waiting for your patchset. Afterall, it once had a
huge impact on our
thousands and hundreds of machines.
thanks,
Jason
>
> Thanks.
>
> >
> > Thanks,
> > Jason
> >
> > On Thu, Jun 4, 2020 at 9:47 PM Jason Xing <[email protected]> wrote:
> >>
> >> On Thu, Jun 4, 2020 at 9:10 PM Eric Dumazet <[email protected]> wrote:
> >>>
> >>> On Thu, Jun 4, 2020 at 2:01 AM <[email protected]> wrote:
> >>>>
> >>>> From: Jason Xing <[email protected]>
> >>>>
> >>>> When using BBR mode, too many tcp socks cannot be released because of
> >>>> duplicate use of the sock_hold() in the manner of tcp_internal_pacing()
> >>>> when RTO happens. Therefore, this situation maddly increases the slab
> >>>> memory and then constantly triggers the OOM until crash.
> >>>>
> >>>> Besides, in addition to BBR mode, if some mode applies pacing function,
> >>>> it could trigger what we've discussed above,
> >>>>
> >>>> Reproduce procedure:
> >>>> 0) cat /proc/slabinfo | grep TCP
> >>>> 1) switch net.ipv4.tcp_congestion_control to bbr
> >>>> 2) using wrk tool something like that to send packages
> >>>> 3) using tc to increase the delay and loss to simulate the RTO case.
> >>>> 4) cat /proc/slabinfo | grep TCP
> >>>> 5) kill the wrk command and observe the number of objects and slabs in
> >>>> TCP.
> >>>> 6) at last, you could notice that the number would not decrease.
> >>>>
> >>>> v2: extend the timer which could cover all those related potential risks
> >>>> (suggested by Eric Dumazet and Neal Cardwell)
> >>>>
> >>>> Signed-off-by: Jason Xing <[email protected]>
> >>>> Signed-off-by: liweishi <[email protected]>
> >>>> Signed-off-by: Shujin Li <[email protected]>
> >>>
> >>> That is not how things work really.
> >>>
> >>> I will submit this properly so that stable teams do not have to guess
> >>> how to backport this to various kernels.
> >>>
> >>> Changelog is misleading, this has nothing to do with BBR, we need to be precise.
> >>>
> >>
> >> Thanks for your help. I can finally apply this patch into my kernel.
> >>
> >> Looking forward to your patchset :)
> >>
> >> Jason
> >>
> >>> Thank you.