The sk_max_ack_backlog will be set in the caller inet_listen() and
dccp_listen_start(), so it is redundant to set it in
inet_csk_listen_start().
Just remove this setting.
Signed-off-by: Yafang Shao <[email protected]>
---
net/ipv4/inet_connection_sock.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index dfd5009..cdd5c95 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -871,7 +871,6 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
reqsk_queue_alloc(&icsk->icsk_accept_queue);
- sk->sk_max_ack_backlog = backlog;
sk->sk_ack_backlog = 0;
inet_csk_delack_init(sk);
--
1.8.3.1
By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
no enough memory it will do both direct reclaim and background reclaim.
If the size of system memory is great, the direct reclaim may cause great
latency spike.
When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
send path. Then, it will return immediately if there's no enough memory to
be allocated, and then the appliation has a chance to do some other stuffs
instead of being blocked here.
Signed-off-by: Yafang Shao <[email protected]>
---
net/ipv4/tcp.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 43ef83b..fe4f5ce 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1182,6 +1182,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
bool process_backlog = false;
bool zc = false;
long timeo;
+ gfp_t gfp;
flags = msg->msg_flags;
@@ -1255,6 +1256,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
/* Ok commence sending. */
copied = 0;
+ gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
+ sk->sk_allocation;
+
restart:
mss_now = tcp_send_mss(sk, &size_goal, flags);
@@ -1283,8 +1287,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
}
first_skb = tcp_rtx_and_write_queues_empty(sk);
linear = select_size(first_skb, zc);
- skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
- first_skb);
+ skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
if (!skb)
goto wait_for_memory;
--
1.8.3.1
On Tue, Oct 9, 2018 at 5:05 AM Yafang Shao <[email protected]> wrote:
>
> The sk_max_ack_backlog will be set in the caller inet_listen() and
> dccp_listen_start(), so it is redundant to set it in
> inet_csk_listen_start().
> Just remove this setting.
>
> Signed-off-by: Yafang Shao <[email protected]>
> ---
> net/ipv4/inet_connection_sock.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index dfd5009..cdd5c95 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -871,7 +871,6 @@ int inet_csk_listen_start(struct sock *sk, int backlog)
>
> reqsk_queue_alloc(&icsk->icsk_accept_queue);
>
> - sk->sk_max_ack_backlog = backlog;
> sk->sk_ack_backlog = 0;
> inet_csk_delack_init(sk);
You got it wrong again. Can you read my feedbacks one more time ?
This setting is not redundant, unless you move the ones in
inet_listen() and inet_dccp_listen() earlier.
On Tue, Oct 9, 2018 at 5:05 AM Yafang Shao <[email protected]> wrote:
>
> By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
> no enough memory it will do both direct reclaim and background reclaim.
> If the size of system memory is great, the direct reclaim may cause great
> latency spike.
>
> When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
> blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
> send path. Then, it will return immediately if there's no enough memory to
> be allocated, and then the appliation has a chance to do some other stuffs
> instead of being blocked here.
>
> Signed-off-by: Yafang Shao <[email protected]>
> ---
> net/ipv4/tcp.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 43ef83b..fe4f5ce 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1182,6 +1182,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> bool process_backlog = false;
> bool zc = false;
> long timeo;
> + gfp_t gfp;
>
> flags = msg->msg_flags;
>
> @@ -1255,6 +1256,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> /* Ok commence sending. */
> copied = 0;
>
> + gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
> + sk->sk_allocation;
> +
> restart:
> mss_now = tcp_send_mss(sk, &size_goal, flags);
>
> @@ -1283,8 +1287,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> }
> first_skb = tcp_rtx_and_write_queues_empty(sk);
> linear = select_size(first_skb, zc);
> - skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
> - first_skb);
> + skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
> if (!skb)
> goto wait_for_memory;
How have you tested this patch exactly ?
Most of TCP payloads are added in page fragments, and you have not
changed the page allocation fragments.
Also, I do not see how an application will get future notifications
that it can retry the failed system call ?
How are you really going to deal with this in high performance applications ?
I would rather prefer a socket setsockopt() to eventually be able to
flip __GFP_DIRECT_RECLAIM in sk->sk_allocation,
to not add all these tests in fast path, but honestly I do not see how
applications can really make use of this.
On Tue, Oct 9, 2018 at 10:12 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Oct 9, 2018 at 5:05 AM Yafang Shao <[email protected]> wrote:
> >
> > By default, the sk->sk_allocation is GFP_KERNEL, that means if there's
> > no enough memory it will do both direct reclaim and background reclaim.
> > If the size of system memory is great, the direct reclaim may cause great
> > latency spike.
> >
> > When we set MSG_DONTWAIT in send syscalls, we really don't want it to be
> > blocked, so we'd better clear __GFP_DIRECT_RECLAIM when allocate skb in the
> > send path. Then, it will return immediately if there's no enough memory to
> > be allocated, and then the appliation has a chance to do some other stuffs
> > instead of being blocked here.
> >
> > Signed-off-by: Yafang Shao <[email protected]>
> > ---
> > net/ipv4/tcp.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 43ef83b..fe4f5ce 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1182,6 +1182,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > bool process_backlog = false;
> > bool zc = false;
> > long timeo;
> > + gfp_t gfp;
> >
> > flags = msg->msg_flags;
> >
> > @@ -1255,6 +1256,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > /* Ok commence sending. */
> > copied = 0;
> >
> > + gfp = flags & MSG_DONTWAIT ? sk->sk_allocation & ~__GFP_DIRECT_RECLAIM :
> > + sk->sk_allocation;
> > +
> > restart:
> > mss_now = tcp_send_mss(sk, &size_goal, flags);
> >
> > @@ -1283,8 +1287,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
> > }
> > first_skb = tcp_rtx_and_write_queues_empty(sk);
> > linear = select_size(first_skb, zc);
> > - skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
> > - first_skb);
> > + skb = sk_stream_alloc_skb(sk, linear, gfp, first_skb);
> > if (!skb)
> > goto wait_for_memory;
>
>
> How have you tested this patch exactly ?
>
There was a network latency (hunreds msecs or even one sec ) recently
on our production enviroment.
And finally I diagnosed that this latency was caused by direct reclaim
in tcp_sendmsg.
That issue could be resovled by keeping a reserved memory.
But I think deeply that why not forbid direct reclaim if we set MSG_DONWAIT.
So I did this change and tested it. The application got a errno
returned instead of being blocked in send path.
That's why I sumbit this patch.
> Most of TCP payloads are added in page fragments, and you have not
> changed the page allocation fragments.
>
> Also, I do not see how an application will get future notifications
> that it can retry the failed system call ?
> How are you really going to deal with this in high performance applications ?
>
I think that immdiately return with errno is better than being blocked.
Maybe this solution is not good enough.
At least it could tell the application that something is wrong and it
can't send now.
> I would rather prefer a socket setsockopt() to eventually be able to
> flip __GFP_DIRECT_RECLAIM in sk->sk_allocation,
> to not add all these tests in fast path, but honestly I do not see how
> applications can really make use of this.
Maybe an event is needed to tell the application it can send now.
I don't have better idea neither.
Thanks
Yafang
> >
> There was a network latency (hunreds msecs or even one sec ) recently
> on our production enviroment.
> And finally I diagnosed that this latency was caused by direct reclaim
> in tcp_sendmsg.
> That issue could be resovled by keeping a reserved memory.
> But I think deeply that why not forbid direct reclaim if we set MSG_DONWAIT.
> So I did this change and tested it. The application got a errno
> returned instead of being blocked in send path.
> That's why I sumbit this patch.
Sure, and I asked you how you have tested it, because it seems clear
to me that you missed
the real memory allocation point (We fill up to 64 KB of page
fragments memory into one (small) skb)
And how is the application going to use MSG_DONTWAIT in the real
world, I do wonder as well.
We do not add bloat in the kernel if no application is ever going to
use it, especially in the TCP fast path.
Give us a test, so that we can see how this can be used...
Thanks.
On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <[email protected]> wrote:
>
> We do not add bloat in the kernel if no application is ever going to
> use it, especially in the TCP fast path.
>
BTW, are you willing to change all memory allocations in the kernel as well ?
Let say an application is using a system call providing a pathname
(open(), stat(), ...), how this system call
is going to ask the kernel for no direct reclaim ?
Even allocating a socket with socket() or accept() has no ability to
avoid direct reclaim.
So tcp_sendmsg() is only the tip of the iceberg.
On Tue, Oct 9, 2018 at 11:38 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <[email protected]> wrote:
> >
>
> > We do not add bloat in the kernel if no application is ever going to
> > use it, especially in the TCP fast path.
> >
>
> BTW, are you willing to change all memory allocations in the kernel as well ?
>
> Let say an application is using a system call providing a pathname
> (open(), stat(), ...), how this system call
> is going to ask the kernel for no direct reclaim ?
>
> Even allocating a socket with socket() or accept() has no ability to
> avoid direct reclaim.
>
> So tcp_sendmsg() is only the tip of the iceberg.
If we can really find a solution that is good enough to hanlde direct
reclaim in tcp_sendmsg,
we could also implement it in other syscalls.
Unexpected latency is hateful.
Thanks
Yafang
On 10/09/2018 06:30 PM, Yafang Shao wrote:
> On Tue, Oct 9, 2018 at 11:38 PM Eric Dumazet <[email protected]> wrote:
>>
>> On Tue, Oct 9, 2018 at 7:58 AM Eric Dumazet <[email protected]> wrote:
>>>
>>
>>> We do not add bloat in the kernel if no application is ever going to
>>> use it, especially in the TCP fast path.
>>>
>>
>> BTW, are you willing to change all memory allocations in the kernel as well ?
>>
>> Let say an application is using a system call providing a pathname
>> (open(), stat(), ...), how this system call
>> is going to ask the kernel for no direct reclaim ?
>>
>> Even allocating a socket with socket() or accept() has no ability to
>> avoid direct reclaim.
>>
>> So tcp_sendmsg() is only the tip of the iceberg.
>
> If we can really find a solution that is good enough to hanlde direct
> reclaim in tcp_sendmsg,
> we could also implement it in other syscalls.
> Unexpected latency is hateful.
We have thousands of other places in the kernel, I want to find a generic solution,
not patch all the places one by one.
So come back when you have something more generic, and once applications have a way
to handle gracefully (without calling sendmsg() in infinite loop ...)
to these memory allocation issues.
How is EPOLLOUT going to be generated ?