LinuxLists.cc - Re: Honoring SO_RCVLOWAT in proto

2008-10-05 20:27:55

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

Give this patch a try:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ab341e..0e43875 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)

/* Connected? */
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+ int target = sock_rcvlowat(sk, 0, INT_MAX);
+
+ if (tp->urg_seq == tp->copied_seq &&
+ !sock_flag(sk, SOCK_URGINLINE) &&
+ tp->urg_data)
+ target--;
+
/* Potential race condition. If read of tp below will
* escape above sk->sk_state, we can be illegally awaken
* in SYN_* states. */
- if ((tp->rcv_nxt != tp->copied_seq) &&
- (tp->urg_seq != tp->copied_seq ||
- tp->rcv_nxt != tp->copied_seq + 1 ||
- sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
+ if (target >= tp->rcv_nxt - tp->copied_seq)
mask |= POLLIN | POLLRDNORM;

if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {

2008-10-05 21:46:14

by swivel

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

On Sun, Oct 05, 2008 at 01:27:22PM -0700, David Miller wrote:
>
> Give this patch a try:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1ab341e..0e43875 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>
> /* Connected? */
> if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> + int target = sock_rcvlowat(sk, 0, INT_MAX);
> +
> + if (tp->urg_seq == tp->copied_seq &&
> + !sock_flag(sk, SOCK_URGINLINE) &&
> + tp->urg_data)
> + target--;
> +
> /* Potential race condition. If read of tp below will
> * escape above sk->sk_state, we can be illegally awaken
> * in SYN_* states. */
> - if ((tp->rcv_nxt != tp->copied_seq) &&
> - (tp->urg_seq != tp->copied_seq ||
> - tp->rcv_nxt != tp->copied_seq + 1 ||
> - sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
> + if (target >= tp->rcv_nxt - tp->copied_seq)
> mask |= POLLIN | POLLRDNORM;
>
> if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {

I will be testing this patch today. At a glance it appears with this
patch we're still not taking rcvlowat into consideration in recv()
with MSG_PEEK flag set. This should probably also be corrected, as
mentioned in the thread previously.

Regards,
Vito Caputo

2008-10-05 22:31:32

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: [email protected]
Date: Sun, 5 Oct 2008 16:45:57 -0500

> I will be testing this patch today. At a glance it appears with this
> patch we're still not taking rcvlowat into consideration in recv()
> with MSG_PEEK flag set. This should probably also be corrected, as
> mentioned in the thread previously.

Yes, I remember you mentioning that.

I'll look into it.

2008-10-06 05:17:28

by lkml

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

On Sun, Oct 05, 2008 at 03:30:59PM -0700, David Miller wrote:
> From: [email protected]
> Date: Sun, 5 Oct 2008 16:45:57 -0500
>
> > I will be testing this patch today. At a glance it appears with this
> > patch we're still not taking rcvlowat into consideration in recv()
> > with MSG_PEEK flag set. This should probably also be corrected, as
> > mentioned in the thread previously.
>
> Yes, I remember you mentioning that.
>
> I'll look into it.

Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
case.

I didn't have a chance to scrutinize the changes but with the patch
applied simple things like a telnet to 127.0.0.1:80 exit immediately
with "Connection closed by foreign host".

In strace of the above telnet failure I see recv() returning EAGAIN
before exiting. Telnet expected a select() to block until data was
available and didn't expect recv() to find nothing available when the
select() reported there was something. Select() was no longer behaving
properly on ipv4 tcp sockets.

Regards,
Vito Caputo

2008-10-06 17:18:47

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: [email protected]
Date: Mon, 6 Oct 2008 00:17:18 -0500

> Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
> case.

Sorry, the condition was reversed, try this one instead:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1ab341e..7d81a1e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -384,13 +384,17 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)

/* Connected? */
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+ int target = sock_rcvlowat(sk, 0, INT_MAX);
+
+ if (tp->urg_seq == tp->copied_seq &&
+ !sock_flag(sk, SOCK_URGINLINE) &&
+ tp->urg_data)
+ target--;
+
/* Potential race condition. If read of tp below will
* escape above sk->sk_state, we can be illegally awaken
* in SYN_* states. */
- if ((tp->rcv_nxt != tp->copied_seq) &&
- (tp->urg_seq != tp->copied_seq ||
- tp->rcv_nxt != tp->copied_seq + 1 ||
- sock_flag(sk, SOCK_URGINLINE) || !tp->urg_data))
+ if (tp->rcv_nxt - tp->copied_seq >= target)
mask |= POLLIN | POLLRDNORM;

if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {

2008-10-06 17:46:33

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: David Miller <[email protected]>
Date: Mon, 06 Oct 2008 10:18:14 -0700 (PDT)

> From: [email protected]
> Date: Mon, 6 Oct 2008 00:17:18 -0500
>
> > Looks like the RCVLOWAT patch breaks the tcp poll logic in the normal
> > case.
>
> Sorry, the condition was reversed, try this one instead:

FWIW, since this passed my own testing, I pushed this into
net-next-2.6

2008-10-13 07:35:18

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: David Miller <[email protected]>
Date: Sun, 05 Oct 2008 15:30:59 -0700 (PDT)

> From: [email protected]
> Date: Sun, 5 Oct 2008 16:45:57 -0500
>
> > I will be testing this patch today. At a glance it appears with this
> > patch we're still not taking rcvlowat into consideration in recv()
> > with MSG_PEEK flag set. This should probably also be corrected, as
> > mentioned in the thread previously.
>
> Yes, I remember you mentioning that.
>
> I'll look into it.

Were you able to test my updated patch?

If it makes your application work, I might want to defer messing around
with MSG_PEEK semantics in recvmsg().

So I've been waiting for your test results before I proceed.

2008-10-13 08:32:27

by swivel

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

On Mon, Oct 13, 2008 at 12:34:41AM -0700, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Sun, 05 Oct 2008 15:30:59 -0700 (PDT)
>
> > From: [email protected]
> > Date: Sun, 5 Oct 2008 16:45:57 -0500
> >
> > > I will be testing this patch today. At a glance it appears with this
> > > patch we're still not taking rcvlowat into consideration in recv()
> > > with MSG_PEEK flag set. This should probably also be corrected, as
> > > mentioned in the thread previously.
> >
> > Yes, I remember you mentioning that.
> >
> > I'll look into it.
>
> Were you able to test my updated patch?
>
> If it makes your application work, I might want to defer messing around
> with MSG_PEEK semantics in recvmsg().
>
> So I've been waiting for your test results before I proceed.

Didn't test your latest patch after you mentioned that you had tested it
yourself successfully.

The application is still broken having recv() not behave normally with
the MSG_PEEK flag. The app will spin calling recv() repeatedly once any
data has made it into the rcvbuf, because recv() will always immediately
return with the >0 # of bytes in the buffer.

I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
expired. Upon timeout expiration recv() is supposed to return -1 with
errno=EAGAIN. With recv() immediately returning on MSG_PEEK there's no
possibility for timeout expiration. recv() behaves perfectly in this
regard without MSG_PEEK, it just needs some minor adjustment to behave
the same with MSG_PEEK specified.

There's a simple diagram illustrating the application implementation
in general. This may help understand the implications of recv() not
blocking once there is data in the rcvbuf without wasting much of your
time:
http://serverkit.org/modules/httpx/httpx.png

Thanks for the follow-up.

Regards,
Vito Caputo

2008-10-13 10:13:10

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: [email protected]
Date: Mon, 13 Oct 2008 03:32:14 -0500

> I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
> Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
> expired.

But if you poll() properly, you'll never call recv() unless the amount
of bytes you want are there.

And since I fixed poll()'s handling of SO_RCVLOWAT it should mostly
work.

2008-10-20 03:58:29

by swivel

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

On Mon, Oct 13, 2008 at 02:58:16AM -0700, David Miller wrote:
> From: [email protected]
> Date: Mon, 13 Oct 2008 03:32:14 -0500
>
> > I'm using the pseudo-blocking recv() behavior achieved with SO_RCVTIMEO.
> > Thus my app expects recv() to block until SO_RCVLOWAT is met or SO_RCVTIMEO
> > expired.
>
> But if you poll() properly, you'll never call recv() unless the amount
> of bytes you want are there.
>
> And since I fixed poll()'s handling of SO_RCVLOWAT it should mostly
> work.

The app already has a kludge in place to work around the current kernel
with broken poll() and recv() with regards to SO_RCVLOWAT.

It's less than ideal but 'mostly works', so I'm already at that point...
Doesn't really make sense to me to rewrite the kludge to depend on a
very modern kernel without even having it be able to use recv()
properly.

I just hope we can have recv() block with MSG_PEEK when SO_RCVLOWAT is
>1 in the near future. My goal was next time I get around to doing a
dist-upgrade the new kernel would have both poll and recv fixed and I
could disable the kludge.

>From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
fix it?

Thanks,
Vito Caputo

2008-10-20 04:26:43

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: [email protected]
Date: Sun, 19 Oct 2008 22:58:19 -0500

> From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
> fix it?

Patches welcome :-)

I'm just very busy and that's the only reason I haven't tackled it
myself.

2008-11-05 11:37:26

by David Miller

[permalink] [raw]

Subject: Re: Honoring SO_RCVLOWAT in proto_ops.poll methods

From: [email protected]
Date: Sun, 19 Oct 2008 22:58:19 -0500

> From what I can see the recv() MSG_PEEK fix is trivial anyways, why not
> fix it?

Does this patch work for you?

tcp: Fix recvmsg MSG_PEEK influence of blocking behavior.

Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used. In particular, this means that
SO_RCVLOWAT is not respected.

Simply remove the test. And this matches the behavior of several
other systems, including BSD.

Signed-off-by: David S. Miller <[email protected]>
---
net/ipv4/tcp.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index eccb716..c5aca0b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1374,8 +1374,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
sk->sk_state == TCP_CLOSE ||
(sk->sk_shutdown & RCV_SHUTDOWN) ||
!timeo ||
- signal_pending(current) ||
- (flags & MSG_PEEK))
+ signal_pending(current))
break;
} else {
if (sock_flag(sk, SOCK_DONE))
--
1.5.6.5