Hi,
I was glad to see Linux gain SO_SNDTIMEO in kernel 2.4. It is a very use
feature which can avoid complexity and pain in userspace programs.
Unfortunately, it seems to be very buggy. Here are two buggy scenarios.
1)
Create a socketpair(), PF_UNIX, SOCK_STREAM.
Set a 5 second SO_SNDTIMEO on the socket.
write() 100k down the socket in one write(), i.e. enough to cause the
write to have to block.
--> BUG!!! The call blocks indefinitely instead of returning after 5
seconds
(Note that the same test but with SO_RCVTIMEO and a read() works as
expected - I get EAGAIN after 5 seconds).
2)
Create a localhost listening socket - AF_INET, SOCK_STREAM.
Connect to the listening port
Set a 5 second SO_SNDTIMEO on the socket.
write() 1Mb down the socket in one write(), i.e. enough to cause it to
have to block
-> The write() will return after 5 seconds with a partial write count.
GOOD!
Repeat the write() - send another 1Mb.
--> BUG!! The call blocks indefinitely instead of returning with EAGAIN
after 5s.
I hope this is detailled enough. I'm trying to gain access to a FreeBSD
box to compare results..
Cheers
Chris
Hello!
> Unfortunately, it seems to be very buggy. Here are two buggy scenarios.
--- ../vger3-010210/linux/net/ipv4/tcp.c Sat Feb 10 23:16:51 2001
+++ linux/net/ipv4/tcp.c Sat Feb 17 23:27:43 2001
@@ -691,6 +691,8 @@
set_current_state(TASK_INTERRUPTIBLE);
+ if (!timeo)
+ break;
if (signal_pending(current))
break;
if (tcp_memory_free(sk) && !vm_wait)
--- ../vger3-010210/linux/net/core/sock.c Tue Jan 30 21:20:16 2001
+++ linux/net/core/sock.c Sat Feb 17 23:27:44 2001
@@ -727,6 +727,8 @@
clear_bit(SOCK_ASYNC_NOSPACE, &sk->socket->flags);
add_wait_queue(sk->sleep, &wait);
for (;;) {
+ if (!timeo)
+ break;
if (signal_pending(current))
break;
set_bit(SOCK_NOSPACE, &sk->socket->flags);
Alexey,
Damn you are quick! :) Testing immediately
Cheers
Chris
On Sat, 17 Feb 2001 [email protected] wrote:
> Hello!
>
> > Unfortunately, it seems to be very buggy. Here are two buggy scenarios.
>
>
> --- ../vger3-010210/linux/net/ipv4/tcp.c Sat Feb 10 23:16:51 2001
> +++ linux/net/ipv4/tcp.c Sat Feb 17 23:27:43 2001
> @@ -691,6 +691,8 @@
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> + if (!timeo)
> + break;
> if (signal_pending(current))
> break;
> if (tcp_memory_free(sk) && !vm_wait)
> --- ../vger3-010210/linux/net/core/sock.c Tue Jan 30 21:20:16 2001
> +++ linux/net/core/sock.c Sat Feb 17 23:27:44 2001
> @@ -727,6 +727,8 @@
> clear_bit(SOCK_ASYNC_NOSPACE, &sk->socket->flags);
> add_wait_queue(sk->sleep, &wait);
> for (;;) {
> + if (!timeo)
> + break;
> if (signal_pending(current))
> break;
> set_bit(SOCK_NOSPACE, &sk->socket->flags);
>
Hi Alexey,
This patch fixes my simple read()/write() tests, nice one. The behaviour
also now matches BSD (someone kindly donated me a FreeBSD shell for
testing).
Unfortunately, I discovered a bug with SO_SNDTIMEO/sendfile():
- Connect an AF_INET, SOCK_STREAM socket to a local listening socket.
- Set 5 seconds SO_SNDTIMEO on the connected socket
- Do a sendfile() from a big file down the connected socket. Make sure the
size is big (e.g. 1Mb) so the call blocks.
--> BUG!! The call blocks indefinitely rather than being interrupted after
5 seconds.
Cheers
Chris
On Sat, 17 Feb 2001 [email protected] wrote:
> Hello!
>
> > Unfortunately, it seems to be very buggy. Here are two buggy scenarios.
>
>
> --- ../vger3-010210/linux/net/ipv4/tcp.c Sat Feb 10 23:16:51 2001
> +++ linux/net/ipv4/tcp.c Sat Feb 17 23:27:43 2001
> @@ -691,6 +691,8 @@
>
> set_current_state(TASK_INTERRUPTIBLE);
>
> + if (!timeo)
> + break;
> if (signal_pending(current))
> break;
> if (tcp_memory_free(sk) && !vm_wait)
> --- ../vger3-010210/linux/net/core/sock.c Tue Jan 30 21:20:16 2001
> +++ linux/net/core/sock.c Sat Feb 17 23:27:44 2001
> @@ -727,6 +727,8 @@
> clear_bit(SOCK_ASYNC_NOSPACE, &sk->socket->flags);
> add_wait_queue(sk->sleep, &wait);
> for (;;) {
> + if (!timeo)
> + break;
> if (signal_pending(current))
> break;
> set_bit(SOCK_NOSPACE, &sk->socket->flags);
>
Hi,
By the way - I tested SO_RCVLOWAT, another 2.4 addition. Good news this
time - seems to work fine.
Cheers
Chris
Hello!
> Unfortunately, I discovered a bug with SO_SNDTIMEO/sendfile():
None of the options apply to sendfile(). It is not socket level
operation. You have to use alarm for it.
BTW, if you have enough fast network, you probably can observe
that sendfile() is even not interrupted by signals. 8) But this
is possible to fix at least. BTW the same fix will repair SO_*TIMEO
partially, i.e. it will timeout after n*timeo, where n is an arbitrary
number not exceeding size/sndbuf.
Alexey
On Sun, 18 Feb 2001 [email protected] wrote:
> Hello!
>
> > Unfortunately, I discovered a bug with SO_SNDTIMEO/sendfile():
>
> None of the options apply to sendfile(). It is not socket level
> operation. You have to use alarm for it.
Hi Alexey,
Actually sendfile() _does_ timeout using SO_SNDTIMEO. It just takes longer
to timeout because the kernel sendfile() page loop will (usually) need to
timeout a short write, and then timeout a 0 byte write.
So the actual timeout would be 2 * SO_SNDTIMEO.
Unfortunately, I'm seeing timeout at (I think) 3 * SO_SNDTIMEO, which I
can't account for.
Cheers
Chris
Hello!
> So the actual timeout would be 2 * SO_SNDTIMEO.
It will timeout if write of some page blocks for SO_SNDTIMEO.
If transmission of any page never takes more than SO_SNDTIMEO it never
times out.
You can think about sendfile() as subroutine doing:
for (;;) {
read(4K from fdin);
write(4K to fdout);
}
All the options apply to each read()/write() separetely, so that... alas.
Alexey
On Sun, 18 Feb 2001 [email protected] wrote:
> Hello!
>
> > So the actual timeout would be 2 * SO_SNDTIMEO.
>
> It will timeout if write of some page blocks for SO_SNDTIMEO.
.. unless that page was partially written, in which case a short write
count is returned (rather than a timeout error), and the loop goes around
again.
> If transmission of any page never takes more than SO_SNDTIMEO it never
> times out.
Which is good, because SO_SNDTIMEO is an inactivity monitor.
Cheers
Chris
Hello!
> .. unless that page was partially written, in which case a short write
> count is returned (rather than a timeout error), and the loop goes around
> again.
sendfile() does not return on partial write and tries to push more
until error. On fast link it most likely succeeds, so that it is unkillable
even with SIGKILL.
> Which is good, because SO_SNDTIMEO is an inactivity monitor.
Then why did you blame? 8)8)
I do not think so. It is rather scheduling breaker. If connection
is idle 99% of time, but wakes each sndtimeo-1usec, it must yuild,
otherwise thread is lost for production.
Alexey
On Sun, 18 Feb 2001 [email protected] wrote:
> Hello!
>
> > Unfortunately, I discovered a bug with SO_SNDTIMEO/sendfile():
>
> None of the options apply to sendfile(). It is not socket level
> operation. You have to use alarm for it.
>
> BTW, if you have enough fast network, you probably can observe
> that sendfile() is even not interrupted by signals. 8) But this
> is possible to fix at least. BTW the same fix will repair SO_*TIMEO
> partially, i.e. it will timeout after n*timeo, where n is an arbitrary
> number not exceeding size/sndbuf.
Hi Alexey,
You are right - our sendfile() implementation is broken. I have fixed it
(patch at end of mail).
However, I believe something is still wrong in the networking layer, even
with my fix applied.
Before I go into details, I want to step back and describe things from a
_users_ perspective. That is most important after all.
Take two different operations: write() to a socket and sendfile() down a
socket. In both cases, the socket has a send timeout of 10 seconds. From a
users' point of view, these are two socket write operations. The source of
data is different (a buffer or a file descriptor), but that is irrelevant.
The user has the right to expect a timeout after 10 seconds of no
progress, on both operations.
I have tried this on FreeBSD, and this is what happens: both sendfile()
and write() timeout in the same way.
On Linux, this is not the case => bug. I fixed a small sendfile() issue,
which did not recognise partial writes as an interruption, but as I said
above, the bug still remains.
Investigation shows that the Linux network layer is behaving oddly. It
seems that we are writing 4096 bytes to a socket. This proceeds in 4096
byte chunks until the send buffer on the socket is full, and a 4096 byte
write blocks. This blocking write is eventually interrupted by the
timeout, and the write call returns.. wait for it.. 4096! This suggests
there was socket space after all, and the call should not have blocked.
I wonder what is going on? I'd like to get this fixed. I think the FreeBSD
behaviour is definitely correct and we want it on Linux.
Cheers
Chris
--- filemap.c.old Sun Feb 18 23:35:06 2001
+++ filemap.c Mon Feb 19 00:13:38 2001
@@ -1062,7 +1062,7 @@
for (;;) {
struct page *page, **hash;
- unsigned long end_index, nr;
+ unsigned long end_index, nr, actor_ret;
end_index = inode->i_size >> PAGE_CACHE_SHIFT;
if (index > end_index)
@@ -1110,13 +1110,13 @@
* "pos" here (the actor routine has to update the user
buffer
* pointers and the remaining count).
*/
- nr = actor(desc, page, offset, nr);
- offset += nr;
+ actor_ret = actor(desc, page, offset, nr);
+ offset += actor_ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
page_cache_release(page);
- if (nr && desc->count)
+ if (actor_ret == nr && desc->count)
continue;
break;
On Mon, 19 Feb 2001, Chris Evans wrote:
> > BTW, if you have enough fast network, you probably can observe
> > that sendfile() is even not interrupted by signals. 8) But this
> > is possible to fix at least. BTW the same fix will repair SO_*TIMEO
> > partially, i.e. it will timeout after n*timeo, where n is an arbitrary
> > number not exceeding size/sndbuf.
>
> Hi Alexey,
>
> You are right - our sendfile() implementation is broken. I have fixed it
> (patch at end of mail).
Actually the whole mess stems from our broken internal ->write() and
->read() APIs.
The _single_ return value is trying to convery _two_ pieces of information
- always a bad move. They are:
1) Success/failure (and error code if it's a failure)
2) Amount of bytes read or written
This bogon does not allow for the following information to be returned
(assume I asked for 8192 bytes to be written):
"4096 bytes were written, and the operation was aborted due to EINTR"
Cheers
Chris
Hello!
> You are right - our sendfile() implementation is broken. I have fixed it
Thank you!
> Investigation shows that the Linux network layer is behaving oddly. It
> seems that we are writing 4096 bytes to a socket. This proceeds in 4096
> byte chunks until the send buffer on the socket is full, and a 4096 byte
> write blocks. This blocking write is eventually interrupted by the
> timeout, and the write call returns.. wait for it.. 4096! This suggests
> there was socket space after all, and the call should not have blocked.
Wakeup does not happen until _enough_ (1/3 of snbuf) of space in sndbuf
is released, otherwise you will overschedule. So, as soon as
write() goes to sleep, it will sleep waiting until 1/3 is released.
If it is interrupted, it use all the released space immediately before exit.
Again, to make more for in this context. This can be even wrong and, probably,
we should return instantly with -EAGAIN/-EINTR/partial count, but it is most
likely suboptimal (though I have already changed this to instant return).
But this does not look essential from caller's viewpoint, except for
sendfile() of course. 8)
Alexey
On Mon, 19 Feb 2001 [email protected] wrote:
> Wakeup does not happen until _enough_ (1/3 of snbuf) of space in sndbuf
> is released, otherwise you will overschedule. So, as soon as
> write() goes to sleep, it will sleep waiting until 1/3 is released.
Of course. Thank you.
> If it is interrupted, it use all the released space immediately before
> exit. Again, to make more for in this context. This can be even wrong
> and, probably, we should return instantly with -EAGAIN/-EINTR/partial
> count, but it is most likely suboptimal (though I have already changed
> this to instant return). But this does not look essential from
> caller's viewpoint, except for sendfile() of course. 8)
Cool.
I think the proper fix, long term, is to fix our internal I/O routine APIs
so that they are capable of returning a byte count _and_ an error. One
day, that might be a useful thing to export to userspace.
Cheers
Chris
[email protected] said:
> I think the proper fix, long term, is to fix our internal I/O routine
> APIs so that they are capable of returning a byte count _and_ an
> error. One day, that might be a useful thing to export to userspace.
The MTD code already does this.
--
dwmw2