2004-10-06 14:52:31

by Joris van Rantwijk

[permalink] [raw]
Subject: UDP recvmsg blocks after select(), 2.6 bug?

Hello,

I have a problem where the sequence of events is as follows:
- application does select() on a UDP socket descriptor
- select returns success with descriptor ready for reading
- application does recvfrom() on this descriptor and this recvfrom()
blocks forever

My understanding of POSIX is limited, but it seems to me that a read call
must never block after select just said that it's ok to read from the
descriptor. So any such behaviour would be a kernel bug.

This problem occurs repeatedly, but only once per week on average so it is
hard to debug but definitely a real problem. I know for a fact that the
sequence of events is as described above from strace output. My kernel
version is 2.6.7.

>From a brief look at the kernel UDP code, I suspect a problem in
net/ipv4/udp.c, udp_recvmsg(): it reads the first available datagram
from the queue, then checks the UDP checksum. If the UDP checksum fails at
this point, the datagram is discarded and the process blocks until the next
datagram arrives.

Could someone please help me track this problem?
Am I correct in my reasoning that the select() -> recvmsg() sequence must
never block?
If yes, is it possible that this problem is triggered by a failed UDP
checksum in the udp_recvmsg() function?
If yes, can we do something to fix this?

Thanks,
Joris.


2004-10-06 15:01:45

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
Joris van Rantwijk <[email protected]> wrote:

> My understanding of POSIX is limited, but it seems to me that a read call
> must never block after select just said that it's ok to read from the
> descriptor. So any such behaviour would be a kernel bug.

There is no such guarentee.

2004-10-06 15:10:27

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004, Joris van Rantwijk wrote:

> Hello,
>
> I have a problem where the sequence of events is as follows:
> - application does select() on a UDP socket descriptor
> - select returns success with descriptor ready for reading
> - application does recvfrom() on this descriptor and this recvfrom()
> blocks forever
>
> My understanding of POSIX is limited, but it seems to me that a read call
> must never block after select just said that it's ok to read from the
> descriptor. So any such behaviour would be a kernel bug.
>

Can you check to see if you have an exception at the same time?
Also, please make sure that the first parameter to select() is
the file-descriptor value + 1. There are things like out-of-band
data that could show up ( MSG_OOB ) under select(). Also, recvfom()
takes a lot of parameters that need to be correct. There is a buffer
length plus a pointer to a socklen_t variable. I've seen people
mess these up and have everything "work" except sometimes...

Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-06 15:14:21

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
> Joris van Rantwijk <[email protected]> wrote:
>
>
>>My understanding of POSIX is limited, but it seems to me that a read call
>>must never block after select just said that it's ok to read from the
>>descriptor. So any such behaviour would be a kernel bug.
>
>
> There is no such guarentee.

Hmm, the man page for select() says:

"Those listed in readfds will be watched to see if characters become available
for reading (more precisely, to see if a read will not block"

Maybe it needs changing?

Chris

2004-10-06 15:18:24

by bert hubert

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 04:52:27PM +0200, Joris van Rantwijk wrote:
> Hello,
>
> I have a problem where the sequence of events is as follows:
> - application does select() on a UDP socket descriptor
> - select returns success with descriptor ready for reading
> - application does recvfrom() on this descriptor and this recvfrom()
> blocks forever

This can happen, and is fully to be expected. For a host of reasons the
packet might not in fact appear. Whenever using select for non-blocking IO
always set your sockets to non-blocking as well.

One of the legitimate reasons is the reception of packets which, on copying,
turn out to have a bad checksum.

Stevens has a section on this, recommended reading.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2004-10-06 15:17:20

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004, David S. Miller wrote:

> On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
> Joris van Rantwijk <[email protected]> wrote:
>
>> My understanding of POSIX is limited, but it seems to me that a read call
>> must never block after select just said that it's ok to read from the
>> descriptor. So any such behaviour would be a kernel bug.
>
> There is no such guarentee.

Huh? Then why would anybody use select()? It can't return a
'guess" or it's broken. When select() or poll() claims that
there are data available, there damn well better be data available
or software becomes a crap-game. And, if you've decided that
such a game-of-chance is okay then please keep your software
out of the new Ethernet Avionics Buses that Airbus now is using.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-06 15:23:17

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 11:15:13 -0400 (EDT)
"Richard B. Johnson" <[email protected]> wrote:

> On Wed, 6 Oct 2004, David S. Miller wrote:
>
> > On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
> > Joris van Rantwijk <[email protected]> wrote:
> >
> >> My understanding of POSIX is limited, but it seems to me that a read call
> >> must never block after select just said that it's ok to read from the
> >> descriptor. So any such behaviour would be a kernel bug.
> >
> > There is no such guarentee.
>
> Huh? Then why would anybody use select()? It can't return a
> 'guess" or it's broken. When select() or poll() claims that
> there are data available, there damn well better be data available
> or software becomes a crap-game.

So if select returns true, and another one of your threads
reads all the data from the file descriptor, what would you
like the behavior to be for the current thread when it calls
read?

So like I said, there is no such guarentee.

2004-10-06 15:31:04

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004, David S. Miller wrote:

> On Wed, 6 Oct 2004 11:15:13 -0400 (EDT)
> "Richard B. Johnson" <[email protected]> wrote:
>
>> On Wed, 6 Oct 2004, David S. Miller wrote:
>>
>>> On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
>>> Joris van Rantwijk <[email protected]> wrote:
>>>
>>>> My understanding of POSIX is limited, but it seems to me that a read call
>>>> must never block after select just said that it's ok to read from the
>>>> descriptor. So any such behaviour would be a kernel bug.
>>>
>>> There is no such guarentee.
>>
>> Huh? Then why would anybody use select()? It can't return a
>> 'guess" or it's broken. When select() or poll() claims that
>> there are data available, there damn well better be data available
>> or software becomes a crap-game.
>
> So if select returns true, and another one of your threads
> reads all the data from the file descriptor, what would you
> like the behavior to be for the current thread when it calls
> read?
>
> So like I said, there is no such guarentee.
>

Any code that uses select() on the same file-descriptor
for several threads is broken. You can't explain away
a select() problem with a bad-coding example. Somebody
else responded that a bad checksum could do the same
thing --not. Select must return correct information.
The user-code doesn't know about, doesn't care, and
in many cases can't find out about, the inner workings
of an operating system.

When a function call like select() says there are data
available, there must be data available, period. If
not, it's broken and needs to be fixed. Requiring
one to set sockets to non-blocking is a poor work-
around for an otherwise fatal flaw.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-06 15:37:51

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:

> So if select returns true, and another one of your threads
> reads all the data from the file descriptor, what would you
> like the behavior to be for the current thread when it calls
> read?

What about the single-threaded case?

Chris

2004-10-06 15:38:50

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David S. Miller" <[email protected]>
Sent: Wednesday, October 06, 2004 16:21
> On Wed, 6 Oct 2004 11:15:13 -0400 (EDT)
> "Richard B. Johnson" <[email protected]> wrote:
>
> > On Wed, 6 Oct 2004, David S. Miller wrote:
> >
> > > On Wed, 6 Oct 2004 16:52:27 +0200 (CEST)
> > > Joris van Rantwijk <[email protected]> wrote:
> > >
> > >> My understanding of POSIX is limited, but it seems to me that a read call
> > >> must never block after select just said that it's ok to read from the
> > >> descriptor. So any such behaviour would be a kernel bug.
> > >
> > > There is no such guarentee.
> >
> > Huh? Then why would anybody use select()? It can't return a
> > 'guess" or it's broken. When select() or poll() claims that
> > there are data available, there damn well better be data available
> > or software becomes a crap-game.
>
> So if select returns true, and another one of your threads
> reads all the data from the file descriptor, what would you
> like the behavior to be for the current thread when it calls
> read?
>
> So like I said, there is no such guarentee.

Perhaps you should have elaborated then, because this is obviously
not what was meant.

--ms


2004-10-06 15:42:39

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Richard B. Johnson wrote:
> On Wed, 6 Oct 2004, David S. Miller wrote:

>> There is no such guarentee.
> Huh? Then why would anybody use select()?

To tell you when to try a nonblocking read?

> It can't return a
> 'guess" or it's broken. When select() or poll() claims that
> there are data available, there damn well better be data available
> or software becomes a crap-game.

In the single-threaded case, where you are the only one touching the socket, I
would expect this to be true. But since I'm a belt-and-suspenders kind of guy,
I usually use nonblocking reads anyway.

Chris

2004-10-06 15:45:59

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-06T11:29:22, "Richard B. Johnson" <[email protected]> wrote:

> Any code that uses select() on the same file-descriptor
> for several threads is broken. You can't explain away
> a select() problem with a bad-coding example.

Any code which expects non-blocking behaviour without O_NONBLOCK is
broken.

Go away.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-06 15:45:59

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 11:29:22 -0400 (EDT)
"Richard B. Johnson" <[email protected]> wrote:

> Somebody else responded that a bad checksum could do the same
> thing --not. Select must return correct information.

Guess what, our UDP implementation does exactly that
and has done so for years. It's perfectly fine.

2004-10-06 15:45:31

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 06 Oct 2004 09:31:46 -0600
Chris Friesen <[email protected]> wrote:

> David S. Miller wrote:
>
> > So if select returns true, and another one of your threads
> > reads all the data from the file descriptor, what would you
> > like the behavior to be for the current thread when it calls
> > read?
>
> What about the single-threaded case?

Incorrect UDP checksums could cause the read data to
be discarded. We do the copy into userspace and checksum
computation in parallel. This is totally legal and we've
been doing it since 2.4.x first got released.

Use non-blocking sockets with select()/poll() and be happy.

2004-10-06 15:58:35

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> "Richard B. Johnson" <[email protected]> wrote:

>>Somebody else responded that a bad checksum could do the same
>>thing --not. Select must return correct information.

> Guess what, our UDP implementation does exactly that
> and has done so for years. It's perfectly fine.

We may want change the man page for select() so that this is made clear.

Chris

2004-10-06 16:01:25

by Paul Jackson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David wrote:
> So like I said, there is no such guarentee.

The select(2) man page states:

more precisely, to see if a read will not block

It doesn't say _which_ read won't block. Seems to me that the
successful non-block read in the other thread qualifies as the
promised non-blocking read.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-10-06 16:07:59

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004, David S. Miller wrote:

> On Wed, 06 Oct 2004 09:31:46 -0600
> Chris Friesen <[email protected]> wrote:
>
>> David S. Miller wrote:
>>
>>> So if select returns true, and another one of your threads
>>> reads all the data from the file descriptor, what would you
>>> like the behavior to be for the current thread when it calls
>>> read?
>>
>> What about the single-threaded case?
>
> Incorrect UDP checksums could cause the read data to
> be discarded. We do the copy into userspace and checksum
> computation in parallel. This is totally legal and we've
> been doing it since 2.4.x first got released.
>
> Use non-blocking sockets with select()/poll() and be happy.

Gawd. This is damn awful. How could this possibly be justified?
You can't have a system call that lies. We already have an OS
that does that. Certainly, no other Unix OS in the past has
thrown away integrity with such aplomb. Next, in the interest
of "performance", you'll probably only occasionally provide
file-data, as well.

You can't do this. When there is some well-defined procedure
such as select() or poll(), that is designed to provide a
reliable way of knowing that a read will succeed, you or
anybody else don't have the authority to declare that it's
not important to actually have it work.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-06 16:46:23

by Dan Kegel

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> Incorrect UDP checksums could cause the read data to
> be discarded. We do the copy into userspace and checksum
> computation in parallel. This is totally legal and we've
> been doing it since 2.4.x first got released.

Might there be a similar effect for packets with bad IP or TCP checksums?
(http://citeseer.ist.psu.edu/stone00when.html)

And as Bert says, Stevens mentions that with TCP accepts,
the other side might close before you call accept.

BTW this is why I insisted in JSR-51 that Java's NIO not allow
the use of Selector with blocking sockets:

http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/SelectableChannel.html
"A channel must be placed into non-blocking mode before being
registered with a selector, and may not be returned to blocking mode until it has been deregistered."

So at least Java programs that use NIO are immune to this
particular user error (unless your implementation of NIO
relaxes this sanity check, tsk).

- Dan

--
My technical stuff: http://kegel.com
My politics: see http://www.misleader.org for examples of why I'm for regime change

2004-10-06 17:03:44

by Neil Horman

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> On Wed, 06 Oct 2004 09:31:46 -0600
> Chris Friesen <[email protected]> wrote:
>
>
>>David S. Miller wrote:
>>
>>
>>>So if select returns true, and another one of your threads
>>>reads all the data from the file descriptor, what would you
>>>like the behavior to be for the current thread when it calls
>>>read?
>>
>>What about the single-threaded case?
>
>
> Incorrect UDP checksums could cause the read data to
> be discarded. We do the copy into userspace and checksum
> computation in parallel. This is totally legal and we've
> been doing it since 2.4.x first got released.
>
> Use non-blocking sockets with select()/poll() and be happy.


I think you could also pass the MSG_ERRQUEUE flag to the recvfrom call
and receive the errored frame, eliminating the case where errored frames
might cause you to block on a read after a good return from a select call.
Neil
--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-10-06 17:44:00

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Mer, 2004-10-06 at 15:52, Joris van Rantwijk wrote:
> My understanding of POSIX is limited, but it seems to me that a read call
> must never block after select just said that it's ok to read from the
> descriptor. So any such behaviour would be a kernel bug.

Select indicates there may be data. That is all - it might also be an
error, it might turn out to be wrong.

You should always combine select with nonblocking I/O

2004-10-06 18:04:38

by Joris van Rantwijk

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hello,

Many thanks to you everybody else for their helpfull comments.

On Wed, 6 Oct 2004, Alan Cox wrote:
> On Mer, 2004-10-06 at 15:52, Joris van Rantwijk wrote:
> > My understanding of POSIX is limited, but it seems to me that a read call
> > must never block after select just said that it's ok to read from the
> > descriptor. So any such behaviour would be a kernel bug.
>
> Select indicates there may be data. That is all - it might also be an
> error, it might turn out to be wrong.
>
> You should always combine select with nonblocking I/O

Ok, thanks. It turns out now that I (and a lot of people with me) have
always been wrong about this. I will go fix the application (dnsmasq) and
try to get the fix to the author.

Sorry about my wrongly blaming the kernel. I do think this issue shows hat
the select(2) manual needs fixing.

For clarity, I'd like to point out that my case has nothing to do with
multi-threading. Using select from multiple threads is a totally different
sort of mistake.

Bye,
Joris.

2004-10-06 19:33:24

by Andries Brouwer

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 08:04:29PM +0200, Joris van Rantwijk wrote:

> > On Mer, 2004-10-06 at 15:52, Joris van Rantwijk wrote:
> > > My understanding of POSIX is limited, but it seems to me that a read call
> > > must never block after select just said that it's ok to read from the
> > > descriptor. So any such behaviour would be a kernel bug.

Alan answers - and I don't like his answer a bit:

> > Select indicates there may be data. That is all - it might also be an
> > error, it might turn out to be wrong.
> >
> > You should always combine select with nonblocking I/O

Joris replies again:

> Sorry about my wrongly blaming the kernel. I do think this issue shows hat
> the select(2) manual needs fixing.

It may need fixing in the sense that it must point out that the Linux kernel
might not conform to POSIX in its handling of select on sockets.

We now not only have "man 2 select", but also "man 3p select".
This is the POSIX text:

A descriptor shall be considered ready for reading when a
call to an input function with O_NONBLOCK clear would not
block, whether or not the function would transfer data
successfully. (The function might return data, an end-of-
file indication, or an error other than one indicating
that it is blocked, and in each of these cases the
descriptor shall be considered ready for reading.)

As far as I can interpret these sentences, Linux does not conform.

Andries


Neil Horman wrote:

>> I think you could also pass the MSG_ERRQUEUE flag to the recvfrom call
>> and receive the errored frame, eliminating the case where errored frames
>> might cause you to block on a read after a good return from a select call.

davem wrote:

>> There is no such guarentee.

>> Incorrect UDP checksums could cause the read data to
>> be discarded. We do the copy into userspace and checksum
>> computation in parallel. This is totally legal and we've
>> been doing it since 2.4.x first got released.


ahu wrote:

>> This can happen, and is fully to be expected. For a host of reasons the
>> packet might not in fact appear. Whenever using select for non-blocking IO
>> always set your sockets to non-blocking as well.

>> One of the legitimate reasons is the reception of packets which, on copying,
>> turn out to have a bad checksum.

>> Stevens has a section on this, recommended reading.

Reference?

2004-10-06 19:43:37

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> It may need fixing in the sense that it must point out that
> the Linux kernel
> might not conform to POSIX in its handling of select on sockets.

Agreed.

> We now not only have "man 2 select", but also "man 3p select".
> This is the POSIX text:
>
> A descriptor shall be considered ready for reading when a
> call to an input function with O_NONBLOCK clear would not
> block, whether or not the function would transfer data
> successfully. (The function might return data, an end-of-
> file indication, or an error other than one indicating
> that it is blocked, and in each of these cases the
> descriptor shall be considered ready for reading.)
>
> As far as I can interpret these sentences, Linux does not conform.

How hard is it to treat the next read to the fd as NON_BLOCKING, even if
it's not set?

> Andries

2004-10-06 19:59:10

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hua Zhong wrote:

> How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> it's not set?

Userspace likely would not properly handle EAGAIN on a nonblocking socket.

As far as I can tell, either you block, or you have to scan the checksum before
select() returns.

Would it be so bad to do the checksum before marking the socket readable?
Chances are we're going to receive the message "soon" anyways, so there is at
least a chance it will stay hot in the cache, no?

Chris

2004-10-06 20:03:03

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> Hua Zhong wrote:
>
> > How hard is it to treat the next read to the fd as
> NON_BLOCKING, even if
> > it's not set?
>
> Userspace likely would not properly handle EAGAIN on a
> nonblocking socket.

But it's better than blocking the call, isn't it?

If the caller is using NON_BLOCKING already, no change in behavior,
otherwise it returns an error which the app may or may not handle, instead
of blocking it (which is usually fatal). Plus it hopefully gives Posix
compliance.

I can see there could be remote DoS attacks by just sending malformed UDP
packets.

> As far as I can tell, either you block, or you have to scan
> the checksum before
> select() returns.
>
> Would it be so bad to do the checksum before marking the
> socket readable?
> Chances are we're going to receive the message "soon"
> anyways, so there is at
> least a chance it will stay hot in the cache, no?
>
> Chris
>

2004-10-06 20:11:46

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 06 Oct 2004 13:54:46 -0600
Chris Friesen <[email protected]> wrote:

> Would it be so bad to do the checksum before marking the socket readable?

Yes, because if we do that we have to make two passes over the
data instead of one. It does make a big difference.

2004-10-06 20:16:24

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hua Zhong wrote:

>>>How hard is it to treat the next read to the fd as
>>NON_BLOCKING, even if it's not set?
>>
>>Userspace likely would not properly handle EAGAIN on a
>>nonblocking socket.

I meant blocking, of course, but you caught that.

> But it's better than blocking the call, isn't it?
>
> If the caller is using NON_BLOCKING already, no change in behavior,
> otherwise it returns an error which the app may or may not handle, instead
> of blocking it (which is usually fatal). Plus it hopefully gives Posix
> compliance.

From what Andries posted, we can't block. If select says its readable, we can
"return data, an end-of-file indication, or an error other than one
indicating that it is blocked".

We have no data, network sockets don't have end-of-file indication (or would
returning a length of zero count?), and there is no other suitable errno that I saw.

> I can see there could be remote DoS attacks by just sending malformed UDP
> packets.

Yep.

Chris

2004-10-06 20:27:07

by Olivier Galibert

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 12:43:27PM -0700, Hua Zhong wrote:
> How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> it's not set?

Programs don't expect EAGAIN from blocking sockets.

OG.

2004-10-06 20:31:10

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Mer, 2004-10-06 at 20:30, Andries Brouwer wrote:
> A descriptor shall be considered ready for reading when a
> call to an input function with O_NONBLOCK clear would not
> block, whether or not the function would transfer data
> successfully. (The function might return data, an end-of-
> file indication, or an error other than one indicating
> that it is blocked, and in each of these cases the
> descriptor shall be considered ready for reading.)
>
> As far as I can interpret these sentences, Linux does not conform.

Nor does anything else in that case. I guess we need a POSIX_ME_HARDER
socket option.

As to the Stevens reference - Stevens says nothing about read but does
mention the problem of accept, which is one of the "can't fix" type
examples.

Connection setup pending
select returns
Connection destroyed
accept blocks

Alan

2004-10-06 20:31:08

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> From what Andries posted, we can't block. If select says
> its readable, we can "return data, an end-of-file indication,
> or an error other than one indicating that it is blocked".

Arrrrgh..I misread it as "or an error indicating that it is blocked"..

So treating it simply as NON_BLOCKING isn't right.

> Hmm...no easy solution then.
>
> In any case, the current behaviour is not compliant with the
> POSIX text that Andries posted. Perhaps this should be
> documented somewhere?
> Alternately, how about having the recvmsg() call return a
> zero, and (if appropriate) the length of the name set to zero? This
> appears to comply with the man page for recvmsg().

If it's a read, returning zero means end-of-file. Not sure what it means
when recvmsg returns zero..But is this just the problem of UDP or
recvmsg? I doubt it.

So I guess the easiest solution is just admit Linux select
isn't posix compliant.

> Chris

2004-10-06 20:49:37

by Andries Brouwer

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 02:18:23PM -0600, Chris Friesen wrote:

> In any case, the current behaviour is not compliant with the POSIX text
> that Andries posted. Perhaps this should be documented somewhere?

For the time being I wrote (in select.2)

BUGS
It has been reported (Linux 2.6) that select may report a
socket file descriptor as "ready for reading", while nev-
ertheless a subsequent read blocks. This could perhaps
happen when data has arrived but upon examination has
wrong checksum and is discarded. Thus it may be safer to
use non-blocking I/O.

(I have not yet investigated, just read the lk posts. Does this
really happen? All kernel versions? Is this the explanation for
the reported behaviour?)

> Alternately, how about having the recvmsg() call return a zero, and (if
> appropriate) the length of the name set to zero? This appears to comply
> with the man page for recvmsg().

Returning 0 for a read signifies end-of-file. Not what you want.

Andries

2004-10-06 20:58:40

by Joris van Rantwijk

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?


On Wed, 6 Oct 2004, Andries Brouwer wrote:
> Does this really happen?

Yes. Finally got my raw-udp-with-wrong-checksum sending program to work
over localhost and it hangs recvfrom pretty good.

> All kernel versions?

Quick guess: probably since late 2.4. Source of 2.4.27 udp.c is similar to
2.6.9, but 2.4.17 returns EAGAIN even for blocking sockets, apparently
this was "fixed" later on.

> Is this the explanation for the reported behaviour?)

Very hard to say since I can't predict when it will occur again.

Joris.

2004-10-06 21:11:55

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Alan Cox" <[email protected]>
Sent: Wednesday, October 06, 2004 20:23
> On Mer, 2004-10-06 at 20:30, Andries Brouwer wrote:
> > A descriptor shall be considered ready for reading when a
> > call to an input function with O_NONBLOCK clear would not
> > block, whether or not the function would transfer data
> > successfully. (The function might return data, an end-of-
> > file indication, or an error other than one indicating
> > that it is blocked, and in each of these cases the
> > descriptor shall be considered ready for reading.)
> >
> > As far as I can interpret these sentences, Linux does not conform.
>
> Nor does anything else in that case. I guess we need a POSIX_ME_HARDER
> socket option.

The default should be a POSIX compliant socket IMHO; a POSIX_ME_NOT
option could provide better performance.

--ms


2004-10-06 21:20:34

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Chris Friesen" <[email protected]>
Sent: Wednesday, October 06, 2004 21:10
[...]
> From what Andries posted, we can't block. If select says its readable, we can
> "return data, an end-of-file indication, or an error other than one
> indicating that it is blocked".
>
> We have no data, network sockets don't have end-of-file indication (or would
> returning a length of zero count?), and there is no other suitable errno that I saw.

The current behavious is definately not POSIX compliant; returning an error
is better in any case and that error does not necessarily have to be listed in the
ERRORS section of the recvmsg() function description in the standard.
Returning EIO would be an option and is listed as with the errors in POSIX.

--ms


2004-10-06 21:20:33

by Neil Horman

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hua Zhong wrote:
>>It may need fixing in the sense that it must point out that
>>the Linux kernel
>>might not conform to POSIX in its handling of select on sockets.
>
>
> Agreed.
>
>
>>We now not only have "man 2 select", but also "man 3p select".
>>This is the POSIX text:
>>
>> A descriptor shall be considered ready for reading when a
>> call to an input function with O_NONBLOCK clear would not
>> block, whether or not the function would transfer data
>> successfully. (The function might return data, an end-of-
>> file indication, or an error other than one indicating
>> that it is blocked, and in each of these cases the
>> descriptor shall be considered ready for reading.)
>>
>>As far as I can interpret these sentences, Linux does not conform.
>

Again, shouldn't this just mean that recvfrom should not be called
without the MSG_ERRQUEUE flag set? From the above description, I read
that select returning with an indication that a descriptor is ready for
reading could mean that it has an error message queued to it, rather
than in-band data. If you then call recvfrom/recv/recvmsg without
indicating that you want to receive the error indications as well, then
isn't that just another example of improper coding, since recvfrom would
have returned immeidately, as select indicated, had the appropriate read
flags been set?

Neil
>
> How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> it's not set?
>
>
>>Andries
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-10-06 21:32:34

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Mer, 2004-10-06 at 23:08, Martijn Sipkema wrote:
> > Nor does anything else in that case. I guess we need a POSIX_ME_HARDER
> > socket option.
>
> The default should be a POSIX compliant socket IMHO; a POSIX_ME_NOT
> option could provide better performance.

The current setup has so far been found to break one app, after what
three years. It can almost double performance. In this case it is very
much POSIX_ME_HARDER, and perhaps longer term suggests the posix/sus
people should revisit their API design.


2004-10-06 20:27:08

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> On Wed, 06 Oct 2004 13:54:46 -0600
> Chris Friesen <[email protected]> wrote:
>
>
>>Would it be so bad to do the checksum before marking the socket readable?
>
>
> Yes, because if we do that we have to make two passes over the
> data instead of one. It does make a big difference.

Hmm...no easy solution then.

In any case, the current behaviour is not compliant with the POSIX text that
Andries posted. Perhaps this should be documented somewhere?

Alternately, how about having the recvmsg() call return a zero, and (if
appropriate) the length of the name set to zero? This appears to comply with
the man page for recvmsg().

Chris

2004-10-06 22:17:04

by Andries Brouwer

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 09:25:28PM +0100, Alan Cox wrote:

> The current setup has so far been found to break one app, after what
> three years. It can almost double performance. In this case it is very
> much POSIX_ME_HARDER, and perhaps longer term suggests the posix/sus
> people should revisit their API design.

Maybe. Have we really investigated and concluded that there is no
reasonable way to follow POSIX and not harm performance?

I would hope that checksum failure is not the fast path,
so at zeroth sight, not having looked at the code, it seems
that we could do rather elaborate things on checksum failure
if we wanted to.

One such thing might be to raise a flag "I/O error seen since last read"
where the flag is cleared by read and causes an EIO when there is no
other input.

(There may be many objections - maybe such a setup would break
more user space programs. Or maybe there are more ways select
is broken than just the "discarded because of bad checksum" way.
But it seems too early to just say "too bad, our select is not
the POSIX one".)

Andries

2004-10-06 22:32:16

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Neil Horman wrote:

> Again, shouldn't this just mean that recvfrom should not be called
> without the MSG_ERRQUEUE flag set?

Does a message with a bad udp checksum even get sent up as a queued error message?

2004-10-06 22:36:55

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 22:38:18 +0200
Andries Brouwer <[email protected]> wrote:

> On Wed, Oct 06, 2004 at 02:18:23PM -0600, Chris Friesen wrote:
>
> > In any case, the current behaviour is not compliant with the POSIX text
> > that Andries posted. Perhaps this should be documented somewhere?
>
> For the time being I wrote (in select.2)
>
> BUGS
> It has been reported (Linux 2.6) that select may report a

2.4.x has identical behavior

2004-10-06 22:45:42

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 00:15:12 +0200
Andries Brouwer <[email protected]> wrote:

> I would hope that checksum failure is not the fast path,
> so at zeroth sight, not having looked at the code, it seems
> that we could do rather elaborate things on checksum failure
> if we wanted to.

The code in question is in net/ipv4/udp.c:udp_recvmsg()

2004-10-06 23:36:04

by YOSHIFUJI Hideaki

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

In article <[email protected]> (at Thu, 7 Oct 2004 00:15:12 +0200), Andries Brouwer <[email protected]> says:

> (There may be many objections - maybe such a setup would break
> more user space programs. Or maybe there are more ways select
> is broken than just the "discarded because of bad checksum" way.
> But it seems too early to just say "too bad, our select is not
> the POSIX one".)

select() != pselect(). :-)

--yoshfuji

2004-10-06 23:40:56

by Neil Horman

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Chris Friesen wrote:

> Neil Horman wrote:
>
>> Again, shouldn't this just mean that recvfrom should not be called
>> without the MSG_ERRQUEUE flag set?
>
>
> Does a message with a bad udp checksum even get sent up as a queued
> error message?

I thought thats exactly what MSG_ERRQUEUE was for, or am I mistaken?
Neil

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/

2004-10-06 23:40:49

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 22:06:08 +0200
Olivier Galibert <[email protected]> wrote:

> On Wed, Oct 06, 2004 at 12:43:27PM -0700, Hua Zhong wrote:
> > How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> > it's not set?
>
> Programs don't expect EAGAIN from blocking sockets.

That's right, which is why we block instead.

2004-10-06 23:40:56

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004 22:45:08 +0100
"Martijn Sipkema" <[email protected]> wrote:

> The current behavious is definately not POSIX compliant; returning an error
> is better in any case

Not true at all. We in fact used to return -EAGAIN if the
checksum failed and the socket was blocking, and Olaf Kirch
pointed that out so we changed it to block instead and wait
for the next packet to arrive instead which is the correct
behavior.

People, get the heck over this. The kernel has behaved this way
for more than 3 years both in 2.4.x and 2.6.x. The code in question
even exists in the 2.2.x sources as well.

Therefore, it would be totally pointless to change the behavior
now since anyone writing an application wishing it to work on
all existing Linux kernels needs to handle this case anyways.

Like stated earlier, use non-blocking sockets with select()/poll()
and be happy.

2004-10-06 23:44:38

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 06 Oct 2004 16:27:11 -0600
Chris Friesen <[email protected]> wrote:

> Neil Horman wrote:
>
> > Again, shouldn't this just mean that recvfrom should not be called
> > without the MSG_ERRQUEUE flag set?
>
> Does a message with a bad udp checksum even get sent up as a queued error message?

No, it doesn't, MSG_ERRQUEUE is used for other things.

2004-10-06 23:19:18

by Willy Tarreau

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hi,

On Wed, Oct 06, 2004 at 09:25:28PM +0100, Alan Cox wrote:

> The current setup has so far been found to break one app, after what
> three years. It can almost double performance. In this case it is very
> much POSIX_ME_HARDER, and perhaps longer term suggests the posix/sus
> people should revisit their API design.

Couldn't we simply make recvfrom() return 0 (no data) or -1 (whatever error)
in a case where select() had a reason to believe that there were data, but
that the copy function discovered that it was corrupted data ?

This should not impact performance and would let recvfrom() behave in a
smarter way. After all, I don't see a problem receiving 0 bytes.

Anyway, I'm all for non-blocking I/O, but I can understand the stupidity
of the situation.

Just a few thoughts, of course.
Willy

2004-10-07 00:19:39

by Olivier Galibert

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 04:35:21PM -0700, David S. Miller wrote:
> On Wed, 6 Oct 2004 22:06:08 +0200
> Olivier Galibert <[email protected]> wrote:
>
> > On Wed, Oct 06, 2004 at 12:43:27PM -0700, Hua Zhong wrote:
> > > How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> > > it's not set?
> >
> > Programs don't expect EAGAIN from blocking sockets.
>
> That's right, which is why we block instead.

Programs don't expect a read to block after a positive select either,
so it doesn't really help.

OG.

2004-10-07 00:30:46

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 02:19:37 +0200
Olivier Galibert <[email protected]> wrote:

> On Wed, Oct 06, 2004 at 04:35:21PM -0700, David S. Miller wrote:
> > On Wed, 6 Oct 2004 22:06:08 +0200
> > Olivier Galibert <[email protected]> wrote:
> >
> > > On Wed, Oct 06, 2004 at 12:43:27PM -0700, Hua Zhong wrote:
> > > > How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> > > > it's not set?
> > >
> > > Programs don't expect EAGAIN from blocking sockets.
> >
> > That's right, which is why we block instead.
>
> Programs don't expect a read to block after a positive select either,
> so it doesn't really help.

It absolutely does help the programs not using select(), using
blocking sockets, and not expecting -EAGAIN.

2004-10-07 01:17:53

by Paul Jakma

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, 6 Oct 2004, Richard B. Johnson wrote:

> thing --not. Select must return correct information.

It does, it's just state that select() reported on changed by the
time user called recvmsg.

> When a function call like select() says there are data available,
> there must be data available, period.

There was, but there wasnt when recvmsg() was called. Time changes
things..

> If not, it's broken and needs to be fixed. Requiring one to set
> sockets to non-blocking is a poor work- around for an otherwise
> fatal flaw.

Any application that expects socket read not to block should set
O_NONBLOCK.

> Cheers,
> Dick Johnson

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
Kansas state law requires pedestrians crossing the highways at night to
wear tail lights.

2004-10-07 05:46:24

by Dan Kegel

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Joris wrote:
> On Wed, 6 Oct 2004, Andries Brouwer wrote:
>> Does this really happen?
>
> Yes. Finally got my raw-udp-with-wrong-checksum sending program to work
> over localhost and it hangs recvfrom pretty good.
>
>> All kernel versions?
>
> Quick guess: probably since late 2.4. Source of 2.4.27 udp.c is similar to
> 2.6.9, but 2.4.17 returns EAGAIN even for blocking sockets, apparently
> this was "fixed" later on.

It would be nice to know how other operating systems behave
when receiving UDP packets with bad checksums. Can someone
try BSD and Solaris?

- Dan

--
My technical stuff: http://kegel.com
My politics: see http://www.misleader.org for examples of why I'm for regime change

2004-10-07 07:12:17

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Paul Jakma wrote:
> On Wed, 6 Oct 2004, Richard B. Johnson wrote:
>> thing --not. Select must return correct information.
> It does, it's just state that select() reported on changed by the time
> user called recvmsg.

Actually, in the single threaded case, the state did not change. We just didn't
actually check the state before returning from select().

>> When a function call like select() says there are data available,
>> there must be data available, period.

> There was, but there wasnt when recvmsg() was called. Time changes things.

Actually, there wasn't. The data was corrupt, therefore there was no data.
Nothing changed with time, as the corrupt data was already present before we
returned from select().

> Any application that expects socket read not to block should set
> O_NONBLOCK.

POSIX says that if select() says a socket is readable, a read call will not
block. Obviously, we are not POSIX compliant.

There's nothing wrong with not being compliant, but it should be documented and
we shouldn't claim to be compliant.

Chris

2004-10-07 08:08:01

by bert hubert

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Wed, Oct 06, 2004 at 09:50:10PM -0700, Dan Kegel wrote:

> It would be nice to know how other operating systems behave
> when receiving UDP packets with bad checksums. Can someone
> try BSD and Solaris?

It does not matter - this behaviour should not be depended upon. There are
lots of other reasons why a packet might in fact not be available, kernels
are allowed to drop UDP packets at will.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2004-10-07 08:28:55

by Adam Heath

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, bert hubert wrote:

> On Wed, Oct 06, 2004 at 09:50:10PM -0700, Dan Kegel wrote:
>
> > It would be nice to know how other operating systems behave
> > when receiving UDP packets with bad checksums. Can someone
> > try BSD and Solaris?
>
> It does not matter - this behaviour should not be depended upon. There are
> lots of other reasons why a packet might in fact not be available, kernels
> are allowed to drop UDP packets at will.

I've been lurking and reading this thread with great interest. I had been
leaning towards thinking the kernel was wrong, until I read this email.

This is a very excellent point.

2004-10-07 09:38:22

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Adam Heath" <[email protected]>
Sent: Thursday, October 07, 2004 09:28
> On Thu, 7 Oct 2004, bert hubert wrote:
>
> > On Wed, Oct 06, 2004 at 09:50:10PM -0700, Dan Kegel wrote:
> >
> > > It would be nice to know how other operating systems behave
> > > when receiving UDP packets with bad checksums. Can someone
> > > try BSD and Solaris?
> >
> > It does not matter - this behaviour should not be depended upon. There are
> > lots of other reasons why a packet might in fact not be available, kernels
> > are allowed to drop UDP packets at will.
>
> I've been lurking and reading this thread with great interest. I had been
> leaning towards thinking the kernel was wrong, until I read this email.
>
> This is a very excellent point.

No, it isn't. If the kernel drops a UDP packet, select() should not return
indicating available data.


--ms

2004-10-07 09:59:48

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David S. Miller" <[email protected]>
Sent: Thursday, October 07, 2004 01:29

> On Thu, 7 Oct 2004 02:19:37 +0200
> Olivier Galibert <[email protected]> wrote:
>
> > On Wed, Oct 06, 2004 at 04:35:21PM -0700, David S. Miller wrote:
> > > On Wed, 6 Oct 2004 22:06:08 +0200
> > > Olivier Galibert <[email protected]> wrote:
> > >
> > > > On Wed, Oct 06, 2004 at 12:43:27PM -0700, Hua Zhong wrote:
> > > > > How hard is it to treat the next read to the fd as NON_BLOCKING, even if
> > > > > it's not set?
> > > >
> > > > Programs don't expect EAGAIN from blocking sockets.
> > >
> > > That's right, which is why we block instead.
> >
> > Programs don't expect a read to block after a positive select either,
> > so it doesn't really help.
>
> It absolutely does help the programs not using select(), using
> blocking sockets, and not expecting -EAGAIN.

Both returning EAGAIN and blocking are not POSIX compliant. Returning EIO
from blocking sockets is an option. Another option would be to have select()
check the data so that when it returns it can guarantee available data. This would
not affect performance of recvmsg() without using select().


--ms


2004-10-07 10:07:50

by Adam Heath

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Martijn Sipkema wrote:

> > > It does not matter - this behaviour should not be depended upon. There are
> > > lots of other reasons why a packet might in fact not be available, kernels
> > > are allowed to drop UDP packets at will.
> >
> > I've been lurking and reading this thread with great interest. I had been
> > leaning towards thinking the kernel was wrong, until I read this email.
> >
> > This is a very excellent point.
>
> No, it isn't. If the kernel drops a UDP packet, select() should not return
> indicating available data.

The kernel can drop a packet after select() returns, and before read() is
called. That's the whole point of *U*DP.

2004-10-07 10:28:46

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Adam Heath" <[email protected]>
Sent: Thursday, October 07, 2004 11:07


> On Thu, 7 Oct 2004, Martijn Sipkema wrote:
>
> > > > It does not matter - this behaviour should not be depended upon. There are
> > > > lots of other reasons why a packet might in fact not be available, kernels
> > > > are allowed to drop UDP packets at will.
> > >
> > > I've been lurking and reading this thread with great interest. I had been
> > > leaning towards thinking the kernel was wrong, until I read this email.
> > >
> > > This is a very excellent point.
> >
> > No, it isn't. If the kernel drops a UDP packet, select() should not return
> > indicating available data.
>
> The kernel can drop a packet after select() returns, and before read() is
> called. That's the whole point of *U*DP.

I don't think that is the point of UDP and I don't think the kernel should
do that, but there are two options for handling this:

If recvmsg() blocks until valid data is available then so should select().
If recvmsg() returns an error on invalid data then select() would indicate the
socket() as ready without knowing if the data was valid.


--ms

2004-10-07 11:53:54

by Paul Jakma

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Chris Friesen wrote:

> Actually, in the single threaded case, the state did not change. We just
> didn't actually check the state before returning from select().

Right, so our perception of state (which for all useful purposes /is/
the state) changed - "we have data" -> "we had to throw out data due
to bad checksum" is a change in kernel state at least, if not in the
(now gone) data.

I'm not really a kernel person. From the application POV, in the
single-threaded case (cause the multi-threaded case is fairly
pathological anyway), there /will/ be time between the select and the
recvmsg, things /can/ change, and obviously they do.

Treating select as anything other than a useful hints mechanism is
going to get you into trouble - just see the Stevens' example others
gave for a long-standing example, in addition to this (sane imho)
Linuxism.

> Actually, there wasn't. The data was corrupt, therefore there was
> no data. Nothing changed with time, as the corrupt data was already
> present before we returned from select().

Perception of state is as good as state here.

> POSIX says that if select() says a socket is readable, a read call
> will not block. Obviously, we are not POSIX compliant.

Right, yes, that seems to be clear now.

Though, I'd still say that any app that calls read/write functions
without O_NONBLOCK set and that expects it will not block, is broken.
Basic common sense really, never mind the fine details of POSIX on
select(). ;)

> There's nothing wrong with not being compliant, but it should be
> documented and we shouldn't claim to be compliant.

Right.

> Chris

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
A good reputation is more valuable than money.
-- Publilius Syrus

2004-10-07 12:36:13

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Paul Jakma" <[email protected]>
Sent: Thursday, October 07, 2004 12:53


> On Thu, 7 Oct 2004, Chris Friesen wrote:
>
> > Actually, in the single threaded case, the state did not change. We just
> > didn't actually check the state before returning from select().
>
> Right, so our perception of state (which for all useful purposes /is/
> the state) changed - "we have data" -> "we had to throw out data due
> to bad checksum" is a change in kernel state at least, if not in the
> (now gone) data.
>
> I'm not really a kernel person. From the application POV, in the
> single-threaded case (cause the multi-threaded case is fairly
> pathological anyway), there /will/ be time between the select and the
> recvmsg, things /can/ change, and obviously they do.

That there is time between the select() and recvmsg() calls is not the
issue; the data is only checked in the call to recvmsg(). Actually the
longer the time between select() and recvmsg(), the larger the probability
that valid data has been received.

> Treating select as anything other than a useful hints mechanism is
> going to get you into trouble - just see the Stevens' example others
> gave for a long-standing example, in addition to this (sane imho)
> Linuxism.

But the standard clearly says otherwise.

> > Actually, there wasn't. The data was corrupt, therefore there was
> > no data. Nothing changed with time, as the corrupt data was already
> > present before we returned from select().
>
> Perception of state is as good as state here.

Perhaps select()'s perception of state should be made to take possible
corruption into account.

> > POSIX says that if select() says a socket is readable, a read call
> > will not block. Obviously, we are not POSIX compliant.
>
> Right, yes, that seems to be clear now.
>
> Though, I'd still say that any app that calls read/write functions
> without O_NONBLOCK set and that expects it will not block, is broken.
> Basic common sense really, never mind the fine details of POSIX on
> select(). ;)

Why would the POSIX standard say recvmsg() should not block if
it did not intend it to be used in that way?

> > There's nothing wrong with not being compliant, but it should be
> > documented and we shouldn't claim to be compliant.
>
> Right.

Wrong. IMHO it is not exactly a good thing to not be compliant on
such basic functionality.


--ms

2004-10-07 13:00:47

by Paul Jakma

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Martijn Sipkema wrote:

> That there is time between the select() and recvmsg() calls is not
> the issue; the data is only checked in the call to recvmsg().
> Actually the longer the time between select() and recvmsg(), the
> larger the probability that valid data has been received.

But it is the issue.

Much can change between the select() and recvmsg - things outside of
kernel control too, and it's long been known.

> But the standard clearly says otherwise.

Standards can have bugs too.

It's not healthy to take a corner-case situation from the standard on
select() and apply it globally to all IO. (not in my mind anyway,
whatever the standard says).

> Perhaps select()'s perception of state should be made to take possible
> corruption into account.

You'll /still/ run into problems, on other platforms too. Set socket
to O_NONBLOCK and deal with it ;)

> Why would the POSIX standard say recvmsg() should not block if
> it did not intend it to be used in that way?

POSIX_ME_HARDER? ;)

> Wrong. IMHO it is not exactly a good thing to not be compliant on
> such basic functionality.

Like I said, to my mind, any sane app should try avoiding assumption
that kernel state remains same between select() and read/write - and
O_NONBLOCK exists to deal nicely with the situation.

You really shouldnt assume select state is guaranteed not to change
by time you get round to doing IO. It's not safe, and not just on
Linux - whatever POSIX says.

> --ms

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
"An ounce of prevention is worth a pound of purge."

2004-10-07 13:13:56

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Paul Jakma" <[email protected]>
Sent: Thursday, October 07, 2004 13:53


> On Thu, 7 Oct 2004, Martijn Sipkema wrote:
>
> > That there is time between the select() and recvmsg() calls is not
> > the issue; the data is only checked in the call to recvmsg().
> > Actually the longer the time between select() and recvmsg(), the
> > larger the probability that valid data has been received.
>
> But it is the issue.
>
> Much can change between the select() and recvmsg - things outside of
> kernel control too, and it's long been known.

There is no change; the current implementation just checks the validity of
the data in the recvmsg() call and not during select().

> > But the standard clearly says otherwise.
>
> Standards can have bugs too.
>
> It's not healthy to take a corner-case situation from the standard on
> select() and apply it globally to all IO. (not in my mind anyway,
> whatever the standard says).


Read the standard. The behavious of select() on sockets is explicitely
described.

> > Perhaps select()'s perception of state should be made to take possible
> > corruption into account.
>
> You'll /still/ run into problems, on other platforms too. Set socket
> to O_NONBLOCK and deal with it ;)

What problems?

> > Why would the POSIX standard say recvmsg() should not block if
> > it did not intend it to be used in that way?
>
> POSIX_ME_HARDER? ;)

Would you care to provide any real answers or are you just telling
me to shut up because whatever Linux does is good, and not appear
unreasonable by adding a ;) ..?

> > Wrong. IMHO it is not exactly a good thing to not be compliant on
> > such basic functionality.
>
> Like I said, to my mind, any sane app should try avoiding assumption
> that kernel state remains same between select() and read/write - and
> O_NONBLOCK exists to deal nicely with the situation.
>
> You really shouldnt assume select state is guaranteed not to change
> by time you get round to doing IO. It's not safe, and not just on
> Linux - whatever POSIX says.

Any sane application would be written for the POSIX API as described
in the standard, and a sane kernel should IMHO implement that standard
whenever possible.

--ms

2004-10-07 13:17:03

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Paul Jakma wrote:

> On Thu, 7 Oct 2004, Martijn Sipkema wrote:
>
>> That there is time between the select() and recvmsg() calls is not the
>> issue; the data is only checked in the call to recvmsg(). Actually the
>> longer the time between select() and recvmsg(), the larger the probability
>> that valid data has been received.
>
> But it is the issue.
>
> Much can change between the select() and recvmsg - things outside of kernel
> control too, and it's long been known.
>
>> But the standard clearly says otherwise.
>
> Standards can have bugs too.
>
> It's not healthy to take a corner-case situation from the standard on
> select() and apply it globally to all IO. (not in my mind anyway, whatever
> the standard says).
>
>> Perhaps select()'s perception of state should be made to take possible
>> corruption into account.
>
> You'll /still/ run into problems, on other platforms too. Set socket to
> O_NONBLOCK and deal with it ;)
>
>> Why would the POSIX standard say recvmsg() should not block if
>> it did not intend it to be used in that way?
>
> POSIX_ME_HARDER? ;)
>
>> Wrong. IMHO it is not exactly a good thing to not be compliant on such
>> basic functionality.
>
> Like I said, to my mind, any sane app should try avoiding assumption that
> kernel state remains same between select() and read/write - and O_NONBLOCK
> exists to deal nicely with the situation.
>
> You really shouldnt assume select state is guaranteed not to change by time
> you get round to doing IO. It's not safe, and not just on Linux - whatever
> POSIX says.
>
>> --ms
>
> regards,

Hmmm. When somebody is sleeping in select(), waiting for that
wake_up_interruptible() call that signals that data are
present, how could data not be present anymore when the
only listener makes the call to read() the data?

Simple. It's a BUG. A no-nonsense BUG. The wake_up_interruptible()
call must be made only after valid data are available for the
listener to read. Anything else is a BUG. There are several
bugs present in the logic. The first being that it will
take some time for a listener to wake up, therefore it's
okay to signal the listener before data are ready. This
is the primary BUG. The code is just details. With the
new preemptive kernel, it becomes increasingly necessary
to perform things in the correct order. The second known
BUG is that somebody decided to remove basic functionality
to improve some benchmarks.

Now, one can update the man pages and state to not use poll()
or select() for their intended purposes, or they can fix the
BUG. It's just that simple.

In the meantime, what is the function that should be used
to emulate the correct behavior of select()? If there isn't
an answer to that, then the BUG needs to be fixed.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-07 13:21:42

by Paul Jakma

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Martijn Sipkema wrote:

> Would you care to provide any real answers or are you just telling
> me to shut up because whatever Linux does is good, and not appear
> unreasonable by adding a ;) ..?

No, I'm saying it's simple good practice to set O_NONBLOCK on sockets
if one expects not to block - any other expectation is not robust,
never mind what POSIX says about select().

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
You know you are getting old when you think you should drive the speed limit.
-- E.A. Gilliam

2004-10-07 13:36:48

by Paul Jakma

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Martijn Sipkema wrote:

> Any sane application would be written for the POSIX API as
> described in the standard, and a sane kernel should IMHO implement
> that standard whenever possible.

NB: I dont disagree with you.

Just the impression I get is that there is no way to avoid this
situation without a serious performance impact, and that the
optimisation shouldnt really any affect any healthy app. (any which
are really should be setting O_NONBLOCK).

If you could follow the spec without significantly harming
performance, then I'd agree spec should be followed.

I dont really have anything useful to say other than that, IMHO, a
sane app should be using O_NONBLOCK if it really does not want to
block, so I shall now quietly back away from this thread.

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
What this country needs is a good five cent microcomputer.

2004-10-07 14:52:45

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Iau, 2004-10-07 at 15:07, Martijn Sipkema wrote:
> > Much can change between the select() and recvmsg - things outside of
> > kernel control too, and it's long been known.
>
> There is no change; the current implementation just checks the validity of
> the data in the recvmsg() call and not during select().

The accept one is documented by Stevens and well known. In the UDP case
currently we could get precise behaviour - by halving performance of UDP
applications like video streaming. We probably don't want to because we
can respond intelligently to OOM situations by freeing the queue if we
don't enforce such a silly rule.

2004-10-07 14:50:40

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

> Read the standard. The behavious of select() on sockets is explicitely
> described.

For a strict posix system, but then if we were a strict posix/sus system
you wouldn't be able to use mmap. Also the kernel doesn't claim to
implement posix behaviour, it avoids those areas were posix is stupid.

> > POSIX_ME_HARDER? ;)
>
> Would you care to provide any real answers or are you just telling
> me to shut up because whatever Linux does is good, and not appear
> unreasonable by adding a ;) ..?

POSIX_ME_HARDER was an environment variable GNU tools used when users
wanted them to do stupid but posix mandated things instead of sensible
things. It was later changed to POSIXLY_CORRECT, which lost the point
somewhat.

> > You really shouldnt assume select state is guaranteed not to change
> > by time you get round to doing IO. It's not safe, and not just on
> > Linux - whatever POSIX says.
>
> Any sane application would be written for the POSIX API as described
> in the standard, and a sane kernel should IMHO implement that standard
> whenever possible.

I doubt that. Sane applications are written to the BSD socket API not
POSIX 1003.1g draft 6.4 and relatives.

Alan

2004-10-07 15:01:15

by Richard B. Johnson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Alan Cox wrote:

> On Iau, 2004-10-07 at 15:07, Martijn Sipkema wrote:
>>> Much can change between the select() and recvmsg - things outside of
>>> kernel control too, and it's long been known.
>>
>> There is no change; the current implementation just checks the validity of
>> the data in the recvmsg() call and not during select().
>
> The accept one is documented by Stevens and well known. In the UDP case
> currently we could get precise behaviour - by halving performance of UDP
> applications like video streaming. We probably don't want to because we
> can respond intelligently to OOM situations by freeing the queue if we
> don't enforce such a silly rule.
>

Well if you accept(pun_intended) what Stevens says, then check his
web-site on what he teaches about select() and sockets. His demo
code certainly requires that select() not fail.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.5-1.358-noreg on an i686 machine (5537.79 BogoMips).
Note 96.31% of all statistics are fiction.

2004-10-07 15:09:35

by Jean-Sebastien Trottier

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Just an outsider's view of someone that has been following this thread:

Could select() have 2 different behaviors depending on wether the
O_NONBLOCK flag is set or not on the socket.

1. If O_NONBLOCK is set, it can immediately return that the socket is
ready to be read (while CRC and possibly other checks are being done in
background). A subsequent call to recvfrom may return EWOULDBLOCK if in
the mean time the data was discarded by the kernel. This is current
behavior.

An application using O_NONBLOCK should be ready to deal with
consequences.

2. In the case where O_NONBLOCK is not set, select() could wait for all
the checks to be done before deciding to return or not. In this case the
meaning would be "there is data ready", NOT "there *might* be data
ready".

This way, there should not be any performance hits for (IMHO) well built
applications that use O_NONBLOCK. And the POSIX standard would not be
broken otherwise and applications that don't use O_NONBLOCK can rely on
actual data being read in recvfrom.

Just my 2 cents...
Sebastien

On Thu, Oct 07, 2004 at 02:36:26PM +0100, Paul Jakma wrote:
> On Thu, 7 Oct 2004, Martijn Sipkema wrote:
>
> >Any sane application would be written for the POSIX API as
> >described in the standard, and a sane kernel should IMHO implement
> >that standard whenever possible.
>
> NB: I dont disagree with you.
>
> Just the impression I get is that there is no way to avoid this
> situation without a serious performance impact, and that the
> optimisation shouldnt really any affect any healthy app. (any which
> are really should be setting O_NONBLOCK).
>
> If you could follow the spec without significantly harming
> performance, then I'd agree spec should be followed.
>
> I dont really have anything useful to say other than that, IMHO, a
> sane app should be using O_NONBLOCK if it really does not want to
> block, so I shall now quietly back away from this thread.
>
> regards,
> --
> Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
> Fortune:
> What this country needs is a good five cent microcomputer.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2004-10-07 15:22:12

by Adam Heath

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004, Alan Cox wrote:

> On Iau, 2004-10-07 at 15:07, Martijn Sipkema wrote:
> > > Much can change between the select() and recvmsg - things outside of
> > > kernel control too, and it's long been known.
> >
> > There is no change; the current implementation just checks the validity of
> > the data in the recvmsg() call and not during select().
>
> The accept one is documented by Stevens and well known. In the UDP case
> currently we could get precise behaviour - by halving performance of UDP
> applications like video streaming. We probably don't want to because we
> can respond intelligently to OOM situations by freeing the queue if we
> don't enforce such a silly rule.

Also, pkts could be dropped even for TCP sockets. TCP will cause the pkt to
be retransmitted, but while that is occuring, the read that was prompted by
the select will still block.

So, any app that does not use O_NONBLOCK is broken, if they assume that a
successful select will indicate a nonblocking read/recvmsg.

2004-10-07 15:32:26

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Alan Cox" <[email protected]>
> > Read the standard. The behavious of select() on sockets is explicitely
> > described.
>
> For a strict posix system, but then if we were a strict posix/sus system
> you wouldn't be able to use mmap. Also the kernel doesn't claim to
> implement posix behaviour, it avoids those areas were posix is stupid.

mmap() _is_ in POSIX AFAIK. Also, there are other standards for things
that aren't in POSIX, but these are supersets.

> > > POSIX_ME_HARDER? ;)
> >
> > Would you care to provide any real answers or are you just telling
> > me to shut up because whatever Linux does is good, and not appear
> > unreasonable by adding a ;) ..?
>
> POSIX_ME_HARDER was an environment variable GNU tools used when users
> wanted them to do stupid but posix mandated things instead of sensible
> things. It was later changed to POSIXLY_CORRECT, which lost the point
> somewhat.

I actually also don't agree with this behaviour of the GNU tools..

> > > You really shouldnt assume select state is guaranteed not to change
> > > by time you get round to doing IO. It's not safe, and not just on
> > > Linux - whatever POSIX says.
> >
> > Any sane application would be written for the POSIX API as described
> > in the standard, and a sane kernel should IMHO implement that standard
> > whenever possible.
>
> I doubt that. Sane applications are written to the BSD socket API not
> POSIX 1003.1g draft 6.4 and relatives.

Perhaps... I get the idea that I just seem to value a standard operating
system interface more than you do; it would be a loss IMHO if people
were forced to write for Linux instead of being able to write portable
applications.

The POSIX interface shouldn't become something one reads to get an idea
of how things wil _not_ work on Linux.


--ms


2004-10-07 15:38:41

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Adam Heath" <[email protected]>
> On Thu, 7 Oct 2004, Alan Cox wrote:
>
> > On Iau, 2004-10-07 at 15:07, Martijn Sipkema wrote:
> > > > Much can change between the select() and recvmsg - things outside of
> > > > kernel control too, and it's long been known.
> > >
> > > There is no change; the current implementation just checks the validity of
> > > the data in the recvmsg() call and not during select().
> >
> > The accept one is documented by Stevens and well known. In the UDP case
> > currently we could get precise behaviour - by halving performance of UDP
> > applications like video streaming. We probably don't want to because we
> > can respond intelligently to OOM situations by freeing the queue if we
> > don't enforce such a silly rule.
>
> Also, pkts could be dropped even for TCP sockets. TCP will cause the pkt to
> be retransmitted, but while that is occuring, the read that was prompted by
> the select will still block.
>
> So, any app that does not use O_NONBLOCK is broken, if they assume that a
> successful select will indicate a nonblocking read/recvmsg.

Aaargh... I'm going to shut up about this now, because this is clearly going
nowhere, but you are saying that any application that expects behaviour as
defined in POSIX is broken, and that bothers me..


--ms

2004-10-07 15:54:00

by Alan

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Iau, 2004-10-07 at 17:32, Martijn Sipkema wrote:
> > For a strict posix system, but then if we were a strict posix/sus system
> > you wouldn't be able to use mmap. Also the kernel doesn't claim to
> > implement posix behaviour, it avoids those areas were posix is stupid.
>
> mmap() _is_ in POSIX AFAIK. Also, there are other standards for things
> that aren't in POSIX, but these are supersets.

I'll quote SuSv3 to illustrate the danger of "specifications"

"The system shall always zero-fill any partial page at the end of an
object. Further, the system shall never write out any modified portions
of the last page of an object which are beyond its end. References
within the address range starting at pa and continuing for len bytes to
whole pages following the end of an object shall result in delivery of a
SIGBUS signal."

Its a mistake, its not apparent if its actually meant something entirely
different or someone forgot a "not", or what is going on.

It certainly isnt a useful definition of mmap.

> > POSIX_ME_HARDER was an environment variable GNU tools used when users
> > wanted them to do stupid but posix mandated things instead of sensible
> > things. It was later changed to POSIXLY_CORRECT, which lost the point
> > somewhat.
>
> I actually also don't agree with this behaviour of the GNU tools..

Thankfully you seem to be in the minority

> > I doubt that. Sane applications are written to the BSD socket API not
> > POSIX 1003.1g draft 6.4 and relatives.
>
> Perhaps... I get the idea that I just seem to value a standard operating
> system interface more than you do; it would be a loss IMHO if people
> were forced to write for Linux instead of being able to write portable
> applications.

Portable applications don't work if you get that close to the grey areas
of a standard. Also you'll find the SuS spec simply doesn't work sanely
for some applications. Bind() is instant connect can block for example.
Now try implementing netbios - bind blocks, connect is instant, and the
posix spec is toiler paper grade at this point.

Likewise we have an MS-DOS FS. Its not POSIX behaviour. Should we remove
it, require a different set of system calls or decide that religious
application of standards is not useful.

Linux applies standards pragmatically. A setsockopt for strict socket
compliance might make sense. It'll trash the app performance for UDP but
it would allow users to select useful v posix.

Alan

2004-10-07 17:01:17

by Adrian Phillips

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

>>>>> "Andries" == Andries Brouwer <[email protected]> writes:

Andries> On Wed, Oct 06, 2004 at 02:18:23PM -0600, Chris Friesen
Andries> wrote:
>> In any case, the current behaviour is not compliant with the
>> POSIX text that Andries posted. Perhaps this should be
>> documented somewhere?

Andries> For the time being I wrote (in select.2)

Andries> BUGS It has been reported (Linux 2.6) that select may
Andries> report a socket file descriptor as "ready for reading",
Andries> while nev- ertheless a subsequent read blocks. This
Andries> could perhaps happen when data has arrived but upon
Andries> examination has wrong checksum and is discarded. Thus it
Andries> may be safer to use non-blocking I/O.

On my Debian stable and testing boxes the following text is in man 2
select (obtained from ftp://ftp.win.tue.nl/pub/linux-local/manpages)
:-

Three independent sets of descriptors are watched. Those listed in readfds will be watched to
see if characters become available for reading (more precisely, to see if a read will not
^^^^^^^^^^^^^
block - in particular, a file descriptor is also ready on end-of-file), those in writefds will
^^^^^

(and seems to be the same in the latest tarball) which means that
there is a good possibility that a number of people, myself including,
rely on read not blocking on a fd after select indicates that
"characters have become available". This section should be altered in
some way as well (perhaps referencing the BUGS section).

Sincerely,

Adrian Phillips

--
Who really wrote the works of William Shakespeare ?
http://www.pbs.org/wgbh/pages/frontline/shakespeare/

2004-10-07 17:03:28

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 05:39:06PM +0100, Martijn Sipkema wrote:
> Aaargh... I'm going to shut up about this now, because this is clearly going
> nowhere, but you are saying that any application that expects behaviour as
> defined in POSIX is broken, and that bothers me..

I like the idea submitted by somebody else. If O_NONBLOCK is enabled,
the current semantics apply. If O_NONBLOCK is not enabled, select()
takes longer to run and verifies that data is actually available. No
cost for applications that people consider to be 'proper', and the
standard, and saner behaviour, is implemented for the rest. If
somebody could provide a patch for this, that was clean, easy to
maintain, and could be proven to have a minimal impact on performance,
I bet the opponents to this would quiet down.

I don't have the time or experience to do this right now. It would take
me over a month just to learn what would need to be done. So, it will
have to be somebody else...

There is one claim I'd like to question - the claim that select()
would be slowed down unnecessarily, even if the behaviour was changed
for both O_NONBLOCK enabled. Isn't it more expensive to allow the
application to be woken up, and poll using read(), than to just do a
quick check in the kernel and not tell the application there is data,
when there really isn't? This sounds to me like a question of
implementation - if select() did the read check, including checksums,
or whatever, the checks are done. The application doesn't get waken
up, or by the time it gets to read(), the data is already
available. No loss. I'm thinking that it must be the current
implementation that would make this expensive to implement, rather
than some convincing theoretical explanation. If this is true, it's
fine - we all live in the practical world, but we should admit it.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-07 17:23:40

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Mark Mielke wrote:

> Isn't it more expensive to allow the
> application to be woken up, and poll using read(), than to just do a
> quick check in the kernel and not tell the application there is data,
> when there really isn't?

The issue is caching.

We have to do a pass over the data to copy it to userspace. If we do the
checksum verification then, it's basically free since its already in the cache.

If we do it at select() time, we end up having to do two passes over every
packet, one to verify the checksum, and one to pass it to userspace.

I agree with you though, it'd be nice to have select() change behaviour based on
whether the socket is blocking or not.

Chris

2004-10-07 17:03:29

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Jean-Sebastien Trottier wrote:
> Just an outsider's view of someone that has been following this thread:
>
> Could select() have 2 different behaviors depending on wether the
> O_NONBLOCK flag is set or not on the socket.
>
> 1. If O_NONBLOCK is set, it can immediately return that the socket is
> ready to be read

> 2. In the case where O_NONBLOCK is not set, select() could wait for all
> the checks to be done before deciding to return or not. In this case the
> meaning would be "there is data ready", NOT "there *might* be data
> ready".

This actually sounds quite interesting.

For applications that are prepared to handle the nonblocking case, you get full
speed. For applications coded to POSIX, you get correctness.

It does mean that select() is now a bit more complicated, but applications
become much easier to write.

Chris

2004-10-07 18:28:20

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> This actually sounds quite interesting.
>
> For applications that are prepared to handle the nonblocking
> case, you get full speed. For applications coded to POSIX,
> you get correctness.
>
> It does mean that select() is now a bit more complicated, but
> applications become much easier to write.

It was my original proposal. The only question is to return which error
code. We cannot return EAGAIN as Posix explicitly disallows it. Is EIO good?
Or some other new error code?

As far as implmentation goes, we probably need a "check_data" function in
the f_ops. Default it could be NULL.

The question is that is it worth the trouble to be Posix complaint? Seems
most developers think not, especially since it has been this way for so long
a time.

Hua

2004-10-07 18:36:39

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hua Zhong wrote:

> It was my original proposal. The only question is to return which error
> code. We cannot return EAGAIN as Posix explicitly disallows it. Is EIO good?
> Or some other new error code?

Since we wouldn't be posix compliant anyway in the nonblocking case, we may as
well return EAGAIN--it's the most appropriate.

Chris

2004-10-07 19:35:35

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> I have a problem where the sequence of events is as follows:
> - application does select() on a UDP socket descriptor
> - select returns success with descriptor ready for reading
> - application does recvfrom() on this descriptor and this recvfrom()
> blocks forever

POSIX does not require the kernel to predict the future. The only guarantee
against having a socket operation block is found in non-blocking sockets.

> My understanding of POSIX is limited, but it seems to me that a read call
> must never block after select just said that it's ok to read from the
> descriptor. So any such behaviour would be a kernel bug.

Suppose hypothetically that we add a new network protocol that permits the
sender to 'invalidate' data after it's received by the remote network stack
and before it's accepted by the remote application. Would you argue that
'select'ing must be considered a read in this case? Even though an
application might 'select' on a socket with no intention to follow up with a
read? Remember, the 'select' operation is supposed to be protocol neutral.

> From a brief look at the kernel UDP code, I suspect a problem in
> net/ipv4/udp.c, udp_recvmsg(): it reads the first available datagram
> from the queue, then checks the UDP checksum. If the UDP checksum fails at
> this point, the datagram is discarded and the process blocks
> until the next
> datagram arrives.

You should understand a hit on 'select' to mean that something happened,
and that it would therefore behoove your application to try the operation it
wants to perform again. The 'select' operation is not fine-grained enough to
know what operation you planned, and whether that particular operation would
block.

Suppose, for example, that instead of using 'read' you used 'recvmsg', and
we add an option to 'recvmsg' to allow you to read datagrams with bad
checksums. What should 'select' do if a datagram is received with a bad
checksum? It has no idea what flavor of 'recvmsg' you're going to call, so
it can't know if your operation is going to block or not.

> Could someone please help me track this problem?
> Am I correct in my reasoning that the select() -> recvmsg() sequence must
> never block?

No, you are incorrect. Consider, again, a 'recvmsg' flag to allow you to
receive messages even if they have bad checksums versus one that blocks
until a message with a valid checksum is received. The 'select' function
just isn't smart enough.

Consider a 'select' for write on a TCP socket. How does 'select' know how
many bytes you're going to write? Again, a 'select' hit just indicates
something relevant has happened, it *cannot* guarantee that a future
operation won't block both because 'select' has no idea what operation is
going to take place in the future and because things can change between now
and then.

> If yes, is it possible that this problem is triggered by a failed UDP
> checksum in the udp_recvmsg() function?
> If yes, can we do something to fix this?

The bug is in your application. The kernel behavior might be considered
undesirable, but it's your application that is failing to tell the kernel
that it must not block.

DS


2004-10-07 21:56:10

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Martijn Sipkema wrote:
> From: "Chris Friesen" <[email protected]>

>>Since we wouldn't be posix compliant anyway in the nonblocking case, we may as
>>well return EAGAIN--it's the most appropriate.
>
>
> No, I don't think so, since POSIX says to return EAGAIN when:
>
> The socket's file descriptor is marked O_NONBLOCK and no data is waiting to
> be received; or MSG_OOB is set and no out-of-band data is available and either
> the socket's file descriptor is marked O_NONBLOCK or the socket does not
> support blocking to await out-of-band data

We are discussing the case where the socket is nonblocking and the udp checksum
is corrupt, right? (Because in the blocking case select() would verify the
checksum.)

In this case, select() returns with the socket readable, we call recvmsg() and
discover the message is corrupt. At this point we throw away the corrupt
message, so we now have no data waiting to be received. We return EAGAIN, and
userspace goes merrily on its way, handling anything else in its loop, then
going back to select().

Seems perfectly suitable.

Chris

2004-10-07 22:03:44

by Andries Brouwer

[permalink] [raw]
Subject: mmap specification - was: ... select specification

On Thu, Oct 07, 2004 at 03:50:31PM +0100, Alan Cox wrote:

> I'll quote SuSv3 to illustrate the danger of "specifications"
>
> "The system shall always zero-fill any partial page at the end of an
> object. Further, the system shall never write out any modified portions
> of the last page of an object which are beyond its end. References
> within the address range starting at pa and continuing for len bytes to
> whole pages following the end of an object shall result in delivery of a
> SIGBUS signal."
>
> Its a mistake, its not apparent if its actually meant something entirely
> different or someone forgot a "not", or what is going on.

What precisely are you thinking of?
You seem to think this is ridiculous.
The way I read this, Linux is compliant, I think.

[I read this as follows: If you mmap a file with MAP_SHARED and modify
the memory at an address so far beyond EOF that it is not in a page
containing stuff from the file, then you get a SIGBUS. -- Linux does this.
Also, that if you modify the memory at an address beyond EOF, then
the file is not modified. -- Again Linux does this.]

Andries

2004-10-07 22:21:00

by Chris Wedgwood

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Thu, Oct 07, 2004 at 11:58:34PM +0200, Andries Brouwer wrote:

> [I read this as follows: If you mmap a file with MAP_SHARED and
> modify the memory at an address so far beyond EOF that it is not in
> a page containing stuff from the file, then you get a SIGBUS. --
> Linux does this. Also, that if you modify the memory at an address
> beyond EOF, then the file is not modified. -- Again Linux does
> this.]

consider mmaping a 1-byte file ... you can modify bytes 0..4095.
bytes 1..4095 shouldn't be recorded to disk ideally

at one point one fs did actually store this data and it caused cpp
problems (it would expect to see zeroes and where it didn't it got
upset)

2004-10-07 22:26:05

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, 8 Oct 2004 00:19:52 +0100
"Martijn Sipkema" <[email protected]> wrote:

> So why not return EIO instead? It would be even better to have select()
> validate the data, but I think returning EIO is better than blocking and most
> likely POSIX compliant.

It's not EIO either, it's "queue empty" which means block.
-EIO is a hard error. You're trying to find some clever way
to say "try again", but:

1) -EAGAIN is illegal on blocking sockets
2) -EIO is a hard error and does not mean "try again"

Therefore we block, which is the correct thing to do in this
situation.

Applications which wish not to block should (SURPRISE SURPRISE)
use non-blocking sockets, otherwise blocking is ok for them.

I can't believe this thread has lasted this long. I think people
had cotton in their ears when I mentioned that every single 2.4.x
and 2.6.x existing system out there has this behavior, therefore
even if we changed the behavior some way today people still need
to handle this to work on all existing Linux systems.

Furthermore, returning -EAGAIN or any other kind of "try again"
_DOES_ break applications, that's why we changed it to block instead.

2004-10-07 22:27:52

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> Chris Friesen <[email protected]> wrote:

>>In this case, select() returns with the socket readable, we call recvmsg() and
>>discover the message is corrupt. At this point we throw away the corrupt
>>message, so we now have no data waiting to be received. We return EAGAIN, and
>>userspace goes merrily on its way, handling anything else in its loop, then
>>going back to select().

> Incorrect. When the user specifies blocking on the file descriptor
> we must give it what it asked for. -EAGAIN on a blocking file descriptor
> is always a bug, in all situations, that's what this code used to do and we
> fixed it because it's a bug.

I believe you misread what I said. Just before the above quote, I said "We are
discussing the case where the socket is nonblocking and the udp checksum is
corrupt, right? "

What I had in mind was that the non-blocking file descriptor have select()
return without verifying the checksum, and if it was discovered to be bad at
recvmsg() time, we return EAGAIN. For a blocking file descriptor, we would
verify the checksum before returning from select().

That way, the blocking case gets the semantics it expects (although worse
performance), while the nonblocking case gets full performance as well as
semantics that it will handle properly.

Chris

2004-10-07 22:30:48

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 07 Oct 2004 16:24:13 -0600
Chris Friesen <[email protected]> wrote:

> I believe you misread what I said. Just before the above quote, I said "We are
> discussing the case where the socket is nonblocking and the udp checksum is
> corrupt, right? "

And in that case we return -EAGAIN and always have.

> What I had in mind was that the non-blocking file descriptor have select()
> return without verifying the checksum, and if it was discovered to be bad at
> recvmsg() time, we return EAGAIN.

That's what we do. In net/ipv4/udp.c:udp_recvmsg()

if (skb->ip_summed==CHECKSUM_UNNECESSARY) {
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov,
copied);
} else if (msg->msg_flags&MSG_TRUNC) {
if (__udp_checksum_complete(skb))
goto csum_copy_err;
err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov,
copied);
} else {
err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov);

if (err == -EINVAL)
goto csum_copy_err;
}
...
csum_copy_err:
UDP_INC_STATS_BH(UDP_MIB_INERRORS);

/* Clear queue. */
if (flags&MSG_PEEK) {
int clear = 0;
spin_lock_irq(&sk->sk_receive_queue.lock);
if (skb == skb_peek(&sk->sk_receive_queue)) {
__skb_unlink(skb, &sk->sk_receive_queue);
clear = 1;
}
spin_unlock_irq(&sk->sk_receive_queue.lock);
if (clear)
kfree_skb(skb);
}

skb_free_datagram(sk, skb);

if (noblock)
return -EAGAIN;
goto try_again;

If socket is non-blocking, return -EAGAIN, else go back to "try_again"
where we block on packet arrival or error.

2004-10-07 22:20:59

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?


From: "Chris Friesen" <[email protected]>
> Martijn Sipkema wrote:
> > From: "Chris Friesen" <[email protected]>
>
> >>Since we wouldn't be posix compliant anyway in the nonblocking case, we may as
> >>well return EAGAIN--it's the most appropriate.
> >
> >
> > No, I don't think so, since POSIX says to return EAGAIN when:
> >
> > The socket's file descriptor is marked O_NONBLOCK and no data is waiting to
> > be received; or MSG_OOB is set and no out-of-band data is available and either
> > the socket's file descriptor is marked O_NONBLOCK or the socket does not
> > support blocking to await out-of-band data
>
> We are discussing the case where the socket is nonblocking and the udp checksum
> is corrupt, right? (Because in the blocking case select() would verify the
> checksum.)
>
> In this case, select() returns with the socket readable, we call recvmsg() and
> discover the message is corrupt. At this point we throw away the corrupt
> message, so we now have no data waiting to be received. We return EAGAIN, and
> userspace goes merrily on its way, handling anything else in its loop, then
> going back to select().
>
> Seems perfectly suitable.

Oh, I thought it was about the case where select() does not check the data and
a blocking socket is used and I would think EIO better in that case. But shouldn't
a nonblocking socket return EIO also, since the blocking socket would not in
fact block?

--ms

2004-10-07 22:46:23

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David S. Miller" <[email protected]>
> On Thu, 07 Oct 2004 15:49:17 -0600
> Chris Friesen <[email protected]> wrote:
>
> > In this case, select() returns with the socket readable, we call recvmsg() and
> > discover the message is corrupt. At this point we throw away the corrupt
> > message, so we now have no data waiting to be received. We return EAGAIN, and
> > userspace goes merrily on its way, handling anything else in its loop, then
> > going back to select().
>
> Incorrect. When the user specifies blocking on the file descriptor
> we must give it what it asked for. -EAGAIN on a blocking file descriptor
> is always a bug, in all situations, that's what this code used to do and we
> fixed it because it's a bug.

So why not return EIO instead? It would be even better to have select()
validate the data, but I think returning EIO is better than blocking and most
likely POSIX compliant.


--ms

2004-10-07 22:56:37

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:
> Chris Friesen <[email protected]> wrote:

>>What I had in mind was that the non-blocking file descriptor have select()
>>return without verifying the checksum, and if it was discovered to be bad at
>>recvmsg() time, we return EAGAIN.
>
>
> That's what we do. In net/ipv4/udp.c:udp_recvmsg()

Yes. I realize this, and agree with that behaviour.

However, you chopped off what I consider the interesting part of my post. I
propose that if we call select() on a blocking file descriptor, we verify the
checksum before saying that the socket is readable. Then, at recvmsg() time, if
it hasn't been checked already we would check it (to allow for the case of
blocking socket without select()).

This allows for easy porting of apps that expect a blocking recvmsg() after
select() to always succeed.

Thus, you end up with:

nonblocking socket -- exactly as current
blocking socket without select -- exactly as current
blocking socket with select -- checksum verified before select() returns


Chris

2004-10-07 22:58:35

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 03:24:00PM -0700, David S. Miller wrote:
> I can't believe this thread has lasted this long. I think people
> had cotton in their ears when I mentioned that every single 2.4.x
> and 2.6.x existing system out there has this behavior, therefore
> even if we changed the behavior some way today people still need
> to handle this to work on all existing Linux systems.

Nah. Some people hate it when their operating system doesn't do what
it is documented to do. These people include me.

We all know you should use non-blocking sockets for the application
domain we are talking about (one's that use select() / poll() to
determine whether data is available). Sometimes, it doesn't work.
When, you ask? When you are using a higher level language, such as a
version of Perl that didn't provide IO::blocking(), and on some
operating systems, such as HP-UX, it wasn't possible to portably
enable O_NONBLOCK for your sockets. We're talking practice here, so
I'm talking a real live example. Sure, it's nice to demand that people
upgrade to a later version of Perl. Guess what? It isn't happening. It
will be another year or two before we can guarantee people have Perl
5.006 on their system.

Anyways, I'm suspecting that the occurrences of these failures are so
low that they have been either un-reproducable, or people haven't even
noticed. Sometimes a blocking read happens to be ok - you are expecting
data, and eventually it will come. Then, the select() / poll() starts
up again, and the application continues on. For all the observer knows,
there was a load spike, and the application wasn't given enough cpu
seconds to process the request.

> Furthermore, returning -EAGAIN or any other kind of "try again"
> _DOES_ break applications, that's why we changed it to block instead.

This is good. The bad part is select() returning "data available", when
the information it is basing its decision on, is invalid, but it hasn't
taken the resources to prove yet. It's wrong. It's acceptable for
O_NONBLOCK, as no properly implemented application that uses O_NONBLOCK
should fail from seeing O_NONBLOCK in these cases.

Applications that do not set O_NONBLOCK have different expectations.
Blocking after select() says data is ready is wrong. If we want to say
'yes, it is wrong, but it's a corner case, too complicated to work around
and we're assuming that people are using O_NONBLOCK', that is fine. I just
want you to say it. :-)

I don't really care to hear "it's right, because that's how I coded it, and
that's how fast I want it to run."

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-07 23:07:47

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 03:47:22PM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 18:42:42 -0400
> Mark Mielke <[email protected]> wrote:
> > Sure, it's nice to demand that people
> > upgrade to a later version of Perl. Guess what? It isn't happening. It
> > will be another year or two before we can guarantee people have Perl
> > 5.006 on their system.
> If those people are tepid about upgrading perl, I think it would
> be even less likely that they would upgrade their kernels.

Good practical point for the here and now. :-)

The discussion, though, is more about what it should have been.
The combined frustrations of many of us.

To colour the discussion a bit, many of us have had these same
frustrations with Sun, and HP on various issues.

Just say "it's a bug, but one we have chosen not to fix for practical
reasons." That would have kept me out of this discussion. Saying the
behaviour is correct and that POSIX is wrong - that raises hairs -
both the question kind, and the concern kind.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-07 23:12:05

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 19:00:19 -0400
Mark Mielke <[email protected]> wrote:

> Just say "it's a bug, but one we have chosen not to fix for practical
> reasons." That would have kept me out of this discussion. Saying the
> behaviour is correct and that POSIX is wrong - that raises hairs -
> both the question kind, and the concern kind.

If anything, I'm saying we're not POSIX compliant in this case
by choice.

2004-10-07 23:06:25

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> I can't believe this thread has lasted this long.

The reason is that you haven't just admitted very clearly that
"Linux select isn't Posix compliant and it was a design decision
not to do so for performance reasons". I think this kind of
authorative answer would shut up many people. :-)

> I think people had cotton in their ears when I mentioned
> that every single 2.4.x and 2.6.x existing system out there
> has this behavior, therefore even if we changed the behavior
> some way today people still need to handle this to work on
> all existing Linux systems.

Unfortunately this isn't the best argument..

I think most people just hope Linux would follow the standard.
The old Linux threads weren't posix-compliant either for years,
and people still fixed (most of?) it in 2.6. By your argument
that would not have happened.

Hua

2004-10-07 23:33:11

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:

> So people who improperly use select() with blocking sockets get punished
> in a different way, with half the performance compared to today?

Yes. Rather than having an app that doesn't work at all in the case of
corrupted packets, they get half the performance.

Chris

2004-10-07 23:32:44

by Kyle Moffett

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Oct 07, 2004, at 18:46, Andries Brouwer wrote:
> The POSIX text is clear to me, and Linux is compliant.
> On the other hand, I have no idea what you try to say.

> On Thu, Oct 07, 2004 at 06:32:43PM -0400, Kyle Moffett wrote:
>
>>>> "References within the address range starting at pa and continuing
>>>> for len bytes to whole pages following the end of an object shall
>>>> result in delivery of a SIGBUS signal."

Reviewing this once more:

> References within the address range starting at pa and continuing for
> len bytes:
range = {pa ... pa+len};

> To whole pages following the end of an object:
range = {pa ... PAGE_ROUND_UP(pa+len)};

> shall result in delivery of a SIGBUS signal:
pa[ range[n] ]; => SIGBUS


This is clearly not what is meant

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a17 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------


2004-10-07 23:06:24

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 18:42:42 -0400
Mark Mielke <[email protected]> wrote:

> Sure, it's nice to demand that people
> upgrade to a later version of Perl. Guess what? It isn't happening. It
> will be another year or two before we can guarantee people have Perl
> 5.006 on their system.

If those people are tepid about upgrading perl, I think it would
be even less likely that they would upgrade their kernels.

2004-10-07 23:06:23

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 15:46:23 -0700
"Hua Zhong" <[email protected]> wrote:

> The reason is that you haven't just admitted very clearly that
> "Linux select isn't Posix compliant and it was a design decision
> not to do so for performance reasons".

It is, happy now? :-)

I never claimed it to be POSIX compliant.

2004-10-08 00:07:45

by Ben Greear

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Can't a lot of NICs do UDP checksum in hardware, basically for free?

At least in this case there would be no penalty for select()
looking for the corrupted packet?

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2004-10-07 22:56:38

by Alan Curry

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller writes the following:
>
>I can't believe this thread has lasted this long. I think people
>had cotton in their ears when I mentioned that every single 2.4.x
>and 2.6.x existing system out there has this behavior, therefore
>even if we changed the behavior some way today people still need
>to handle this to work on all existing Linux systems.

That argument sounds like "I broke something a few years ago, and nobody
complained quickly enough, so I got away with it." It's not impressive.

2004-10-08 00:26:24

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> > POSIX does not require the kernel to predict the future. The
> > only guarantee
> > against having a socket operation block is found in
> > non-blocking sockets.

> It is one thing to implement select()/recvmsg() in a non POSIX compliant
> way; it is another thing to make false claims about that standard. POSIX
> _does_ guarantee that a call to recvmsg() does not block after a call
> to select().

I do not believe this.

> > Suppose, for example, that instead of using 'read' you used
> > 'recvmsg', and
> > we add an option to 'recvmsg' to allow you to read datagrams with bad
> > checksums. What should 'select' do if a datagram is received with a bad
> > checksum? It has no idea what flavor of 'recvmsg' you're going
> > to call, so
> > it can't know if your operation is going to block or not.

> This is all described in detail in the standard.

Where, specifically, does the standard guarantee that a subsequent call to
'recvmsg' will not block?

> > No, you are incorrect. Consider, again, a 'recvmsg' flag to allow you to
> > receive messages even if they have bad checksums versus one that blocks
> > until a message with a valid checksum is received. The 'select' function
> > just isn't smart enough.
> >
> > Consider a 'select' for write on a TCP socket. How does
> > 'select' know how
> > many bytes you're going to write? Again, a 'select' hit just indicates
> > something relevant has happened, it *cannot* guarantee that a future
> > operation won't block both because 'select' has no idea what
> > operation is
> > going to take place in the future and because things can change
> > between now
> > and then.

> You really should read the standard on this..

I have. We obviously disagree on what it says. Since you're the one
claiming a guarantee that I claim does not exist, perhaps you could cite
where you think this guarantee appears.

DS


2004-10-08 00:21:12

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 07 Oct 2004 16:39:06 -0600
Chris Friesen <[email protected]> wrote:

> However, you chopped off what I consider the interesting part of my post. I
> propose that if we call select() on a blocking file descriptor, we verify the
> checksum before saying that the socket is readable. Then, at recvmsg() time, if
> it hasn't been checked already we would check it (to allow for the case of
> blocking socket without select()).

So people who improperly use select() with blocking sockets get punished
in a different way, with half the performance compared to today?

2004-10-08 00:21:10

by Andries Brouwer

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Thu, Oct 07, 2004 at 06:32:43PM -0400, Kyle Moffett wrote:

>>>"References within the address range starting at pa and continuing
>>> for len bytes to whole pages following the end of an object shall
>>> result in delivery of a SIGBUS signal."
>
> The last bit of the SuS text means:
>
> pa <-- len --> eof <-> page boundary
>
> Anywhere from pa to page boundary will generate SIGBUS.

The POSIX text is clear to me, and Linux is compliant.
On the other hand, I have no idea what you try to say.

Andries

2004-10-08 00:49:51

by Lee Revell

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 2004-10-07 at 18:47, David S. Miller wrote:
> On Thu, 7 Oct 2004 18:42:42 -0400
> Mark Mielke <[email protected]> wrote:
>
> > Sure, it's nice to demand that people
> > upgrade to a later version of Perl. Guess what? It isn't happening. It
> > will be another year or two before we can guarantee people have Perl
> > 5.006 on their system.
>
> If those people are tepid about upgrading perl, I think it would
> be even less likely that they would upgrade their kernels.

Not true. If you upgrade the kernel you may incur a little downtime.
Upgrade perl and you potentially break 1000's of customer CGI scripts.

Lee

2004-10-07 22:56:38

by Kyle Moffett

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Oct 07, 2004, at 17:58, Andries Brouwer wrote:
> On Thu, Oct 07, 2004 at 03:50:31PM +0100, Alan Cox wrote:
>> "References
>> within the address range starting at pa and continuing for len bytes
>> to
>> whole pages following the end of an object shall result in delivery
>> of a
>> SIGBUS signal."
>
> [I read this as follows: If you mmap a file with MAP_SHARED and modify
> the memory at an address so far beyond EOF that it is not in a page
> containing stuff from the file, then you get a SIGBUS. -- Linux does
> this.
> Also, that if you modify the memory at an address beyond EOF, then
> the file is not modified. -- Again Linux does this.]

The last bit of the SuS text means:

pa <-- len --> eof <-> page boundary

Anywhere from pa to page boundary will generate SIGBUS. This is a
rather useless definition no? I think what they meant is:

"References within the address range starting on the first whole page
at least len bytes after pa shall result in delivery of a SIGBUS
signal."

(I am assuming, of course, that pa is the result of the mmap call. If
I'm
wrong please tell me, thanks!)

Cheers,
Kyle Moffett

-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a17 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$ r
!y?(-)
------END GEEK CODE BLOCK------


2004-10-08 01:18:44

by Andries Brouwer

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Thu, Oct 07, 2004 at 03:17:45PM -0700, Chris Wedgwood wrote:
> On Thu, Oct 07, 2004 at 11:58:34PM +0200, Andries Brouwer wrote:
>
> > [I read this as follows: If you mmap a file with MAP_SHARED and
> > modify the memory at an address so far beyond EOF that it is not in
> > a page containing stuff from the file, then you get a SIGBUS. --
> > Linux does this. Also, that if you modify the memory at an address
> > beyond EOF, then the file is not modified. -- Again Linux does
> > this.]
>
> consider mmaping a 1-byte file ... you can modify bytes 0..4095.
> bytes 1..4095 shouldn't be recorded to disk ideally
>
> at one point one fs did actually store this data and it caused cpp
> problems (it would expect to see zeroes and where it didn't it got
> upset)

Is this an answer? Or an anecdote?

[You seem to say anecdotically that at one point in time Linux mmap
was not POSIX compliant, and problems arose.
Alan on the other hand seems to say that POSIX comes with
ridiculous requirements.]

Andries

2004-10-08 01:31:36

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 07 Oct 2004 15:49:17 -0600
Chris Friesen <[email protected]> wrote:

> In this case, select() returns with the socket readable, we call recvmsg() and
> discover the message is corrupt. At this point we throw away the corrupt
> message, so we now have no data waiting to be received. We return EAGAIN, and
> userspace goes merrily on its way, handling anything else in its loop, then
> going back to select().

Incorrect. When the user specifies blocking on the file descriptor
we must give it what it asked for. -EAGAIN on a blocking file descriptor
is always a bug, in all situations, that's what this code used to do and we
fixed it because it's a bug.

It goes "merrily on its way" if it marks the file descriptor as non-blocking.

2004-10-07 21:46:03

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Chris Friesen" <[email protected]>
> Hua Zhong wrote:
>
> > It was my original proposal. The only question is to return which error
> > code. We cannot return EAGAIN as Posix explicitly disallows it. Is EIO good?
> > Or some other new error code?
>
> Since we wouldn't be posix compliant anyway in the nonblocking case, we may as
> well return EAGAIN--it's the most appropriate.

No, I don't think so, since POSIX says to return EAGAIN when:

The socket's file descriptor is marked O_NONBLOCK and no data is waiting to
be received; or MSG_OOB is set and no out-of-band data is available and either
the socket's file descriptor is marked O_NONBLOCK or the socket does not
support blocking to await out-of-band data

So, I think returning EIO is probably better; I think that would be POSIX
compliant.


--ms

2004-10-07 21:40:56

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David Schwartz" <[email protected]>
> > I have a problem where the sequence of events is as follows:
> > - application does select() on a UDP socket descriptor
> > - select returns success with descriptor ready for reading
> > - application does recvfrom() on this descriptor and this recvfrom()
> > blocks forever
>
> POSIX does not require the kernel to predict the future. The only guarantee
> against having a socket operation block is found in non-blocking sockets.

It is one thing to implement select()/recvmsg() in a non POSIX compliant
way; it is another thing to make false claims about that standard. POSIX
_does_ guarantee that a call to recvmsg() does not block after a call
to select().

> > My understanding of POSIX is limited, but it seems to me that a read call
> > must never block after select just said that it's ok to read from the
> > descriptor. So any such behaviour would be a kernel bug.
>
> Suppose hypothetically that we add a new network protocol that permits the
> sender to 'invalidate' data after it's received by the remote network stack
> and before it's accepted by the remote application. Would you argue that
> 'select'ing must be considered a read in this case? Even though an
> application might 'select' on a socket with no intention to follow up with a
> read? Remember, the 'select' operation is supposed to be protocol neutral.

Consider the data accepted by the remote application at the moment it
calls select().

> > From a brief look at the kernel UDP code, I suspect a problem in
> > net/ipv4/udp.c, udp_recvmsg(): it reads the first available datagram
> > from the queue, then checks the UDP checksum. If the UDP checksum fails at
> > this point, the datagram is discarded and the process blocks
> > until the next
> > datagram arrives.
>
> You should understand a hit on 'select' to mean that something happened,
> and that it would therefore behoove your application to try the operation it
> wants to perform again. The 'select' operation is not fine-grained enough to
> know what operation you planned, and whether that particular operation would
> block.
>
> Suppose, for example, that instead of using 'read' you used 'recvmsg', and
> we add an option to 'recvmsg' to allow you to read datagrams with bad
> checksums. What should 'select' do if a datagram is received with a bad
> checksum? It has no idea what flavor of 'recvmsg' you're going to call, so
> it can't know if your operation is going to block or not.

This is all described in detail in the standard.

> > Could someone please help me track this problem?
> > Am I correct in my reasoning that the select() -> recvmsg() sequence must
> > never block?
>
> No, you are incorrect. Consider, again, a 'recvmsg' flag to allow you to
> receive messages even if they have bad checksums versus one that blocks
> until a message with a valid checksum is received. The 'select' function
> just isn't smart enough.
>
> Consider a 'select' for write on a TCP socket. How does 'select' know how
> many bytes you're going to write? Again, a 'select' hit just indicates
> something relevant has happened, it *cannot* guarantee that a future
> operation won't block both because 'select' has no idea what operation is
> going to take place in the future and because things can change between now
> and then.

You really should read the standard on this..

> > If yes, is it possible that this problem is triggered by a failed UDP
> > checksum in the udp_recvmsg() function?
> > If yes, can we do something to fix this?
>
> The bug is in your application. The kernel behavior might be considered
> undesirable, but it's your application that is failing to tell the kernel
> that it must not block.

Actually, the application may be correct, but he should change it
anyway since it is unlikely that Linux will follow the standard on select()
anytime soon...

--ms

2004-10-08 03:01:59

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 03:42:04PM -0700, David S. Miller wrote:
> On Thu, 07 Oct 2004 16:39:06 -0600
> Chris Friesen <[email protected]> wrote:
> > However, you chopped off what I consider the interesting part of my
> > post. I propose that if we call select() on a blocking file
> > descriptor, we verify the checksum before saying that the socket is
> > readable. Then, at recvmsg() time, if it hasn't been checked
> > already we would check it (to allow for the case of blocking socket
> > without select()).
> So people who improperly use select() with blocking sockets get punished
> in a different way, with half the performance compared to today?

No. People who use select() and read() in the *other* documented
manner, however ill-advised, would see the expected, and at least from
the perspective of myself and a few other people around here, correct
behaviour. How much it costs to implement the correct behaviour is a
different concern, and perhaps one these people should have considered
when determining whether or not to use O_NONBLOCK...

Your position, I believe has been that the use of select() on a blocking
file descriptor is invalid. Taking this to an extreme, select() should
return EBADF or something to that effect, when passed a file descriptor
that does not have O_NONBLOCK set...

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-07 18:25:28

by George Spelvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

> There is one claim I'd like to question - the claim that select()
> would be slowed down unnecessarily, even if the behaviour was changed
> for both O_NONBLOCK enabled. Isn't it more expensive to allow the
> application to be woken up, and poll using read(), than to just do a
> quick check in the kernel and not tell the application there is data,
> when there really isn't?

It depends on how often the second check fails.

The 99.9+% case is that the checksum is good, and in that case, you have
to pay the wakeup cost anyway. (So sending bad-checksum packets isn't
even a useful DoS; good-checksum packets still steal more cycles.)

You only get the benefit of saving the wakeup if the check fails.
But every time it passes, you get the benefit of making the check cheaper.
You have to multiply by the relative probabilities to see the net effect.

For something like UDP checksums on packets which have already passed the
ethernet checksum, that's a pretty overwhelming ratio.

2004-10-08 03:41:03

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 22:51:48 -0400
Mark Mielke <[email protected]> wrote:

> Your position, I believe has been that the use of select() on a blocking
> file descriptor is invalid.

Incorrect.

My position is that expecting a blocking file descriptor not to
block is invalid.

2004-10-08 03:53:04

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 08:39:43PM -0700, David S. Miller wrote:
> On Thu, 7 Oct 2004 22:51:48 -0400
> Mark Mielke <[email protected]> wrote:
> > Your position, I believe has been that the use of select() on a blocking
> > file descriptor is invalid.
> Incorrect.
> My position is that expecting a blocking file descriptor not to
> block is invalid.

Extrapolated, this would be - use of select() on a blocking file descriptor
is invalid.

Otherwise, what would be the point of using select() only to accidentally
block a small percentage of the time that one couldn't predict?

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-08 04:02:52

by David Miller

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, 7 Oct 2004 23:48:48 -0400
Mark Mielke <[email protected]> wrote:

> Extrapolated, this would be - use of select() on a blocking file descriptor
> is invalid.

It's valid, but it's asking for trouble. It is going to block on
you under certain circumstances.

2004-10-08 06:14:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Thu, Oct 07, 2004 at 07:00:19PM -0400, Mark Mielke wrote:
>
> Just say "it's a bug, but one we have chosen not to fix for practical
> reasons." That would have kept me out of this discussion. Saying the
> behaviour is correct and that POSIX is wrong - that raises hairs -
> both the question kind, and the concern kind.

Why? POSIX have gotten *lots* of things wrong in the past.

For example, using 512 byte units for df and du (which we ignore, and
for which the POSIX will hopefully eventually catch up with sanity)
and fcntl unlocking semantics (which we adhere to despite the fact
that is broken beyond belief, and very likely will and will continue
to cause application bugs in the feature). What we do when POSIX does
something idiotic is something that has to be addressed on a
case-by-case basis.

- Ted

2004-10-08 06:41:13

by Willy Tarreau

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hi David,

On Wed, Oct 06, 2004 at 05:29:59PM -0700, David S. Miller wrote:
> It absolutely does help the programs not using select(), using
> blocking sockets, and not expecting -EAGAIN.

As I asked in a previous mail in this overly long thread, why not returning
zero bytes at all. It is perfectly valid to receive an UDP packet with 0
data bytes, and any application should be able to support this case anyway.
In case of TCP, this would be a problem because the app would think it
indicates the last byte has been received. But in case of UDP, there is no
problem.

BTW, could we enumerate the known cases where select() might report an FD
as readable while finally not ? If there are only very few cases which can
all be worked around at nearly no cost, it might be worth doing it, or at
least documenting them. From this thread, I gathered :

1) multi-threaded apps -> broken design anyway, and should be obvious.
at most, the case can be documented
2) bad UDP checksums -> this is currently the object of this thread
3) packet dropped because of buffer size, load, etc... (not confirmed)
4) others ? any TCP/unix socket/pipe known case ?

Regards,
Willy

2004-10-08 09:19:38

by Andries Brouwer

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

On Thu, Oct 07, 2004 at 07:30:53PM -0400, Kyle Moffett wrote:
> On Oct 07, 2004, at 18:46, Andries Brouwer wrote:
> >The POSIX text is clear to me, and Linux is compliant.
> >On the other hand, I have no idea what you try to say.
>
> >On Thu, Oct 07, 2004 at 06:32:43PM -0400, Kyle Moffett wrote:
> >
> >>>>"References within the address range starting at pa and continuing
> >>>>for len bytes to whole pages following the end of an object shall
> >>>>result in delivery of a SIGBUS signal."
>
> Reviewing this once more:
>
> >References within the address range starting at pa and continuing for
> >len bytes:
> range = {pa ... pa+len};
>
> >To whole pages following the end of an object:
> range = {pa ... PAGE_ROUND_UP(pa+len)};

It is here you are wrong.

2004-10-08 15:24:27

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 08, 2004 at 02:10:52AM -0400, Theodore Ts'o wrote:
> On Thu, Oct 07, 2004 at 07:00:19PM -0400, Mark Mielke wrote:
> > Just say "it's a bug, but one we have chosen not to fix for practical
> > reasons." That would have kept me out of this discussion. Saying the
> > behaviour is correct and that POSIX is wrong - that raises hairs -
> > both the question kind, and the concern kind.
> Why? POSIX have gotten *lots* of things wrong in the past.
> [ non-relevant complaints about POSIX ]
> What we do when POSIX does
> something idiotic is something that has to be addressed on a
> case-by-case basis.

In this case, POSIX defines select() / blocking read() to be useful.
Linux defines it to be dangerous.

I have no question in my mind which behaviour is 'correct', in this
case. Deciding between something that works, and something that doesn't,
is a no brainer for me. Talking about performance, and so on, is just a
complete distraction. Who cares about performance when a percentage of
the time the caller will be in a confused state as a result?

I'm ok with case-by-case. I'm not ok with a generic "POSIX sucks lots -
why should we be POSIX compliant?"

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-08 15:32:24

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 08, 2004 at 08:41:04AM +0200, Willy Tarreau wrote:
> On Wed, Oct 06, 2004 at 05:29:59PM -0700, David S. Miller wrote:
> > It absolutely does help the programs not using select(), using
> > blocking sockets, and not expecting -EAGAIN.
> As I asked in a previous mail in this overly long thread, why not returning
> zero bytes at all. It is perfectly valid to receive an UDP packet with 0
> data bytes, and any application should be able to support this case anyway.
> In case of TCP, this would be a problem because the app would think it
> indicates the last byte has been received. But in case of UDP, there is no
> problem.

0 isn't correct either. No zero length packet was successfully received.

I agree with the current read() behaviour. It's the select() behaviour
that I consider to be wrong. Patching return values is just a hacky way
of avoiding the issue.

The issue can be more easily avoided by saying 'the Linux developers
believe that the use of select() with blocking file descriptors is
invalid or not recommended, and have chosen not to ensure that this
use of system calls is reliable'. "We're not POSIX compliant in this
case" isn't good enough for me. One acknowledges the issue. The other
ignores it.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-09 12:06:27

by Colin Phipps

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

So the performance gain is significant. And programs that break were
buggy anyway. But that still leaves the question of whether it benefits
users, given that there is a lot of software, buggy by this
interpretation, that can break. In particular, exposing UDP daemons to
denial of service using bad-checksum UDP packets looks like a rather
interesting security issue.

I have just tried syslog and inetd on a couple of machines running
2.6.8.1, and both hang when given a single bad-checksum udp packet.
hping2 -2 -c 1 -b is the tool of choice. Sure, they could have broken
anyway, but this makes them easy targets - and presumably they are the
tip of the iceberg.

--
Colin Phipps <[email protected]>

2004-10-09 13:21:56

by Bodo Eggert

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David S. Miller wrote:

> People, get the heck over this. The kernel has behaved this way
> for more than 3 years both in 2.4.x and 2.6.x. The code in question
> even exists in the 2.2.x sources as well.
>
> Therefore, it would be totally pointless to change the behavior
> now since anyone writing an application wishing it to work on
> all existing Linux kernels needs to handle this case anyways.

If you want people to write workaround for functions that intentional
break applications depending on dysfunctional behaviour, you should
document it. You didn't, and therefore most applications will be broken:

google survey:
Results 1 - 10 of about 13,100 for udp select recv. (0.18 seconds)
Results 1 - 10 of about 636 for udp select recv o_nonblock. (0.16 seconds)

Results 1 - 10 of about 4,350 for select recv SOCK_DGRAM. (0.40 seconds)
Results 1 - 10 of about 685 for select recv SOCK_DGRAM o_nonblock. (0.39
seconds)


I think nobody complaining results from nobody sending bad UDP packets,
and nobody sending bad UDP packets resulted from nobody complaining.


BTW: If you're breaking select() for blocking sockets, you can as well
return -EBROKEN. It's as close to the specification as waiting after
guaranteeing not to wait, but it will not result in hidden flaws.
--
Fun things to slip into your budget
Request for 'supermodel access' to the UNIX server.
Just don't tell the PHB why your home directory is named 'jpgs.'

2004-10-09 18:21:21

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David Schwartz" <[email protected]>
> > > POSIX does not require the kernel to predict the future. The
> > > only guarantee
> > > against having a socket operation block is found in
> > > non-blocking sockets.
>
> > It is one thing to implement select()/recvmsg() in a non POSIX compliant
> > way; it is another thing to make false claims about that standard. POSIX
> > _does_ guarantee that a call to recvmsg() does not block after a call
> > to select().
>
> I do not believe this.
>
[...]
> Where, specifically, does the standard guarantee that a subsequent call to
> 'recvmsg' will not block?

When using select() on a socket for reading, select will block until
that socket is ready.

According to POSIX:

A descriptor shall be considered ready for reading when a call to an
input function with O_NONBLOCK clear would not block, whether or not
the function would transfer data successfully.

and

If a descriptor refers to a socket, the implied input function is the
recvmsg() function with parameters requesting normal and ancillary data,
such that the presence of either type shall cause the socket to be marked
as readable. The presence of out-of-band data shall be checked if the
socket option SO_OOBINLINE has been enabled, as out-of-band data is
enqueued with normal data. If the socket is currently listening, then it
shall be marked as readable if an incoming connection request has been
received, and a call to the accept() function shall complete without
blocking.

Thus recvmsg() shouldn't in any case block after a select() on a socket.


--ms



2004-10-09 18:28:35

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> [...]
> > Where, specifically, does the standard guarantee that a
> > subsequent call to
> > 'recvmsg' will not block?
>
> When using select() on a socket for reading, select will block until
> that socket is ready.
>
> According to POSIX:
>
> A descriptor shall be considered ready for reading when a call to an
> input function with O_NONBLOCK clear would not block, whether or not
> the function would transfer data successfully.

Note that it says *would* not block, not *will* not block. The definition
of the word "would" is "an expression of probability or likelihood" (or a
"presumption or expectation"). This is *not* a guarantee.

> and
>
> If a descriptor refers to a socket, the implied input function is the
> recvmsg() function with parameters requesting normal and ancillary data,
> such that the presence of either type shall cause the socket to
> be marked
> as readable. The presence of out-of-band data shall be checked if the
> socket option SO_OOBINLINE has been enabled, as out-of-band data is
> enqueued with normal data. If the socket is currently listening, then it
> shall be marked as readable if an incoming connection request has been
> received, and a call to the accept() function shall complete without
> blocking.
>
> Thus recvmsg() shouldn't in any case block after a select() on a socket.

I don't draw that conclusion from that paragraph. It does say the presence
of normal data shall mark the socket readable, but it doesn't require the
kernel to keep that data available, at least not as far as I can see.

As far as I can tell, neither of these two excerpts prohibit an
implementation from, for example, discarding UDP data (say, to save memory)
after it triggered a read hit on a 'select' call. Yes, the 'recvmsg' call
would not have blocked, had it been made at the time the data was available.

DS


2004-10-09 18:53:51

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 09, 2004 at 11:28:28AM -0700, David Schwartz wrote:
> > > Where, specifically, does the standard guarantee that a
> > > subsequent call to
> > > 'recvmsg' will not block?
> > When using select() on a socket for reading, select will block until
> > that socket is ready.
> > According to POSIX:
> > A descriptor shall be considered ready for reading when a call to an
> > input function with O_NONBLOCK clear would not block, whether or not
> > the function would transfer data successfully.
> Note that it says *would* not block, not *will* not block. The definition
> of the word "would" is "an expression of probability or likelihood" (or a
> "presumption or expectation"). This is *not* a guarantee.

Are you sure you aren't confusing 'would' with 'should'?

Would and will are the same, except in terms of time.

This is ridiculous. Everybody in this discussion *knows* that the
existing behaviour is broken. As another poster appeared to show, can
be proven to be usable as a denial of service attack against any
application that doesn't use O_NONBLOCK for UDP packets under Linux.

Please - people who don't agree, just ensure that Linux is documented
to not implement select() on sockets without O_NONBLOCK properly. No
more silly excuses. 'Would' vs 'will' meaning 'probably'... sheesh...

> > If a descriptor refers to a socket, the implied input function is the
> > recvmsg() function with parameters requesting normal and ancillary data,
> > such that the presence of either type shall cause the socket to
> > be marked
> > as readable. The presence of out-of-band data shall be checked if the
> > socket option SO_OOBINLINE has been enabled, as out-of-band data is
> > enqueued with normal data. If the socket is currently listening, then it
> > shall be marked as readable if an incoming connection request has been
> > received, and a call to the accept() function shall complete without
> > blocking.
> > Thus recvmsg() shouldn't in any case block after a select() on a socket.
> I don't draw that conclusion from that paragraph. It does say the presence
> of normal data shall mark the socket readable, but it doesn't require the
> kernel to keep that data available, at least not as far as I can see.

The data was *never* available. select() lied.

> As far as I can tell, neither of these two excerpts prohibit an
> implementation from, for example, discarding UDP data (say, to save memory)
> after it triggered a read hit on a 'select' call.

Your reading let's you have a broken system call interface, and declare that
it is acceptable. Why? What is the purpose of this position? Who does it
benefit?

> Yes, the 'recvmsg' call
> would not have blocked, had it been made at the time the data was available.

Wrong. The data was never available. If the select() was replaced by
recvmsg() it most certainly *would* have blocked. Therefore, select()
should not have said 'data is ready'.

If this understanding isn't clear, I begin to seriously worry about
the competency of a few people...

It's ok to say "we chose to leave it broken because we feel O_NONBLOCK
should be mandatory". Nobody with any sort of authority seems to want
to say this. They would prefer to talk about "POSIX compliancy" as if
the issue was theoretical and irrelevant.

It is disconcerting, to say the least.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-09 20:00:21

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

[...]
> Please - people who don't agree, just ensure that Linux is documented
> to not implement select() on sockets without O_NONBLOCK properly.

Actually, the behaviour isn't correct for sockets with O_NONBLOCK
either, since EAGAIN may only be returned when recvmsg() would not
block without O_NONBLOCK.


--ms


2004-10-09 20:10:49

by Martijn Sipkema

[permalink] [raw]
Subject: Re: mmap specification - was: ... select specification

From: "Andries Brouwer" <[email protected]>
> On Thu, Oct 07, 2004 at 07:30:53PM -0400, Kyle Moffett wrote:
> > On Oct 07, 2004, at 18:46, Andries Brouwer wrote:
> > >The POSIX text is clear to me, and Linux is compliant.
> > >On the other hand, I have no idea what you try to say.
> >
> > >On Thu, Oct 07, 2004 at 06:32:43PM -0400, Kyle Moffett wrote:
> > >
> > >>>>"References within the address range starting at pa and continuing
> > >>>>for len bytes to whole pages following the end of an object shall
> > >>>>result in delivery of a SIGBUS signal."
> >
> > Reviewing this once more:
> >
> > >References within the address range starting at pa and continuing for
> > >len bytes:
> > range = {pa ... pa+len};
> >
> > >To whole pages following the end of an object:
> > range = {pa ... PAGE_ROUND_UP(pa+len)};
>
> It is here you are wrong.

Indeed, I think Kyle took ``end of an object'' to mean the end of the
mapping instead of the end of what is mapped, e.g. EOF in case of a file.


--ms


2004-10-09 23:04:17

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 09, 2004 at 10:00:35PM +0100, Martijn Sipkema wrote:
> [...]
> > Please - people who don't agree, just ensure that Linux is documented
> > to not implement select() on sockets without O_NONBLOCK properly.
> Actually, the behaviour isn't correct for sockets with O_NONBLOCK
> either, since EAGAIN may only be returned when recvmsg() would not
> block without O_NONBLOCK.

At least this one is acceptable, though, as most current applications
won't break, or be open to attack. I agree - it should also be documented
as improper, but with enough words to point out that it isn't a big deal.

I think you and I are on the same page.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-15 22:43:09

by Robert White

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?



-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Willy Tarreau

> As I asked in a previous mail in this overly long thread, why not returning
> zero bytes at all. It is perfectly valid to receive an UDP packet with 0


Zero bytes is "end of file". Don't go trying to co-opt end of file. That way lies
madness and despair.

You would *then* need a flag on each file descriptor to determine if the most
previous call before the read op was a select that returned the file as readable so
you knew whether to block or return the not-really-end-of-file. Your *app* would
then also need a flag/context to determine whether the end of file just read was
contextually an aborted read after select.


Nope, very very very very very bad idea... 8-)

[On the larger issues, I am surprised that select() doesn't guarantee available data
and one subsequent non-blocking read, but again in the case of a UDP discard after
the select but before the read, that is the only thing that makes sense. I would
vote (were this a democracy 8-) to put a CAVEAT in the manual that listed the _rare_
cases, as examples, where the warrant of available data may prove false; give a nod
to real life, and _firmly suggest_ that if you are using select, you *probably* want
nonblocking file descriptors too.]

2004-10-15 23:35:08

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> > As I asked in a previous mail in this overly long thread, why
> > not returning
> > zero bytes at all. It is perfectly valid to receive an UDP packet with 0

> Zero bytes is "end of file". Don't go trying to co-opt end of
> file. That way lies
> madness and despair.

Not for UDP. Zero bytes means that zero bytes of data were received, a
perfectly legitimate (though seldom useful) number.

> You would *then* need a flag on each file descriptor to determine
> if the most
> previous call before the read op was a select that returned the
> file as readable so
> you knew whether to block or return the not-really-end-of-file.
> Your *app* would
> then also need a flag/context to determine whether the end of
> file just read was
> contextually an aborted read after select.

You mean whether it was a zero-byte datagram or some sort of error.

> [On the larger issues, I am surprised that select() doesn't
> guarantee available data
> and one subsequent non-blocking read, but again in the case of a
> UDP discard after
> the select but before the read, that is the only thing that makes
> sense. I would
> vote (were this a democracy 8-) to put a CAVEAT in the manual
> that listed the _rare_
> cases, as examples, where the warrant of available data may prove
> false; give a nod
> to real life, and _firmly suggest_ that if you are using select,
> you *probably* want
> nonblocking file descriptors too.]

I think it's a really bad idea to make 'select' more complicated by trying
to nail down precise semantics for every possible protocol. The 'select'
function is supposed to be protocol-neutral and trying to say it guarantees
X on protocol Y, where such guarantees constrict what the kernel can do and
do not make user code anything but more fragile, doesn't seem to be a good
idea to me.

For UDP specifically, datagrams are fundamentally discardable whenever that
seems to be a good idea. In general, there are any number of corner cases
for various combinations of protocols and situations where a 'select' hit
will not be followed by an operation that doesn't block.

It just happens to be that 'select' works best when it's a hint that
something has changed and the operation can/should be re-tried. It works
very badly when the results of a 'select' are supposed to change something
because you're supposed to be able to 'select' (and then not perform the
operation) without affecting things. This is level semantics, not edge.

The CAVEAT is that 'select', like every other status information function
provided by the kernel, does not guarantee anything about the future. Just
like 'stat' does not guarantee that the file size will still be the same
later when you call 'read'.

DS


2004-10-16 01:00:06

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David Schwartz wrote:

> The CAVEAT is that 'select', like every other status information function
> provided by the kernel, does not guarantee anything about the future. Just
> like 'stat' does not guarantee that the file size will still be the same
> later when you call 'read'.

This is not a very good counterexample. If I'm the only one accessing a file,
then the file size should not just change all by itself.

As you say, select is level triggered. However, the very definition of select
returning a file descriptor as readable is that a subsequent read will not
block. This seems very straightforward. Maybe it's not the best from a
performance point of view, but it's very straightforward.

So if we change the semantics slightly to say that select returning readable
really means a subsequent *blocking* read will not block, then apps that use
blocking sockets will get proper semantics, and apps using nonblocking reads
will get full performance.

As it stands, it's fairly straightforward to do a DOS and hang various apps with
a single corrupt packet each. This is suboptimal.

Chris

2004-10-16 02:40:52

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 04:33:40PM -0700, David Schwartz wrote:
> I think it's a really bad idea to make 'select' more complicated by trying
> to nail down precise semantics for every possible protocol. The 'select'
> function is supposed to be protocol-neutral and trying to say it guarantees
> X on protocol Y, where such guarantees constrict what the kernel can do and
> do not make user code anything but more fragile, doesn't seem to be a good
> idea to me.

Are you saying that a minimal operating system should feel free to implement
select() to always return true with all bits set?

> For UDP specifically, datagrams are fundamentally discardable whenever that
> seems to be a good idea. In general, there are any number of corner cases
> for various combinations of protocols and situations where a 'select' hit
> will not be followed by an operation that doesn't block.

Like?

In the accept() case, you genuinely have a 'if you had substituted the
select() with an accept(), the accept() would have succeeded.'

In the UDP case, you do NOT have this situation. A recvmesg() in place
of select() would block, therefore, select() should not block.

> It just happens to be that 'select' works best when it's a hint that
> something has changed and the operation can/should be re-tried. It works
> very badly when the results of a 'select' are supposed to change something
> because you're supposed to be able to 'select' (and then not perform the
> operation) without affecting things. This is level semantics, not edge.

We're talking about a packet that was never readable. If you had an
efficient enough check ahead of time (perhaps implemented in
hardware?), it would never get to the point where select() was in this
position. The Linux developer of this section of code decided that they
wanted UDP to be more efficient by delaying the checksum validation until
the last minute. The cost of this, is that they broke the API with regard
to select(). This isn't being admitted. Instead, off-topic challenges of
POSIX, and impractical claims regarding the use of blocking file descriptors
with select() being not recommended have been offered.

These answers are simply wrong. If the decision is to make select() with
blocking file descriptors unreliable, than the decision *IS* to recommend
that select() never be used with blocking file descriptors. What kind of
operating system developers would recommend the use of select() if it is
known that the behaviour is unreliable?

> The CAVEAT is that 'select', like every other status information function
> provided by the kernel, does not guarantee anything about the future. Just
> like 'stat' does not guarantee that the file size will still be the same
> later when you call 'read'.

We're not talking about the future. Get off that horse.

We're talking about the present. At the time select() is invoked, there is
*no* available data. select() is lying.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-16 04:23:55

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> On Fri, Oct 15, 2004 at 04:33:40PM -0700, David Schwartz wrote:
> > I think it's a really bad idea to make 'select' more
> > complicated by trying
> > to nail down precise semantics for every possible protocol. The 'select'
> > function is supposed to be protocol-neutral and trying to say
> > it guarantees
> > X on protocol Y, where such guarantees constrict what the
> > kernel can do and
> > do not make user code anything but more fragile, doesn't seem
> > to be a good
> > idea to me.

> Are you saying that a minimal operating system should feel free
> to implement
> select() to always return true with all bits set?

Yes, though that would obviously be a very poor implementation.

> > For UDP specifically, datagrams are fundamentally discardable
> > whenever that
> > seems to be a good idea. In general, there are any number of
> > corner cases
> > for various combinations of protocols and situations where a
> > 'select' hit
> > will not be followed by an operation that doesn't block.
>
> Like?

Like a TCP 'write'.

> In the accept() case, you genuinely have a 'if you had substituted the
> select() with an accept(), the accept() would have succeeded.'

While this is true, there is an 'as if' here. There's no difference to the
application between a datagram that dropped and a datagram that got
corrupted in transit.

> In the UDP case, you do NOT have this situation. A recvmesg() in place
> of select() would block, therefore, select() should not block.

But it would block if the UDP datagram had been dropped after the 'select'
hit but before the 'recvmsg'. There is no logical reason there should be a
semantic difference between a UDP packet that was dropped and a UDP packet
that was corrupted.

> > It just happens to be that 'select' works best when it's a hint that
> > something has changed and the operation can/should be re-tried. It works
> > very badly when the results of a 'select' are supposed to
> > change something
> > because you're supposed to be able to 'select' (and then not perform the
> > operation) without affecting things. This is level semantics, not edge.

> We're talking about a packet that was never readable.

There is no *application* difference between a dropped packet and a corrupt
packet. If the packet was dropped, it would have been readable.

> If you had an
> efficient enough check ahead of time (perhaps implemented in
> hardware?), it would never get to the point where select() was in this
> position. The Linux developer of this section of code decided that they
> wanted UDP to be more efficient by delaying the checksum validation until
> the last minute. The cost of this, is that they broke the API with regard
> to select(). This isn't being admitted. Instead, off-topic challenges of
> POSIX, and impractical claims regarding the use of blocking file
> descriptors
> with select() being not recommended have been offered.

> These answers are simply wrong. If the decision is to make select() with
> blocking file descriptors unreliable, than the decision *IS* to recommend
> that select() never be used with blocking file descriptors. What kind of
> operating system developers would recommend the use of select() if it is
> known that the behaviour is unreliable?

I have never seen anyone recommend anything different. Every time this
comes up, it astonishes me that anyone would ever use 'select' on a blocking
file descriptor except in the very special case where you plan to change it
to non-blocking before performing the I/O operation. There are so many well
known corner cases, TCP write being one, UDP discard due to memory pressure
being another, connection abort being yet another.

> > The CAVEAT is that 'select', like every other status
> > information function
> > provided by the kernel, does not guarantee anything about the
> > future. Just
> > like 'stat' does not guarantee that the file size will still be the same
> > later when you call 'read'.
>
> We're not talking about the future. Get off that horse.
>
> We're talking about the present. At the time select() is invoked, there is
> *no* available data. select() is lying.

There is no difference to the application between a UDP packet that was
discarded due to memory pressure in-between a 'select' hit and a packet that
was received with a bad checksum. If you can write an application that can
detect the difference, do so. Then I'll agree with you. Until that time,
this argument fails due to the 'as if' rule. If the API can perform
identically under conditions that do not differ as far as the application is
concerned, then the behavior is legal.

DS


2004-10-16 04:40:58

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 09:23:44PM -0700, David Schwartz wrote:
> > Are you saying that a minimal operating system should feel free
> > to implement
> > select() to always return true with all bits set?
> Yes, though that would obviously be a very poor implementation.

The 'obviously' ... 'poor' should tell you something.

> > > For UDP specifically, datagrams are fundamentally discardable
> > > whenever that
> > > seems to be a good idea. In general, there are any number of
> > > corner cases
> > > for various combinations of protocols and situations where a
> > > 'select' hit
> > > will not be followed by an operation that doesn't block.
> > Like?
> Like a TCP 'write'.

Not applicable. See the definition of select() in most reasonable
standards. We're talking about read of a packet. You're talking about
write of an arbitrary number of bytes. No standard I have seen
declares that select() guarantees a write of an arbitrary number of
bytes.

> > In the accept() case, you genuinely have a 'if you had substituted the
> > select() with an accept(), the accept() would have succeeded.'
> While this is true, there is an 'as if' here. There's no difference to the
> application between a datagram that dropped and a datagram that got
> corrupted in transit.

There *shouldn't* be, and yet there is. The application can *currently*
watch select() until it returns true, then, if read() returns EAGAIN in
the non-blocking case, or if it blocks (harder to tell - but don't assume
this means impossible), the application can now be aware that a datagram
was corrupted in transit (as opposed to being dropped).

The internal implementation - an OPTIMIZATION - is being exposed to the
application. In the case of non-blocking, the effect is limited to behaviour
which most of us agree is acceptable. In the case of blocking, the effect
is *not* acceptable. As suggested earlier in this thread, daemons which
receive UDP packets that use blocking file descriptors, and select(), can
be easily DOS'd. You think this is acceptable?

> > In the UDP case, you do NOT have this situation. A recvmesg() in place
> > of select() would block, therefore, select() should not block.
> But it would block if the UDP datagram had been dropped after the 'select'
> hit but before the 'recvmsg'. There is no logical reason there should be a
> semantic difference between a UDP packet that was dropped and a UDP packet
> that was corrupted.

You're thinking too fast, and skipping the most important point here:

1) packet was dropped earlier (or was never sent)
- if select() is issued, it blocks
- if recvmesg() is issued, it blocks
2) packet was received, but is corrupt
- if select() is issued, it does not block
- if recvmesg() is issued, it blocks

See the problem?

> > > It just happens to be that 'select' works best when it's a hint that
> > > something has changed and the operation can/should be re-tried. It works
> > > very badly when the results of a 'select' are supposed to
> > > change something
> > > because you're supposed to be able to 'select' (and then not perform the
> > > operation) without affecting things. This is level semantics, not edge.
>
> > We're talking about a packet that was never readable.
> There is no *application* difference between a dropped packet and a corrupt
> packet. If the packet was dropped, it would have been readable.

So I've and a few other people have tried to explain to you. Since it seems
we agree, why do you contradict yourself by allowing select() to expose
the difference to the application?

> > These answers are simply wrong. If the decision is to make select() with
> > blocking file descriptors unreliable, than the decision *IS* to recommend
> > that select() never be used with blocking file descriptors. What kind of
> > operating system developers would recommend the use of select() if it is
> > known that the behaviour is unreliable?
> I have never seen anyone recommend anything different. Every time this
> comes up, it astonishes me that anyone would ever use 'select' on a blocking
> file descriptor except in the very special case where you plan to change it
> to non-blocking before performing the I/O operation. There are so many well
> known corner cases, TCP write being one, UDP discard due to memory pressure
> being another, connection abort being yet another.

It astonishes you that somebody reads any of the UNIX manuals or standards,
and comes to the conclusion that they can use select() and read() together
on a blocking file descriptor?

What astonishes me is how few of these limitations are openly documented.

> > > The CAVEAT is that 'select', like every other status
> > > information function
> > > provided by the kernel, does not guarantee anything about the
> > > future. Just
> > > like 'stat' does not guarantee that the file size will still be the same
> > > later when you call 'read'.
> > We're not talking about the future. Get off that horse.
> > We're talking about the present. At the time select() is invoked, there is
> > *no* available data. select() is lying.
> There is no difference to the application between a UDP packet that was
> discarded due to memory pressure in-between a 'select' hit and a packet that
> was received with a bad checksum. If you can write an application that can
> detect the difference, do so. Then I'll agree with you. Until that time,
> this argument fails due to the 'as if' rule. If the API can perform
> identically under conditions that do not differ as far as the application is
> concerned, then the behavior is legal.

As I said, there obviously is. select() and read() exposes the difference.

Please consider the argument outside of your pre-conceived conclusion. :-)

As I've also said before - I'm OK with optimization being important - IF,
it is readily admitted that Linux has a limitation in this regard. Instead,
I see only arguments to try and suggest that Linux is 'correct', and that
there is no real problem. There *IS* a problem, and developers who author
applications for Linux that make use of select() *SHOULD* be *FORCED* to
be aware of these limitations.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-16 05:00:45

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> You're thinking too fast, and skipping the most important point here:
>
> 1) packet was dropped earlier (or was never sent)
> - if select() is issued, it blocks
> - if recvmesg() is issued, it blocks
> 2) packet was received, but is corrupt
> - if select() is issued, it does not block
> - if recvmesg() is issued, it blocks
>
> See the problem?

I'm talking about the case where it is dropped after the 'select' hit but
before the call to 'recvmsg'. In that case, the select does not block but
the recvmsg does.

> > > We're talking about a packet that was never readable.
> > > There is no *application* difference between a dropped packet
> > > and a corrupt
> > packet. If the packet was dropped, it would have been readable.

> So I've and a few other people have tried to explain to you.
> Since it seems
> we agree, why do you contradict yourself by allowing select() to expose
> the difference to the application?

It does not. You cannot tell the difference between a packet that was
dropped right before you call 'recvmsg' and a packet that was received
corrupted.

> It astonishes you that somebody reads any of the UNIX manuals or
> standards,
> and comes to the conclusion that they can use select() and read() together
> on a blocking file descriptor?

The certainly can. They just have to understand what behavior they're going
to get. If you absolutely must not ever block, you must use blocking sockets
because the kernel cannot guarantee future behavior. Period. End of story.

> What astonishes me is how few of these limitations are openly documented.

That is a *very* good point. But it doesn't help to deny the limitations. A
hit on 'select' does not guarantee that a future operation will not block
unless you have very tight control over circumstances that typical
applications do not have tight control over.

> > There is no difference to the application between a UDP
> > packet that was
> > discarded due to memory pressure in-between a 'select' hit and
> > a packet that
> > was received with a bad checksum. If you can write an
> > application that can
> > detect the difference, do so. Then I'll agree with you. Until that time,
> > this argument fails due to the 'as if' rule. If the API can perform
> > > identically under conditions that do not differ as far as the
> > application is
> > concerned, then the behavior is legal.
>
> As I said, there obviously is. select() and read() exposes the difference.
>
> Please consider the argument outside of your pre-conceived conclusion. :-)

It does not. In fact, quite literally, the UDP packet is dropped right at
the call to 'recvmsg'. This is totally legal behavior -- a UDP packet can be
discarded at *any* time.

> As I've also said before - I'm OK with optimization being important - IF,
> it is readily admitted that Linux has a limitation in this
> regard. Instead,
> I see only arguments to try and suggest that Linux is 'correct', and that
> there is no real problem. There *IS* a problem, and developers who author
> applications for Linux that make use of select() *SHOULD* be *FORCED* to
> be aware of these limitations.

Linux's behavior is correct in the literal sense that it is doing something
that is allowed. It's incorrect in the sense that it's sub-optimal. However,
it will not break any application that could not already be broken by other
circumstances.

DS


2004-10-16 06:30:40

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 09:58:38PM -0700, David Schwartz wrote:
> > You're thinking too fast, and skipping the most important point here:
> > 1) packet was dropped earlier (or was never sent)
> > - if select() is issued, it blocks
> > - if recvmesg() is issued, it blocks
> > 2) packet was received, but is corrupt
> > - if select() is issued, it does not block
> > - if recvmesg() is issued, it blocks
> > See the problem?
> I'm talking about the case where it is dropped after the 'select' hit but
> before the call to 'recvmsg'. In that case, the select does not block but
> the recvmsg does.

You are talking about the make believe case that only exists due to
the *current* implementation of Linux UDP packet reading. It doesn't
have to exist. It exists only behaviour nobody saw fit to implement it
with semantics that were reliable, because the implentors didn't foresee
blocking file descriptors being used. It's an implementation oversight.

> > It astonishes you that somebody reads any of the UNIX manuals or
> > standards,
> > and comes to the conclusion that they can use select() and read() together
> > on a blocking file descriptor?
> The certainly can. They just have to understand what behavior they're going
> to get. If you absolutely must not ever block, you must use blocking sockets
> because the kernel cannot guarantee future behavior. Period. End of story.

We're not talking about future behaviour. We're talking about past behaviour.

That the kernel chose to delay making a decision too long to tell the truth
in select() is not reasonable for blocking sockets. The answer *MUST* be,
fix it for blocking sockets, or tell the truth, and just say - blocking
file descriptors for UDP sockets should not be used with select(). Why is
that so hard? Why all the distracting and minimizing language about
recommendations?

> > What astonishes me is how few of these limitations are openly documented.
> That is a *very* good point. But it doesn't help to deny the limitations. A
> hit on 'select' does not guarantee that a future operation will not block
> unless you have very tight control over circumstances that typical
> applications do not have tight control over.

Sure, but this isn't about the future, remember. The kernel already has the
information to know whether there is data, or whether there isn't. It just
isn't doing the work to find out.

> > Please consider the argument outside of your pre-conceived conclusion. :-)
> It does not. In fact, quite literally, the UDP packet is dropped right at
> the call to 'recvmsg'. This is totally legal behavior -- a UDP packet can be
> discarded at *any* time.

That's a liberal understanding. It's also distracting. select() could know
that the packet will be discarded. It chooses to not.

> Linux's behavior is correct in the literal sense that it is doing something
> that is allowed. It's incorrect in the sense that it's sub-optimal. However,
> it will not break any application that could not already be broken by other
> circumstances.

Allowed by who? For select() to say data is ready, and read() to block,
is not allowed by all the standards that I have read. This is the first
time I have ever heard of a situation like this, for select(). It is *not*
the same as writing an arbitrary number of bytes, or accepting.

You say it will not break any application that could not already be broken
by other circumstances. I disagree. For example, a UDP-based server that
only receives, and never sends, would be perfectly happy to select() on
several file descriptors, and readmesg() whenever the UDP file descriptor
says readable. It would not break on other operating systems that implement
select() to be useful for determining whether or not to read() from a
blocking file descriptor.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-16 10:24:49

by Willy Tarreau

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 03:42:55PM -0700, Robert White wrote:
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Willy Tarreau
>
> > As I asked in a previous mail in this overly long thread, why not returning
> > zero bytes at all. It is perfectly valid to receive an UDP packet with 0
>
> Zero bytes is "end of file". Don't go trying to co-opt end of file. That way lies
> madness and despair.

Please explain me what "end of file" means with UDP. If your UDP-based app
expects to receive a zero when the other end stops transmitting, then it
might wait for a very long time. As opposed to TCP, there's no FIN control
flag to tell the remote host that you sent your last packet.

Willy

2004-10-16 13:26:42

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 16, 2004 at 12:24:33PM +0200, Willy Tarreau wrote:
> On Fri, Oct 15, 2004 at 03:42:55PM -0700, Robert White wrote:
> > > As I asked in a previous mail in this overly long thread, why not
> > > returning zero bytes at all. It is perfectly valid to receive an
> > > UDP packet with 0
> > Zero bytes is "end of file". Don't go trying to co-opt end of file.
> > That way lies madness and despair.
> Please explain me what "end of file" means with UDP. If your UDP-based app
> expects to receive a zero when the other end stops transmitting, then it
> might wait for a very long time. As opposed to TCP, there's no FIN control
> flag to tell the remote host that you sent your last packet.

He means zero byte packet. :-)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-16 18:25:50

by Andries Brouwer

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 09:58:38PM -0700, David Schwartz wrote:

> Linux's behavior is correct in the literal sense that it is doing something
> that is allowed. It's incorrect in the sense that it's sub-optimal.

"Allowed" by whom? By you?

2004-10-16 21:46:25

by Roland Kuhn

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hi Mark!

On Oct 16, 2004, at 8:25 AM, Mark Mielke wrote:

> On Fri, Oct 15, 2004 at 09:58:38PM -0700, David Schwartz wrote:
>>> You're thinking too fast, and skipping the most important point here:
>>> 1) packet was dropped earlier (or was never sent)
>>> - if select() is issued, it blocks
>>> - if recvmesg() is issued, it blocks
>>> 2) packet was received, but is corrupt
>>> - if select() is issued, it does not block
>>> - if recvmesg() is issued, it blocks
>>> See the problem?
>> I'm talking about the case where it is dropped after the 'select' hit
>> but
>> before the call to 'recvmsg'. In that case, the select does not block
>> but
>> the recvmsg does.
>
> You are talking about the make believe case that only exists due to
> the *current* implementation of Linux UDP packet reading. It doesn't
> have to exist. It exists only behaviour nobody saw fit to implement it
> with semantics that were reliable, because the implentors didn't
> foresee
> blocking file descriptors being used. It's an implementation oversight.
>
Well, I haven't read the source to see what would be necessary to
create this behaviour, but David was talking about the situation where
the UDP packet is dropped because of memory pressure. This event cannot
possibly be foretold by select()...

Ciao,
Roland

--
TU Muenchen, Physik-Department E18, James-Franck-Str. 85747 Garching
Telefon 089/289-12592; Telefax 089/289-12570
--
A mouse is a device used to point at
the xterm you want to type in.
Kim Alm on a.s.r.


Attachments:
PGP.sig (186.00 B)
This is a digitally signed message part

2004-10-17 00:11:46

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 16, 2004 at 11:44:21PM +0200, Roland Kuhn wrote:
> >You are talking about the make believe case that only exists due to
> >the *current* implementation of Linux UDP packet reading. It doesn't
> >have to exist. It exists only behaviour nobody saw fit to implement it
> >with semantics that were reliable, because the implentors didn't
> >foresee
> >blocking file descriptors being used. It's an implementation oversight.
> Well, I haven't read the source to see what would be necessary to
> create this behaviour, but David was talking about the situation where
> the UDP packet is dropped because of memory pressure. This event cannot
> possibly be foretold by select()...

I don't think he is, but if he is:

I'm not sure that either is reasonable behaviour. The socket buffers
don't increase or decrease at run time, do they? If they do shrink at
run time, this is news to me...

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-17 00:29:39

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> On Fri, Oct 15, 2004 at 09:58:38PM -0700, David Schwartz wrote:
>
> > Linux's behavior is correct in the literal sense that it is
> > doing something
> > that is allowed. It's incorrect in the sense that it's sub-optimal.
>
> "Allowed" by whom? By you?

I clearly explained what I meant in context that you snipped. In summary, I
mean 'allowed' in the sense that it's not prohibited by the standard and
arguing that it's not allowed leads to direct logical contradictions.
Nothing prohibits an implementation from dropping a UDP packet after it has
been received. Nothing in POSIX requires that a subsequent operation
actually does not block.

Would you argue that an implementation cannot drop a UDP packet after it
has indicated a read hit on 'select' because of that packet? If so, where
does POSIX say this? And if not, then it can drop a corrupt packet on a call
to 'recvmsg' as well.

DS



2004-10-17 00:29:54

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> Sure, but this isn't about the future, remember. The kernel
> already has the
> information to know whether there is data, or whether there isn't. It just
> isn't doing the work to find out.

The kernel does not have the information yet. It could get it, but it does
not have it.

> Allowed by who? For select() to say data is ready, and read() to block,
> is not allowed by all the standards that I have read. This is the first
> time I have ever heard of a situation like this, for select(). It is *not*
> the same as writing an arbitrary number of bytes, or accepting.

So is it your position that a kernel cannot drop a UDP packet at any time
after it indicated that the socket was readable because of that packet? If
so, please cite where in POSIX you think you find this requirement.

The kernel elects to drop the packet on the call to 'recvmsg'. This is its
right -- it can drop a UDP packet at any time. POSIX is careful not to imply
that 'select' guarantees future behavior because this is not possible in
principle.

> You say it will not break any application that could not already be broken
> by other circumstances. I disagree. For example, a UDP-based server that
> only receives, and never sends, would be perfectly happy to select() on
> several file descriptors, and readmesg() whenever the UDP file descriptor
> says readable. It would not break on other operating systems that
> implement
> select() to be useful for determining whether or not to read() from a
> blocking file descriptor.

Sure it would. It would break on any platform that dropped a UDP packet
after having triggered a read hit on 'select' because of that packet. POSIX
does not say that a future read will not block because it cannot guarantee
that under at least some circumstances.

DS


2004-10-17 00:32:47

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> I don't think he is, but if he is:
>
> I'm not sure that either is reasonable behaviour. The socket buffers
> don't increase or decrease at run time, do they? If they do shrink at
> run time, this is news to me...

The socket buffers are not guaranteed to indicate a particular number of
bytes in a sense that it meaningful to the application. In fact, on Linux,
they don't mean application bytes.

In any event, we aren't talking about any particular implementation, we are
talking about a standard. So what Linux does or doesn't do in response to
memory pressure isn't relevant. What's relevant is what the standard
actually guarantees and what the semantics of the protocols themselves are.

UDP is not reliable. Packets can be dropped, mangled, and lost. Nothing in
POSIX prohibits an implementation from dropping a packet right before you
call 'recvmsg'.

DS


2004-10-17 12:22:53

by Andries Brouwer

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 16, 2004 at 05:28:22PM -0700, David Schwartz wrote:

> > > Linux's behavior is correct in the literal sense that it is
> > > doing something
> > > that is allowed. It's incorrect in the sense that it's sub-optimal.
> >
> > "Allowed" by whom? By you?
>
> I clearly explained what I meant in context that you snipped. In summary, I
> mean 'allowed' in the sense that it's not prohibited by the standard

Yes, but it is prohibited by the standard in case you refer to POSIX.
I quoted chapter and verse. If you refer to a different standard, be explicit.

Whether the standard is reasonable or not, and whether we care or not,
that is a different matter. But you must keep the facts straight.

Andries

2004-10-17 13:35:49

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-16T17:28:24, David Schwartz <[email protected]> wrote:

> The kernel elects to drop the packet on the call to 'recvmsg'. This is its
> right -- it can drop a UDP packet at any time. POSIX is careful not to imply
> that 'select' guarantees future behavior because this is not possible in
> principle.

I'm sorry, but according to my reading of POSIX and the Austin spec,
this is exactly what select() returning 'ready to read' implies.

The SuV spec is actually quite detailed about the options here:

A descriptor shall be considered ready for reading when a call
to an input function with O_NONBLOCK clear would not block,
whether or not the function would transfer data successfully.
(The function might return data, an end-of-file indication, or
an error other than one indicating that it is blocked, and in
each of these cases the descriptor shall be considered ready for
reading.)

This actually forbids recvmsg() to return EAGAIN and EWOULDBLOCK as
has been suggested. EIO seems to be the best fit.

But I'd have to agree that blocking on recvmsg() after select() has
indicated ready to read does violate the specification, unless the
socket has actually been opened with O_NONBLOCK.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 14:17:23

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> On 2004-10-16T17:28:24, David Schwartz <[email protected]> wrote:
>
> > The kernel elects to drop the packet on the call to 'recvmsg'. This is its
> > right -- it can drop a UDP packet at any time. POSIX is careful not to imply
> > that 'select' guarantees future behavior because this is not possible in
> > principle.
>
> I'm sorry, but according to my reading of POSIX and the Austin spec,
> this is exactly what select() returning 'ready to read' implies.
>
> The SuV spec is actually quite detailed about the options here:
>
> A descriptor shall be considered ready for reading when a call
> to an input function with O_NONBLOCK clear would not block,
> whether or not the function would transfer data successfully.
> (The function might return data, an end-of-file indication, or
> an error other than one indicating that it is blocked, and in
> each of these cases the descriptor shall be considered ready for
> reading.)

But it says nowhere that the select()/recvmsg() operation is atomic, right?


Cheers,
Buddy

2004-10-17 14:53:13

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 16, 2004 at 05:30:23PM -0700, David Schwartz wrote:
> > I'm not sure that either is reasonable behaviour. The socket buffers
> > don't increase or decrease at run time, do they? If they do shrink at
> > run time, this is news to me...
> The socket buffers are not guaranteed to indicate a particular number of
> bytes in a sense that it meaningful to the application. In fact, on Linux,
> they don't mean application bytes.

I believe this also means, that it doesn't happen, correct? So why is the
subject being changed?

> In any event, we aren't talking about any particular implementation, we are
> talking about a standard. So what Linux does or doesn't do in response to
> memory pressure isn't relevant. What's relevant is what the standard
> actually guarantees and what the semantics of the protocols themselves are.

Yes, we are talking about a standard. I'm talking about POSIX, and the
related standards that describe select(), and read()/recvmsg().

> UDP is not reliable. Packets can be dropped, mangled, and lost. Nothing in
> POSIX prohibits an implementation from dropping a packet right before you
> call 'recvmsg'.

You seem to believe that the definition of UDP supercedes every and
all standards that describe select() and read()/recvmsg(). Selecting
your standards based on your preference and comfort level. Who gave
you the right to do this for Linux? Why does your comfort level allow
you to take the 'drop' freedom of UDP to the extreme, ignore POSIX, and
so on, that make the behaviour of select() on blocking file descriptors
reliable in certain specific regards (*including* this one, from the
interpretation of more than one person), and then not outright honestly
declare that *nobody* should use select() on blocking file descriptors,
because it isn't reliable under Linux? Why talk of recommendations, or
explanations? select() for blocking file descriptors under Linux leads
to programs being susceptible to DOS attacks. You don't recommend against
it. You fix it, or just say: Don't use the system calls this way - it isn't
supported under Linux.

I've said all this before, and you've said your piece before. Is there
a particular reason you want to just recommend against it's use, instead
of declaring it invalid under Linux? (blocking sockets with select())

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-17 14:58:15

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sat, Oct 16, 2004 at 05:28:24PM -0700, David Schwartz wrote:
> > Sure, but this isn't about the future, remember. The kernel
> > already has the
> > information to know whether there is data, or whether there isn't. It just
> > isn't doing the work to find out.
> The kernel does not have the information yet. It could get it, but it does
> not have it.

You are wrong, David. The kernel *does* have the information. I haven't
seen anybody except you deny this.

What has been said, is that it is considered too expensive to do the check
at select() time, based on the information that *is* available.

> > Allowed by who? For select() to say data is ready, and read() to block,
> > is not allowed by all the standards that I have read. This is the first
> > time I have ever heard of a situation like this, for select(). It is *not*
> > the same as writing an arbitrary number of bytes, or accepting.
> So is it your position that a kernel cannot drop a UDP packet at any time
> after it indicated that the socket was readable because of that packet? If
> so, please cite where in POSIX you think you find this requirement.

Not after select() has stated that the packet is ready for
reading. That is correct. I would to the definition of select() that
was posted earlier in this thread. You've chosen to ignore this
definition, it seems.

> The kernel elects to drop the packet on the call to 'recvmsg'. This is its
> right -- it can drop a UDP packet at any time. POSIX is careful not to imply
> that 'select' guarantees future behavior because this is not possible in
> principle.

Not future behaviour. POSIX, based on the definition of select() would seem
to suggest that, in a delayed-drop implementation such as Linux's, that
select() should feel free to drop the corrupt packet. recvmsg() just makes
the use of select() with blocking sockets unreasonable. Don't you see this?

> > You say it will not break any application that could not already be broken
> > by other circumstances. I disagree. For example, a UDP-based server that
> > only receives, and never sends, would be perfectly happy to select() on
> > several file descriptors, and readmesg() whenever the UDP file descriptor
> > says readable. It would not break on other operating systems that
> > implement
> > select() to be useful for determining whether or not to read() from a
> > blocking file descriptor.
> Sure it would. It would break on any platform that dropped a UDP packet
> after having triggered a read hit on 'select' because of that packet. POSIX
> does not say that a future read will not block because it cannot guarantee
> that under at least some circumstances.

Again - we're not talking about the future. Get off that. The packet
is already in the queue. Linux already has the information to know whether
the packet will be dropped or not.

If you want to say that select() on blocking sockets is invalid, please
say it. Don't recommend against it. Say - the use of select() on blocking
sockets is invalid under Linux. Put it in documentation that people will
read.

That's all we want.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-17 15:12:09

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > The SuV spec is actually quite detailed about the options here:
> > A descriptor shall be considered ready for reading when a call
> > to an input function with O_NONBLOCK clear would not block,
> > whether or not the function would transfer data successfully.
> > (The function might return data, an end-of-file indication, or
> > an error other than one indicating that it is blocked, and in
> > each of these cases the descriptor shall be considered ready for
> > reading.)
> But it says nowhere that the select()/recvmsg() operation is atomic, right?

This is a distraction. If the call to select() had been substituted
with a call to recvmsg(), it would have blocked. Instead, select() is
returning 'yes, you can read', and then recvmsg() is blocking. The
select() lied. The information is all sitting in the kernel packet
queue. The content of the packet, and the checksum are sitting in
memory waiting to be considered. select() has all the information it
needs to make a decision as to whether to say 'yes, you can read' or
'there is nothing to read'. The current implementation delays this
check until the very last instant (recvmsg()) in order to get a few
more performance points. For non-blocking sockets, this works fine -
applications should be able to handle EAGAIN unless they are extremely
poorly wirtten. For blocking sockets, it makes select() useless as a
reliable mechanism for determining whether or not the recvmsg() will
block. I say useless, because I don't know why any professional
programmer would ever use an interface that was unreliable in a
production system. Other people here aren't willing to make this claim.
I'm not sure why...

In the above paragraph, I only prove that the atomic argument is
irrelevant and a distraction. The current behaviour might be
acceptable - but only if it is widely known and understood that
select() should not be used with blocking sockets *AT ALL* under
Linux. Somebody showed what looks to be a successful DOS of inetd on
Linux based on this new knowledge. The existence of this thread
suggests that it isn't widely known or understood.

I want to trust Linux with production systems. Any sort of opinion
that 'specs are only to be used as recommendations' or 'we can
interpret a spec however we want to get performance points, and we
don't have to be careful to document the Linux limitations' is
*scary*. The last time this happened that it caused such a mess was
the implementation of kernel-level threads. Linux developers shouldn't
be making these decisions in the dark. It isn't comforting. This
thread has disconcerted me, and my continued participation is an
attempt on my part to aid in the representation of people like me. I'm
disappointed. Please feel my emotions on this and respect my
motivations and don't write this off so quick as "POSIX is wrong" as
some have done.

Many of us want Linux to continue to become professional grade. We
have a love for it. When your love betrays you, it hurts.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-17 15:41:05

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 11:05:09 -0400, Mark Mielke <[email protected]> wrote:
> On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> > On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > > The SuV spec is actually quite detailed about the options here:
> > > A descriptor shall be considered ready for reading when a call
> > > to an input function with O_NONBLOCK clear would not block,
> > > whether or not the function would transfer data successfully.
> > > (The function might return data, an end-of-file indication, or
> > > an error other than one indicating that it is blocked, and in
> > > each of these cases the descriptor shall be considered ready for
> > > reading.)
> > But it says nowhere that the select()/recvmsg() operation is atomic, right?
>
> This is a distraction. If the call to select() had been substituted
> with a call to recvmsg(), it would have blocked. Instead, select() is
> returning 'yes, you can read', and then recvmsg() is blocking. The
> select() lied. The information is all sitting in the kernel packet

No. A million things might happen between select() and recvmsg(), both
in kernel and application. For a consistent behaviour throughout all
possibilities, you *have* to assume that any read on a blocking fd may
block, and that a fd ready for reading at select() time might not be
readable once the app gets to recvmsg() -- for whatever reason.

And indeed, that implies that select() on blocking fds is generally
not useful if you expect to bypass the blocking through select().
Personally, I think any application that implements this expectation
is broken. (If only because you might have to do a second read() or
recvmsg() which will either result in a crappy select() loop or a
broken read()/recvmsg() loop).

> [snip]

> poorly wirtten. For blocking sockets, it makes select() useless as a
> reliable mechanism for determining whether or not the recvmsg() will
> block. I say useless, because I don't know why any professional

That use of select() *is* useless, there's no doubt about that. It is
an application problem though.

> [snip]

> In the above paragraph, I only prove that the atomic argument is

Where's the proof?! You *define* some behaviour that you think makes most sense.

> irrelevant and a distraction. The current behaviour might be
> acceptable - but only if it is widely known and understood that
> select() should not be used with blocking sockets *AT ALL* under
> Linux. Somebody showed what looks to be a successful DOS of inetd on
> Linux based on this new knowledge. The existence of this thread
> suggests that it isn't widely known or understood.

That is not a DoS, it is an application feature or bug, depending on
what the programmer was thinking.

> I want to trust Linux with production systems. Any sort of opinion

>From a pragmatic point of view, it may be comforting to know that most
applications that can now be considered broken will *still* be broken
even if a recvmsg() will never block after select() has given the
verdict "Thou shalt read".


Cheers,
Buddy

2004-10-17 16:27:38

by Lee Revell

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 2004-10-17 at 11:40, Buddy Lucas wrote:
> > poorly wirtten. For blocking sockets, it makes select() useless as a
> > reliable mechanism for determining whether or not the recvmsg() will
> > block. I say useless, because I don't know why any professional
>
> That use of select() *is* useless, there's no doubt about that. It is
> an application problem though.
>

At this point it's a documentation problem. It's clear that the
behavior is by design and is not changing. It's also clear that this is
not the expected behavior for some people. Can't we just note this in
CAVEATS or something and move on?

Lee



2004-10-17 16:41:17

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Buddy Lucas" <[email protected]>
> On Sun, 17 Oct 2004 11:05:09 -0400, Mark Mielke <[email protected]> wrote:
> > On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> > > On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > > > The SuV spec is actually quite detailed about the options here:
> > > > A descriptor shall be considered ready for reading when a call
> > > > to an input function with O_NONBLOCK clear would not block,
> > > > whether or not the function would transfer data successfully.
> > > > (The function might return data, an end-of-file indication, or
> > > > an error other than one indicating that it is blocked, and in
> > > > each of these cases the descriptor shall be considered ready for
> > > > reading.)
> > > But it says nowhere that the select()/recvmsg() operation is atomic, right?
> >
> > This is a distraction. If the call to select() had been substituted
> > with a call to recvmsg(), it would have blocked. Instead, select() is
> > returning 'yes, you can read', and then recvmsg() is blocking. The
> > select() lied. The information is all sitting in the kernel packet
>
> No. A million things might happen between select() and recvmsg(), both
> in kernel and application. For a consistent behaviour throughout all
> possibilities, you *have* to assume that any read on a blocking fd may
> block, and that a fd ready for reading at select() time might not be
> readable once the app gets to recvmsg() -- for whatever reason.

It is perfectly possible to not have a million things happen between
select() and recvmsg() and POSIX defines what can happen and what
can't; it states that a process calling select() on a socket will not block
on a subsequent recvmsg() on that socket.

> And indeed, that implies that select() on blocking fds is generally
> not useful if you expect to bypass the blocking through select().
> Personally, I think any application that implements this expectation
> is broken. (If only because you might have to do a second read() or
> recvmsg() which will either result in a crappy select() loop or a
> broken read()/recvmsg() loop).

The way select() is defined in POSIX effectively means that once an
application has done a select() on a socket, the data that caused
select() to return is committed, i.e. it can no longer be dropped and
should be considered received by the application; this has nothing
to do with UDP being unreliable and being unreliable for the sake
of it is not what UDP was meant for.

Whether you think an application that is written to use select() as
defined in POSIX is broken is not really important. The fact remains
that Linux currently implements a select() that is _not_ POSIX
compliant and is so solely for performance reasons. I personally think
correct behaviour is much more important.


--ms

2004-10-17 17:23:19

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-17T16:17:06, Buddy Lucas <[email protected]> wrote:

> > The SuV spec is actually quite detailed about the options here:
> >
> > A descriptor shall be considered ready for reading when a call
> > to an input function with O_NONBLOCK clear would not block,
> > whether or not the function would transfer data successfully.
> > (The function might return data, an end-of-file indication, or
> > an error other than one indicating that it is blocked, and in
> > each of these cases the descriptor shall be considered ready for
> > reading.)
> But it says nowhere that the select()/recvmsg() operation is atomic, right?

See, Buddy, the point here is that Linux _does_ violate the
specification. You can try weaseling out of it, but it's not going to
work.

This isn't per se the same as saying that it's not a sensible violation,
but very clearly the specs disagree with the current Linux behaviour.

It's impossible to claim that you are allowed by the spec to block on a
recvmsg directly following a successful select. You are not. You could
claim that, but you'd be wrong.

If the packet has been dropped in between, which _could_ have happened
because UDP is allowed to be dropped basically anywhere, EIO may be
returned. But blocking or returning EAGAIN/EWOULDBLOCK is verboten. The
spec is very clearly on that.

(Now I'd claim that returning EIO after a succesful select is also
slightly suboptimal - the performance optimizations should be turned off
for blocking sockets, IMHO, and the data which caused the select() to
return should be considered comitted - but it would be allowed.)

I'm not so sure what's so hard to accept about that. It may be well that
Linux is following the de-facto industry standard (or even setting it)
here, and I'd agree that if you don't want blocking use O_NONBLOCK, but
in no way can Linux claim POSIX/SuV spec compliance for this behaviour.

I'm not getting why people argue so much to try and weasel the words so
that it comes out as compliant. It's not. It may make sense due to
practical reasons, but it's not compliant.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 17:27:24

by Jesper Juhl

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004, Buddy Lucas wrote:

> On Sun, 17 Oct 2004 11:05:09 -0400, Mark Mielke <[email protected]> wrote:
> > On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> > > On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > > > The SuV spec is actually quite detailed about the options here:
> > > > A descriptor shall be considered ready for reading when a call
> > > > to an input function with O_NONBLOCK clear would not block,
> > > > whether or not the function would transfer data successfully.
> > > > (The function might return data, an end-of-file indication, or
> > > > an error other than one indicating that it is blocked, and in
> > > > each of these cases the descriptor shall be considered ready for
> > > > reading.)
> > > But it says nowhere that the select()/recvmsg() operation is atomic, right?
> >
> > This is a distraction. If the call to select() had been substituted
> > with a call to recvmsg(), it would have blocked. Instead, select() is
> > returning 'yes, you can read', and then recvmsg() is blocking. The
> > select() lied. The information is all sitting in the kernel packet
>
> No. A million things might happen between select() and recvmsg(), both
> in kernel and application. For a consistent behaviour throughout all
> possibilities, you *have* to assume that any read on a blocking fd may
> block, and that a fd ready for reading at select() time might not be
> readable once the app gets to recvmsg() -- for whatever reason.
>
> And indeed, that implies that select() on blocking fds is generally
> not useful if you expect to bypass the blocking through select().
> Personally, I think any application that implements this expectation
> is broken. (If only because you might have to do a second read() or
> recvmsg() which will either result in a crappy select() loop or a
> broken read()/recvmsg() loop).
>
Personally I agree that if you want non blocking sockets that's what you
should use, but many people expect that if select says a socket is
readable then attempting to recieve that data will not block. Many people
refer to Stevens "UNIX Network Programming" to find out how select and
related networking functions are supposed to behave, and Stevens has this
to say on page 153 under the heading "Under What Conditions Is a
Descriptor Ready?" [1] :

[...]
1. A socket is ready for reading if any of the following four conditions
is true:

a. The number of bytes of data in the socker recieve buffer is greater
than or equal to the current size of the low-water mark for the
socket recieve buffer. A read operation on the socket will not block
and will return a value greater than 0 ...

[...]

He's not saying this is specific to either UDP or TCP sockets nor blocking
or non-blocking sockets, so I suspect that regardless of what POSIX might
say a lot of people will be reading the above and assume that if select
says a socket (any type of socket) is ready for reading, then attempting
to read will not block.
Using blocking sockets with select in this manner may be silly or
inefficient, but I think many people expect it to work without the
subsequent read blocking in any case.


[1] I'm only quoting what I think is relevant to the thread, but I can
quote the rest if wanted. My copy of the book has ISBN 0-13-490012-X in
case someone wants to check.


--
Jesper Juhl


2004-10-17 17:35:15

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 18:35:34 +0100, Martijn Sipkema <[email protected]> wrote:
> From: "Buddy Lucas" <[email protected]>
> > On Sun, 17 Oct 2004 11:05:09 -0400, Mark Mielke <[email protected]> wrote:
> > > On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> > > > On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > > > > The SuV spec is actually quite detailed about the options here:
> > > > > A descriptor shall be considered ready for reading when a call
> > > > > to an input function with O_NONBLOCK clear would not block,
> > > > > whether or not the function would transfer data successfully.
> > > > > (The function might return data, an end-of-file indication, or
> > > > > an error other than one indicating that it is blocked, and in
> > > > > each of these cases the descriptor shall be considered ready for
> > > > > reading.)
> > > > But it says nowhere that the select()/recvmsg() operation is atomic, right?
> > >
> > > This is a distraction. If the call to select() had been substituted
> > > with a call to recvmsg(), it would have blocked. Instead, select() is
> > > returning 'yes, you can read', and then recvmsg() is blocking. The
> > > select() lied. The information is all sitting in the kernel packet
> >
> > No. A million things might happen between select() and recvmsg(), both
> > in kernel and application. For a consistent behaviour throughout all
> > possibilities, you *have* to assume that any read on a blocking fd may
> > block, and that a fd ready for reading at select() time might not be
> > readable once the app gets to recvmsg() -- for whatever reason.
>
> It is perfectly possible to not have a million things happen between
> select() and recvmsg() and POSIX defines what can happen and what
> can't; it states that a process calling select() on a socket will not block
> on a subsequent recvmsg() on that socket.
>
> > And indeed, that implies that select() on blocking fds is generally
> > not useful if you expect to bypass the blocking through select().
> > Personally, I think any application that implements this expectation
> > is broken. (If only because you might have to do a second read() or
> > recvmsg() which will either result in a crappy select() loop or a
> > broken read()/recvmsg() loop).
>
> The way select() is defined in POSIX effectively means that once an
> application has done a select() on a socket, the data that caused
> select() to return is committed, i.e. it can no longer be dropped and
> should be considered received by the application; this has nothing

That is plainly wrong. Data is never received by an application before
recvmsg() has succeeded.

> to do with UDP being unreliable and being unreliable for the sake
> of it is not what UDP was meant for.
>
> Whether you think an application that is written to use select() as
> defined in POSIX is broken is not really important. The fact remains
> that Linux currently implements a select() that is _not_ POSIX
> compliant and is so solely for performance reasons. I personally think
> correct behaviour is much more important.

All I'm saying is, that applications that are not correct now, will
probably not be correct even if we change the way Linux handles this
situation. The sanest thing really seems to accept the fact that any
read() on a blocking fd might block, even if the programmer thinks it
really shouldn't.

But then I am one of those who thinks it's sane to check for
EWOULDBLOCK on a nonblocking socket after blocking in select().

Let's just document this and move on to something more important.


Cheers,
Buddy

> --ms
>
>

2004-10-17 17:54:40

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 19:22:44 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> On 2004-10-17T16:17:06, Buddy Lucas <[email protected]> wrote:
>
> > > The SuV spec is actually quite detailed about the options here:
> > >
> > > A descriptor shall be considered ready for reading when a call
> > > to an input function with O_NONBLOCK clear would not block,
> > > whether or not the function would transfer data successfully.
> > > (The function might return data, an end-of-file indication, or
> > > an error other than one indicating that it is blocked, and in
> > > each of these cases the descriptor shall be considered ready for
> > > reading.)
> > But it says nowhere that the select()/recvmsg() operation is atomic, right?
>
> See, Buddy, the point here is that Linux _does_ violate the
> specification. You can try weaseling out of it, but it's not going to
> work.

Sigh. Read the quote to which I responded again. Not a word about
atomicity. Nowhere does it say that a descriptor which was ready for
reading at select() time is still readable at recvmsg() time. There is
no doubt that it would be very nice if select() would say something
useful, but that's not the issue here.

> This isn't per se the same as saying that it's not a sensible violation,
> but very clearly the specs disagree with the current Linux behaviour.

So document it.

> It's impossible to claim that you are allowed by the spec to block on a
> recvmsg directly following a successful select. You are not. You could
> claim that, but you'd be wrong.

Empty statement.

> If the packet has been dropped in between, which _could_ have happened
> because UDP is allowed to be dropped basically anywhere, EIO may be
> returned. But blocking or returning EAGAIN/EWOULDBLOCK is verboten. The
> spec is very clearly on that.

Obviously returning EAGAIN/EWOULDBLOCK while reading from a blocking
fd is not what we want (in the situation at hand). I don't see how it
relates to the discussion.

> (Now I'd claim that returning EIO after a succesful select is also
> slightly suboptimal - the performance optimizations should be turned off
> for blocking sockets, IMHO, and the data which caused the select() to
> return should be considered comitted - but it would be allowed.)
> I'm not so sure what's so hard to accept about that. It may be well that
> Linux is following the de-facto industry standard (or even setting it)
> here, and I'd agree that if you don't want blocking use O_NONBLOCK, but
> in no way can Linux claim POSIX/SuV spec compliance for this behaviour.

It doesn't.


Cheers,
Buddy

2004-10-17 18:04:44

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 19:35:03 +0200 (CEST), Jesper Juhl <[email protected]> wrote:

> Personally I agree that if you want non blocking sockets that's what you
> should use, but many people expect that if select says a socket is
> readable then attempting to recieve that data will not block. Many people
> refer to Stevens "UNIX Network Programming" to find out how select and
> related networking functions are supposed to behave, and Stevens has this

[ snip ]

Also note the examples that Stevens gives. For instance, he explicitly
checks for EWOULDBLOCK after a read on a nonblocking fd that has been
reported readable by select().


Cheers,
Buddy

2004-10-17 18:05:55

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-17T19:54:04, Buddy Lucas <[email protected]> wrote:

> Sigh. Read the quote to which I responded again. Not a word about
> atomicity. Nowhere does it say that a descriptor which was ready for
> reading at select() time is still readable at recvmsg() time.

That is the _whole point_ of select() on non-O_NONBLOCK sockets.

> There is no doubt that it would be very nice if select() would say
> something useful, but that's not the issue here.

According to the specs, it says something useful. It doesn't on Linux,
agreed.

> > This isn't per se the same as saying that it's not a sensible violation,
> > but very clearly the specs disagree with the current Linux behaviour.
> So document it.

That's one way of doing it, yes.

> > If the packet has been dropped in between, which _could_ have happened
> > because UDP is allowed to be dropped basically anywhere, EIO may be
> > returned. But blocking or returning EAGAIN/EWOULDBLOCK is verboten. The
> > spec is very clearly on that.
> Obviously returning EAGAIN/EWOULDBLOCK while reading from a blocking
> fd is not what we want (in the situation at hand). I don't see how it
> relates to the discussion.

Others have suggested it in this thread as a possible error code, so
that's how it relates to this discussion. Surprise ;-)

> > I'm not so sure what's so hard to accept about that. It may be well that
> > Linux is following the de-facto industry standard (or even setting it)
> > here, and I'd agree that if you don't want blocking use O_NONBLOCK, but
> > in no way can Linux claim POSIX/SuV spec compliance for this behaviour.
> It doesn't.

*sigh* According to the man pages and the Linux Standard Base it does,
and it has been claimed repeatedly in this thread too.

Please get your facts straight.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 18:08:03

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-17T20:04:21, Buddy Lucas <[email protected]> wrote:

> [ snip ]
>
> Also note the examples that Stevens gives. For instance, he explicitly
> checks for EWOULDBLOCK after a read on a nonblocking fd that has been
> reported readable by select().

The specs don't disagree with that. On a O_NONBLOCK socket, that is
allowed.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 18:12:36

by Mark Mielke

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, Oct 17, 2004 at 07:54:04PM +0200, Buddy Lucas wrote:
> > This isn't per se the same as saying that it's not a sensible violation,
> > but very clearly the specs disagree with the current Linux behaviour.
> So document it.

This is all I'm expecting to see. :-)

Buddy: You took the opposite stance, but still stuck to the truth. I give
you points for that.

Cheers,
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2004-10-17 18:21:08

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 20:06:29 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> On 2004-10-17T20:04:21, Buddy Lucas <[email protected]> wrote:
>
> > [ snip ]
> >
> > Also note the examples that Stevens gives. For instance, he explicitly
> > checks for EWOULDBLOCK after a read on a nonblocking fd that has been
> > reported readable by select().
>
> The specs don't disagree with that. On a O_NONBLOCK socket, that is
> allowed.

I think the specs got to you, man!


Cheers,
Buddy

2004-10-17 18:53:52

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> It is perfectly possible to not have a million things happen between
> select() and recvmsg() and POSIX defines what can happen and what
> can't; it states that a process calling select() on a socket will
> not block
> on a subsequent recvmsg() on that socket.

I'm sorry, that's an absolutely preposterous view. For one thing, Linux
violates this by allowing processes and threads to share file descriptors
(since another process can steal the data before the call to 'recvmsg'). Oh
well, I guess we'll have to take that out if we want to comply with POSIX on
'select' semantics.

> The way select() is defined in POSIX effectively means that once an
> application has done a select() on a socket, the data that caused
> select() to return is committed, i.e. it can no longer be dropped and
> should be considered received by the application; this has nothing
> to do with UDP being unreliable and being unreliable for the sake
> of it is not what UDP was meant for.

Again, I think this is an absurd reading of the standard. No other status
function provides a future guarantee. And it's semantically ugly to have
'select' change the status of network data when it's purely intended to be a
'get status' function.

> Whether you think an application that is written to use select() as
> defined in POSIX is broken is not really important. The fact remains
> that Linux currently implements a select() that is _not_ POSIX
> compliant and is so solely for performance reasons. I personally think
> correct behaviour is much more important.

This is only because you interpret the standard as providing a future
guarantee that it is literally impossible for any modern operating system to
provide. I certainly don't interpret the standard that way. Look up the word
'would' in the dictionary.

Linux does in fact make the decision to discard the data *after* the call
to 'select'. This is not in any way different from another process that
shared the file descriptor consuming the data after the call to 'select'.

DS


2004-10-17 18:59:00

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Buddy Lucas" <[email protected]>
> On Sun, 17 Oct 2004 18:35:34 +0100, Martijn Sipkema <[email protected]> wrote:
> > From: "Buddy Lucas" <[email protected]>
> > > On Sun, 17 Oct 2004 11:05:09 -0400, Mark Mielke <[email protected]> wrote:
> > > > On Sun, Oct 17, 2004 at 04:17:06PM +0200, Buddy Lucas wrote:
> > > > > On Sun, 17 Oct 2004 15:35:37 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> > > > > > The SuV spec is actually quite detailed about the options here:
> > > > > > A descriptor shall be considered ready for reading when a call
> > > > > > to an input function with O_NONBLOCK clear would not block,
> > > > > > whether or not the function would transfer data successfully.
> > > > > > (The function might return data, an end-of-file indication, or
> > > > > > an error other than one indicating that it is blocked, and in
> > > > > > each of these cases the descriptor shall be considered ready for
> > > > > > reading.)
> > > > > But it says nowhere that the select()/recvmsg() operation is atomic, right?
> > > >
> > > > This is a distraction. If the call to select() had been substituted
> > > > with a call to recvmsg(), it would have blocked. Instead, select() is
> > > > returning 'yes, you can read', and then recvmsg() is blocking. The
> > > > select() lied. The information is all sitting in the kernel packet
> > >
> > > No. A million things might happen between select() and recvmsg(), both
> > > in kernel and application. For a consistent behaviour throughout all
> > > possibilities, you *have* to assume that any read on a blocking fd may
> > > block, and that a fd ready for reading at select() time might not be
> > > readable once the app gets to recvmsg() -- for whatever reason.
> >
> > It is perfectly possible to not have a million things happen between
> > select() and recvmsg() and POSIX defines what can happen and what
> > can't; it states that a process calling select() on a socket will not block
> > on a subsequent recvmsg() on that socket.
> >
> > > And indeed, that implies that select() on blocking fds is generally
> > > not useful if you expect to bypass the blocking through select().
> > > Personally, I think any application that implements this expectation
> > > is broken. (If only because you might have to do a second read() or
> > > recvmsg() which will either result in a crappy select() loop or a
> > > broken read()/recvmsg() loop).
> >
> > The way select() is defined in POSIX effectively means that once an
> > application has done a select() on a socket, the data that caused
> > select() to return is committed, i.e. it can no longer be dropped and
> > should be considered received by the application; this has nothing
>
> That is plainly wrong. Data is never received by an application before
> recvmsg() has succeeded.

I didn't say it was, but that from the view of the UDP protocol it is, i.e.
a UDP packet can not be dropped from that point onwards.

> > to do with UDP being unreliable and being unreliable for the sake
> > of it is not what UDP was meant for.
> >
> > Whether you think an application that is written to use select() as
> > defined in POSIX is broken is not really important. The fact remains
> > that Linux currently implements a select() that is _not_ POSIX
> > compliant and is so solely for performance reasons. I personally think
> > correct behaviour is much more important.
>
> All I'm saying is, that applications that are not correct now, will
> probably not be correct even if we change the way Linux handles this
> situation. The sanest thing really seems to accept the fact that any
> read() on a blocking fd might block, even if the programmer thinks it
> really shouldn't.
>
> But then I am one of those who thinks it's sane to check for
> EWOULDBLOCK on a nonblocking socket after blocking in select().

A POSIX comliant implementation would never do this.

> Let's just document this and move on to something more important.

It actually _is_ important. Just implement select() and recvmsg() as
described in the standard.


--ms

2004-10-17 19:04:39

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Lars Marowsky-Bree" <[email protected]>
>> [ snip ]
>>
>> Also note the examples that Stevens gives. For instance, he explicitly
>> checks for EWOULDBLOCK after a read on a nonblocking fd that has been
>> reported readable by select().
>
> The specs don't disagree with that. On a O_NONBLOCK socket, that is
> allowed.

No, it isn't. select() may not behave differently based on the O_NONBLOCK
flag at the moment of the select() call. And if a call to recvmsg() with O_NONBLOCK
cleared doesn't block and since it can't return EAGAIN, then I don't think a recvmsg()
call with O_NONBLOCK set should return EAGAIN where something like
EIO should have been returned otherwise.


--ms

2004-10-17 19:21:45

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> > This is a distraction. If the call to select() had been substituted
> > with a call to recvmsg(), it would have blocked. Instead,
> select() is
> > returning 'yes, you can read', and then recvmsg() is blocking. The
> > select() lied. The information is all sitting in the kernel packet
>
> No. A million things might happen between select() and recvmsg(), both
> in kernel and application. For a consistent behaviour throughout all
> possibilities, you *have* to assume that any read on a blocking fd may
> block.

Care to provide a real example?

UDP isn't one. It was done for performance reasons as David admitted and it
could very well be done otherwise: do the checksum before select returns.

David has admitted the only reason Linux chose to do so is performance.

It might be the case that a million things might happen between select and
recvmsg, but none of them, as I can see, *have* to force Linux to work this
way. The only reason as I can see is performance and imlementation
convenience.

Hua

2004-10-17 19:26:46

by Hua Zhong (hzhong)

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

> I'm sorry, that's an absolutely preposterous view. For
> one thing, Linux violates this by allowing processes and
> threads to share file descriptors (since another process
> can steal the data before the call to 'recvmsg'). Oh
> well, I guess we'll have to take that out if we want to
> comply with POSIX on 'select' semantics.

Another typical fake argument. Do not mix kernel and user space problems.

The standard says the [following] recvmsg would not block, not the following
recvmsg *from the same* thread would not block.

Hua


2004-10-17 19:32:20

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "David Schwartz" <[email protected]>
> > It is perfectly possible to not have a million things happen between
> > select() and recvmsg() and POSIX defines what can happen and what
> > can't; it states that a process calling select() on a socket will
> > not block
> > on a subsequent recvmsg() on that socket.
>
> I'm sorry, that's an absolutely preposterous view. For one thing, Linux
> violates this by allowing processes and threads to share file descriptors
> (since another process can steal the data before the call to 'recvmsg'). Oh
> well, I guess we'll have to take that out if we want to comply with POSIX on
> 'select' semantics.

I don't think this is comparable, see below.

> > The way select() is defined in POSIX effectively means that once an
> > application has done a select() on a socket, the data that caused
> > select() to return is committed, i.e. it can no longer be dropped and
> > should be considered received by the application; this has nothing
> > to do with UDP being unreliable and being unreliable for the sake
> > of it is not what UDP was meant for.
>
> Again, I think this is an absurd reading of the standard. No other status
> function provides a future guarantee. And it's semantically ugly to have
> 'select' change the status of network data when it's purely intended to be a
> 'get status' function.

It not about a future guarantee; the information as to whether the data
is corrupt or not is available at the time when select() is called and POSIX
requires it to be.

You _chose_ to implement your select() in a way that is not POSIX
compliant.

> > Whether you think an application that is written to use select() as
> > defined in POSIX is broken is not really important. The fact remains
> > that Linux currently implements a select() that is _not_ POSIX
> > compliant and is so solely for performance reasons. I personally think
> > correct behaviour is much more important.
>
> This is only because you interpret the standard as providing a future
> guarantee that it is literally impossible for any modern operating system to
> provide.

Hardly so.

> I certainly don't interpret the standard that way. Look up the word
> 'would' in the dictionary.

would is meant here as "if you were to call recvmsg(), then" and not as
"may or may not..".

Your interpretation of select() is that it merely provides a hint that the
socket may be ready; that may be convenient for you, but it is not what
POSIX describes.

> Linux does in fact make the decision to discard the data *after* the call
> to 'select'. This is not in any way different from another process that
> shared the file descriptor consuming the data after the call to 'select'.

I think it is different. The first recvmsg() from one of these processes
doesn't block; it is the recvmsg() on that file descriptor that select()
guarantees will not block.


--ms


2004-10-17 19:34:57

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 20:58:39 +0100, Martijn Sipkema <[email protected]> wrote:
> >
> > But then I am one of those who thinks it's sane to check for
> > EWOULDBLOCK on a nonblocking socket after blocking in select().
>
> A POSIX comliant implementation would never do this.

Here's your own quote, from a couple of hundred mails ago:

> According to POSIX:

> A descriptor shall be considered ready for reading when a call to an
> input function with O_NONBLOCK clear would not block, whether or not
> the function would transfer data successfully.

You concluded from this that, if select() says a descriptor is
readable, the subsequent recvmsg() must not block. The point is, from
your quote I cannot deduct anything but: a recvmsg() on a descriptor
that is readable must not block -- which makes perfect sense.

But unless POSIX also says something about the conservability of
"readability" of descriptors, specifically in between select() and
recvmsg(), your conclusion is just wrong.

> > Let's just document this and move on to something more important.
>
> It actually _is_ important. Just implement select() and recvmsg() as
> described in the standard.

I am very glad Linux makes sane decisions while trying to adhere to
the standards as much as possible.


Cheers,
Buddy

2004-10-17 19:43:07

by Martijn Sipkema

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

From: "Buddy Lucas" <[email protected]>
> On Sun, 17 Oct 2004 20:58:39 +0100, Martijn Sipkema <[email protected]> wrote:
> > >
> > > But then I am one of those who thinks it's sane to check for
> > > EWOULDBLOCK on a nonblocking socket after blocking in select().
> >
> > A POSIX comliant implementation would never do this.
>
> Here's your own quote, from a couple of hundred mails ago:
>
> > According to POSIX:
>
> > A descriptor shall be considered ready for reading when a call to an
> > input function with O_NONBLOCK clear would not block, whether or not
> > the function would transfer data successfully.
>
> You concluded from this that, if select() says a descriptor is
> readable, the subsequent recvmsg() must not block. The point is, from
> your quote I cannot deduct anything but: a recvmsg() on a descriptor
> that is readable must not block -- which makes perfect sense.
>
> But unless POSIX also says something about the conservability of
> "readability" of descriptors, specifically in between select() and
> recvmsg(), your conclusion is just wrong.

But with the current Linux implementation it is possible that a call to select()
returns while a call to recvmsg() would have blocked, and that is not
correct according to POSIX.

And I indeed read this as meaning that the first recvmsg() after the select()
may not block.

> > > Let's just document this and move on to something more important.
> >
> > It actually _is_ important. Just implement select() and recvmsg() as
> > described in the standard.
>
> I am very glad Linux makes sane decisions while trying to adhere to
> the standards as much as possible.

I don't think this is ``trying to adhere to the standards''... not by a long shot.


--ms

2004-10-17 20:02:50

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 21:42:04 +0100, Martijn Sipkema <[email protected]> wrote:
> From: "Buddy Lucas" <[email protected]>
> > On Sun, 17 Oct 2004 20:58:39 +0100, Martijn Sipkema <[email protected]> wrote:
> > > >
> > > > But then I am one of those who thinks it's sane to check for
> > > > EWOULDBLOCK on a nonblocking socket after blocking in select().
> > >
> > > A POSIX comliant implementation would never do this.
> >
> > Here's your own quote, from a couple of hundred mails ago:
> >
> > > According to POSIX:
> >
> > > A descriptor shall be considered ready for reading when a call to an
> > > input function with O_NONBLOCK clear would not block, whether or not
> > > the function would transfer data successfully.
> >
> > You concluded from this that, if select() says a descriptor is
> > readable, the subsequent recvmsg() must not block. The point is, from
> > your quote I cannot deduct anything but: a recvmsg() on a descriptor
> > that is readable must not block -- which makes perfect sense.
> >
> > But unless POSIX also says something about the conservability of
> > "readability" of descriptors, specifically in between select() and
> > recvmsg(), your conclusion is just wrong.
>
> But with the current Linux implementation it is possible that a call to select()
> returns while a call to recvmsg() would have blocked, and that is not
> correct according to POSIX.

Come on. Those are different things. (If POSIX says this explicitly,
it's even more broken than I thought.)

> And I indeed read this as meaning that the first recvmsg() after the select()
> may not block.

Well, then you *are* wrong. ;-)


Cheers,
Buddy

2004-10-17 20:11:37

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-17T21:33:27, Buddy Lucas <[email protected]> wrote:

> You concluded from this that, if select() says a descriptor is
> readable, the subsequent recvmsg() must not block. The point is, from
> your quote I cannot deduct anything but: a recvmsg() on a descriptor
> that is readable must not block -- which makes perfect sense.
>
> But unless POSIX also says something about the conservability of
> "readability" of descriptors, specifically in between select() and
> recvmsg(), your conclusion is just wrong.

What kind of idiotic (and most of all, wrong) hairsplitting are you
doing here, for heaven's sake? That's obviously exactly what the
standard implies.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 20:11:37

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 2004-10-17T21:04:40, Martijn Sipkema <[email protected]> wrote:

> > The specs don't disagree with that. On a O_NONBLOCK socket, that is
> > allowed.
> No, it isn't. select() may not behave differently based on the O_NONBLOCK
> flag at the moment of the select() call.

Yes it may. Though this is getting nitpicking; please re-read:

"A descriptor shall be considered ready for reading when a call to an
input function with O_NONBLOCK clear would not block, whether or not the
function would transfer data successfully. (The function might return
data, an end-of-file indication, or an error other than one indicating
that it is blocked, and in each of these cases the descriptor shall be
considered ready for reading.)"

No claim is made for O_NONBLOCK set; so in that case we can do something
sane.

> And if a call to recvmsg() with O_NONBLOCK cleared doesn't block and
> since it can't return EAGAIN, then I don't think a recvmsg() call with
> O_NONBLOCK set should return EAGAIN where something like EIO should
> have been returned otherwise.

Point taken. For UDP, returning EIO in both cases is just fine (it's
actually also correct, for a checksum error constitutes a "network read
error"...). At least according to the spec.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX AG - A Novell company

2004-10-17 20:25:51

by Buddy Lucas

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Sun, 17 Oct 2004 22:11:18 +0200, Lars Marowsky-Bree <[email protected]> wrote:
> On 2004-10-17T21:33:27, Buddy Lucas <[email protected]> wrote:
>
> > You concluded from this that, if select() says a descriptor is
> > readable, the subsequent recvmsg() must not block. The point is, from
> > your quote I cannot deduct anything but: a recvmsg() on a descriptor
> > that is readable must not block -- which makes perfect sense.
> >
> > But unless POSIX also says something about the conservability of
> > "readability" of descriptors, specifically in between select() and
> > recvmsg(), your conclusion is just wrong.
>
> What kind of idiotic (and most of all, wrong) hairsplitting are you
> doing here, for heaven's sake? That's obviously exactly what the
> standard implies.

Take this discussion off-list please.

2004-10-18 22:25:26

by Robert White

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?

Sorry, I was thinking in the generic case of "a protocol read() after select()" and
not specifically a UDP read() after select(); as any semantic chosen needs to be
generic. That may be out of scope for your considerations.

If you allow for zero-length packets on the transport, then you are introducing a
semantic entity to silently displace a procedural entity [e.g. if your messaging
scheme allows for zero-length packets and you have the kernel "fake" a zero-length
packet, then the kernel is triggering semantics instead of and error recognition
process; if zero-length packets are impossible, then the kernel is creating a "new"
return condition with unforeseen consequences in all existing code.]

If your process is *not* prepared to deal with a zero-length return from a receive
message, then you will get a semantic error. [e.g. you "know" that all the packets
you receive have a certain structure, but here you are "receiving" a non-error even
that is outside of your semantic set and so not allowed-for in your existing state.
Etc.]

For every other file handle zero-length read is end of file. So there is this "well
established" semantic meaning for "if ( 0 == read(fd,...))" and you are proposing to
non-trivially create a one-off for the specific case of fd==UDP-socket. So now if
you pass this file handle through a generic mechanism then you break the generic
semantics by creating a "different class of files" where a state problem leads to the
generation of an "in-band, originless, valid receive event" that is completely
dissimilar to the expected meaning of the return value from a standard function call.

Basically, if it is possible to send and receive a zero-length message in a
connectionless protocol, you are _stealing_ the possible semantic meaning of that
message and retroactively claiming it as a signal from the kernel to the program. IF
it isn't already possible to send and receive a zero-length message in that
connectionless transport, then you are adding a semantic that all the existing code
may be completely unable to interpret, or which may "trick" applications into
deciding they are getting the end-of-file condition because they don't know or care
that the transport in question is UDP.

So if you have a generic handling mechanism, centered on select(), that "knows" that
if it sees 0==read(...) then it should close the file handle, and if that mechanism
is given sockets conforming to this proposed modification, then that mechanism will
break.

[I *am* out of my depth about whether UDP allows zero-length messages, it has never
come up for me, but I don't think it matters. If it isn't UDP legal, then you are
adding semantic. If it is legal, then you are overloading known semantic. Both
actions are surprising, so both are wrong.]

So returning zero from a read function on a file descriptor that "can not"
meaningfully know end-of-file (because it doesn't FIN etc) is still a very bad idea
because of the odd-out cases where it will have "impossible" or at least wildly
incorrect semantic consequences.

Not a good space to be mucking around in.

But that's just my opinion. And I am now rambling... 8-)

Rob White.


-----Original Message-----
From: Willy Tarreau [mailto:[email protected]]
Sent: Saturday, October 16, 2004 3:25 AM
To: Robert White
Cc: 'David S. Miller'; 'Olivier Galibert'; [email protected]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Fri, Oct 15, 2004 at 03:42:55PM -0700, Robert White wrote:
> -----Original Message-----
> From: [email protected]
[mailto:[email protected]]
> On Behalf Of Willy Tarreau
>
> > As I asked in a previous mail in this overly long thread, why not returning
> > zero bytes at all. It is perfectly valid to receive an UDP packet with 0
>
> Zero bytes is "end of file". Don't go trying to co-opt end of file. That way lies
> madness and despair.

Please explain me what "end of file" means with UDP. If your UDP-based app
expects to receive a zero when the other end stops transmitting, then it
might wait for a very long time. As opposed to TCP, there's no FIN control
flag to tell the remote host that you sent your last packet.

Willy

2004-10-19 01:26:26

by John Pearson

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?


This has probably been said before, but just in case I missed
it; there are many viewpoints in this thread, but the two main camps
appear to be arguing at cross-purposes.

As far as I can see:

- YES, Linux select 'lies' and violates POSIX wrt checksums:
a call to recvmsg() might well have blocked when select()
said data was ready, as a result of a currupt UDP packet;

- NO, 'fixing' select() won't guarantee that recvmsg()
will not block/return EAGAIN, because select() only
guarantees that a call to recvmsg() would not have blocked
at that time - as others have observed, it cannot guarantee
that 'valid' data won't subsequently be discarded; any
subsequent call to recvmsg() is only 'immediate' in a fuzzy,
imprecise and inadequate sense.

Can we get back to arguing about something less repetitive
(or at least, make the circle larger and more scenic)?



John.


On Sun, Oct 17, 2004 at 07:22:44PM +0200, Lars Marowsky-Bree wrote
> On 2004-10-17T16:17:06, Buddy Lucas <[email protected]> wrote:
>
> > > The SuV spec is actually quite detailed about the options here:
> > >
> > > A descriptor shall be considered ready for reading when a call
> > > to an input function with O_NONBLOCK clear would not block,
> > > whether or not the function would transfer data successfully.
> > > (The function might return data, an end-of-file indication, or
> > > an error other than one indicating that it is blocked, and in
> > > each of these cases the descriptor shall be considered ready for
> > > reading.)
> > But it says nowhere that the select()/recvmsg() operation is atomic, right?
>
> See, Buddy, the point here is that Linux _does_ violate the
> specification. You can try weaseling out of it, but it's not going to
> work.
>
> This isn't per se the same as saying that it's not a sensible violation,
> but very clearly the specs disagree with the current Linux behaviour.
>
> It's impossible to claim that you are allowed by the spec to block on a
> recvmsg directly following a successful select. You are not. You could
> claim that, but you'd be wrong.
>
> If the packet has been dropped in between, which _could_ have happened
> because UDP is allowed to be dropped basically anywhere, EIO may be
> returned. But blocking or returning EAGAIN/EWOULDBLOCK is verboten. The
> spec is very clearly on that.
>
> (Now I'd claim that returning EIO after a succesful select is also
> slightly suboptimal - the performance optimizations should be turned off
> for blocking sockets, IMHO, and the data which caused the select() to
> return should be considered comitted - but it would be allowed.)
>
> I'm not so sure what's so hard to accept about that. It may be well that
> Linux is following the de-facto industry standard (or even setting it)
> here, and I'd agree that if you don't want blocking use O_NONBLOCK, but
> in no way can Linux claim POSIX/SuV spec compliance for this behaviour.
>
> I'm not getting why people argue so much to try and weasel the words so
> that it comes out as compliant. It's not. It may make sense due to
> practical reasons, but it's not compliant.
>
>
> Sincerely,
> Lars Marowsky-Br?e <[email protected]>
>
> --
> High Availability & Clustering
> SUSE Labs, Research and Development
> SUSE LINUX AG - A Novell company
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
> --__--__--

--
Voice: +61 8 8202 9040
Email: [email protected]

Pracom Ltd
288 Glen Osmond Road
Fullarton, South Australia 5063

Ph: + 61 8 82029000
Fax: +61 8 82029001

CAUTION: This email and any attachments may contain information that is
confidential and subject to copyright. If you are not the
intended recipient, you must not read, use, disseminate, distribute or
copy this email or any attachments. If you have received this
email in error, please notify the sender immediately by reply email and
erase this email and any attachments.

DISCLAIMER: Pracom uses virus-scanning technology but accepts no
responsibility for loss or damage arising from the use of the
information transmitted by this email including damage from virus.

2004-10-19 13:50:19

by Colin Phipps

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On Tue, Oct 19, 2004 at 10:51:03AM +0930, John Pearson wrote:
> As far as I can see:
>
> - YES, Linux select 'lies' and violates POSIX wrt checksums:
> a call to recvmsg() might well have blocked when select()
> said data was ready, as a result of a currupt UDP packet;
>
> - NO, 'fixing' select() won't guarantee that recvmsg()
> will not block/return EAGAIN, because select() only
> guarantees that a call to recvmsg() would not have blocked
> at that time - as others have observed, it cannot guarantee
> that 'valid' data won't subsequently be discarded; any
> subsequent call to recvmsg() is only 'immediate' in a fuzzy,
> imprecise and inadequate sense.
>
> Can we get back to arguing about something less repetitive
> (or at least, make the circle larger and more scenic)?

I would put a third point on the list; the behaviour makes the failure
case for a lot of broken apps much more likely, and easy to trigger
remotely.

In the interest of making things more "scenic", let's have a few more
broken apps:

glibc RPC - so portmap, statd, ...

It seems there's a common idiom in most of the broken programs.
Programmers assume that they can take a collection of library functions
that do blocking IO, and then multiplex them by sticking a select on the
front to choose when to call them. Given the wording of the POSIX
standard, it could be naively read to endorse this idiom - it says a
socket is readable if it won't block to read. glibc RPC does this; the
underlying functions all assume blocking fds, and it then sticks a
select on the front. This occurs again in inetd, again in syslog, again
in net-snmp, and those are just the ones I see on my desktop machine.
You can easily patch the kernel to have it report them all (just
remember to disable syslog first, as it is one of the culprits).

Sure they are all broken, but now they are all exposed to bad UDP
checksums. Perhaps the people who benefit from the time saved by this
micro-optimisation would care to use the time saved to fix glibc?

2004-10-20 21:33:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Followup to: <[email protected]>
By author: Lars Marowsky-Bree <[email protected]>
In newsgroup: linux.dev.kernel
>
> This actually forbids recvmsg() to return EAGAIN and EWOULDBLOCK as
> has been suggested. EIO seems to be the best fit.
>
> But I'd have to agree that blocking on recvmsg() after select() has
> indicated ready to read does violate the specification, unless the
> socket has actually been opened with O_NONBLOCK.
>

EIO seems to be The Right Thing[TM]... it pretty much says "yes, we
received something, but it was bad." What isn't clear to me is how
applications react to EIO. It could easily be considered a fatal
error... :-/

-hpa

2004-10-20 22:00:42

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

H. Peter Anvin wrote:

> EIO seems to be The Right Thing[TM]... it pretty much says "yes, we
> received something, but it was bad." What isn't clear to me is how
> applications react to EIO. It could easily be considered a fatal
> error... :-/

From an application point of view, The Right Thing would be to do the checksum
validation at select() time if the socket is blocking.

If it's nonblocking, then just do as we do now and return EAGAIN at recvmsg() time.

This would ensure that all existing apps get the expected semantics, but the
ones based on blocking sockets would see a performance degredation.

Chris

2004-10-20 22:02:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Chris Friesen wrote:
> H. Peter Anvin wrote:
>
>> EIO seems to be The Right Thing[TM]... it pretty much says "yes, we
>> received something, but it was bad." What isn't clear to me is how
>> applications react to EIO. It could easily be considered a fatal
>> error... :-/
>
>
> From an application point of view, The Right Thing would be to do the
> checksum validation at select() time if the socket is blocking.
>
> If it's nonblocking, then just do as we do now and return EAGAIN at
> recvmsg() time.
>
> This would ensure that all existing apps get the expected semantics, but
> the ones based on blocking sockets would see a performance degredation.
>

Doing work twice can hardly be considered The Right Thing.

-hpa

2004-10-20 22:18:37

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

H. Peter Anvin wrote:

> Doing work twice can hardly be considered The Right Thing.

If the alternative is to have apps that can be DOS'd with a single corrupt
packet, I think that yes, it is.

Going forward, all apps should use nonblocking sockets, correct? (With the
current design, because it's the only way to get correctness, with my proposal
because it's the only way to get full speed.)

Given that, we then have to worry about the installed base of binaries out there
that will use blocking sockets. And it's going to take some time before they
all convert. For all those binaries, (which include basic things like syslog,
inetd, portmap, and statd) the existing kernel behaviour does not match what the
app is expecting. With a minor change, we can give the behaviour that the app
expects, though at a performance penalty. Once the app switches over to
nonblocking sockets (which it has to do *anyway* under the current model) then
it gets full performance.

To summarize:
1) apps have to switch to nonblocking sockets (for correctness)
2) my proposal makes them work as expected in the meantime, with a performance cost


Chris

2004-10-20 23:30:19

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> 2) my proposal makes them work as expected in the meantime, with
> a performance cost
>
> Chris

Perhaps I missed the details, but under your proposal, how do you predict
at 'select' time what mode the socket will be in at 'recvmsg' time?!

DS


2004-10-21 01:05:09

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

David Schwartz wrote:

> Perhaps I missed the details, but under your proposal, how do you predict
> at 'select' time what mode the socket will be in at 'recvmsg' time?!

Well, if you've got a blocking socket, and do a nonblocking read with
MSG_DONTWAIT, everything works fine. You lose a bit of performance, but it works.

The problem case is if you create a socket, set O_NONBLOCK, do select, clear
O_NONBLOCK, then do a recvmsg().

I suspect it's not a very common thing to do, so my proposal would still help
the vast majority of existing apps.

Chris

2004-10-21 01:45:57

by David Schwartz

[permalink] [raw]
Subject: RE: UDP recvmsg blocks after select(), 2.6 bug?


> David Schwartz wrote:

> > Perhaps I missed the details, but under your proposal, how
> > do you predict
> > at 'select' time what mode the socket will be in at 'recvmsg' time?!

> Well, if you've got a blocking socket, and do a nonblocking read with
> MSG_DONTWAIT, everything works fine. You lose a bit of
> performance, but it works.
>
> The problem case is if you create a socket, set O_NONBLOCK, do
> select, clear
> O_NONBLOCK, then do a recvmsg().
>
> I suspect it's not a very common thing to do, so my proposal
> would still help
> the vast majority of existing apps.
>
> Chris

I think this is a reasonable thing to do. Applications that select in one
mode and then operate in another are rare, and the suggested change won't
break anything.

DS



2004-10-21 03:05:45

by Michael Clark

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 10/21/04 06:00, H. Peter Anvin wrote:
> Chris Friesen wrote:
>
>> H. Peter Anvin wrote:
>>
>>> EIO seems to be The Right Thing[TM]... it pretty much says "yes, we
>>> received something, but it was bad." What isn't clear to me is how
>>> applications react to EIO. It could easily be considered a fatal
>>> error... :-/
>>
>>
>>
>> From an application point of view, The Right Thing would be to do the
>> checksum validation at select() time if the socket is blocking.
>>
>> If it's nonblocking, then just do as we do now and return EAGAIN at
>> recvmsg() time.
>>
>> This would ensure that all existing apps get the expected semantics,
>> but the ones based on blocking sockets would see a performance
>> degredation.
>>
>
> Doing work twice can hardly be considered The Right Thing.

Optimisations that break documented interfaces and age old assumptions
can hardly be considered The Right Thing :)

And you only do the checksum once (just earlier), and the copy_to_user
should be cache hot as most of these UDP apps will call recvmesg right
after the select.

That said, an app with many connected sockets will have a high chance
of losing the cache. Although a handful of unconnected UDP sockets
(one per interface) are more common in the use case of a large number
of clients, so in general the performance difference should be minor.

Doing the same amount work (with chance of lower performance because of
cache loss) is good IMHO if it means the choice of a reliable vs an
unreliable interface. You can only take the performance optimisation
argument so far and when these optimisations start breaking interfaces,
i think that's too far ie. what to give up efficiency vs. correctness?

Just my 2c in favour of !O_NONBLOCK early UDP checksum test in select.

~mc

2004-10-21 04:03:34

by Michael Clark

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Hi All,

I'm actually trying to write a test case to demonstrate
the problem in a repeatable way.

I first looked at socket(PF_INET, SOCK_RAW) to inject the
packets but it appears I have no control over the IP
checksum, and an invalid UDP sum didn't cause any problems
(the packet was accepted fine).

So i've hacked up something with socket(PF_PACKET, SOCK_RAW)
but the problem i'm getting (although tcpdumps of the packets
look perfect), they are not being accepted by my UDP listening
socket. It uses real interfaces, not loopback for the raw
packet injection as the checksumming appears to be bypassed
on the loopback interface (I always see bad UDP checksum
on normally originated packets on the loopback interface).

Anyway, i've probably done something dumb. Anyone care to look
at my test code? The current code sends the packets out on all
interfaces (using the correct associated IP and MAC). Can change
the #if in main() to make it use normal UDP not my cooked packets
(which currently have correct checksums).

~mc


Attachments:
udptest.c (7.02 kB)

2004-10-21 04:15:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Michael Clark wrote:
>>
>> Doing work twice can hardly be considered The Right Thing.
>
> Optimisations that break documented interfaces and age old assumptions
> can hardly be considered The Right Thing :)

The whole point is that it doesn't break the *documented* interface.

-hpa

2004-10-21 05:08:56

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

H. Peter Anvin wrote:

> The whole point is that it doesn't break the *documented* interface.

In my view (and apparently others, as has been verified in current apps using
blocking sockets), current behaviour *does* break the documented interface.

The man page for select says:

"Those listed in readfds will be watched to see if characters become
available for reading (more precisely, to see if a read will not block..."

If I'm the only one touching the socket, select returns with it readable, and I
block when calling recvmsg, then by definition that behaviour does not match the
documented interface.

Chris

2004-10-21 05:18:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Chris Friesen wrote:
> H. Peter Anvin wrote:
>
>> The whole point is that it doesn't break the *documented* interface.
>
>
> In my view (and apparently others, as has been verified in current apps
> using blocking sockets), current behaviour *does* break the documented
> interface.
>
> The man page for select says:
>
> "Those listed in readfds will be watched to see if characters
> become available for reading (more precisely, to see if a read will not
> block..."
>
> If I'm the only one touching the socket, select returns with it
> readable, and I block when calling recvmsg, then by definition that
> behaviour does not match the documented interface.
>

I'm talking about returning -1, EIO.

-hpa

2004-10-21 05:54:15

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

H. Peter Anvin wrote:
>> H. Peter Anvin wrote:
>>
>>> The whole point is that it doesn't break the *documented* interface.

> I'm talking about returning -1, EIO.


Ah. By "it", I thought you meant the current performance optimizations, not the
EIO. My apologies.

I think returning EIO is suboptimal, as it is not an expected error value for
recvmsg(). (It's not listed in the man pages for recvmsg() or ip.) It would
certainly work for new apps, but probably not for many existing binaries.

On the other hand, if you simply do the checksum verification at select() time
for blocking sockets, then the existing binaries get exactly the behaviour they
expect.

Chris

2004-10-21 06:02:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

Chris Friesen wrote:
> H. Peter Anvin wrote:
>
>>> H. Peter Anvin wrote:
>>>
>>>> The whole point is that it doesn't break the *documented* interface.
>
>
>> I'm talking about returning -1, EIO.
>
>
>
> Ah. By "it", I thought you meant the current performance optimizations,
> not the EIO. My apologies.
>
> I think returning EIO is suboptimal, as it is not an expected error
> value for recvmsg(). (It's not listed in the man pages for recvmsg() or
> ip.) It would certainly work for new apps, but probably not for many
> existing binaries.

POSIX specifies:

The recvmsg( ) function shall fail if:

[EAGAIN] or [EWOULDBLOCK] The socket's file descriptor is marked
O_NONBLOCK and no data is waiting to be received; or MSG_OOB is set and
no out-of-band data is available and either the socket s file descriptor
is marked O_NONBLOCK or the socket does not support blocking to await
out-of-band data.

[EBADF] The socket argument is not a valid open file descriptor.

[ECONNRESET] A connection was forcibly closed by a peer.

[EINTR] This function was interrupted by a signal before any data was
available.

[EINVAL] The sum of the iov_len values overflows a ssize_t, or the
MSG_OOB flag is set 37371 and no out-of-band data is available.

[EMSGSIZE] The msg_iovlen member of the msghdr structure pointed to by
message is less 37373 than or equal to 0, or is greater than {IOV_MAX}.

[ENOTCONN] A receive is attempted on a connection-mode socket that is
not connected.

[ENOTSOCK] The socket argument does not refer to a socket.

[EOPNOTSUPP] The specified flags are not supported for this socket type.

[ETIMEDOUT] The connection timed out during connection establishment, or
due to a transmission timeout on active connection.

The recvmsg( ) function may fail if:

[EIO] An I/O error occurred while reading from or writing to the file
system.

[ENOBUFS] Insufficient resources were available in the system to perform
the operation.

[ENOMEM] Insufficient memory was available to fulfill the request.

Since you didn't code to Linux, and didn't code to POSIX... what did you
code to?

> On the other hand, if you simply do the checksum verification at
> select() time for blocking sockets, then the existing binaries get
> exactly the behaviour they expect.
>

... unless the blocking changes. In which case you either have to do
work twice, or it might *never* happen. Not to mention the extra code
complexity.

The performance overhead of checksumming is substantial; I have seen
some real horror examples of what happens when you do it badly.

-hpa

2004-10-21 06:21:44

by Michael Clark

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

On 10/21/04 13:50, Chris Friesen wrote:
> H. Peter Anvin wrote:
>
>>> H. Peter Anvin wrote:
>>>
>>>> The whole point is that it doesn't break the *documented* interface.
>
>
>> I'm talking about returning -1, EIO.
>
>
>
> Ah. By "it", I thought you meant the current performance optimizations,
> not the EIO. My apologies.

Same.

> I think returning EIO is suboptimal, as it is not an expected error
> value for recvmsg(). (It's not listed in the man pages for recvmsg() or
> ip.) It would certainly work for new apps, but probably not for many
> existing binaries.

Yes, big likelyhood of breaking apps although does give you the
deterministic behaviour of recvmsg not blocking after select.

> On the other hand, if you simply do the checksum verification at
> select() time for blocking sockets, then the existing binaries get
> exactly the behaviour they expect.

This would be the best from existing usersapce apps, although i'd much
rather have EIO returned than the current behaviour if that was that
was the only choice (although the offshoot would be the need for patches
to quite a few apps).

~mc

2004-10-21 15:22:26

by Chris Friesen

[permalink] [raw]
Subject: Re: UDP recvmsg blocks after select(), 2.6 bug?

H. Peter Anvin wrote:

> POSIX specifies:

<useful stuff snipped>

> The recvmsg( ) function may fail if:
>
> [EIO] An I/O error occurred while reading from or writing to the file
> system.

<snipped>

> Since you didn't code to Linux, and didn't code to POSIX... what did you
> code to?

I didn't code it--my code generally uses nonblocking sockets or doesn't use
select at all. I'm just commenting on existing apps.

What do you mean by "didn't code to Linux"? The Linux man pages for recvmsg()
and ip do not mention EIO. Hence, I suspect that not many people coding for
Linux will have handled it. Furthermore, the Linux man page for select() says
that a socket that is returned as readable will not block on a subsequent read.

>> On the other hand, if you simply do the checksum verification at
>> select() time for blocking sockets, then the existing binaries get
>> exactly the behaviour they expect.

> ... unless the blocking changes. In which case you either have to do
> work twice, or it might *never* happen. Not to mention the extra code
> complexity.

If you verify the checksum at select time, you could just flag the packet as
verified. Then even if you do a recvmsg() with MSG_DONTWAIT, you wouldn't have
to verify it again. It means an extra pass over the data compared to a full-on
O_NONBLOCK socket, but it's still correct.

If you change from nonblocking to blocking between select() and recvmsg(), then
you have a problem, but you're still no worse off than the current situation.

The extra complexity is a valid point, but I suggest that the expectations of
the installed base are more important.


> The performance overhead of checksumming is substantial; I have seen
> some real horror examples of what happens when you do it badly.

Again, this is only a backwards compatibility thing. All new apps should use
nonblocking sockets anyways, right? So this way, old apps don't suffer from
single-packet-DOS attacks, at the cost of a performance penalty.

Chris