2000-11-02 22:12:04

by Paul Marquis

[permalink] [raw]
Subject: select() bug

I've uncovered a bug in select() when checking if it's okay to write
on a pipe. It will report "false negatives" if there is any unread
data already on the pipe, even a single byte. As soon as the pipe
gets flushed, select() does the right thing.

Normally, I wouldn't think this such a big deal, except that Apache
uses the writability of s pipe to determine if any of its children
that are log file handlers are dead. If select() reports it can't
write immediately, Apache terminates and restarts the child process,
creating unnecessary load on the system.

The bug exists in the 2.2.x kernels up to and including 2.2.17 that
I've tried and I don't know it extends beyond pipes. I've attached
sample code to demonstrate the problem, which works correctly on BSD
and Solaris.

Is this a know bug?

--
Paul Marquis
[email protected]

If it's tourist season, why can't we shoot them?


Attachments:
simple_pipe_test.c (2.81 kB)

2000-11-02 22:28:26

by Alan

[permalink] [raw]
Subject: Re: select() bug

> that are log file handlers are dead. If select() reports it can't
> write immediately, Apache terminates and restarts the child process,
> creating unnecessary load on the system.

Is there anything saying that select has to report ready the instant a byte
would fit. Certainly its better for performance to reduce the context switch
rate by encouraging blocking


2000-11-02 22:43:17

by Richard B. Johnson

[permalink] [raw]
Subject: Re: select() bug

On Thu, 2 Nov 2000, Alan Cox wrote:

> > that are log file handlers are dead. If select() reports it can't
> > write immediately, Apache terminates and restarts the child process,
> > creating unnecessary load on the system.
>
> Is there anything saying that select has to report ready the instant a byte
> would fit. Certainly its better for performance to reduce the context switch
> rate by encouraging blocking
>

It looks as though, when select() is reporting that the pipe is
NOT writable, the code writes anyway -- then complains that it
could be written.

This is a code bug. The pipe can certainly become writable during
the time interval between when it was checked by select() and the
attempt to write the byte.


Cheers,
Dick Johnson

Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-11-02 22:54:18

by Paul Marquis

[permalink] [raw]
Subject: Re: select() bug

I guess in theory, you're right, though if a write() could succeed,
shouldn't select() say that it would?

And this assumes you're calling select() with a timeout. In Apache,
the caretaker process wakes up periodically and polls the pipe with a
timeout of zero. If it gets back the pipe is not writable, it kills
the process. With this false negative situation, this is a bad thing.

Alan Cox wrote:
>
> > that are log file handlers are dead. If select() reports it can't
> > write immediately, Apache terminates and restarts the child process,
> > creating unnecessary load on the system.
>
> Is there anything saying that select has to report ready the instant a byte
> would fit. Certainly its better for performance to reduce the context switch
> rate by encouraging blocking

--
Paul Marquis
[email protected]

If it's tourist season, why can't we shoot them?

2000-11-02 22:57:28

by Alan

[permalink] [raw]
Subject: Re: select() bug

> I guess in theory, you're right, though if a write() could succeed,
> shouldn't select() say that it would?

Thats certainly not normal for a lot of devices. Most fast devices wake up
when buffers are half empty for example.

Alan

2000-11-02 22:58:58

by Paul Marquis

[permalink] [raw]
Subject: Re: select() bug

In the code sample, there is a loop that is run four times. During
each iteration, a call to select() and write() is done, while every
other iteration does a read(). Between the 1st and 2nd calls to
write(), as well as the 3rd and 4th, select() fails, but all write()'s
and read()'s succeed. There is no code bug.

Richard B. Johnson wrote:
> On Thu, 2 Nov 2000, Alan Cox wrote:
>
> > > that are log file handlers are dead. If select() reports it can't
> > > write immediately, Apache terminates and restarts the child process,
> > > creating unnecessary load on the system.
> >
> > Is there anything saying that select has to report ready the instant a byte
> > would fit. Certainly its better for performance to reduce the context switch
> > rate by encouraging blocking
> >
>
> It looks as though, when select() is reporting that the pipe is
> NOT writable, the code writes anyway -- then complains that it
> could be written.
>
> This is a code bug. The pipe can certainly become writable during
> the time interval between when it was checked by select() and the
> attempt to write the byte.
>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).
>
> "Memory is like gasoline. You use it up when you are running. Of
> course you get it all back when you reboot..."; Actual explanation
> obtained from the Micro$oft help desk.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

--
Paul Marquis
[email protected]

If it's tourist season, why can't we shoot them?

2000-11-02 23:08:59

by Paul Marquis

[permalink] [raw]
Subject: Re: select() bug

I'm not exactly sure what you mean by this statement. Would you mind
explaining further?

Alan Cox wrote:
> Most fast devices wake up when buffers are half empty for example.

--
Paul Marquis
[email protected]

If it's tourist season, why can't we shoot them?

2000-11-02 23:20:04

by Alan

[permalink] [raw]
Subject: Re: select() bug

> I'm not exactly sure what you mean by this statement. Would you mind
> explaining further?

Well take a socket with 64K of buffering. You don't want to wake processes
waiting in select or in write every time you can scribble another 1460 bytes
to the buffer. Instead you wait until there is 32K of room then wake the
user. That means that there is one wakeup/trip through userspace every 32K
rather than potentially every time a byte is read the other end

2000-11-02 23:44:31

by Paul Marquis

[permalink] [raw]
Subject: Re: select() bug

Okay, I see your point, thanks. A couple of comments/questions:

- Does this make sense with devices with small kernel buffers? From
my experimentation, pipes on Linux have a 4K buffer and tend to be
read and written very quickly.

- If I'm correct that pipes have a 4K kernel buffer, then writing 1
byte shouldn't cause this situation, as the buffer is well more than
half empty. Is this still a bug?

Semantic issues aside, since Apache does the test I mentionned earlier
to determine child status and since it could be misled, should this
feature be turned off?

Thanks for your input.

Alan Cox wrote:
> > I'm not exactly sure what you mean by this statement. Would you mind
> > explaining further?
>
> Well take a socket with 64K of buffering. You don't want to wake processes
> waiting in select or in write every time you can scribble another 1460 bytes
> to the buffer. Instead you wait until there is 32K of room then wake the
> user. That means that there is one wakeup/trip through userspace every 32K
> rather than potentially every time a byte is read the other end

--
Paul Marquis
[email protected]

If it's tourist season, why can't we shoot them?

2000-11-02 23:54:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: select() bug

Followup to: <[email protected]>
By author: Paul Marquis <[email protected]>
In newsgroup: linux.dev.kernel
>
> Okay, I see your point, thanks. A couple of comments/questions:
>
> - Does this make sense with devices with small kernel buffers? From
> my experimentation, pipes on Linux have a 4K buffer and tend to be
> read and written very quickly.
>
> - If I'm correct that pipes have a 4K kernel buffer, then writing 1
> byte shouldn't cause this situation, as the buffer is well more than
> half empty. Is this still a bug?
>
> Semantic issues aside, since Apache does the test I mentionned earlier
> to determine child status and since it could be misled, should this
> feature be turned off?
>
> Thanks for your input.
>

Has anyone considered the possibility of expanding the buffer of
high-traffic pipes?

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2000-11-02 23:55:42

by Alan

[permalink] [raw]
Subject: Re: select() bug

> - Does this make sense with devices with small kernel buffers? From
> my experimentation, pipes on Linux have a 4K buffer and tend to be
> read and written very quickly.

It makes sense for all things I suspect

> - If I'm correct that pipes have a 4K kernel buffer, then writing 1
> byte shouldn't cause this situation, as the buffer is well more than
> half empty. Is this still a bug?

The pipe code uses totally full/empty. Im not sure why that was chosen

> Semantic issues aside, since Apache does the test I mentionned earlier
> to determine child status and since it could be misled, should this
> feature be turned off?

Or made smarter yes

2000-11-03 00:00:52

by Alan

[permalink] [raw]
Subject: Re: select() bug

> Has anyone considered the possibility of expanding the buffer of
> high-traffic pipes?

Do that too much and the data falls out of L1 cache before we context switch.
Its a rather complex juggling game. DaveM's kiovec pipe patches avoid some of
this by cheating and going user->user

2000-11-03 00:01:42

by David Miller

[permalink] [raw]
Subject: Re: select() bug

From: "H. Peter Anvin" <[email protected]>
Date: 2 Nov 2000 15:53:29 -0800

Has anyone considered the possibility of expanding the buffer of
high-traffic pipes?

The kiobuf pipe patches are a more effective performance improvement
for this type of usage. It has the benefit of not requiring a
temporary kernel buffer of any size :-)

Later,
David S. Miller
[email protected]

2000-11-03 00:04:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: select() bug

"David S. Miller" wrote:
>
> From: "H. Peter Anvin" <[email protected]>
> Date: 2 Nov 2000 15:53:29 -0800
>
> Has anyone considered the possibility of expanding the buffer of
> high-traffic pipes?
>
> The kiobuf pipe patches are a more effective performance improvement
> for this type of usage. It has the benefit of not requiring a
> temporary kernel buffer of any size :-)
>

That's (very) nice, but it does assume there is currently a reader
listening.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2000-11-03 00:07:42

by David Miller

[permalink] [raw]
Subject: Re: select() bug

Date: Thu, 02 Nov 2000 16:04:05 -0800
From: "H. Peter Anvin" <[email protected]>

That's (very) nice, but it does assume there is currently a reader
listening.

No, it has no such assumption.

Later,
David S. Miller
[email protected]

2000-11-03 00:13:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: select() bug

"David S. Miller" wrote:
>
> Date: Thu, 02 Nov 2000 16:04:05 -0800
> From: "H. Peter Anvin" <[email protected]>
>
> That's (very) nice, but it does assume there is currently a reader
> listening.
>
> No, it has no such assumption.
>

Oh. What do you do if there isn't... link up the contents of the write()
in a kiovec and hold the writer?

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2000-11-03 00:20:22

by David Miller

[permalink] [raw]
Subject: Re: select() bug

Date: Thu, 02 Nov 2000 16:13:13 -0800
From: "H. Peter Anvin" <[email protected]>

Oh. What do you do if there isn't... link up the contents of the
write() in a kiovec and hold the writer?

Right, the writer blocks.

I've already posted the patches here within the past week, I'll send
them to you under seperate cover so you can see for yourself how it
works.

Later,
David S. Miller
[email protected]

2000-11-03 00:39:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: select() bug

"David S. Miller" wrote:
>
> Date: Thu, 02 Nov 2000 16:13:13 -0800
> From: "H. Peter Anvin" <[email protected]>
>
> Oh. What do you do if there isn't... link up the contents of the
> write() in a kiovec and hold the writer?
>
> Right, the writer blocks.
>
> I've already posted the patches here within the past week, I'll send
> them to you under seperate cover so you can see for yourself how it
> works.
>

Sure, but my point was that it would be nice for high-traffic pipes to
allow a larger volume of data with the two processes still running
concurrently.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2000-11-03 01:14:03

by Richard B. Johnson

[permalink] [raw]
Subject: Re: select() bug

On Thu, 2 Nov 2000, Paul Marquis wrote:

> In the code sample, there is a loop that is run four times. During
> each iteration, a call to select() and write() is done, while every
> other iteration does a read(). Between the 1st and 2nd calls to
> write(), as well as the 3rd and 4th, select() fails, but all write()'s
> and read()'s succeed. There is no code bug.

Not as I see it. This is clearly a code bug. Look:
iteration 1
select says pipe is writable
write succeeded - 1 bytes written
iteration 2
select returned 0 -- pipe not writable

for (i = 0; i < 4; i++)
{
printf("iteration %d\n", i + 1);

status = 0;

/* do select */
FD_ZERO(&write_fds);
FD_SET(fd[1], &write_fds);
tv.tv_sec = 0;
tv.tv_usec = 0;
switch (select(fd[1] + 1, NULL, &write_fds, NULL, &tv))
{
case -1:
/* should probably check for EINTR and/or EWOULDBLOCK */
printf("select error - %s\n", strerror(errno));
break;
case 0:
/* no I/O */
puts("select returned 0 -- pipe not writable");
break;

The BREAK breaks out of the switch-statement ---
|
|-----------------------------------------------------------
|
| default:
| /* make sure our fd is writable */
| if (FD_ISSET(fd[1], &write_fds))
| {
| puts("select says pipe is writable");
| status = 1;
| }
| else
| puts("select says pipe is not writable");
| break;
| }
|
| /* do write */
|-------> n = write(fd[1], buf + (i % 2), 1);

Then you write it anyway!

write succeeded - 1 bytes written
select bug




Cheers,
Dick Johnson

Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-11-03 05:00:48

by Paul Marquis

[permalink] [raw]
Subject: Re: select() bug

It shows that even though select() says a file descriptor is not
writable, a write() can still succeed. This code is not used anywhere
in the real world, or at least my real world :-P. It just demonstrates
"the bug".

Richard B. Johnson" wrote:
> Not as I see it. This is clearly a code bug. Look:
> iteration 1
> select says pipe is writable
> write succeeded - 1 bytes written
> iteration 2
> select returned 0 -- pipe not writable
>
> for (i = 0; i < 4; i++)
> {
> printf("iteration %d\n", i + 1);
>
> status = 0;
>
> /* do select */
> FD_ZERO(&write_fds);
> FD_SET(fd[1], &write_fds);
> tv.tv_sec = 0;
> tv.tv_usec = 0;
> switch (select(fd[1] + 1, NULL, &write_fds, NULL, &tv))
> {
> case -1:
> /* should probably check for EINTR and/or EWOULDBLOCK */
> printf("select error - %s\n", strerror(errno));
> break;
> case 0:
> /* no I/O */
> puts("select returned 0 -- pipe not writable");
> break;
>
> The BREAK breaks out of the switch-statement ---
> |
> |-----------------------------------------------------------
> |
> | default:
> | /* make sure our fd is writable */
> | if (FD_ISSET(fd[1], &write_fds))
> | {
> | puts("select says pipe is writable");
> | status = 1;
> | }
> | else
> | puts("select says pipe is not writable");
> | break;
> | }
> |
> | /* do write */
> |-------> n = write(fd[1], buf + (i % 2), 1);
>
> Then you write it anyway!
>
> write succeeded - 1 bytes written
> select bug
>
> Cheers,
> Dick Johnson
>
> Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).
>
> "Memory is like gasoline. You use it up when you are running. Of
> course you get it all back when you reboot..."; Actual explanation
> obtained from the Micro$oft help desk.

--
Paul Marquis
[email protected]

2000-11-03 05:52:46

by dean gaudet

[permalink] [raw]
Subject: Re: select() bug

> > Semantic issues aside, since Apache does the test I mentionned earlier
> > to determine child status and since it could be misled, should this
> > feature be turned off?
>
> Or made smarter yes

i'm scratching my head wondering what i was thinking when i wrote that
code. the specific thing the parent wants to handle there is the case of
a stuck logger... you'd think i would have at least put some debouncing
logic in there.

(the parent is able to replace a logger without disturbing the children
because it keeps a copy of both halves of the logging pipes.)

an alternative would be to look for many children stuck in the logging
state (they make a note in the scoreboard before going into it).

paul -- if you want to just eliminate that feature (it'll still be able to
replace the logger if the logger exits) go into src/http_log.c,
piped_log_maintenance and comment out the OC_REASON_UNWRITEABLE.

-dean

2000-11-03 07:06:04

by Marc Lehmann

[permalink] [raw]
Subject: Re: select() bug

On Thu, Nov 02, 2000 at 11:55:52PM +0000, Alan Cox <[email protected]> wrote:
> > - If I'm correct that pipes have a 4K kernel buffer, then writing 1
> > byte shouldn't cause this situation, as the buffer is well more than
> > half empty. Is this still a bug?
>
> The pipe code uses totally full/empty. Im not sure why that was chosen

Just a quick guess: maybe because of the POSIX atomicity guarantees (if
select returned, write might have to block which is not what is expected),
and maybe this limitation was used not only on write but on read (Although
it's not necessary on the read side, AFAIK).

--
-----==- |
----==-- _ |
---==---(_)__ __ ____ __ Marc Lehmann +--
--==---/ / _ \/ // /\ \/ / [email protected] |e|
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+
The choice of a GNU generation |
|

2000-11-03 13:06:09

by Richard B. Johnson

[permalink] [raw]
Subject: Re: select() bug

On Fri, 3 Nov 2000, Paul Marquis wrote:

> It shows that even though select() says a file descriptor is not
> writable, a write() can still succeed. This code is not used anywhere
> in the real world, or at least my real world :-P. It just demonstrates
> "the bug".
>

It demonstrates nothing except bad code. Anybody should know that
such a descriptor can become writable at any instant, simply because
the kernel may be finding room for more unread data.

So, the code sees that, at this instant, it's not writable. Then
it writes to it anyway. Sometimes the fd becomes writable between
the time that it was checked, and the time it was written. So
the write succeeded. So what. It shows nothing but bad code. It
demonstrates no bug whatsover.

There may be a bug. However, this code doesn't demonstrate anything
but bad coding practice.

Cheers,
Dick Johnson

Penguin : Linux version 2.2.17 on an i686 machine (801.18 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2000-11-05 16:15:54

by Stanislav Meduna

[permalink] [raw]
Subject: Re: select() bug

> > > - If I'm correct that pipes have a 4K kernel buffer, then writing 1
> > > byte shouldn't cause this situation, as the buffer is well more than
> > > half empty. Is this still a bug?
> >
> > The pipe code uses totally full/empty. Im not sure why that was chosen
>
> Just a quick guess: maybe because of the POSIX atomicity guarantees (if
> select returned, write might have to block which is not what is expected),
> and maybe this limitation was used not only on write but on read (Although
> it's not necessary on the read side, AFAIK).

FWIW: I tried to do some experiments (I was hit by the context
switch 'explosion': the writer writes one byte and then
cannot write another one until the reader wakes up and reads
the single one - for my application this was very bad).

Unfortunately my experiments made the most-used uses slower
and I don't know the kernel internals enough to analyse it :-(

I am repeating here my post from 14. 7. (this was with -test4,
I don't know whether there were any significant changes since then)
under the subject "pipe concurrent r/w performance". Last time
I did not get any replies.

- snip -

I am (again) playing with pipe.c trying to enlarge the
pipe buffer, so the pipe can select for write even when
non-empty (but with more than PIPE_BUF free), aiming
for reducing context switches when two applications
signal something via a pipe without waiting for
an answer. The functionality is (hopefully) OK and
I am testing the performance.

I have written two similar test programs. The
writer side selects the pipe and then writes a byte.
The reader side does either 1) blocking read of one
byte, or 2) selects the pipe for reading and then
reads a byte.

To my surprise the case 1) is slower than the original
version by 20% and there is actually more context switches.
Case 2) is at least 30% faster than the original version
and the number of context switches is dramatically reduced.

The difference probably is, that the reader is faster
than the writer and in the case 1) it nearly always
blocks in the pipe_wait call inside pipe_read.
It seems like as soon as the writer writes something
in the pipe, the reader immediately gets the CPU
instead of allowing the writer to continue.

In the case 2) it blocks in poll_wait inside pipe_poll.

I don't quite understand all the semantics behind
synchronisation primitives, but it seems like the
current code is not optimal for the concurrent
presence of reader and writer in pipe_read/pipe_write.
The code in pipe_poll excluded such possibility.
Could someone understanding these things better
take a look?

My patch and the test programs are available at
http://www.penguin.cz/~stano/code/kernelpipe.tar.gz
The patch is against 2.4.0-test4, my machine is
a SMP one (maybe the situation differs on UP).

- snip -

Regards
--
Stano