2009-11-05 15:40:14

by Max Kellermann

[permalink] [raw]
Subject: [PATCH] pipe: don't block after data has been written

According to the select() / poll() documentation, a write operation on
a file descriptor which is "ready for writing" must not block. Linux
violates this rule: if you pass a very large buffer to write(), the
system call will not return until everything is written, or an error
occurs.

This patch adds a simple check: if at least one byte has already been
written, break from the loop, instead of calling pipe_wait().

Signed-off-by: Max Kellermann <[email protected]>
---

fs/pipe.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index ae17d02..9d84f0b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -582,7 +582,7 @@ redo2:
}
if (bufs < PIPE_BUFFERS)
continue;
- if (filp->f_flags & O_NONBLOCK) {
+ if (filp->f_flags & O_NONBLOCK || ret > 0) {
if (!ret)
ret = -EAGAIN;
break;


2009-11-05 16:20:25

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

On Thu, Nov 05, 2009 at 04:31:47PM +0100, Max Kellermann wrote:
>According to the select() / poll() documentation, a write operation on
>a file descriptor which is "ready for writing" must not block. Linux
>violates this rule: if you pass a very large buffer to write(), the
>system call will not return until everything is written, or an error
>occurs.
>
>This patch adds a simple check: if at least one byte has already been
>written, break from the loop, instead of calling pipe_wait().

Do you have any working test-case for this?

Thanks.

>
>Signed-off-by: Max Kellermann <[email protected]>
>---
>
> fs/pipe.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>diff --git a/fs/pipe.c b/fs/pipe.c
>index ae17d02..9d84f0b 100644
>--- a/fs/pipe.c
>+++ b/fs/pipe.c
>@@ -582,7 +582,7 @@ redo2:
> }
> if (bufs < PIPE_BUFFERS)
> continue;
>- if (filp->f_flags & O_NONBLOCK) {
>+ if (filp->f_flags & O_NONBLOCK || ret > 0) {
> if (!ret)
> ret = -EAGAIN;
> break;
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Live like a child, think like the god.

2009-11-05 16:25:51

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

On 2009/11/05 17:20, Am?rico Wang <[email protected]> wrote:
> On Thu, Nov 05, 2009 at 04:31:47PM +0100, Max Kellermann wrote:
> >According to the select() / poll() documentation, a write operation on
> >a file descriptor which is "ready for writing" must not block. Linux
> >violates this rule: if you pass a very large buffer to write(), the
> >system call will not return until everything is written, or an error
> >occurs.
> >
> >This patch adds a simple check: if at least one byte has already been
> >written, break from the loop, instead of calling pipe_wait().
>
> Do you have any working test-case for this?

Eric Dumazet posted a program earlier today (in response to my
pipe/splice patch):

http://lkml.org/lkml/2009/11/5/144

With this patch applied, it runs correctly (without it, the write()
blocks until the consumer has read everything):

poll([{fd=4, events=POLLOUT}], 1, -1) = 1 ([{fd=4, revents=POLLOUT}])
[...]
write(4, "\0\0\0\0"..., 1000000) = 65536

2009-11-05 16:28:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

Max Kellermann a écrit :
> According to the select() / poll() documentation, a write operation on
> a file descriptor which is "ready for writing" must not block. Linux
> violates this rule: if you pass a very large buffer to write(), the
> system call will not return until everything is written, or an error
> occurs.
>
> This patch adds a simple check: if at least one byte has already been
> written, break from the loop, instead of calling pipe_wait().
>
> Signed-off-by: Max Kellermann <[email protected]>
> ---
>
> fs/pipe.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index ae17d02..9d84f0b 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -582,7 +582,7 @@ redo2:
> }
> if (bufs < PIPE_BUFFERS)
> continue;
> - if (filp->f_flags & O_NONBLOCK) {
> + if (filp->f_flags & O_NONBLOCK || ret > 0) {
> if (!ret)
> ret = -EAGAIN;
> break;
>

Then select()/poll() documentation is wrong, please correct documentation ?

http://www.opengroup.org/onlinepubs/000095399/functions/write.html

ssize_t write(int fildes, const void *buf, size_t nbyte);

If the O_NONBLOCK flag is clear, a write request may cause the thread to block,
but on normal completion it shall return nbyte.

Every Unix I know behaves the same when writing to a pipe.


Your patch breaks many programs, that dont use poll()/select()

char result[1000000];
main()
{
computethings();
write(1, buffer, 1000000);
}


$ ./program | more

Please learn how useful O_NDELAY can be in a poll()/select() environment.

Thanks

2009-11-05 16:36:00

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

On 2009/11/05 17:27, Eric Dumazet <[email protected]> wrote:
> Your patch breaks many programs, that dont use poll()/select()
>
> char result[1000000];
> main()
> {
> computethings();
> write(1, buffer, 1000000);
> }

Your code does not check the return value of write(). This is a bug.

So how exactly does my patch break this program?

Let's read some more of the manual you cited: "If a write() requests
that more bytes be written than there is room for (for example, the
process' file size limit or the physical end of a medium), only as
many bytes as there is room for shall be written."

2009-11-05 16:37:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

Max Kellermann a ?crit :
> On 2009/11/05 17:27, Eric Dumazet <[email protected]> wrote:
>> Your patch breaks many programs, that dont use poll()/select()
>>
>> char result[1000000];
>> main()
>> {
>> computethings();
>> write(1, buffer, 1000000);
>> }
>
> Your code does not check the return value of write(). This is a bug.
>

Welcome to real world.

2009-11-05 17:32:04

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

On 11/5/09 9:37 AM, Eric Dumazet wrote:
> Max Kellermann a ?crit :
>> On 2009/11/05 17:27, Eric Dumazet<[email protected]> wrote:
>>> Your patch breaks many programs, that dont use poll()/select()
>>>
>>> char result[1000000];
>>> main()
>>> {
>>> computethings();
>>> write(1, buffer, 1000000);
>>> }
>>
>> Your code does not check the return value of write(). This is a bug.
>>
>
> Welcome to real world.

Yes in the real world there are bugs. The decision is to choose which
bug you are going to expose. If it was my decision I would make the code
work as documented, as Max wants to do.

I remember many years ago needing to fix some inetd-called server code
that got unexpected partial writes on blocking sockets. It was either
Solaris or HP/UX. So this is nothing new.

In fact I think that Linux will already do short writes if a signal is
received without restart set for the handler. I found several bugs last
year in glibc and libstdc++ fwrite and iostreams regarding that.
--
Zan Lynx
[email protected]

"Knowledge is Power. Power Corrupts. Study Hard. Be Evil."

2009-11-05 18:31:04

by Alan

[permalink] [raw]
Subject: Re: [PATCH] pipe: don't block after data has been written

> > Welcome to real world.
>
> Yes in the real world there are bugs. The decision is to choose which
> bug you are going to expose. If it was my decision I would make the code
> work as documented, as Max wants to do.

Outside of academia the reality is fairly simple. A system needs to
behave according to the expected behaviour. That is a mix of things
- Standards
- Extrapolation (applying the logic of the standard to cases beyond it)
- Tradition (things that used to work still work)

If you like: How it is defined to work, how it is expected to work and how
it worked last year.

Tradition is a suprisingly large part of it. In the unix world that
tradition includes things like "signals do not interrupt disk I/O writes
causing short writes".

Pipes however is pretty much pure standards behaviour

In blocking mode they block
In non-blocking mode they don't block

Furthermore there are specific rules about writes under a certain size
always occurring in an atomic manner.

> In fact I think that Linux will already do short writes if a signal is
> received without restart set for the handler. I found several bugs last
> year in glibc and libstdc++ fwrite and iostreams regarding that.

The kernel takes great pains not to do this in the cases where tradition
dictates otherwise (notably in disk I/O)

Alan