2007-08-14 11:42:14

by Denys Vlasenko

[permalink] [raw]
Subject: O_NONBLOCK is broken

Hi folks,

I apologize for a provocative subject. O_NONBLOCK in Linux is not broken.
It is working as designed. But the design itself is suffering from a flaw.

Suppose I want to write block of data to stdout, without blocking.
I will do the classic thing:

int fl = fcntl(1, F_GETFL, 0);
fcntl(1, F_SETFL, fl | O_NONBLOCK);
r = write(1, buf, size);
fcntl(1, F_SETFL, fl); /* restore ASAP! */

The problem is, O_NONBLOCK flag is not attached to file *descriptor*,
but to a "file description" mentioned in fcntl manpage:

"Each open file description has certain associated status flags, initialized
by open(2) and possibly modified by fcntl(2). Duplicated file descriptors
(made with dup(), fcntl(F_DUPFD), fork(), etc.) refer to the same open file
description, and thus share the same file status flags."

We don't know whether our stdout descriptor #1 is shared with anyone or not,
and if we were started from shell, it typically is. That's why we try to
restore flags ASAP.

But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK,
other process which share fd #1 with us may well be affected
by file suddenly becoming O_NONBLOCK under its feet.

Worse, other process can do the same
fcntl(1, F_SETFL, fl | O_NONBLOCK);
...
fcntl(1, F_SETFL, fl);
sequence, and first fcntl can return flags with O_NONBLOCK set (because of
us), and then second fcntl will set O_NONBLOCK permanently, which is not
what was intended!

Other failure mode is that process can be killed by a signal
between fcntl's, leaving file in O_NONBLOCK mode.

This isn't theoretical problem, it actually happens not-so-rarely, for
example, with pagers.

Possible solutions:

a) Introduce *per-fd* flag, so that one can use
fcntl(1, F_SETFD, fdflag | O_NONBLOCK) instead of
fcntl(1, F_SETFL, flflag | O_NONBLOCK) instead of
Currently there is only one per-fd flag - O_CLOEXEC with value of 1.
O_NONBLOCK is 0x4000.

b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags);
work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT.
(it's ok to fail with "socket op on non-socket fd" for other values
of flags)

Both things are non-standard, but portable programs can test for errors
and fall back to "standard" UNIX way of doing it.


P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy:

int fl = fcntl(fd, F_GETFL, 0);
/* other process can muck with file flags here */
fcntl(fd, F_SETFL, fl | SOME_BITS);

How can I *atomically* add or remove bits from file flags?
--
vda


2007-08-14 12:25:45

by Alan

[permalink] [raw]
Subject: Re: O_NONBLOCK is broken

> b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags);
> work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT.
> (it's ok to fail with "socket op on non-socket fd" for other values
> of flags)

I think that makes a lot of sense, and to be honest other MSG_ flags make
useful sense and have meaningful semantics that might be helpful
elsewhere if ever coded that way.

If you want to do this the first job is going to be to sort out the way
non-block is propogated to device driver read/write handlers. At the
moment they all check filp->f_flags

Alan

2007-08-14 17:29:15

by Jan Engelhardt

[permalink] [raw]
Subject: Re: O_NONBLOCK is broken


On Aug 14 2007 13:33, Alan Cox wrote:
>
>> b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags);
>> work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT.
>> (it's ok to fail with "socket op on non-socket fd" for other values
>> of flags)
>
>I think that makes a lot of sense, and to be honest other MSG_ flags make
>useful sense and have meaningful semantics that might be helpful
>elsewhere if ever coded that way.
>
>If you want to do this the first job is going to be to sort out the way
>non-block is propogated to device driver read/write handlers. At the
>moment they all check filp->f_flags

And a side effect, kernel code (kthreads) rarely allocate a file
descriptor.


Jan
--

2007-08-14 22:15:53

by David Schwartz

[permalink] [raw]
Subject: RE: O_NONBLOCK is broken


> The problem is, O_NONBLOCK flag is not attached to file *descriptor*,
> but to a "file description" mentioned in fcntl manpage:
[snip]
> We don't know whether our stdout descriptor #1 is shared with
> anyone or not,
> and if we were started from shell, it typically is. That's why we try to
> restore flags ASAP.

> But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK,
> other process which share fd #1 with us may well be affected
> by file suddenly becoming O_NONBLOCK under its feet.
>
> Worse, other process can do the same
> fcntl(1, F_SETFL, fl | O_NONBLOCK);
> ...
> fcntl(1, F_SETFL, fl);
> sequence, and first fcntl can return flags with O_NONBLOCK set
> (because of
> us), and then second fcntl will set O_NONBLOCK permanently, which is not
> what was intended!
[snip]
> P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy:
>
> int fl = fcntl(fd, F_GETFL, 0);
> /* other process can muck with file flags here */
> fcntl(fd, F_SETFL, fl | SOME_BITS);
>
> How can I *atomically* add or remove bits from file flags?

Simply put, you cannot change file flags on a shared descriptor. It is a bug
to do so, a bug that is sadly present in many common programs.

I like the idea of being able to specify blocking or non-blocking behavior
in the operation. It is not too uncommon to want to perform blocking
operations sometimes and non-blocking operations other times for the same
object and having to keep changing modes, even if it wasn't racy, is a pain.

However, there's a much more fundamental problem here. Processes need a good
way to get exclusive use of their stdin, stdout, and stderr streams and
there is no good way. Perhaps an "exclusive lock" that blocked all other
process' attempts to use the terminal until it was released would be a good
thing.

DS


2007-08-19 12:50:39

by Denys Vlasenko

[permalink] [raw]
Subject: [PATCH] allow send/recv(MSG_DONTWAIT) on non-sockets (was Re: O_NONBLOCK is broken)

On Tuesday 14 August 2007 13:33, Alan Cox wrote:
> > b) Make recv(fd, buf, size, flags) and send(fd, buf, size, flags);
> > work with non-socket fds too, for flags==0 or flags==MSG_DONTWAIT.
> > (it's ok to fail with "socket op on non-socket fd" for other values
> > of flags)
>
> I think that makes a lot of sense, and to be honest other MSG_ flags make
> useful sense and have meaningful semantics that might be helpful
> elsewhere if ever coded that way.

Yes, that's my feeling too.

> If you want to do this the first job is going to be to sort out the way
> non-block is propogated to device driver read/write handlers. At the
> moment they all check filp->f_flags

...which happens in ~250 files. I'd rather not touch that much
of code, if possible.

Attached patch detects send/recv(fd, buf, size, MSG_DONTWAIT) on
non-sockets and turns them into non-blocking write/read.
Since filp->f_flags appear to be read and modified without any locking,
I cannot modify it without potentially affecting other processes
accessing the same file through shared struct file.

Therefore I simply make a temporary copy of struct file, set
O_NONBLOCK in it and pass it to vfs_read/write.
Is this heresy? ;) I see only one spinlock in struct file:

#ifdef CONFIG_EPOLL
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */

Do I need to take it?

Also attached is ndelaytest.c which can be used to test that
send(MSG_DONTWAIT) indeed is failing with EAGAIN if write would block
and that other processes never see O_NONBLOCK set.

Comments?
--
vda


Attachments:
(No filename) (1.53 kB)
nonblock_linux-2.6.22-rc6.patch (2.83 kB)
ndelaytest.c (1.43 kB)
Download all attachments

2007-08-19 12:51:23

by Denys Vlasenko

[permalink] [raw]
Subject: Re: O_NONBLOCK is broken

On Tuesday 14 August 2007 22:59, David Schwartz wrote:
> > The problem is, O_NONBLOCK flag is not attached to file *descriptor*,
> > but to a "file description" mentioned in fcntl manpage:
>
> [snip]
>
> > We don't know whether our stdout descriptor #1 is shared with
> > anyone or not,
> > and if we were started from shell, it typically is. That's why we try to
> > restore flags ASAP.
> >
> > But "ASAP" isn't soon enough. Between setting and clearing O_NONBLOCK,
> > other process which share fd #1 with us may well be affected
> > by file suddenly becoming O_NONBLOCK under its feet.
> >
> > Worse, other process can do the same
> > fcntl(1, F_SETFL, fl | O_NONBLOCK);
> > ...
> > fcntl(1, F_SETFL, fl);
> > sequence, and first fcntl can return flags with O_NONBLOCK set
> > (because of
> > us), and then second fcntl will set O_NONBLOCK permanently, which is not
> > what was intended!
>
> [snip]
>
> > P.S. Hmm, it seems fcntl GETFL/SETFL interface seems to be racy:
> >
> > int fl = fcntl(fd, F_GETFL, 0);
> > /* other process can muck with file flags here */
> > fcntl(fd, F_SETFL, fl | SOME_BITS);
> >
> > How can I *atomically* add or remove bits from file flags?
>
> Simply put, you cannot change file flags on a shared descriptor. It is a
> bug to do so, a bug that is sadly present in many common programs.

It means that the design is flawed and if done right, file flags
which are changeable by fcntl (O_NONBLOCK, O_APPEND, O_ASYNC, O_DIRECT,
O_NOATIME) shouldn't be shared, they are useless as shared.
IOW, they should be file _descriptor_ flags.

It's unlikely that kernel tribe leaders will agree to violate POSIX
and make fcntl(F_SETFL) be per-fd thing. There can be users of this
(mis)feature.

Making fcntl(F_SETFD) accept those same flags and making it override
F_SETFL flags may fare slightly better, but may require propagation
of these flags into *a lot* of kernel codepaths.

> I like the idea of being able to specify blocking or non-blocking behavior
> in the operation. It is not too uncommon to want to perform blocking
> operations sometimes and non-blocking operations other times for the same
> object and having to keep changing modes, even if it wasn't racy, is a
> pain.

I am submitting a patch witch allows this. Let's see what people will say.

Yet another way to fix this problem is to add a new fcntl operation
"duplicate an open file":

fd = fcntl(fd, F_DUPFL, min_fd);

which is analogous to F_DUPFD, but produces _unshared_ file descriptor.
You can F_SETFL it as you want, no one else will be affected.

> However, there's a much more fundamental problem here. Processes need a
> good way to get exclusive use of their stdin, stdout, and stderr streams
> and there is no good way. Perhaps an "exclusive lock" that blocked all
> other process' attempts to use the terminal until it was released would be
> a good thing.

Yep, maybe. But this is a different problem.
IOW: there are cases where one doesn't want this kind of locking,
but simply needs to do unblocked read/write. That's what I'm trying
to solve.
--
vda