2007-09-28 17:35:18

by Ulrich Drepper

[permalink] [raw]
Subject: F_DUPFD_CLOEXEC implementation

One more small change to extend the availability of creation of
file descriptors with FD_CLOEXEC set. Adding a new command to
fcntl() requires no new system call and the overall impact on
code size if minimal.

If this patch gets accepted we will also add this change to the
next revision of the POSIX spec.

To test the patch, use the following little program. Adjust the
value of F_DUPFD_CLOEXEC appropriately.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#ifndef F_DUPFD_CLOEXEC
# define F_DUPFD_CLOEXEC 12
#endif

int
main (int argc, char *argv[])
{
if (argc > 1)
{
if (fcntl (3, F_GETFD) == 0)
{
puts ("descriptor not closed");
exit (1);
}
if (errno != EBADF)
{
puts ("error not EBADF");
exit (1);
}

exit (0);
}
int fd = fcntl (STDOUT_FILENO, F_DUPFD_CLOEXEC, 0);
if (fd == -1 && errno == EINVAL)
{
puts ("F_DUPFD_CLOEXEC not supported");
return 0;
}
if (fd != 3)
{
puts ("program called with descriptors other than 0,1,2");
return 1;
}

execl ("/proc/self/exe", "/proc/self/exe", "1", NULL);
puts ("execl failed");
return 1;
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Ulrich Drepper <[email protected]>

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 78b2ff0..c9db73f 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -110,7 +110,7 @@ out:
return error;
}

-static int dupfd(struct file *file, unsigned int start)
+static int dupfd(struct file *file, unsigned int start, int cloexec)
{
struct files_struct * files = current->files;
struct fdtable *fdt;
@@ -122,7 +122,10 @@ static int dupfd(struct file *file, unsigned int start)
/* locate_fd() may have expanded fdtable, load the ptr */
fdt = files_fdtable(files);
FD_SET(fd, fdt->open_fds);
- FD_CLR(fd, fdt->close_on_exec);
+ if (cloexec)
+ FD_SET(fd, fdt->close_on_exec);
+ else
+ FD_CLR(fd, fdt->close_on_exec);
spin_unlock(&files->file_lock);
fd_install(fd, file);
} else {
@@ -195,7 +198,7 @@ asmlinkage long sys_dup(unsigned int fildes)
struct file * file = fget(fildes);

if (file)
- ret = dupfd(file, 0);
+ ret = dupfd(file, 0, 0);
return ret;
}

@@ -319,8 +322,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,

switch (cmd) {
case F_DUPFD:
+ case F_DUPFD_CLOEXEC:
get_file(filp);
- err = dupfd(filp, arg);
+ err = dupfd(filp, arg, cmd == F_DUPFD_CLOEXEC);
break;
case F_GETFD:
err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
diff --git a/include/asm-generic/fcntl.h b/include/asm-generic/fcntl.h
index b847741..b01408a 100644
--- a/include/asm-generic/fcntl.h
+++ b/include/asm-generic/fcntl.h
@@ -73,6 +73,9 @@
#define F_SETSIG 10 /* for sockets. */
#define F_GETSIG 11 /* for sockets. */
#endif
+#ifndef F_DUPFD_CLOEXEC
+#define F_DUPFD_CLOEXEC 12
+#endif

/* for F_[GET|SET]FL */
#define FD_CLOEXEC 1 /* actually anything with low bit set goes */


2007-09-28 18:19:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Fri, 28 Sep 2007, Ulrich Drepper wrote:

> One more small change to extend the availability of creation of
> file descriptors with FD_CLOEXEC set. Adding a new command to
> fcntl() requires no new system call and the overall impact on
> code size if minimal.
>
> If this patch gets accepted we will also add this change to the
> next revision of the POSIX spec.
>
> To test the patch, use the following little program. Adjust the
> value of F_DUPFD_CLOEXEC appropriately.

I think new system calls would have been a cleaner way to accomplish this.
The "small pill at a time" may have better chance to go in, but will
likely result in an uglier userspace interface.
In any case, this is better than *nothing*, if it makes it easier to use
fds inside system libraries.



- Davide


2007-09-28 18:24:14

by Ulrich Drepper

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Davide Libenzi wrote:
> I think new system calls would have been a cleaner way to accomplish this.
> The "small pill at a time" may have better chance to go in, but will
> likely result in an uglier userspace interface.

We'd need this call anyway since neither dup nor dup2 provides the
functionality of F_DUPFD (but F_DUPFD can be used to implement dup).

For dup2() I will wait until we have a sys_indirect implementation.
I'll try to get this soon.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG/Ua02ijCOnn/RHQRAgOQAKCfQ9H4VYau6nVGuVXyJ7IfBXK+QgCfYQxv
k4esG379v8VBceFIECDybk0=
=dvhX
-----END PGP SIGNATURE-----

2007-09-30 00:32:24

by Denys Vlasenko

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

Hi Ulrich,

On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
> One more small change to extend the availability of creation of
> file descriptors with FD_CLOEXEC set. Adding a new command to
> fcntl() requires no new system call and the overall impact on
> code size if minimal.

Tangential question: do you have any idea how userspace can
safely do nonblocking read or write on a potentially-shared fd?

IIUC, currently it cannot be done without races:

old_flags = fcntl(fd, F_GETFL);
...other process may change flags!...
fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
read(fd, ...)
...other process may see flags changed under its feet!...
fcntl(fd, F_SETFL, old_flags);

Can this be fixed?
--
vda

2007-09-30 23:11:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Sun, 30 Sep 2007, Denys Vlasenko wrote:

> Hi Ulrich,
>
> On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
> > One more small change to extend the availability of creation of
> > file descriptors with FD_CLOEXEC set. Adding a new command to
> > fcntl() requires no new system call and the overall impact on
> > code size if minimal.
>
> Tangential question: do you have any idea how userspace can
> safely do nonblocking read or write on a potentially-shared fd?
>
> IIUC, currently it cannot be done without races:
>
> old_flags = fcntl(fd, F_GETFL);
> ...other process may change flags!...
> fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
> read(fd, ...)
> ...other process may see flags changed under its feet!...
> fcntl(fd, F_SETFL, old_flags);
>
> Can this be fixed?

I'm not sure I understood correctly your use case. But, if you have two
processes/threads randomly switching O_NONBLOCK on/off, your problems
arise not only the F_SETFL time.
If one of the tasks is not expecting an fd to be O_NONBLOCK, that will
likely end up not handling correctly read/write-miss situations.
In that case it'd be better to keep the fd as O_NONBLOCK, and manually
create blocking behaviour (when needed) with poll+read/write.



- Davide


2007-09-30 23:59:12

by Denys Vlasenko

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Monday 01 October 2007 00:11, Davide Libenzi wrote:
> On Sun, 30 Sep 2007, Denys Vlasenko wrote:
>
> > Hi Ulrich,
> >
> > On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
> > > One more small change to extend the availability of creation of
> > > file descriptors with FD_CLOEXEC set. Adding a new command to
> > > fcntl() requires no new system call and the overall impact on
> > > code size if minimal.
> >
> > Tangential question: do you have any idea how userspace can
> > safely do nonblocking read or write on a potentially-shared fd?
> >
> > IIUC, currently it cannot be done without races:
> >
> > old_flags = fcntl(fd, F_GETFL);
> > ...other process may change flags!...
> > fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
> > read(fd, ...)
> > ...other process may see flags changed under its feet!...
> > fcntl(fd, F_SETFL, old_flags);
> >
> > Can this be fixed?
>
> I'm not sure I understood correctly your use case. But, if you have two
> processes/threads randomly switching O_NONBLOCK on/off, your problems
> arise not only the F_SETFL time.

My use case is: I want to do a nonblocking read on descriptor 0 (stdin).
It may be a pipe or a socket.

There may be other processes which share this descriptor with me,
I simply cannot know that. And they, too, may want to do reads on it.

I want to do nonblocking read in such a way that neither those other
processes will ever see fd switching to O_NONBLOCK and back, and
I also want to be safe from other processes doing the same.

I don't see how this can be done using standard unix primitives.
--
vda

2007-10-01 01:15:39

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

In article <[email protected]>,
Denys Vlasenko <[email protected]> wrote:
>Hi Ulrich,
>
>On Friday 28 September 2007 18:34, Ulrich Drepper wrote:
>> One more small change to extend the availability of creation of
>> file descriptors with FD_CLOEXEC set. Adding a new command to
>> fcntl() requires no new system call and the overall impact on
>> code size if minimal.
>
>Tangential question: do you have any idea how userspace can
>safely do nonblocking read or write on a potentially-shared fd?
>
>IIUC, currently it cannot be done without races:
>
>old_flags = fcntl(fd, F_GETFL);
>...other process may change flags!...
>fcntl(fd, F_SETFL, old_flags | O_NONBLOCK);
>read(fd, ...)
>...other process may see flags changed under its feet!...
>fcntl(fd, F_SETFL, old_flags);
>
>Can this be fixed?

This is for sockets, right ? Just use revc() instead of read().

n = recv(filedesc, buffer, buflen, MSG_DONTWAIT);

.. is equivalent to setting O_NONBLOCK. See "man recv".

Mike.

2007-10-01 03:15:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Mon, 1 Oct 2007, Denys Vlasenko wrote:

> My use case is: I want to do a nonblocking read on descriptor 0 (stdin).
> It may be a pipe or a socket.
>
> There may be other processes which share this descriptor with me,
> I simply cannot know that. And they, too, may want to do reads on it.
>
> I want to do nonblocking read in such a way that neither those other
> processes will ever see fd switching to O_NONBLOCK and back, and
> I also want to be safe from other processes doing the same.
>
> I don't see how this can be done using standard unix primitives.

Indeed. You could simulate non-blocking using poll with zero timeout, but
if another task may read/write on it, your following read/write may end up
blocking even after a poll returned the required events.
One way to solve this would be some sort of readx/writex where you pass an
extra flags parameter (this could be done with sys_indirect, assuming
we'll ever get that mainline) where you specify the non-blocking
requirement for-this-call, and not as global per-file flag. Then, of
course, you'll have to modify all the "file->f_flags & O_NONBLOCK" tests
(and there are many of them) to check for that flag too (that can be a
per task_struct flag).



- Davide


2007-10-01 10:07:48

by Denys Vlasenko

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Monday 01 October 2007 04:15, Davide Libenzi wrote:
> On Mon, 1 Oct 2007, Denys Vlasenko wrote:
>
> > My use case is: I want to do a nonblocking read on descriptor 0 (stdin).
> > It may be a pipe or a socket.
> >
> > There may be other processes which share this descriptor with me,
> > I simply cannot know that. And they, too, may want to do reads on it.
> >
> > I want to do nonblocking read in such a way that neither those other
> > processes will ever see fd switching to O_NONBLOCK and back, and
> > I also want to be safe from other processes doing the same.
> >
> > I don't see how this can be done using standard unix primitives.
>
> Indeed. You could simulate non-blocking using poll with zero timeout, but
> if another task may read/write on it, your following read/write may end up
> blocking even after a poll returned the required events.
> One way to solve this would be some sort of readx/writex where you pass an
> extra flags parameter

We have that already. They are called send and recv. ;)

> (this could be done with sys_indirect, assuming
> we'll ever get that mainline) where you specify the non-blocking
> requirement for-this-call, and not as global per-file flag. Then, of
> course, you'll have to modify all the "file->f_flags & O_NONBLOCK" tests
> (and there are many of them) to check for that flag too (that can be a
> per task_struct flag).

Attached patch detects send/recv(fd, buf, size, MSG_DONTWAIT) on
non-sockets and turns them into non-blocking write/read.
Since filp->f_flags appear to be read and modified without any locking,
I cannot modify it without potentially affecting other processes
accessing the same file through shared struct file.

Therefore I simply make a temporary copy of struct file, set
O_NONBLOCK in it and pass it to vfs_read/write.
Is this heresy? ;) I see only one spinlock in struct file:

#ifdef CONFIG_EPOLL
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */

Do I need to take it?

Also attached is ndelaytest.c which can be used to test that
send(MSG_DONTWAIT) indeed is failing with EAGAIN if write would block
and that other processes never see O_NONBLOCK set.

Comments?
--
vda


Attachments:
(No filename) (2.14 kB)
ndelaytest.c (1.43 kB)
nonblock_linux-2.6.22-rc6.patch (2.83 kB)
Download all attachments

2007-10-01 18:17:04

by Al Viro

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Mon, Oct 01, 2007 at 11:07:15AM +0100, Denys Vlasenko wrote:
> Also attached is ndelaytest.c which can be used to test that
> send(MSG_DONTWAIT) indeed is failing with EAGAIN if write would block
> and that other processes never see O_NONBLOCK set.
>
> Comments?

Never send patches during or approaching hangover?
* it's on a bunch of cyclic lists. Have its neighbor
go away while you are doing all that crap => boom
* there's that thing call current position... It gets buggered.
* overwriting it while another task might be in the middle of
syscall involving it => boom
* non-cooperative tasks reading *in* *parallel* from the same
opened file are going to have a lot more serious problems than agreeing
on O_NONBLOCK anyway, so I really don't understand what the hell is that for.

2007-10-01 18:49:23

by Denys Vlasenko

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Monday 01 October 2007 19:16, Al Viro wrote:
> * it's on a bunch of cyclic lists. Have its neighbor
> go away while you are doing all that crap => boom
> * there's that thing call current position... It gets buggered.
> * overwriting it while another task might be in the middle of
> syscall involving it => boom

Hm, I suspected that it's herecy. Any idea how to do it cleanly?

> * non-cooperative tasks reading *in* *parallel* from the same
> opened file are going to have a lot more serious problems than agreeing
> on O_NONBLOCK anyway, so I really don't understand what the hell is that for.

They don't even need to read in parallel, just having shared fd is enough.
Think about pipes, sockets and terminals. A real-world scenario:

* a process started from shell (interactive or shell script)
* it sets O_NONBLOCK and does a read from fd 0...
* it gets killed (kill -9, whatever)
* shell suddenly has it's fd 0 in O_NONBLOCK mode
* shell and all subsequent commands started from it unexpectedly have
O_NONBLOCKed stdin.
--
vda

2007-10-01 19:03:14

by Michael Tokarev

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

Al Viro wrote:
> On Mon, Oct 01, 2007 at 11:07:15AM +0100, Denys Vlasenko wrote:
>> Also attached is ndelaytest.c which can be used to test that
>> send(MSG_DONTWAIT) indeed is failing with EAGAIN if write would block
>> and that other processes never see O_NONBLOCK set.
>>
>> Comments?
>
> Never send patches during or approaching hangover?
> * it's on a bunch of cyclic lists. Have its neighbor
> go away while you are doing all that crap => boom
> * there's that thing call current position... It gets buggered.
> * overwriting it while another task might be in the middle of
> syscall involving it => boom
> * non-cooperative tasks reading *in* *parallel* from the same
> opened file are going to have a lot more serious problems than agreeing
> on O_NONBLOCK anyway, so I really don't understand what the hell is that for.

Good summary... ;)

But for the last part of the last item - sometimes, definitely more than
once, I wondered why there's no equivalent to recv(MSG_DONTWAIT) for
non-sockets -- why for sockets it's as simple as adding an option (a
single bit), while for all the rest it requires two fcntl calls...
Sometimes it's handy. ;)

Not that I'm arguing for or against such a feature anyway..

/mjt

2007-10-01 19:05:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Mon, 1 Oct 2007, Denys Vlasenko wrote:

> On Monday 01 October 2007 19:16, Al Viro wrote:
> > * it's on a bunch of cyclic lists. Have its neighbor
> > go away while you are doing all that crap => boom
> > * there's that thing call current position... It gets buggered.
> > * overwriting it while another task might be in the middle of
> > syscall involving it => boom
>
> Hm, I suspected that it's herecy. Any idea how to do it cleanly?
>
> > * non-cooperative tasks reading *in* *parallel* from the same
> > opened file are going to have a lot more serious problems than agreeing
> > on O_NONBLOCK anyway, so I really don't understand what the hell is that for.
>
> They don't even need to read in parallel, just having shared fd is enough.
> Think about pipes, sockets and terminals. A real-world scenario:
>
> * a process started from shell (interactive or shell script)
> * it sets O_NONBLOCK and does a read from fd 0...
> * it gets killed (kill -9, whatever)
> * shell suddenly has it's fd 0 in O_NONBLOCK mode
> * shell and all subsequent commands started from it unexpectedly have
> O_NONBLOCKed stdin.

I told you how in the previous email. You cannot use the:

1) set O_NONBLOCK
2) read/write
3) unset O_NONBLOCK

in a racy-free fashion, w/out wrapping it with a lock (thing that we
don't want to do).



PS: send/recv are socket functions, and you really don't want to overload
them for other fds.



- Davide


2007-10-02 09:28:31

by Denys Vlasenko

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Monday 01 October 2007 20:04, Davide Libenzi wrote:
> > They don't even need to read in parallel, just having shared fd is enough.
> > Think about pipes, sockets and terminals. A real-world scenario:
> >
> > * a process started from shell (interactive or shell script)
> > * it sets O_NONBLOCK and does a read from fd 0...
> > * it gets killed (kill -9, whatever)
> > * shell suddenly has it's fd 0 in O_NONBLOCK mode
> > * shell and all subsequent commands started from it unexpectedly have
> > O_NONBLOCKed stdin.
>
> I told you how in the previous email. You cannot use the:
>
> 1) set O_NONBLOCK
> 2) read/write
> 3) unset O_NONBLOCK
>
> in a racy-free fashion, w/out wrapping it with a lock (thing that we
> don't want to do).

I'm confused. I am saying exactly this same thing: that I cannot
do it atomically using standard unix operations, but I still need
to do a nonblocking read. Why are you explaining to me that it
cannot be done? I *know*. I'm asking what API should be
added/extended to make it possible.

I have following proposals:

* make recv(..., MSG_DONTWAIT) work on any fd

Sounds neat, but not trivial to implement in current kernel.

* new fcntl command F_DUPFL: fcntl(fd, F_DUPFL, n):
Analogous to F_DUPFD, but gives you *unshared* copy of the fd.
Further seeks, fcntl(fd, F_SETFL, O_NONBLOCK), etc won't affect
any other process.

How hard would it be implement F_DUPFL in current kernel?
--
vda

2007-10-02 19:52:52

by Davide Libenzi

[permalink] [raw]
Subject: Re: F_DUPFD_CLOEXEC implementation

On Tue, 2 Oct 2007, Denys Vlasenko wrote:

> I have following proposals:
>
> * make recv(..., MSG_DONTWAIT) work on any fd
>
> Sounds neat, but not trivial to implement in current kernel.

This is mildly ugly, if you ask me. Those are socket functions, and the
flags parameter contain some pretty specific network meanings.



> * new fcntl command F_DUPFL: fcntl(fd, F_DUPFL, n):
> Analogous to F_DUPFD, but gives you *unshared* copy of the fd.
> Further seeks, fcntl(fd, F_SETFL, O_NONBLOCK), etc won't affect
> any other process.

You'll need an ad-hoc copy function though, since your memcpy-based one is
gonna explode even before memcpy returns ;) You'll have problems with
ref-counting too. And that layer is not designed to cleanly support that
operation.
Unfortunately the "clean" solution would involve changing a whole bunch of
code, and I don't feel exactly sure it'd be worth it.



- Davide