2014-06-01 13:59:29

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 0/6] Getting rid of get_unused_fd() / enable close-on-exec

TL;DR;

These are mostly obvious patches, easy to review, easy to apply:
you want them in the kernel. Now. And get_unused_fd() deserves to
die. Act today !

Hi,

Please find the seventh revision of my patchset to remove get_unused_fd()
macro in order to help subsystems to use get_unused_fd_flags() (or
anon_inode_getfd()) with flags either provided by the userspace or
set to O_CLOEXEC by default where appropriate.

Without get_unused_fd() macro, more subsystems are likely to use
get_unused_fd_flags() (or anon_inode_getfd()) and be taught to
provide an API that let userspace choose the opening flags
of the file descriptor.

Not allowing userspace to provide the "open" flags or not using O_CLOEXEC
by default should be considered bad practice from security point of view:
in most case O_CLOEXEC must be used to not leak file descriptor across
exec().

Not allowing userspace to atomically set close-on-exec flag and not using
O_CLOEXEC should be avoided to protect multi-threaded program
from race condition when it tried to set close-on-exec flag using
fcntl(fd, F_SETFD, FD_CLOEXEC) after opening the file descriptor.

Example:

int fd;
int ret;

fd = open(filename, O_RDONLY);
if (fd < 0) {
perror("open");
return -1;
}

/*
* window opened for another thread to call fork(),
* then the new process can call exec() at any time
* and the file descriptor would be inherited
*/

ret = fcntl(fd, F_SETFD, FD_CLOEXEC)
if (ret < 0) {
perror("fnctl()");
close(fd);
return -1;
}

vs.:

int fd;
fd = open(filaneme, O_RDONLY | O_CLOEXEC);
if (fd < 0) {
perror("open");
return -1;
}

Using O_CLOEXEC by default when flags are not (eg. cannot be) provided
by userspace is the safest bet as it allows userspace to choose, if, when
and where the file descriptor is going to be inherited across exec():
userspace is free to call fcntl() to remove FD_CLOEXEC flag in the child
process that will call exec().

Unfortunately, O_CLOEXEC cannot be made the default for most existing system
calls as it's not the default behavior for POSIX / Unix. Reader interested
in this issue could have a look at "Ghosts of Unix past, part 2: Conflated
designs" [1] article by Neil Brown.

FAQ:

- Why do one want close-on-exec ?

Setting close-on-exec flag on file descriptor ensure it won't be inherited
silently by child, child of child, etc. when executing another program.

If the file descriptor is not closed, some kernel resources can be locked
until the last process with the opened file descriptor exit.

If the file descriptor is not closed, this can lead to a security issue,
eg. making resources available to a less privileged program allowing
information leak and/or deny of service.

- Why do one need atomic close-on-exec ?

Even if it's possible to set close-on-exec flag through call to fcntl()
as shown previously, it introduces a race condition in multi-threaded
process, where a thread call fork() / exec() while another thread
is between call to open() and fcntl().

Additionally, using close-on-exec free the programmer from tracking manually
which file descriptor is to close in a child before calling exec():
in a program using multiple third-party libraries, it's difficult to know
which file descriptor must be closed.
AFAIK, while there's a atexit(), pthread_atfork(), there's no atexec()
userspace function in libc to allow libraries to register a handler in
order to close its opened file descriptor before exec().

- Why default to close-on-exec ?

Some kernel interfaces don't allow userspace to pass a O_CLOEXEC-like
flag when creating a new file descriptor.

In such cases, if possible (see below), O_CLOEXEC must be made
the default so that userspace doesn't have to call fcntl()
which, as demonstrated previously, is open to race condition in
multi-threaded program.

- How to choose between flag 0 or O_CLOEXEC in call to
get_unused_fd_flags() (or anon_inode_getfd()) ?

Short: Will it break existing application ? Will it break kernel ABI ?

If answer is no, use O_CLOEXEC.
If answer is yes, use 0.

Long: If userspace expect to retrieve a file descriptor with plain
old Unix(tm) semantics, O_CLOEXEC must not be the default, as it
could break some applications expecting that the file descriptor
will be inherited across exec().

But for some subsystems, such as InfiniBand, KVM, VFIO, it makes
no sense to have file descriptors inherited across exec() since
those are tied to resources that will vanish when a another program
will replace the current one by mean of exec(), so it's safe to use
O_CLOEXEC in such cases.

For others, like XFS, the file descriptor is retrieved by one
program and will be used by a different program, executed as a
child. In this case, setting O_CLOEXEC would break existing
application which do not expect to have to call fcntl(fd,
F_SETFD, 0) to make it available across exec().

If file descriptor created by a subsystem is not tied to the
current process resources, it's likely legal to use it in a
different process context, thus O_CLOEXEC must not be the default.

If one, as a subsystem maintainer, cannot tell for sure that
no userspace program ever rely current behavior, eg. file descriptor
being inherited across exec(), then the default flag *must*
be kept 0 to not break application.

- This subsystem cannot be turned to use O_CLOEXEC by default:

If O_CLOEXEC cannot be made the default, it would be interesting
to think to extend the API to have a (set of) function(s) taking
a flag parameter so that userspace can atomically request close-on-exec
if it need it (and it should need it).

- Background:

One might want to read "Secure File Descriptor Handling" [2] by
Ulrich Drepper who is responsible of adding O_CLOEXEC flag on open(),
and flags alike on other syscalls.

One might also want to read PEP-446 "Make newly created file descriptors
non-inheritable" [3] by Victor Stinner since it has lot more background
information on file descriptor leaking.

One also like to read "Excuse me son, but your code is leaking !!!" [4]
by Dan Walsh for advice.

[1] http://lwn.net/Articles/412131/
[2] http://udrepper.livejournal.com/20407.html
[3] http://www.python.org/dev/peps/pep-0446/
[4] http://danwalsh.livejournal.com/53603.html

- Statistics:

In linux-next tag next-20140530, they're currently:

- 32 calls to fd_install()
with one call part of anon_inode_getfd()
- 25 calls to get_unused_fd_flags()
with one call part of anon_inode_getfd()
with another part of get_unused_fd() macro
- 11 calls to anon_inode_getfd()
- 9 calls to anon_inode_getfile()
with one call part of anon_inode_getfd()
- 6 calls to get_unused_fd()

Changes from patchset v6 [PATCHSETv6]

- Rebased on top of latest -next
- Added Cc: [email protected] for the first trivials
patches.

Changes from patchset v5 [PATCHSETv5]

- perf: introduce a flag to enable close-on-exec in perf_event_open()
DROPPED: applied upstream, commit a21b0b354d4a.

Changes from patchset v4 [PATCHSETv4]:

- rewrote cover letter following discussion with perf maintainer.
Thanks to Peter Zijlstra.

- modified a bit some commit messages.

- events: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: replaced by following patch.

- perf: introduce a flag to enable close-on-exec in perf_event_open()
NEW: instead of hard coding the flags to 0, this patch
allows userspace to specify close-on-exec flag.

- fanotify: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: replaced by following patch.

- fanotify: enable close-on-exec on events' fd when requested in
fanotify_init()
NEW: instead of hard coding the flags to 0, this patch
enable close-on-exec if userspace request it.

Changes from patchset v3 [PATCHSETv3]:

- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
DROPPED: applied upstream, commit a646fbf0fd11.

Changes from patchset v2 [PATCHSETv2]:

- android/sw_sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
DROPPED: applied upstream, commit 45acea57335e.

- android/sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
DROPPED: applied upstream, commit 9c6cd3b39048.

- vfio: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied upstream, commit a5d550703d2c.
Additionally subsystem maintainer applied another patch on top
to set the flags to O_CLOEXEC, commit 5d042fbdbb2d.

- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
NEW: propose to use O_CLOEXEC as default flag.

Changes from patchset v1 [PATCHSETv1]:

- explicitly added subsystem maintainers as mail recepients.

- infiniband: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: subsystem maintainer applied another patch
using get_unused_fd_flags(O_CLOEXEC) as suggested,
commit da183c7af844.

- android/sw_sync: use get_unused_fd_flags(0) instead of get_unused_fd()
MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested.

- android/sync: use get_unused_fd_flags(0) instead of get_unused_fd()
MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested.

- xfs: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied asis by subsystem maintainer, commit 862a62937e76.

- sctp: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied asis by subsystem maintainer, commit 8a59bd3e9b29.

Links:

[PATCHSETv6]
http://lkml.kernel.org/r/[email protected]

[PATCHSETv5]
http://lkml.kernel.org/r/[email protected]

[PATCHSETv4]
http://lkml.kernel.org/r/[email protected]

[PATCHSETv3]
http://lkml.kernel.org/r/[email protected]

[PATCHSETv2]
http://lkml.kernel.org/r/[email protected]

[PATCHSETv1]
http://lkml.kernel.org/r/[email protected]

Yann Droneaud (6):
ia64: use get_unused_fd_flags(0) instead of get_unused_fd()
ppc/cell: use get_unused_fd_flags(0) instead of get_unused_fd()
binfmt_misc: use get_unused_fd_flags(0) instead of get_unused_fd()
file: use get_unused_fd_flags(0) instead of get_unused_fd()
fanotify: enable close-on-exec on events' fd when requested in
fanotify_init()
file: remove macro get_unused_fd()

arch/ia64/kernel/perfmon.c | 2 +-
arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
fs/binfmt_misc.c | 2 +-
fs/file.c | 2 +-
fs/notify/fanotify/fanotify_user.c | 2 +-
include/linux/file.h | 1 -
6 files changed, 6 insertions(+), 7 deletions(-)

--
1.9.3


2014-06-01 13:59:23

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 1/6] ia64: use get_unused_fd_flags(0) instead of get_unused_fd()

Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.

Link: http://lkml.kernel.org/r/[email protected]
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
arch/ia64/kernel/perfmon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 5845ffea67c3..dc063fe6646a 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2662,7 +2662,7 @@ pfm_context_create(pfm_context_t *ctx, void *arg, int count, struct pt_regs *reg

ret = -ENOMEM;

- fd = get_unused_fd();
+ fd = get_unused_fd_flags(0);
if (fd < 0)
return fd;

--
1.9.3

2014-06-01 13:59:35

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 2/6] ppc/cell: use get_unused_fd_flags(0) instead of get_unused_fd()

Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.

Link: http://lkml.kernel.org/r/[email protected]
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 87ba7cf99cd7..51effcec30d8 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -301,7 +301,7 @@ static int spufs_context_open(struct path *path)
int ret;
struct file *filp;

- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;

@@ -518,7 +518,7 @@ static int spufs_gang_open(struct path *path)
int ret;
struct file *filp;

- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;

--
1.9.3

2014-06-01 13:59:43

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 4/6] file: use get_unused_fd_flags(0) instead of get_unused_fd()

Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be made the default to not leak file descriptor
across exec().

In a further patch, get_unused_fd() will be removed so that
new code start using get_unused_fd_flags() (or anon_inode_getfd())
with correct flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0).

Link: http://lkml.kernel.org/r/[email protected]
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
fs/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 66923fe3176e..4186f81cd9f1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -868,7 +868,7 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes)
struct file *file = fget_raw(fildes);

if (file) {
- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret >= 0)
fd_install(ret, file);
else
--
1.9.3

2014-06-01 13:59:40

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 3/6] binfmt_misc: use get_unused_fd_flags(0) instead of get_unused_fd()

Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().

Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.

This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.

Link: http://lkml.kernel.org/r/[email protected]
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
fs/binfmt_misc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index b60500300dd7..3c9fcc08aec9 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -138,7 +138,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
/* if the binary should be opened on behalf of the
* interpreter than keep it open and assign descriptor
* to it */
- fd_binary = get_unused_fd();
+ fd_binary = get_unused_fd_flags(0);
if (fd_binary < 0) {
retval = fd_binary;
goto _ret;
--
1.9.3

2014-06-01 13:59:49

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 5/6] fanotify: enable close-on-exec on events' fd when requested in fanotify_init()

According to commit 80af258867648, file descriptor created as part
of file access notification events inherit flags from the event_f_flags
argument passed to syscall fanotify_init(2).

So while it is legal for userspace to call fanotify_init() with O_CLOEXEC
as part of its second argument, O_CLOEXEC is currently silently ignored.

Indeed event_f_flags are only given to dentry_open(), which only
seems to care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT
in open_check_o_direct() and O_LARGEFILE in generic_file_open().

More, there's no effective check on event_f_flags value that would
catch unknown / unsupported values, unlike the one on f_flags argument
of the syscall (see FAN_ALL_INIT_FLAGS in include/uapi/linux/fanotify.h).

Reading article "Botching up ioctls"[1] by Daniel Vetter might make one
feel uncomfortable when trying to add extension to an API that doesn't
check for unrecognized values.

But it seems logical to set close-on-exec flag on the file descriptor
if userspace is allowed to request it with O_CLOEXEC.

In fact, according to some lookup on http://codesearch.debian.net/ and
various search engine, there's already some userspace code requesting it:

- in systemd's readahead[2]:

fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);

- in clsync[3]:

#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)

int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);

- in examples [4] from "Filesystem monitoring in the Linux kernel"
article[5] by Aleksander Morgado:

if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)

Lookup also returned some wrong usage of the syscall:

- in Gonk HAL from Mozilla Firefox OS sources[6]:

mFd = fanotify_init(FAN_CLASS_NOTIF, FAN_CLOEXEC);

Adding support for O_CLOEXEC in fanotify won't magically enable it for
Gonk since FAN_CLOEXEC is defined as 0x1, which is likely equal to
O_WRONLY when used in open flag context. In the other hand, it won't
hurt it either.

So this patch replaces call to macro get_unused_fd() by a call to
function get_unused_fd_flags() with event_f_flags value as argument.
This way O_CLOEXEC flag in the second argument of fanotify_init(2)
syscall is interpreted so that close-on-exec get enabled.

[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://hg.mozilla.org/mozilla-central/file/325c74addeba/hal/gonk/GonkDiskSpaceWatcher.cpp#l167

Link: http://lkml.kernel.org/r/[email protected]
Cc: Richard Guy Briggs <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Signed-off-by: Yann Droneaud <[email protected]>
---
fs/notify/fanotify/fanotify_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 3fdc8a3e1134..6782226e2868 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -78,7 +78,7 @@ static int create_fd(struct fsnotify_group *group,

pr_debug("%s: group=%p event=%p\n", __func__, group, event);

- client_fd = get_unused_fd();
+ client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
if (client_fd < 0)
return client_fd;

--
1.9.3

2014-06-01 18:59:22

by Yann Droneaud

[permalink] [raw]
Subject: [PATCHv7 6/6] file: remove macro get_unused_fd()

Macro get_unused_fd() allocates a file descriptor without enabling
close-on-exec: it calls function get_unused_fd_flags() without
O_CLOEXEC flag.

This can be seen as an unsafe default: in most case close-on-exec
should be enabled to not leak file descriptor across exec().

This patch removes get_unused_fd() instead of updating it to use
O_CLOEXEC so that out of tree modules won't be affect by a runtime
behavor change which might introduce other kind of bug. It's better
to catch the change at build time, making it easier to fix.

Removing the macro will also promote use of get_unused_fd_flags()
(or anon_inode_getfd()) with flags provided by userspace. Or, if
flags cannot be given by userspace, with flags set to O_CLOEXEC set
by default.

Link: http://lkml.kernel.org/r/[email protected]
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
include/linux/file.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/include/linux/file.h b/include/linux/file.h
index 4d69123377a2..f87d30882a24 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -66,7 +66,6 @@ extern void set_close_on_exec(unsigned int fd, int flag);
extern bool get_close_on_exec(unsigned int fd);
extern void put_filp(struct file *);
extern int get_unused_fd_flags(unsigned flags);
-#define get_unused_fd() get_unused_fd_flags(0)
extern void put_unused_fd(unsigned int fd);

extern void fd_install(unsigned int fd, struct file *file);
--
1.9.3