Hi,
Please find the fifth revision of my patchset to remove get_unused_fd()
macro in order to help subsystems to use get_unused_fd_flags() (or
anon_inode_getfd()) with flags either provided by the userspace or
set to O_CLOEXEC by default where appropriate.
Without get_unused_fd() macro, more subsystems are likely to use
get_unused_fd_flags() (or anon_inode_getfd()) and be teached to
provide an API that let userspace choose the opening flags
of the file descriptor.
Not allowing userspace to provide the "open" flags or not using O_CLOEXEC
by default should be considered bad practice from security point of view:
in most case O_CLOEXEC must be used to not leak file descriptor across
exec().
Not allowing userspace to atomically set close-on-exec flag and not using
O_CLOEXEC should be avoided to protect multi-threaded program
from race condition when it tried to set close-on-exec flag using
fcntl(fd, F_SETFD, FD_CLOEXEC) after opening the file descriptor.
Example:
int fd;
int ret;
fd = open(filename, O_RDONLY);
if (fd < 0) {
perror("open");
return -1;
}
/*
* window opened for another thread to call fork(),
* then the new process can call exec() at any time
* and the file descriptor would be inherited
*/
ret = fcntl(fd, F_SETFD, FD_CLOEXEC)
if (ret < 0) {
perror("fnctl()");
close(fd);
return -1;
}
vs.:
int fd;
fd = open(filaneme, O_RDONLY | O_CLOEXEC);
if (fd < 0) {
perror("open");
return -1;
}
Using O_CLOEXEC by default when flags are not (eg. cannot be) provided
by userspace is the safest bet as it allows userspace to choose, if, when
and where the file descriptor is going to be inherited across exec():
userspace is free to call fcntl() to remove FD_CLOEXEC flag in the child
process that will call exec().
Unfortunately, O_CLOEXEC cannot be made the default for most existing system
calls as it's not the default behavior for POSIX / Unix. Reader interested
in this issue could have a look at "Ghosts of Unix past, part 2: Conflated
designs" [1] article by Neil Brown.
FAQ:
- Why do one want close-on-exec ?
Setting close-on-exec flag on file descriptor ensure it won't be inherited
silently by child, child of child, etc. when executing another program.
If the file descriptor is not closed, some kernel resources can be locked
until the last process with the opened file descriptor exit.
If the file descriptor is not closed, this can lead to a security issue,
eg. making resources available to a less privileged program allowing
information leak and/or deny of service.
- Why do one need atomic close-on-exec ?
Even if it's possible to set close-on-exec flag through call to fcntl()
as shown previously, it introduces a race condition in multi-threaded
process, where a thread call fork() / exec() while another thread
is between call to open() and fcntl().
Additionally, using close-on-exec free the programmer from tracking manually
which file descriptor is to close in a child before calling exec():
in a program using multiple third-party libraries, it's difficult to know
which file descriptor must be closed.
AFAIK, while there's a atexit(), pthread_atfork(), there's no atexec()
userspace function in libc to allow libraries to register a handler in
order to close its opened file descriptor before exec().
- Why default to close-on-exec ?
Some kernel interfaces don't allow userspace to pass a O_CLOEXEC-like
flag when creating a new file descriptor.
In such cases, if possible (see below), O_CLOEXEC must be made
the default so that userspace doesn't have to call fcntl()
which, as demonstrated previously, is open to race condition in
multi-threaded program.
- How to choose between flag 0 or O_CLOEXEC in call to
get_unused_fd_flags() (or anon_inode_getfd()) ?
Short: Will it break existing application ? Will it break kernel ABI ?
If answer is no, use O_CLOEXEC.
If answer is yes, use 0.
Long: If userspace expect to retrieve a file descriptor with plain
old Unix(tm) semantics, O_CLOEXEC must not be the default, as it
could break some applications expecting that the file descriptor
will be inherited across exec().
But for some subsystems, such as InfiniBand, KVM, VFIO, it makes
no sense to have file descriptors inherited across exec() since
those are tied to resources that will vanish when a another program
will replace the current one by mean of exec(), so it's safe to use
O_CLOEXEC in such cases.
For others, like XFS, the file descriptor is retrieved by one
program and will be used by a different program, executed as a
child. In this case, setting O_CLOEXEC would break existing
application which do not expect to have to call fcntl(fd,
F_SETFD, 0) to make it available across exec().
If file descriptor created by a subsystem is not tied to the
current process resources, it's likely legal to use it in a
different process context, thus O_CLOEXEC must not be the default.
If one, as a subsystem maintainer, cannot tell for sure that
no userspace program ever rely current behavior, eg. file descriptor
being inherited across exec(), then the default flag *must*
be kept 0 to not break application.
- This subsystem cannot be turned to use O_CLOEXEC by default:
If O_CLOEXEC cannot be made the default, it would be interesting
to think to extend the API to have a (set of) function(s) taking
a flag parameter so that userspace can atomically request close-on-exec
if it need it (and it should need it).
- Background:
One might want to read "Secure File Descriptor Handling" [2] by
Ulrich Drepper who is responsible of adding O_CLOEXEC flag on open(),
and flags alike on other syscalls.
One might also want to read PEP-446 "Make newly created file descriptors
non-inheritable" [3] by Victor Stinner since it has lot more background
information on file descriptor leaking.
One also like to read "Excuse me son, but your code is leaking !!!" [4]
by Dan Walsh for advice.
[1] http://lwn.net/Articles/412131/
[2] http://udrepper.livejournal.com/20407.html
[3] http://www.python.org/dev/peps/pep-0446/
[4] http://danwalsh.livejournal.com/53603.html
- Statistics:
In linux-next tag 20131224, they're currently:
- 32 calls to fd_install()
with one call part of anon_inode_getfd()
- 24 calls to get_unused_fd_flags()
with one call part of anon_inode_getfd()
with another part of get_unused_fd() macro
- 11 calls to anon_inode_getfd()
- 8 calls to anon_inode_getfile()
with one call part of anon_inode_getfd()
- 7 calls to get_unused_fd()
Changes from patchset v4 [PATCHSETv4]:
- rewrote cover letter following discussion with perf maintainer.
Thanks to Peter Zijlstra.
- modified a bit some commit messages.
- events: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: replaced by following patch
- perf: introduce a flag to enable close-on-exec in perf_event_open()
NEW: instead of hard coding the flags to 0, this patch
allows userspace to specify close-on-exec flag.
- fanotify: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: replaced by following patch
- fanotify: enable close-on-exec on events' fd when requested in
fanotify_init()
NEW: instead of hard coding the flags to 0, this patch
enable close-on-exec if userspace request it.
Changes from patchset v3 [PATCHSETv3]:
- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
DROPPED: applied upstream
Changes from patchset v2 [PATCHSETv2]:
- android/sw_sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
DROPPED: applied upstream
- android/sync: use get_unused_fd_flags(O_CLOEXEC) instead of get_unused_fd()
DROPPED: applied upstream
- vfio: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied upstream.
Additionally subsystem maintainer applied another patch on top
to set the flags to O_CLOEXEC.
- industrialio: use anon_inode_getfd() with O_CLOEXEC flag
NEW: propose to use O_CLOEXEC as default flag.
Changes from patchset v1 [PATCHSETv1]:
- explicitly added subsystem maintainers as mail recepients.
- infiniband: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: subsystem maintainer applied another patch
using get_unused_fd_flags(O_CLOEXEC) as suggested.
- android/sw_sync: use get_unused_fd_flags(0) instead of get_unused_fd()
MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested
- android/sync: use get_unused_fd_flags(0) instead of get_unused_fd()
MODIFIED: use get_unused_fd_flags(O_CLOEXEC) as suggested
- xfs: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied asis by subsystem maintainer.
- sctp: use get_unused_fd_flags(0) instead of get_unused_fd()
DROPPED: applied asis by subsystem maintainer.
Links:
[PATCHSETv4]
http://lkml.kernel.org/r/[email protected]
[PATCHSETv3]
http://lkml.kernel.org/r/[email protected]
[PATCHSETv2]
http://lkml.kernel.org/r/[email protected]
[PATCHSETv1]
http://lkml.kernel.org/r/[email protected]
PS: Happy new (gregorian calendar's) year 2014 and best wishes ;)
Yann Droneaud (7):
ia64: use get_unused_fd_flags(0) instead of get_unused_fd()
ppc/cell: use get_unused_fd_flags(0) instead of get_unused_fd()
binfmt_misc: use get_unused_fd_flags(0) instead of get_unused_fd()
file: use get_unused_fd_flags(0) instead of get_unused_fd()
fanotify: enable close-on-exec on events' fd when requested in
fanotify_init()
perf: introduce a flag to enable close-on-exec in perf_event_open()
file: remove macro get_unused_fd()
arch/ia64/kernel/perfmon.c | 2 +-
arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
fs/binfmt_misc.c | 2 +-
fs/file.c | 2 +-
fs/notify/fanotify/fanotify_user.c | 2 +-
include/linux/file.h | 1 -
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 12 +++++++++---
8 files changed, 16 insertions(+), 10 deletions(-)
--
1.8.4.2
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().
Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.
In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/powerpc/platforms/cell/spufs/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 87ba7cf99cd7..51effcec30d8 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -301,7 +301,7 @@ static int spufs_context_open(struct path *path)
int ret;
struct file *filp;
- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;
@@ -518,7 +518,7 @@ static int spufs_gang_open(struct path *path)
int ret;
struct file *filp;
- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret < 0)
return ret;
--
1.8.4.2
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be made the default to not leak file descriptor
across exec().
In a further patch, get_unused_fd() will be removed so that
new code start using get_unused_fd_flags() (or anon_inode_getfd())
with correct flags.
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0).
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
fs/file.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index 4a78f981557a..1420d2890b43 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -897,7 +897,7 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes)
struct file *file = fget_raw(fildes);
if (file) {
- ret = get_unused_fd();
+ ret = get_unused_fd_flags(0);
if (ret >= 0)
fd_install(ret, file);
else
--
1.8.4.2
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().
Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.
In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
fs/binfmt_misc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 1c740e152f38..052f6dc8e549 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -138,7 +138,7 @@ static int load_misc_binary(struct linux_binprm *bprm)
/* if the binary should be opened on behalf of the
* interpreter than keep it open and assign descriptor
* to it */
- fd_binary = get_unused_fd();
+ fd_binary = get_unused_fd_flags(0);
if (fd_binary < 0) {
retval = fd_binary;
goto _ret;
--
1.8.4.2
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().
Instead of macro get_unused_fd(), function get_unused_fd_flags()
(or anon_inode_getfd()) should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.
In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags() (or anon_inode_getfd()) with correct
flags.
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
The hard coded flag value (0) should be reviewed on a per-subsystem
basis, and, if possible, set to O_CLOEXEC.
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/ia64/kernel/perfmon.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index cb592773c78b..44b148fbb560 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2662,7 +2662,7 @@ pfm_context_create(pfm_context_t *ctx, void *arg, int count, struct pt_regs *reg
ret = -ENOMEM;
- fd = get_unused_fd();
+ fd = get_unused_fd_flags(0);
if (fd < 0)
return fd;
--
1.8.4.2
According to commit 80af258867648, file descriptor created as part
of file access notification events inherit flags from the event_f_flags
argument passed to syscall fanotify_init(2).
So while it is legal for userspace to call fanotify_init() with O_CLOEXEC
as part of its second argument, O_CLOEXEC is currently silently ignored.
Indeed event_f_flags are only given to dentry_open(), which only
seems to care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT
in open_check_o_direct() and O_LARGEFILE in generic_file_open().
More, there's no effective check on event_f_flags value that would
catch unknown / unsupported values, unlike the one on f_flags argument
of the syscall (see FAN_ALL_INIT_FLAGS in include/uapi/linux/fanotify.h).
Reading article "Botching up ioctls"[1] by Daniel Vetter might make one
feel uncomfortable when trying to add extension to an API that doesn't
check for unrecognized values.
But it seems logical to set close-on-exec flag on the file descriptor
if userspace is allowed to request it with O_CLOEXEC.
In fact, according to some lookup on http://codesearch.debian.net/ and
various search engine, there's already some userspace code requesting it:
- in systemd's readahead[2]:
fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);
- in clsync[3]:
#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)
int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);
- in examples [4] from "Filesystem monitoring in the Linux kernel"
article[5] by Aleksander Morgado:
if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)
Lookup also returned some wrong usage of the syscall:
- in Gonk HAL from Mozilla Firefox OS sources[6]:
mFd = fanotify_init(FAN_CLASS_NOTIF, FAN_CLOEXEC);
Adding support for O_CLOEXEC in fanotify won't magically enable it for
Gonk since FAN_CLOEXEC is defined as 0x1, which is likely equal to
O_WRONLY when used in open flag context. In the other hand, it won't
hurt it either.
So this patch replaces call to macro get_unused_fd() by a call to
function get_unused_fd_flags() with event_f_flags value as argument.
This way O_CLOEXEC flag in the second argument of fanotify_init(2)
syscall is interpreted so that close-on-exec get enabled.
[1] http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://hg.mozilla.org/mozilla-central/file/325c74addeba/hal/gonk/GonkDiskSpaceWatcher.cpp#l167
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: [email protected]
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
fs/notify/fanotify/fanotify_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 57d7c083cb4b..6d0eaabba02e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -71,7 +71,7 @@ static int create_fd(struct fsnotify_group *group,
pr_debug("%s: group=%p event=%p\n", __func__, group, event);
- client_fd = get_unused_fd();
+ client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
if (client_fd < 0)
return client_fd;
--
1.8.4.2
Unlike recent modern userspace API such as
epoll_create1 (EPOLL_CLOEXEC), eventfd (EFD_CLOEXEC),
fanotify_init (FAN_CLOEXEC), inotify_init1 (IN_CLOEXEC),
signalfd (SFD_CLOEXEC), timerfd_create (TFD_CLOEXEC),
or the venerable general purpose open (O_CLOEXEC),
perf_event_open() syscall lack a flag to atomically set FD_CLOEXEC
(eg. close-on-exec) flag on file descriptor it returns to userspace.
The present patch adds a PERF_FLAG_FD_CLOEXEC flag to allow
perf_event_open() syscall to atomically set close-on-exec.
Having this flag will enable userspace to remove the file descriptor
from the list of file descriptors being inherited across exec,
without the need to call fcntl(fd, F_SETFD, FD_CLOEXEC) and the
associated race condition between the current thread and another
thread calling fork(2) then execve(2).
Links:
- Secure File Descriptor Handling (Ulrich Drepper, 2008)
http://udrepper.livejournal.com/20407.html
- Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012)
http://danwalsh.livejournal.com/53603.html
- Notes in DMA buffer sharing: leak and security hole
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dma-buf-sharing.txt?id=v3.13-rc3#n428
Cc: Peter Zijlstra <[email protected]>
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/1383168950%[email protected]
Link: http://lkml.kernel.org/r/1383170413.9171.10.camel@dworkin
Link: http://lkml.kernel.org/r/[email protected]
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 959d454f76a1..e244ed412745 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -725,6 +725,7 @@ enum perf_callchain_context {
#define PERF_FLAG_FD_NO_GROUP (1U << 0)
#define PERF_FLAG_FD_OUTPUT (1U << 1)
#define PERF_FLAG_PID_CGROUP (1U << 2) /* pid=cgroup id, per-cpu mode only */
+#define PERF_FLAG_FD_CLOEXEC (1U << 3) /* O_CLOEXEC */
union perf_mem_data_src {
__u64 val;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4b743b299124..86e6a49abf8e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -119,7 +119,8 @@ static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
#define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
PERF_FLAG_FD_OUTPUT |\
- PERF_FLAG_PID_CGROUP)
+ PERF_FLAG_PID_CGROUP |\
+ PERF_FLAG_FD_CLOEXEC)
/*
* branch priv levels that need permission checks
@@ -6995,6 +6996,7 @@ SYSCALL_DEFINE5(perf_event_open,
int event_fd;
int move_group = 0;
int err;
+ int f_flags = O_RDWR;
/* for future expandability... */
if (flags & ~PERF_FLAG_ALL)
@@ -7023,7 +7025,10 @@ SYSCALL_DEFINE5(perf_event_open,
if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
return -EINVAL;
- event_fd = get_unused_fd();
+ if (flags & PERF_FLAG_FD_CLOEXEC)
+ f_flags |= O_CLOEXEC;
+
+ event_fd = get_unused_fd_flags(f_flags);
if (event_fd < 0)
return event_fd;
@@ -7145,7 +7150,8 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, O_RDWR);
+ event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
+ f_flags);
if (IS_ERR(event_file)) {
err = PTR_ERR(event_file);
goto err_context;
--
1.8.4.2
Macro get_unused_fd() allocates a file descriptor without enabling
close-on-exec: it calls function get_unused_fd_flags() without
O_CLOEXEC flag.
This can be seen as an unsafe default: in most case close-on-exec
should be enabled to not leak file descriptor across exec().
This patch removes get_unused_fd() instead of updating it to use
O_CLOEXEC so that out of tree modules won't be affect by a runtime
behavor change which might introduce other kind of bug. It's better
to catch the change at build time, making it easier to fix.
Removing the macro will also promote use of get_unused_fd_flags()
(or anon_inode_getfd()) with flags provided by userspace. Or, if
flags cannot be given by userspace, with flags set to O_CLOEXEC set
by default.
Cc: Al Viro <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/linux/file.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/file.h b/include/linux/file.h
index cbacf4faf447..866600259c07 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -63,7 +63,6 @@ extern void set_close_on_exec(unsigned int fd, int flag);
extern bool get_close_on_exec(unsigned int fd);
extern void put_filp(struct file *);
extern int get_unused_fd_flags(unsigned flags);
-#define get_unused_fd() get_unused_fd_flags(0)
extern void put_unused_fd(unsigned int fd);
extern void fd_install(unsigned int fd, struct file *file);
--
1.8.4.2
On Sun, Jan 05, 2014 at 09:36:33PM +0100, Yann Droneaud wrote:
> Unlike recent modern userspace API such as
> epoll_create1 (EPOLL_CLOEXEC), eventfd (EFD_CLOEXEC),
> fanotify_init (FAN_CLOEXEC), inotify_init1 (IN_CLOEXEC),
> signalfd (SFD_CLOEXEC), timerfd_create (TFD_CLOEXEC),
> or the venerable general purpose open (O_CLOEXEC),
> perf_event_open() syscall lack a flag to atomically set FD_CLOEXEC
> (eg. close-on-exec) flag on file descriptor it returns to userspace.
>
> The present patch adds a PERF_FLAG_FD_CLOEXEC flag to allow
> perf_event_open() syscall to atomically set close-on-exec.
>
> Having this flag will enable userspace to remove the file descriptor
> from the list of file descriptors being inherited across exec,
> without the need to call fcntl(fd, F_SETFD, FD_CLOEXEC) and the
> associated race condition between the current thread and another
> thread calling fork(2) then execve(2).
>
> Links:
>
> - Secure File Descriptor Handling (Ulrich Drepper, 2008)
> http://udrepper.livejournal.com/20407.html
>
> - Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012)
> http://danwalsh.livejournal.com/53603.html
>
> - Notes in DMA buffer sharing: leak and security hole
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dma-buf-sharing.txt?id=v3.13-rc3#n428
>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Al Viro <[email protected]>
> Signed-off-by: Yann Droneaud <[email protected]>
Seems sane enough.
Thanks!
In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
added to perf_event_open(2) syscall to allows userspace
to enable close-on-exec behavor atomically when creating
the file descriptor.
This patch makes perf tools use the new flag.
Beware that perf tools compiled with the new flag won't work
on older kernel which do not support flag PERF_FLAG_FD_CLOEXEC.
[1] http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
[2] https://patchwork.kernel.org/patch/3434971/
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
tools/perf/bench/mem-memcpy.c | 3 ++-
tools/perf/bench/mem-memset.c | 3 ++-
tools/perf/builtin-sched.c | 2 +-
tools/perf/tests/bp_signal.c | 2 +-
tools/perf/tests/bp_signal_overflow.c | 2 +-
tools/perf/tests/rdpmc.c | 2 +-
tools/perf/util/evsel.c | 4 ++--
tools/perf/util/record.c | 9 ++++++---
tools/testing/selftests/powerpc/pmu/event.c | 3 ++-
9 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 5ce71d3b72cf..c1be6c17b382 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -83,7 +83,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ PERF_FLAG_FD_CLOEXEC);
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/bench/mem-memset.c b/tools/perf/bench/mem-memset.c
index 9af79d2b18e5..3b92b7da9cf2 100644
--- a/tools/perf/bench/mem-memset.c
+++ b/tools/perf/bench/mem-memset.c
@@ -83,7 +83,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ PERF_FLAG_FD_CLOEXEC);
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 0f3c65518a2c..933a15783a54 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -435,7 +435,7 @@ static int self_open_counters(void)
attr.type = PERF_TYPE_SOFTWARE;
attr.config = PERF_COUNT_SW_TASK_CLOCK;
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC);
if (fd < 0)
pr_err("Error: sys_perf_event_open() syscall returned "
diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
index aba095489193..de33526ff9d6 100644
--- a/tools/perf/tests/bp_signal.c
+++ b/tools/perf/tests/bp_signal.c
@@ -78,7 +78,7 @@ static int bp_event(void *fn, int setup_signal)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1, PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/bp_signal_overflow.c b/tools/perf/tests/bp_signal_overflow.c
index 44ac82179708..e54bf622adb2 100644
--- a/tools/perf/tests/bp_signal_overflow.c
+++ b/tools/perf/tests/bp_signal_overflow.c
@@ -91,7 +91,7 @@ int test__bp_signal_overflow(void)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1, PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/rdpmc.c b/tools/perf/tests/rdpmc.c
index 46649c25fa5e..46664eca77af 100644
--- a/tools/perf/tests/rdpmc.c
+++ b/tools/perf/tests/rdpmc.c
@@ -104,7 +104,7 @@ static int __test__rdpmc(void)
sa.sa_sigaction = segfault_handler;
sigaction(SIGSEGV, &sa, NULL);
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1, PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
pr_err("Error: sys_perf_event_open() syscall returned "
"with %d (%s)\n", fd, strerror(errno));
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 01ff4cfde1f5..8f5d6bc2ad3e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -970,7 +970,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
struct thread_map *threads)
{
int cpu, thread;
- unsigned long flags = 0;
+ unsigned long flags = PERF_FLAG_FD_CLOEXEC;
int pid = -1, err;
enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;
@@ -979,7 +979,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
return -ENOMEM;
if (evsel->cgrp) {
- flags = PERF_FLAG_PID_CGROUP;
+ flags |= PERF_FLAG_PID_CGROUP;
pid = evsel->cgrp->fd;
}
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index e5104538c354..1146fa9003ab 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -22,14 +22,16 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
evsel = perf_evlist__first(evlist);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ PERF_FLAG_FD_CLOEXEC);
if (fd < 0)
goto out_delete;
close(fd);
fn(evsel);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ PERF_FLAG_FD_CLOEXEC);
if (fd < 0) {
if (errno == EINVAL)
err = -EINVAL;
@@ -204,7 +206,8 @@ bool perf_evlist__can_select_event(struct perf_evlist *evlist, const char *str)
cpu = evlist->cpus->map[0];
}
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ PERF_FLAG_FD_CLOEXEC);
if (fd >= 0) {
close(fd);
ret = true;
diff --git a/tools/testing/selftests/powerpc/pmu/event.c b/tools/testing/selftests/powerpc/pmu/event.c
index 2b2d11df2450..a705e42b6132 100644
--- a/tools/testing/selftests/powerpc/pmu/event.c
+++ b/tools/testing/selftests/powerpc/pmu/event.c
@@ -45,7 +45,8 @@ void event_init_named(struct event *e, u64 config, char *name)
int event_open_with_options(struct event *e, pid_t pid, int cpu, int group_fd)
{
- e->fd = perf_event_open(&e->attr, pid, cpu, group_fd, 0);
+ e->fd = perf_event_open(&e->attr, pid, cpu, group_fd,
+ PERF_FLAG_FD_CLOEXEC);
if (e->fd == -1) {
perror("perf_event_open");
return -1;
--
1.8.4.2
On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> added to perf_event_open(2) syscall to allows userspace
> to enable close-on-exec behavor atomically when creating
> the file descriptor.
>
> This patch makes perf tools use the new flag.
>
> Beware that perf tools compiled with the new flag won't work
> on older kernel which do not support flag PERF_FLAG_FD_CLOEXEC.
>
> [1] http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
> [2] https://patchwork.kernel.org/patch/3434971/
>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Yann Droneaud <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
acme, ACK? I was thinking I'd keep these two patches together so the
entire things lands in tip in one go?
On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> added to perf_event_open(2) syscall to allows userspace
> to enable close-on-exec behavor atomically when creating
> the file descriptor.
>
> This patch makes perf tools use the new flag.
>
> Beware that perf tools compiled with the new flag won't work
> on older kernel which do not support flag PERF_FLAG_FD_CLOEXEC.
I think we should enhance the api probe routines (perf_do_probe_api)
to detect that, than just bypass us from running on older kernels
jirka
Em Mon, Jan 06, 2014 at 12:24:36PM +0100, Peter Zijlstra escreveu:
> On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> > In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> > added to perf_event_open(2) syscall to allows userspace
> > to enable close-on-exec behavor atomically when creating
> > the file descriptor.
> > This patch makes perf tools use the new flag.
> > Beware that perf tools compiled with the new flag won't work
> > on older kernel which do not support flag PERF_FLAG_FD_CLOEXEC.
> > [1] http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
> > [2] https://patchwork.kernel.org/patch/3434971/
> acme, ACK? I was thinking I'd keep these two patches together so the
> entire things lands in tip in one go?
Nope, it should notice the EINVAL, drop the flag that doesn't work on
older kernels, retry, so that new tools continue to work on older
kernels, with yet another fallback.
Please take a look at __perf_evsel__open(), probably it will be best to
add a flag to perf_missing_features, like the one we have for the
perf_event_attr.mmap2 flag, so that we fail just once, etc.
- Arnaldo
Hi,
Le lundi 06 janvier 2014 à 15:22 +0100, Jiri Olsa a écrit :
> On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> > In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> > added to perf_event_open(2) syscall to allows userspace
> > to enable close-on-exec behavor atomically when creating
> > the file descriptor.
> >
> > This patch makes perf tools use the new flag.
> >
> > Beware that perf tools compiled with the new flag won't work
> > on older kernel which do not support flag PERF_FLAG_FD_CLOEXEC.
>
> I think we should enhance the api probe routines (perf_do_probe_api)
> to detect that, than just bypass us from running on older kernels
>
perf_probe_api() is only available in tools/perf/util/record.c module
(eg. it's a static function), while I've modified all calls to
perf_event_open().
So if you want to follow the probe path, a new function should be added.
This function must be used to retrieve the value of the
PERF_FLAG_FD_CLOEXEC flag if available.
For example:
static int cloexec = PERF_FLAG_FD_CLOEXEC;
int perf_flag_fd_cloexec(void)
{
static int probed;
if (!probed) {
int fd = perf_event_open(&attr, 0, -1, -1,
PERF_FLAG_FD_CLOEXEC);
probed = 1;
if (fd >= 0)
close(fd);
else
cloexec = 0;
}
return cloexec;
}
But I don't know how to setup a struct perf_event_attr which will work
in "all" case (and do no harm).
I was able to run this code with attr struct cleared (eg. set to 0), but
I don't know perf internals enough to be confident.
Regards.
--
Yann Droneaud
OPTEYA
On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> added to perf_event_open(2) syscall to allows userspace
> to enable close-on-exec behavor atomically when creating
> the file descriptor.
>
> This patch makes perf tools use the new flag.
What is that good for? I can see why for a threaded program, but
"perf" is not threaded.
-Andi
On Mon, Jan 06, 2014 at 08:27:54AM -0800, Andi Kleen wrote:
> On Mon, Jan 06, 2014 at 11:51:25AM +0100, Yann Droneaud wrote:
> > In a previous patch [1][2], flag PERF_FLAG_FD_CLOEXEC was
> > added to perf_event_open(2) syscall to allows userspace
> > to enable close-on-exec behavor atomically when creating
> > the file descriptor.
> >
> > This patch makes perf tools use the new flag.
>
> What is that good for? I can see why for a threaded program, but
> "perf" is not threaded.
AFAICT its got nothing to do with threaded or not, but only with exec()
and we do in fact call exec() quite a lot in perf.
It ensures we do not leak open perf FDs into our child processes. Now
I'm not entirely sure how we do the exec these days but I think we were
good about not not leaking them anyway, but more paranoia never really
hurts.
> AFAICT its got nothing to do with threaded or not, but only with exec()
> and we do in fact call exec() quite a lot in perf.
>
> It ensures we do not leak open perf FDs into our child processes. Now
> I'm not entirely sure how we do the exec these days but I think we were
> good about not not leaking them anyway, but more paranoia never really
> hurts.
Then you can just set it with fcntl, which works everywhere,
instead of doing extra feature tests.
The atomic setup is only needed for threaded programs to avoid
races with other threads doing exec.
-Andi
--
[email protected] -- Speaking for myself only
Hi,
Le lundi 06 janvier 2014 à 08:52 -0800, Andi Kleen a écrit :
> > AFAICT its got nothing to do with threaded or not, but only with exec()
> > and we do in fact call exec() quite a lot in perf.
> >
> > It ensures we do not leak open perf FDs into our child processes. Now
> > I'm not entirely sure how we do the exec these days but I think we were
> > good about not not leaking them anyway, but more paranoia never really
> > hurts.
>
> Then you can just set it with fcntl, which works everywhere,
> instead of doing extra feature tests.
>
> The atomic setup is only needed for threaded programs to avoid
> races with other threads doing exec.
>
True.
The purpose of this patch was more about exercising the new flag
proposed for perf_event_open() as a way to demonstrate its usage.
Having it used in the perf code base will likely help to spread it to
other tools based on perf event API.
Trying to add a close-on-exec enable flag late emphasis it should have
been done earlier and perhaps made the default.
As I wrote in "[PATCH v5 0/7] Getting rid of get_unused_fd() / enable
close-on-exec" [1], setting close-on-exec at open time is easier to
write for the programmer and less error prone than trying to add call to
fcntl() and forget to add it in some code path. (forgetting to add the
proper flag to open() is probably more error prone ... so one might want
to make close-on-exec the *default* for any new API).
[1] http://lkml.kernel.org/r/[email protected]
Regards.
--
Yann Droneaud
OPTEYA
Hi,
Le lundi 06 janvier 2014 à 11:43 -0300, Arnaldo Carvalho de Melo a
écrit :
> Em Mon, Jan 06, 2014 at 12:24:36PM +0100, Peter Zijlstra escreveu:
> > acme, ACK? I was thinking I'd keep these two patches together so the
> > entire things lands in tip in one go?
>
> Nope, it should notice the EINVAL, drop the flag that doesn't work on
> older kernels, retry, so that new tools continue to work on older
> kernels, with yet another fallback.
>
But it may be difficult to distinguish the 'origin' of EINVAL: is it
from FD_CLOEXEC flag or from an unsupported parameter in the attributes.
> Please take a look at __perf_evsel__open(), probably it will be best to
> add a flag to perf_missing_features, like the one we have for the
> perf_event_attr.mmap2 flag, so that we fail just once, etc.
>
Unfortunately perf_event_open() is called in multiple places, not only
in __perf_evsel__open(). So a more generic solution should be designed.
Is something like the function proposed in message
<[email protected]> [1]
ok to be added to its own module:
static int cloexec = PERF_FLAG_FD_CLOEXEC;
int perf_flag_fd_cloexec(void)
{
static int probed;
if (!probed) {
struct perf_event_attr attr = { 0 };
int fd = perf_event_open(&attr, 0, -1, -1,
PERF_FLAG_FD_CLOEXEC);
probed = 1;
if (fd >= 0)
close(fd);
else
cloexec = 0;
}
return cloexec;
}
This function should be used to build the flag passed to
perf_event_open().
Regards.
[1]
http://lkml.kernel.org/r/[email protected]
--
Yann Droneaud
OPTEYA
Em Mon, Jan 06, 2014 at 10:01:35PM +0100, Yann Droneaud escreveu:
> Hi,
>
> Le lundi 06 janvier 2014 ? 11:43 -0300, Arnaldo Carvalho de Melo a
> ?crit :
> > Em Mon, Jan 06, 2014 at 12:24:36PM +0100, Peter Zijlstra escreveu:
>
> > > acme, ACK? I was thinking I'd keep these two patches together so the
> > > entire things lands in tip in one go?
> >
> > Nope, it should notice the EINVAL, drop the flag that doesn't work on
> > older kernels, retry, so that new tools continue to work on older
> > kernels, with yet another fallback.
> >
>
> But it may be difficult to distinguish the 'origin' of EINVAL: is it
> from FD_CLOEXEC flag or from an unsupported parameter in the attributes.
It definitely is, that is why the first thing to fall back is the most
recently added feature.
> > Please take a look at __perf_evsel__open(), probably it will be best to
> > add a flag to perf_missing_features, like the one we have for the
> > perf_event_attr.mmap2 flag, so that we fail just once, etc.
> >
>
> Unfortunately perf_event_open() is called in multiple places, not only
> in __perf_evsel__open(). So a more generic solution should be designed.
>
> Is something like the function proposed in message
> <[email protected]> [1]
> ok to be added to its own module:
Well, at least make it set a perf_missing_features flag instead of that
'cloexec' global, that way the routines that use perf_evsel__open would
reuse it.
- Arnaldo
> static int cloexec = PERF_FLAG_FD_CLOEXEC;
>
> int perf_flag_fd_cloexec(void)
> {
> static int probed;
>
> if (!probed) {
> struct perf_event_attr attr = { 0 };
> int fd = perf_event_open(&attr, 0, -1, -1,
> PERF_FLAG_FD_CLOEXEC);
> probed = 1;
> if (fd >= 0)
> close(fd);
> else
> cloexec = 0;
> }
>
> return cloexec;
> }
>
> This function should be used to build the flag passed to
> perf_event_open().
>
> Regards.
>
> [1]
> http://lkml.kernel.org/r/[email protected]
>
> --
> Yann Droneaud
> OPTEYA
>
In a previous patch [1], flag PERF_FLAG_FD_CLOEXEC was
added to perf_event_open(2) syscall to allows userspace
to enable close-on-exec behavor atomically when creating
the file descriptor.
This patch makes perf tool use the new flag if supported
by the kernel, so that the event file descriptors got automatically
closed if perf tool exec a sub-command.
Changes from v0 [2]:
- add probing for PERF_FLAG_FD_CLOEXEC flag allowing
perf to run on older kernel:
* use "missing feature" pattern in evsel to disable
PERF_FLAG_FD_CLOEXEC in perf_evsel__open() if not
supported by kernel;
* add a dedicated function to probe for
PERF_FLAG_FD_CLOEXEC support in the current kernel.
This function is to be used by other functions
calling sys_perf_event_open() directly.
- remove PERF_FLAG_FD_CLOEXEC from PowerPC selftest
as it's not related to perf tool: it should be part
of a separate patch.
[1] http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
https://patchwork.kernel.org/patch/3434971/
[2] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3437111/
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
tools/perf/Makefile.perf | 1 +
tools/perf/bench/mem-memcpy.c | 4 ++-
tools/perf/bench/mem-memset.c | 4 ++-
tools/perf/builtin-sched.c | 4 ++-
tools/perf/tests/bp_signal.c | 4 ++-
tools/perf/tests/bp_signal_overflow.c | 4 ++-
tools/perf/tests/rdpmc.c | 4 ++-
tools/perf/util/cloexec.c | 54 +++++++++++++++++++++++++++++++++++
tools/perf/util/cloexec.h | 6 ++++
tools/perf/util/evsel.c | 12 ++++++--
tools/perf/util/record.c | 10 +++++--
11 files changed, 95 insertions(+), 12 deletions(-)
create mode 100644 tools/perf/util/cloexec.c
create mode 100644 tools/perf/util/cloexec.h
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 97a2145e4cc6..0428b11402b8 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -372,6 +372,7 @@ LIB_OBJS += $(OUTPUT)util/stat.o
LIB_OBJS += $(OUTPUT)util/record.o
LIB_OBJS += $(OUTPUT)util/srcline.o
LIB_OBJS += $(OUTPUT)util/data.o
+LIB_OBJS += $(OUTPUT)util/cloexec.o
LIB_OBJS += $(OUTPUT)ui/setup.o
LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 5ce71d3b72cf..bf5a21b848a9 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memcpy-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/bench/mem-memset.c b/tools/perf/bench/mem-memset.c
index 9af79d2b18e5..260747ea1e0e 100644
--- a/tools/perf/bench/mem-memset.c
+++ b/tools/perf/bench/mem-memset.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memset-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 0f3c65518a2c..bce5d4efea9c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -10,6 +10,7 @@
#include "util/header.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/cloexec.h"
#include "util/parse-options.h"
#include "util/trace-event.h"
@@ -435,7 +436,8 @@ static int self_open_counters(void)
attr.type = PERF_TYPE_SOFTWARE;
attr.config = PERF_COUNT_SW_TASK_CLOCK;
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0)
pr_err("Error: sys_perf_event_open() syscall returned "
diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
index aba095489193..fdc0d3e185f9 100644
--- a/tools/perf/tests/bp_signal.c
+++ b/tools/perf/tests/bp_signal.c
@@ -25,6 +25,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int fd1;
static int fd2;
@@ -78,7 +79,8 @@ static int bp_event(void *fn, int setup_signal)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/bp_signal_overflow.c b/tools/perf/tests/bp_signal_overflow.c
index 44ac82179708..b0b17415f18c 100644
--- a/tools/perf/tests/bp_signal_overflow.c
+++ b/tools/perf/tests/bp_signal_overflow.c
@@ -24,6 +24,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int overflows;
@@ -91,7 +92,8 @@ int test__bp_signal_overflow(void)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/rdpmc.c b/tools/perf/tests/rdpmc.c
index 46649c25fa5e..c1e55ff18774 100644
--- a/tools/perf/tests/rdpmc.c
+++ b/tools/perf/tests/rdpmc.c
@@ -6,6 +6,7 @@
#include "perf.h"
#include "debug.h"
#include "tests.h"
+#include "../util/cloexec.h"
#if defined(__x86_64__) || defined(__i386__)
@@ -104,7 +105,8 @@ static int __test__rdpmc(void)
sa.sa_sigaction = segfault_handler;
sigaction(SIGSEGV, &sa, NULL);
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_err("Error: sys_perf_event_open() syscall returned "
"with %d (%s)\n", fd, strerror(errno));
diff --git a/tools/perf/util/cloexec.c b/tools/perf/util/cloexec.c
new file mode 100644
index 000000000000..3889a7b9e2d6
--- /dev/null
+++ b/tools/perf/util/cloexec.c
@@ -0,0 +1,54 @@
+#include "util.h"
+#include "../perf.h"
+#include "cloexec.h"
+
+static unsigned long flag = PERF_FLAG_FD_CLOEXEC;
+
+static int perf_flag_probe(void)
+{
+ struct perf_event_attr attr;
+ int fd;
+ int err;
+
+ /* check cloexec flag */
+ memset(&attr, 0, sizeof(attr));
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ PERF_FLAG_FD_CLOEXEC);
+ if (fd >= 0) {
+ close(fd);
+ return 1;
+ }
+
+ if (errno != EINVAL) {
+ err = errno;
+ pr_warning("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+ }
+
+ /* not supported, confirm error related to PERF_FLAG_FD_CLOEXEC */
+ memset(&attr, 0, sizeof(attr));
+ fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ if (fd >= 0) {
+ close(fd);
+ return 0;
+ }
+
+ err = errno;
+ die("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+
+ return -1;
+}
+
+unsigned long perf_event_open_cloexec_flag(void)
+{
+ static int probed;
+
+ if (!probed) {
+ if (perf_flag_probe() <= 0) {
+ flag = 0;
+ }
+ }
+
+ return flag;
+}
diff --git a/tools/perf/util/cloexec.h b/tools/perf/util/cloexec.h
new file mode 100644
index 000000000000..94a5a7d829d5
--- /dev/null
+++ b/tools/perf/util/cloexec.h
@@ -0,0 +1,6 @@
+#ifndef __PERF_CLOEXEC_H
+#define __PERF_CLOEXEC_H
+
+unsigned long perf_event_open_cloexec_flag(void);
+
+#endif /* __PERF_CLOEXEC_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 01ff4cfde1f5..19953fc7f90e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -29,6 +29,7 @@ static struct {
bool sample_id_all;
bool exclude_guest;
bool mmap2;
+ bool cloexec;
} perf_missing_features;
#define FD(e, x, y) (*(int *)xyarray__entry(e->fd, x, y))
@@ -970,7 +971,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
struct thread_map *threads)
{
int cpu, thread;
- unsigned long flags = 0;
+ unsigned long flags = PERF_FLAG_FD_CLOEXEC;
int pid = -1, err;
enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;
@@ -979,11 +980,13 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
return -ENOMEM;
if (evsel->cgrp) {
- flags = PERF_FLAG_PID_CGROUP;
+ flags |= PERF_FLAG_PID_CGROUP;
pid = evsel->cgrp->fd;
}
fallback_missing_features:
+ if (perf_missing_features.cloexec)
+ flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
if (perf_missing_features.mmap2)
evsel->attr.mmap2 = 0;
if (perf_missing_features.exclude_guest)
@@ -1052,7 +1055,10 @@ try_fallback:
if (err != -EINVAL || cpu > 0 || thread > 0)
goto out_close;
- if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
+ if (!perf_missing_features.cloexec && (flags & PERF_FLAG_FD_CLOEXEC)) {
+ perf_missing_features.cloexec = true;
+ goto fallback_missing_features;
+ } else if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
perf_missing_features.mmap2 = true;
goto fallback_missing_features;
} else if (!perf_missing_features.exclude_guest &&
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index e5104538c354..d51db8341437 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -4,6 +4,7 @@
#include "parse-events.h"
#include "fs.h"
#include "util.h"
+#include "cloexec.h"
typedef void (*setup_probe_fn_t)(struct perf_evsel *evsel);
@@ -22,14 +23,16 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
evsel = perf_evlist__first(evlist);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0)
goto out_delete;
close(fd);
fn(evsel);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
if (errno == EINVAL)
err = -EINVAL;
@@ -204,7 +207,8 @@ bool perf_evlist__can_select_event(struct perf_evlist *evlist, const char *str)
cpu = evlist->cpus->map[0];
}
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ perf_event_open_cloexec_flag());
if (fd >= 0) {
close(fd);
ret = true;
--
1.8.4.2
Commit-ID: a21b0b354d4ac39be691f51c53562e2c24443d9e
Gitweb: http://git.kernel.org/tip/a21b0b354d4ac39be691f51c53562e2c24443d9e
Author: Yann Droneaud <[email protected]>
AuthorDate: Sun, 5 Jan 2014 21:36:33 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 12 Jan 2014 10:16:59 +0100
perf: Introduce a flag to enable close-on-exec in perf_event_open()
Unlike recent modern userspace API such as:
epoll_create1 (EPOLL_CLOEXEC), eventfd (EFD_CLOEXEC),
fanotify_init (FAN_CLOEXEC), inotify_init1 (IN_CLOEXEC),
signalfd (SFD_CLOEXEC), timerfd_create (TFD_CLOEXEC),
or the venerable general purpose open (O_CLOEXEC),
perf_event_open() syscall lack a flag to atomically set FD_CLOEXEC
(eg. close-on-exec) flag on file descriptor it returns to userspace.
The present patch adds a PERF_FLAG_FD_CLOEXEC flag to allow
perf_event_open() syscall to atomically set close-on-exec.
Having this flag will enable userspace to remove the file descriptor
from the list of file descriptors being inherited across exec,
without the need to call fcntl(fd, F_SETFD, FD_CLOEXEC) and the
associated race condition between the current thread and another
thread calling fork(2) then execve(2).
Links:
- Secure File Descriptor Handling (Ulrich Drepper, 2008)
http://udrepper.livejournal.com/20407.html
- Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012)
http://danwalsh.livejournal.com/53603.html
- Notes in DMA buffer sharing: leak and security hole
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dma-buf-sharing.txt?id=v3.13-rc3#n428
Signed-off-by: Yann Droneaud <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/perf_event.h | 1 +
kernel/events/core.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index e1802d6..ca018b4 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -724,6 +724,7 @@ enum perf_callchain_context {
#define PERF_FLAG_FD_NO_GROUP (1U << 0)
#define PERF_FLAG_FD_OUTPUT (1U << 1)
#define PERF_FLAG_PID_CGROUP (1U << 2) /* pid=cgroup id, per-cpu mode only */
+#define PERF_FLAG_FD_CLOEXEC (1U << 3) /* O_CLOEXEC */
union perf_mem_data_src {
__u64 val;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c3b6c27..5c87264 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -119,7 +119,8 @@ static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
#define PERF_FLAG_ALL (PERF_FLAG_FD_NO_GROUP |\
PERF_FLAG_FD_OUTPUT |\
- PERF_FLAG_PID_CGROUP)
+ PERF_FLAG_PID_CGROUP |\
+ PERF_FLAG_FD_CLOEXEC)
/*
* branch priv levels that need permission checks
@@ -6982,6 +6983,7 @@ SYSCALL_DEFINE5(perf_event_open,
int event_fd;
int move_group = 0;
int err;
+ int f_flags = O_RDWR;
/* for future expandability... */
if (flags & ~PERF_FLAG_ALL)
@@ -7010,7 +7012,10 @@ SYSCALL_DEFINE5(perf_event_open,
if ((flags & PERF_FLAG_PID_CGROUP) && (pid == -1 || cpu == -1))
return -EINVAL;
- event_fd = get_unused_fd();
+ if (flags & PERF_FLAG_FD_CLOEXEC)
+ f_flags |= O_CLOEXEC;
+
+ event_fd = get_unused_fd_flags(f_flags);
if (event_fd < 0)
return event_fd;
@@ -7132,7 +7137,8 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_context;
}
- event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, O_RDWR);
+ event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
+ f_flags);
if (IS_ERR(event_file)) {
err = PTR_ERR(event_file);
goto err_context;
In a previous patch [1], flag PERF_FLAG_FD_CLOEXEC was
added to perf_event_open(2) syscall to allows userspace
to enable close-on-exec behavor atomically when creating
the file descriptor.
This patch makes perf tools use the new flag if supported
by the kernel, so that the event file descriptors got
automatically closed if perf tool exec a sub-command.
Changes from v1 [2]:
- don't probe PERF_FLAG_FD_CLOEXEC for each call to
perf_event_open_cloexec_flag(): don't forget to set
'probed' variable once flag was probed.
- call perf_event_open_cloexec_flag() only once in
util/record.c:perf_do_probe_api().
- fixed minor coding style issue (unneeded braces)
in util/cloexec.c
Changes from v0 [3]:
- add probing for PERF_FLAG_FD_CLOEXEC flag allowing
perf to run on older kernel:
* use "missing feature" pattern in evsel to disable
PERF_FLAG_FD_CLOEXEC in perf_evsel__open() if not
supported by kernel;
* add a dedicated function to probe for
PERF_FLAG_FD_CLOEXEC support in the current kernel.
This function is to be used by other functions
calling sys_perf_event_open() directly.
- remove PERF_FLAG_FD_CLOEXEC from PowerPC selftest
as it's not related to perf tool: it should be part
of a separate patch.
[1] http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
https://patchwork.kernel.org/patch/3434971/
[2] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3469571/
[3] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3437111/
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
tools/perf/Makefile.perf | 1 +
tools/perf/bench/mem-memcpy.c | 4 ++-
tools/perf/bench/mem-memset.c | 4 ++-
tools/perf/builtin-sched.c | 4 ++-
tools/perf/tests/bp_signal.c | 4 ++-
tools/perf/tests/bp_signal_overflow.c | 4 ++-
tools/perf/tests/rdpmc.c | 4 ++-
tools/perf/util/cloexec.c | 54 +++++++++++++++++++++++++++++++++++
tools/perf/util/cloexec.h | 6 ++++
tools/perf/util/evsel.c | 12 ++++++--
tools/perf/util/record.c | 9 ++++--
11 files changed, 94 insertions(+), 12 deletions(-)
create mode 100644 tools/perf/util/cloexec.c
create mode 100644 tools/perf/util/cloexec.h
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 3638b0bd20dc..bcce558b5407 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -372,6 +372,7 @@ LIB_OBJS += $(OUTPUT)util/stat.o
LIB_OBJS += $(OUTPUT)util/record.o
LIB_OBJS += $(OUTPUT)util/srcline.o
LIB_OBJS += $(OUTPUT)util/data.o
+LIB_OBJS += $(OUTPUT)util/cloexec.o
LIB_OBJS += $(OUTPUT)ui/setup.o
LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 5ce71d3b72cf..bf5a21b848a9 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memcpy-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/bench/mem-memset.c b/tools/perf/bench/mem-memset.c
index 9af79d2b18e5..260747ea1e0e 100644
--- a/tools/perf/bench/mem-memset.c
+++ b/tools/perf/bench/mem-memset.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memset-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 6a76a07b6789..54017bdec88c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -10,6 +10,7 @@
#include "util/header.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/cloexec.h"
#include "util/parse-options.h"
#include "util/trace-event.h"
@@ -435,7 +436,8 @@ static int self_open_counters(void)
attr.type = PERF_TYPE_SOFTWARE;
attr.config = PERF_COUNT_SW_TASK_CLOCK;
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0)
pr_err("Error: sys_perf_event_open() syscall returned "
diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
index aba095489193..fdc0d3e185f9 100644
--- a/tools/perf/tests/bp_signal.c
+++ b/tools/perf/tests/bp_signal.c
@@ -25,6 +25,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int fd1;
static int fd2;
@@ -78,7 +79,8 @@ static int bp_event(void *fn, int setup_signal)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/bp_signal_overflow.c b/tools/perf/tests/bp_signal_overflow.c
index 44ac82179708..b0b17415f18c 100644
--- a/tools/perf/tests/bp_signal_overflow.c
+++ b/tools/perf/tests/bp_signal_overflow.c
@@ -24,6 +24,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int overflows;
@@ -91,7 +92,8 @@ int test__bp_signal_overflow(void)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/rdpmc.c b/tools/perf/tests/rdpmc.c
index 46649c25fa5e..c1e55ff18774 100644
--- a/tools/perf/tests/rdpmc.c
+++ b/tools/perf/tests/rdpmc.c
@@ -6,6 +6,7 @@
#include "perf.h"
#include "debug.h"
#include "tests.h"
+#include "../util/cloexec.h"
#if defined(__x86_64__) || defined(__i386__)
@@ -104,7 +105,8 @@ static int __test__rdpmc(void)
sa.sa_sigaction = segfault_handler;
sigaction(SIGSEGV, &sa, NULL);
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_err("Error: sys_perf_event_open() syscall returned "
"with %d (%s)\n", fd, strerror(errno));
diff --git a/tools/perf/util/cloexec.c b/tools/perf/util/cloexec.c
new file mode 100644
index 000000000000..06f6ee087fa1
--- /dev/null
+++ b/tools/perf/util/cloexec.c
@@ -0,0 +1,54 @@
+#include "util.h"
+#include "../perf.h"
+#include "cloexec.h"
+
+static unsigned long flag = PERF_FLAG_FD_CLOEXEC;
+
+static int perf_flag_probe(void)
+{
+ struct perf_event_attr attr;
+ int fd;
+ int err;
+
+ /* check cloexec flag */
+ memset(&attr, 0, sizeof(attr));
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ PERF_FLAG_FD_CLOEXEC);
+ if (fd >= 0) {
+ close(fd);
+ return 1;
+ }
+
+ if (errno != EINVAL) {
+ err = errno;
+ pr_warning("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+ }
+
+ /* not supported, confirm error related to PERF_FLAG_FD_CLOEXEC */
+ memset(&attr, 0, sizeof(attr));
+ fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ if (fd >= 0) {
+ close(fd);
+ return 0;
+ }
+
+ err = errno;
+ die("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+
+ return -1;
+}
+
+unsigned long perf_event_open_cloexec_flag(void)
+{
+ static bool probed;
+
+ if (!probed) {
+ if (perf_flag_probe() <= 0)
+ flag = 0;
+ probed = true;
+ }
+
+ return flag;
+}
diff --git a/tools/perf/util/cloexec.h b/tools/perf/util/cloexec.h
new file mode 100644
index 000000000000..94a5a7d829d5
--- /dev/null
+++ b/tools/perf/util/cloexec.h
@@ -0,0 +1,6 @@
+#ifndef __PERF_CLOEXEC_H
+#define __PERF_CLOEXEC_H
+
+unsigned long perf_event_open_cloexec_flag(void);
+
+#endif /* __PERF_CLOEXEC_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index ade8d9c1c431..ff845c7e7a4f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -29,6 +29,7 @@ static struct {
bool sample_id_all;
bool exclude_guest;
bool mmap2;
+ bool cloexec;
} perf_missing_features;
#define FD(e, x, y) (*(int *)xyarray__entry(e->fd, x, y))
@@ -968,7 +969,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
struct thread_map *threads)
{
int cpu, thread;
- unsigned long flags = 0;
+ unsigned long flags = PERF_FLAG_FD_CLOEXEC;
int pid = -1, err;
enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;
@@ -977,11 +978,13 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
return -ENOMEM;
if (evsel->cgrp) {
- flags = PERF_FLAG_PID_CGROUP;
+ flags |= PERF_FLAG_PID_CGROUP;
pid = evsel->cgrp->fd;
}
fallback_missing_features:
+ if (perf_missing_features.cloexec)
+ flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
if (perf_missing_features.mmap2)
evsel->attr.mmap2 = 0;
if (perf_missing_features.exclude_guest)
@@ -1050,7 +1053,10 @@ try_fallback:
if (err != -EINVAL || cpu > 0 || thread > 0)
goto out_close;
- if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
+ if (!perf_missing_features.cloexec && (flags & PERF_FLAG_FD_CLOEXEC)) {
+ perf_missing_features.cloexec = true;
+ goto fallback_missing_features;
+ } else if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
perf_missing_features.mmap2 = true;
goto fallback_missing_features;
} else if (!perf_missing_features.exclude_guest &&
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index 104a47563d39..6483ef5df31b 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -4,6 +4,7 @@
#include "parse-events.h"
#include "fs.h"
#include "util.h"
+#include "cloexec.h"
typedef void (*setup_probe_fn_t)(struct perf_evsel *evsel);
@@ -11,6 +12,7 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
{
struct perf_evlist *evlist;
struct perf_evsel *evsel;
+ unsigned long flags = perf_event_open_cloexec_flag();
int err = -EAGAIN, fd;
evlist = perf_evlist__new();
@@ -22,14 +24,14 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
evsel = perf_evlist__first(evlist);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
if (fd < 0)
goto out_delete;
close(fd);
fn(evsel);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
if (fd < 0) {
if (errno == EINVAL)
err = -EINVAL;
@@ -203,7 +205,8 @@ bool perf_evlist__can_select_event(struct perf_evlist *evlist, const char *str)
cpu = evlist->cpus->map[0];
}
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ perf_event_open_cloexec_flag());
if (fd >= 0) {
close(fd);
ret = true;
--
1.8.4.2
Em Mon, Jan 13, 2014 at 11:09:30AM +0100, Yann Droneaud escreveu:
> +++ b/tools/perf/util/cloexec.c
> @@ -0,0 +1,54 @@
> +#include "util.h"
> +#include "../perf.h"
> +#include "cloexec.h"
> +
> +static unsigned long flag = PERF_FLAG_FD_CLOEXEC;
> +
> +static int perf_flag_probe(void)
> +{
> + struct perf_event_attr attr;
> + int fd;
> + int err;
> +
> + /* check cloexec flag */
> + memset(&attr, 0, sizeof(attr));
If you do this you'll use attr.type = PERF_TYPE_HARDWARE and
attr.config then means PERF_COUNT_HW_CPU_CYCLES, which will fail
where we don't have a PMU, see perf_evsel__fallback(), perhaps it is
better for you to go with the fallback that is most likely supported in
all arches:
evsel->attr.type = PERF_TYPE_SOFTWARE; // 1
evsel->attr.config = PERF_COUNT_SW_CPU_CLOCK; // 0
I.e. just do a:
struct perf_event_attr attr = { .type = 1, }; and ditch the
memset (or do a attr.type = PERF_COUNT_SW_CPU_CLOCK;, up to you).
Thanks for doing the __perf_evsel__open fallbacking!
- Arnaldo
> + fd = sys_perf_event_open(&attr, 0, -1, -1,
> + PERF_FLAG_FD_CLOEXEC);
> + if (fd >= 0) {
> + close(fd);
> + return 1;
> + }
> +
> + if (errno != EINVAL) {
> + err = errno;
> + pr_warning("sys_perf_event_open() syscall returned "
> + "%d (%d: %s)\n", fd, err, strerror(err));
> + }
> +
> + /* not supported, confirm error related to PERF_FLAG_FD_CLOEXEC */
> + memset(&attr, 0, sizeof(attr));
> + fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> + if (fd >= 0) {
> + close(fd);
> + return 0;
> + }
> +
> + err = errno;
> + die("sys_perf_event_open() syscall returned "
> + "%d (%d: %s)\n", fd, err, strerror(err));
> +
> + return -1;
> +}
> +
> +unsigned long perf_event_open_cloexec_flag(void)
> +{
> + static bool probed;
> +
> + if (!probed) {
> + if (perf_flag_probe() <= 0)
> + flag = 0;
> + probed = true;
> + }
> +
> + return flag;
> +}
> diff --git a/tools/perf/util/cloexec.h b/tools/perf/util/cloexec.h
> new file mode 100644
> index 000000000000..94a5a7d829d5
> --- /dev/null
> +++ b/tools/perf/util/cloexec.h
> @@ -0,0 +1,6 @@
> +#ifndef __PERF_CLOEXEC_H
> +#define __PERF_CLOEXEC_H
> +
> +unsigned long perf_event_open_cloexec_flag(void);
> +
> +#endif /* __PERF_CLOEXEC_H */
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index ade8d9c1c431..ff845c7e7a4f 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -29,6 +29,7 @@ static struct {
> bool sample_id_all;
> bool exclude_guest;
> bool mmap2;
> + bool cloexec;
> } perf_missing_features;
>
> #define FD(e, x, y) (*(int *)xyarray__entry(e->fd, x, y))
> @@ -968,7 +969,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
> struct thread_map *threads)
> {
> int cpu, thread;
> - unsigned long flags = 0;
> + unsigned long flags = PERF_FLAG_FD_CLOEXEC;
> int pid = -1, err;
> enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;
>
> @@ -977,11 +978,13 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
> return -ENOMEM;
>
> if (evsel->cgrp) {
> - flags = PERF_FLAG_PID_CGROUP;
> + flags |= PERF_FLAG_PID_CGROUP;
> pid = evsel->cgrp->fd;
> }
>
> fallback_missing_features:
> + if (perf_missing_features.cloexec)
> + flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
> if (perf_missing_features.mmap2)
> evsel->attr.mmap2 = 0;
> if (perf_missing_features.exclude_guest)
> @@ -1050,7 +1053,10 @@ try_fallback:
> if (err != -EINVAL || cpu > 0 || thread > 0)
> goto out_close;
>
> - if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
> + if (!perf_missing_features.cloexec && (flags & PERF_FLAG_FD_CLOEXEC)) {
> + perf_missing_features.cloexec = true;
> + goto fallback_missing_features;
> + } else if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
> perf_missing_features.mmap2 = true;
> goto fallback_missing_features;
> } else if (!perf_missing_features.exclude_guest &&
> diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
> index 104a47563d39..6483ef5df31b 100644
> --- a/tools/perf/util/record.c
> +++ b/tools/perf/util/record.c
> @@ -4,6 +4,7 @@
> #include "parse-events.h"
> #include "fs.h"
> #include "util.h"
> +#include "cloexec.h"
>
> typedef void (*setup_probe_fn_t)(struct perf_evsel *evsel);
>
> @@ -11,6 +12,7 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
> {
> struct perf_evlist *evlist;
> struct perf_evsel *evsel;
> + unsigned long flags = perf_event_open_cloexec_flag();
> int err = -EAGAIN, fd;
>
> evlist = perf_evlist__new();
> @@ -22,14 +24,14 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
>
> evsel = perf_evlist__first(evlist);
>
> - fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
> + fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
> if (fd < 0)
> goto out_delete;
> close(fd);
>
> fn(evsel);
>
> - fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
> + fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
> if (fd < 0) {
> if (errno == EINVAL)
> err = -EINVAL;
> @@ -203,7 +205,8 @@ bool perf_evlist__can_select_event(struct perf_evlist *evlist, const char *str)
> cpu = evlist->cpus->map[0];
> }
>
> - fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
> + fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
> + perf_event_open_cloexec_flag());
> if (fd >= 0) {
> close(fd);
> ret = true;
> --
> 1.8.4.2
Hi Lino,
I've recently sent this patch which may fix a bug in the way fanotify
creates its file descriptors.
I failed to noticed you've been working recently on fanotify subsystem
and miss the opportunity to add you in the Cc:s.
Please have a look to this patch and help to get it merged.
Le dimanche 05 janvier 2014 à 21:36 +0100, Yann Droneaud a écrit :
> According to commit 80af258867648, file descriptor created as part
> of file access notification events inherit flags from the event_f_flags
> argument passed to syscall fanotify_init(2).
>
> So while it is legal for userspace to call fanotify_init() with O_CLOEXEC
> as part of its second argument, O_CLOEXEC is currently silently ignored.
>
> Indeed event_f_flags are only given to dentry_open(), which only
> seems to care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT
> in open_check_o_direct() and O_LARGEFILE in generic_file_open().
>
> More, there's no effective check on event_f_flags value that would
> catch unknown / unsupported values, unlike the one on f_flags argument
> of the syscall (see FAN_ALL_INIT_FLAGS in include/uapi/linux/fanotify.h).
...
> But it seems logical to set close-on-exec flag on the file descriptor
> if userspace is allowed to request it with O_CLOEXEC.
>
...
> So this patch replaces call to macro get_unused_fd() by a call to
> function get_unused_fd_flags() with event_f_flags value as argument.
> This way O_CLOEXEC flag in the second argument of fanotify_init(2)
> syscall is interpreted so that close-on-exec get enabled.
...
> Cc: Eric Paris <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: [email protected]
> Signed-off-by: Yann Droneaud <[email protected]>
> Link:
> http://lkml.kernel.org/r/[email protected]
>
> ---
> fs/notify/fanotify/fanotify_user.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 57d7c083cb4b..6d0eaabba02e 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -71,7 +71,7 @@ static int create_fd(struct fsnotify_group *group,
>
> pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>
> - client_fd = get_unused_fd();
> + client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
> if (client_fd < 0)
> return client_fd;
>
Regards.
--
Yann Droneaud
OPTEYA
In commit a21b0b354d4a, flag PERF_FLAG_FD_CLOEXEC was
added to perf_event_open(2) syscall to allows userspace
to atomically enable close-on-exec behavor when creating
the file descriptor.
This patch makes perf tools use the new flag if supported
by the kernel, so that the event file descriptors got
automatically closed if perf tool exec a sub-command.
Changes from v2 [1]:
- use safest perf attr configuration in probe function,
as used in perf_evsel__fallback().
Changes from v1 [2]:
- don't probe PERF_FLAG_FD_CLOEXEC for each call to
perf_event_open_cloexec_flag(): don't forget to set
'probed' variable once flag was probed.
- call perf_event_open_cloexec_flag() only once in
util/record.c:perf_do_probe_api().
- fixed minor coding style issue (unneeded braces)
in util/cloexec.c
Changes from v0 [3]:
- add probing for PERF_FLAG_FD_CLOEXEC flag allowing
perf to run on older kernel:
* use "missing feature" pattern in evsel to disable
PERF_FLAG_FD_CLOEXEC in perf_evsel__open() if not
supported by kernel;
* add a dedicated function to probe for
PERF_FLAG_FD_CLOEXEC support in the current kernel.
This function is to be used by other functions
calling sys_perf_event_open() directly.
- remove PERF_FLAG_FD_CLOEXEC from PowerPC selftest
as it's not related to perf tool: it should be part
of a separate patch.
[1] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3474141/
[2] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3469571/
[3] http://lkml.kernel.org/r/[email protected]
https://patchwork.kernel.org/patch/3437111/
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Yann Droneaud <[email protected]>
---
tools/perf/Makefile.perf | 1 +
tools/perf/bench/mem-memcpy.c | 4 ++-
tools/perf/bench/mem-memset.c | 4 ++-
tools/perf/builtin-sched.c | 4 ++-
tools/perf/tests/bp_signal.c | 4 ++-
tools/perf/tests/bp_signal_overflow.c | 4 ++-
tools/perf/tests/rdpmc.c | 4 ++-
tools/perf/util/cloexec.c | 56 +++++++++++++++++++++++++++++++++++
tools/perf/util/cloexec.h | 6 ++++
tools/perf/util/evsel.c | 12 ++++++--
tools/perf/util/record.c | 9 ++++--
11 files changed, 96 insertions(+), 12 deletions(-)
create mode 100644 tools/perf/util/cloexec.c
create mode 100644 tools/perf/util/cloexec.h
diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index 7257e7e9e38a..6aeefd97c9db 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -373,6 +373,7 @@ LIB_OBJS += $(OUTPUT)util/stat.o
LIB_OBJS += $(OUTPUT)util/record.o
LIB_OBJS += $(OUTPUT)util/srcline.o
LIB_OBJS += $(OUTPUT)util/data.o
+LIB_OBJS += $(OUTPUT)util/cloexec.o
LIB_OBJS += $(OUTPUT)ui/setup.o
LIB_OBJS += $(OUTPUT)ui/helpline.o
diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index 5ce71d3b72cf..bf5a21b848a9 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memcpy-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/bench/mem-memset.c b/tools/perf/bench/mem-memset.c
index 9af79d2b18e5..260747ea1e0e 100644
--- a/tools/perf/bench/mem-memset.c
+++ b/tools/perf/bench/mem-memset.c
@@ -10,6 +10,7 @@
#include "../util/util.h"
#include "../util/parse-options.h"
#include "../util/header.h"
+#include "../util/cloexec.h"
#include "bench.h"
#include "mem-memset-arch.h"
@@ -83,7 +84,8 @@ static struct perf_event_attr cycle_attr = {
static void init_cycle(void)
{
- cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1, 0);
+ cycle_fd = sys_perf_event_open(&cycle_attr, getpid(), -1, -1,
+ perf_event_open_cloexec_flag());
if (cycle_fd < 0 && errno == ENOSYS)
die("No CONFIG_PERF_EVENTS=y kernel support configured?\n");
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 6a76a07b6789..54017bdec88c 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -10,6 +10,7 @@
#include "util/header.h"
#include "util/session.h"
#include "util/tool.h"
+#include "util/cloexec.h"
#include "util/parse-options.h"
#include "util/trace-event.h"
@@ -435,7 +436,8 @@ static int self_open_counters(void)
attr.type = PERF_TYPE_SOFTWARE;
attr.config = PERF_COUNT_SW_TASK_CLOCK;
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0)
pr_err("Error: sys_perf_event_open() syscall returned "
diff --git a/tools/perf/tests/bp_signal.c b/tools/perf/tests/bp_signal.c
index aba095489193..fdc0d3e185f9 100644
--- a/tools/perf/tests/bp_signal.c
+++ b/tools/perf/tests/bp_signal.c
@@ -25,6 +25,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int fd1;
static int fd2;
@@ -78,7 +79,8 @@ static int bp_event(void *fn, int setup_signal)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/bp_signal_overflow.c b/tools/perf/tests/bp_signal_overflow.c
index 44ac82179708..b0b17415f18c 100644
--- a/tools/perf/tests/bp_signal_overflow.c
+++ b/tools/perf/tests/bp_signal_overflow.c
@@ -24,6 +24,7 @@
#include "tests.h"
#include "debug.h"
#include "perf.h"
+#include "../util/cloexec.h"
static int overflows;
@@ -91,7 +92,8 @@ int test__bp_signal_overflow(void)
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
- fd = sys_perf_event_open(&pe, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&pe, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_debug("failed opening event %llx\n", pe.config);
return TEST_FAIL;
diff --git a/tools/perf/tests/rdpmc.c b/tools/perf/tests/rdpmc.c
index 46649c25fa5e..c1e55ff18774 100644
--- a/tools/perf/tests/rdpmc.c
+++ b/tools/perf/tests/rdpmc.c
@@ -6,6 +6,7 @@
#include "perf.h"
#include "debug.h"
#include "tests.h"
+#include "../util/cloexec.h"
#if defined(__x86_64__) || defined(__i386__)
@@ -104,7 +105,8 @@ static int __test__rdpmc(void)
sa.sa_sigaction = segfault_handler;
sigaction(SIGSEGV, &sa, NULL);
- fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ perf_event_open_cloexec_flag());
if (fd < 0) {
pr_err("Error: sys_perf_event_open() syscall returned "
"with %d (%s)\n", fd, strerror(errno));
diff --git a/tools/perf/util/cloexec.c b/tools/perf/util/cloexec.c
new file mode 100644
index 000000000000..771ed36fcc61
--- /dev/null
+++ b/tools/perf/util/cloexec.c
@@ -0,0 +1,56 @@
+#include "util.h"
+#include "../perf.h"
+#include "cloexec.h"
+
+static unsigned long flag = PERF_FLAG_FD_CLOEXEC;
+
+static int perf_flag_probe(void)
+{
+ /* use 'safest' configuration as used in perf_evsel__fallback() */
+ struct perf_event_attr attr = {
+ .type = PERF_COUNT_SW_CPU_CLOCK,
+ .config = PERF_COUNT_SW_CPU_CLOCK,
+ };
+ int fd;
+ int err;
+
+ /* check cloexec flag */
+ fd = sys_perf_event_open(&attr, 0, -1, -1,
+ PERF_FLAG_FD_CLOEXEC);
+ if (fd >= 0) {
+ close(fd);
+ return 1;
+ }
+
+ if (errno != EINVAL) {
+ err = errno;
+ pr_warning("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+ }
+
+ /* not supported, confirm error related to PERF_FLAG_FD_CLOEXEC */
+ fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
+ if (fd >= 0) {
+ close(fd);
+ return 0;
+ }
+
+ err = errno;
+ die("sys_perf_event_open() syscall returned "
+ "%d (%d: %s)\n", fd, err, strerror(err));
+
+ return -1;
+}
+
+unsigned long perf_event_open_cloexec_flag(void)
+{
+ static bool probed;
+
+ if (!probed) {
+ if (perf_flag_probe() <= 0)
+ flag = 0;
+ probed = true;
+ }
+
+ return flag;
+}
diff --git a/tools/perf/util/cloexec.h b/tools/perf/util/cloexec.h
new file mode 100644
index 000000000000..94a5a7d829d5
--- /dev/null
+++ b/tools/perf/util/cloexec.h
@@ -0,0 +1,6 @@
+#ifndef __PERF_CLOEXEC_H
+#define __PERF_CLOEXEC_H
+
+unsigned long perf_event_open_cloexec_flag(void);
+
+#endif /* __PERF_CLOEXEC_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 22e18a26b7e6..0a44fba07b29 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -29,6 +29,7 @@ static struct {
bool sample_id_all;
bool exclude_guest;
bool mmap2;
+ bool cloexec;
} perf_missing_features;
#define FD(e, x, y) (*(int *)xyarray__entry(e->fd, x, y))
@@ -969,7 +970,7 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
struct thread_map *threads)
{
int cpu, thread;
- unsigned long flags = 0;
+ unsigned long flags = PERF_FLAG_FD_CLOEXEC;
int pid = -1, err;
enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;
@@ -978,11 +979,13 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
return -ENOMEM;
if (evsel->cgrp) {
- flags = PERF_FLAG_PID_CGROUP;
+ flags |= PERF_FLAG_PID_CGROUP;
pid = evsel->cgrp->fd;
}
fallback_missing_features:
+ if (perf_missing_features.cloexec)
+ flags &= ~(unsigned long)PERF_FLAG_FD_CLOEXEC;
if (perf_missing_features.mmap2)
evsel->attr.mmap2 = 0;
if (perf_missing_features.exclude_guest)
@@ -1051,7 +1054,10 @@ try_fallback:
if (err != -EINVAL || cpu > 0 || thread > 0)
goto out_close;
- if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
+ if (!perf_missing_features.cloexec && (flags & PERF_FLAG_FD_CLOEXEC)) {
+ perf_missing_features.cloexec = true;
+ goto fallback_missing_features;
+ } else if (!perf_missing_features.mmap2 && evsel->attr.mmap2) {
perf_missing_features.mmap2 = true;
goto fallback_missing_features;
} else if (!perf_missing_features.exclude_guest &&
diff --git a/tools/perf/util/record.c b/tools/perf/util/record.c
index 373762501dad..fa543aab6963 100644
--- a/tools/perf/util/record.c
+++ b/tools/perf/util/record.c
@@ -4,6 +4,7 @@
#include "parse-events.h"
#include "fs.h"
#include "util.h"
+#include "cloexec.h"
typedef void (*setup_probe_fn_t)(struct perf_evsel *evsel);
@@ -11,6 +12,7 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
{
struct perf_evlist *evlist;
struct perf_evsel *evsel;
+ unsigned long flags = perf_event_open_cloexec_flag();
int err = -EAGAIN, fd;
evlist = perf_evlist__new();
@@ -22,14 +24,14 @@ static int perf_do_probe_api(setup_probe_fn_t fn, int cpu, const char *str)
evsel = perf_evlist__first(evlist);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
if (fd < 0)
goto out_delete;
close(fd);
fn(evsel);
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, flags);
if (fd < 0) {
if (errno == EINVAL)
err = -EINVAL;
@@ -203,7 +205,8 @@ bool perf_evlist__can_select_event(struct perf_evlist *evlist, const char *str)
cpu = evlist->cpus->map[0];
}
- fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1, 0);
+ fd = sys_perf_event_open(&evsel->attr, -1, cpu, -1,
+ perf_event_open_cloexec_flag());
if (fd >= 0) {
close(fd);
ret = true;
--
1.8.5.3