2020-04-02 08:42:49

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


Changes in v8:
- added Acked-by and Reviewed-by tags acquired so far
- rebased on the top of tip perf/core repository:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip perf/core
sha1: 629b3df7ecb01fddfdf71cb5d3c563d143117c33
Changes in v7:
- updated and extended kernel.rst and perf-security.rst documentation
files with the information about CAP_PERFMON capability and its use cases
- documented the case of double audit logging of CAP_PERFMON and CAP_SYS_ADMIN
capabilities on a SELinux enabled system
Changes in v6:
- avoided noaudit checks in perfmon_capable() to explicitly advertise
CAP_PERFMON usage thru audit logs to secure system performance
monitoring and observability
Changes in v5:
- renamed CAP_SYS_PERFMON to CAP_PERFMON
- extended perfmon_capable() with noaudit checks
Changes in v4:
- converted perfmon_capable() into an inline function
- made perf_events kprobes, uprobes, hw breakpoints and namespaces data
available to CAP_SYS_PERFMON privileged processes
- applied perfmon_capable() to drivers/perf and drivers/oprofile
- extended __cmd_ftrace() with support of CAP_SYS_PERFMON
Changes in v3:
- implemented perfmon_capable() macros aggregating required capabilities
checks
Changes in v2:
- made perf_events trace points available to CAP_SYS_PERFMON privileged
processes
- made perf_event_paranoid_check() treat CAP_SYS_PERFMON equally to
CAP_SYS_ADMIN
- applied CAP_SYS_PERFMON to i915_perf, bpf_trace, powerpc and parisc
system performance monitoring and observability related subsystems

Currently access to perf_events, i915_perf and other performance
monitoring and observability subsystems of the kernel is open only for
a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
process effective set [2].

This patch set introduces CAP_PERFMON capability designed to secure
system performance monitoring and observability operations so that
CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
for performance monitoring and observability subsystems of the kernel.

CAP_PERFMON intends to harden system security and integrity during
performance monitoring and observability operations by decreasing attack
surface that is available to a CAP_SYS_ADMIN privileged process [2].
Providing the access to performance monitoring and observability
operations under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes the operation more secure. Thus, CAP_PERFMON implements the
principal of least privilege for performance monitoring and
observability operations (POSIX IEEE 1003.1e: 2.2.2.39 principle of
least privilege: A security design principle that states that a process
or program be granted only those privileges (e.g., capabilities)
necessary to accomplish its legitimate function, and only for the time
that such privileges are actually required)

CAP_PERFMON intends to meet the demand to secure system performance
monitoring and observability operations for adoption in security
sensitive, restricted, multiuser production environments (e.g. HPC
clusters, cloud and virtual compute environments), where root or
CAP_SYS_ADMIN credentials are not available to mass users of a system,
and securely unblock accessibility of system performance monitoring and
observability operations beyond root and CAP_SYS_ADMIN use cases.

CAP_PERFMON intends to take over CAP_SYS_ADMIN credentials related to
system performance monitoring and observability operations and balance
amount of CAP_SYS_ADMIN credentials following the recommendations in
the capabilities man page [2] for CAP_SYS_ADMIN: "Note: this capability
is overloaded; see Notes to kernel developers, below." For backward
compatibility reasons access to system performance monitoring and
observability subsystems of the kernel remains open for CAP_SYS_ADMIN
privileged processes but CAP_SYS_ADMIN capability usage for secure
system performance monitoring and observability operations is
discouraged with respect to the designed CAP_PERFMON capability.

Possible alternative solution to this system security hardening,
capabilities balancing task of making performance monitoring and
observability operations more secure and accessible could be to use
the existing CAP_SYS_PTRACE capability to govern system performance
monitoring and observability subsystems. However CAP_SYS_PTRACE
capability still provides users with more credentials than are
required for secure performance monitoring and observability
operations and this excess is avoided by the designed CAP_PERFMON.

Although software running under CAP_PERFMON can not ensure avoidance of
related hardware issues, the software can still mitigate those issues
following the official hardware issues mitigation procedure [3]. The
bugs in the software itself can be fixed following the standard kernel
development process [4] to maintain and harden security of system
performance monitoring and observability operations. Finally, the patch
set is shaped in the way that simplifies backtracking procedure of
possible induced issues [5] as much as possible.

---
Alexey Budankov (12):
capabilities: introduce CAP_PERFMON to kernel and user space
perf/core: open access to the core for CAP_PERFMON privileged process
perf/core: open access to probes for CAP_PERFMON privileged process
perf tool: extend Perf tool with CAP_PERFMON capability support
drm/i915/perf: open access for CAP_PERFMON privileged process
trace/bpf_trace: open access for CAP_PERFMON privileged process
powerpc/perf: open access for CAP_PERFMON privileged process
parisc/perf: open access for CAP_PERFMON privileged process
drivers/perf: open access for CAP_PERFMON privileged process
drivers/oprofile: open access for CAP_PERFMON privileged process
doc/admin-guide: update perf-security.rst with CAP_PERFMON information
doc/admin-guide: update kernel.rst with CAP_PERFMON information

Documentation/admin-guide/perf-security.rst | 65 +++++++++++++--------
Documentation/admin-guide/sysctl/kernel.rst | 16 +++--
arch/parisc/kernel/perf.c | 2 +-
arch/powerpc/perf/imc-pmu.c | 4 +-
drivers/gpu/drm/i915/i915_perf.c | 13 ++---
drivers/oprofile/event_buffer.c | 2 +-
drivers/perf/arm_spe_pmu.c | 4 +-
include/linux/capability.h | 4 ++
include/linux/perf_event.h | 6 +-
include/uapi/linux/capability.h | 8 ++-
kernel/events/core.c | 6 +-
kernel/trace/bpf_trace.c | 2 +-
security/selinux/include/classmap.h | 4 +-
tools/perf/builtin-ftrace.c | 5 +-
tools/perf/design.txt | 3 +-
tools/perf/util/cap.h | 4 ++
tools/perf/util/evsel.c | 10 ++--
tools/perf/util/util.c | 1 +
18 files changed, 98 insertions(+), 61 deletions(-)

---
Validation (Intel Skylake, 8 cores, Fedora 29, 5.5.0-rc3+, x86_64):

libcap library [6], [7], [8] and Perf tool can be used to apply
CAP_PERFMON capability for secure system performance monitoring and
observability beyond the scope permitted by the system wide
perf_event_paranoid kernel setting [9] and below are the steps for
evaluation:

- patch, build and boot the kernel
- patch, build Perf tool e.g. to /home/user/perf
...
# git clone git://git.kernel.org/pub/scm/libs/libcap/libcap.git libcap
# pushd libcap
# patch libcap/include/uapi/linux/capabilities.h with [PATCH 1]
# make
# pushd progs
# ./setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
# ./setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
/home/user/perf: OK
# ./getcap /home/user/perf
/home/user/perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
# echo 2 > /proc/sys/kernel/perf_event_paranoid
# cat /proc/sys/kernel/perf_event_paranoid
2
...
$ /home/user/perf top
... works as expected ...
$ cat /proc/`pidof perf`/status
Name: perf
Umask: 0002
State: S (sleeping)
Tgid: 2958
Ngid: 0
Pid: 2958
PPid: 9847
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
...
CapInh: 0000000000000000
CapPrm: 0000004400080000
CapEff: 0000004400080000 => 01000100 00000000 00001000 00000000 00000000
cap_perfmon,cap_sys_ptrace,cap_syslog
CapBnd: 0000007fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ff
Cpus_allowed_list: 0-7
...

Usage of cap_perfmon effectively avoids unused credentials excess:

- with cap_sys_admin:
CapEff: 0000007fffffffff => 01111111 11111111 11111111 11111111 11111111

- with cap_perfmon:
CapEff: 0000004400080000 => 01000100 00000000 00001000 00000000 00000000
38 34 19
perfmon syslog sys_ptrace

---
[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
[2] http://man7.org/linux/man-pages/man7/capabilities.7.html
[3] https://www.kernel.org/doc/html/latest/process/embargoed-hardware-issues.html
[4] https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html
[5] https://www.kernel.org/doc/html/latest/process/management-style.html#decisions
[6] http://man7.org/linux/man-pages/man8/setcap.8.html
[7] https://git.kernel.org/pub/scm/libs/libcap/libcap.git
[8] https://sites.google.com/site/fullycapable/, posix_1003.1e-990310.pdf
[9] http://man7.org/linux/man-pages/man2/perf_event_open.2.html


2020-04-02 08:46:09

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 01/12] capabilities: introduce CAP_PERFMON to kernel and user space


Introduce CAP_PERFMON capability designed to secure system performance
monitoring and observability operations so that CAP_PERFMON would assist
CAP_SYS_ADMIN capability in its governing role for performance monitoring
and observability subsystems.

CAP_PERFMON hardens system security and integrity during performance
monitoring and observability operations by decreasing attack surface that
is available to a CAP_SYS_ADMIN privileged process [2]. Providing the access
to system performance monitoring and observability operations under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials, excludes
chances to misuse the credentials and makes the operation more secure.
Thus, CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e: 2.2.2.39 principle
of least privilege: A security design principle that states that a process
or program be granted only those privileges (e.g., capabilities) necessary
to accomplish its legitimate function, and only for the time that such
privileges are actually required)

CAP_PERFMON meets the demand to secure system performance monitoring and
observability operations for adoption in security sensitive, restricted,
multiuser production environments (e.g. HPC clusters, cloud and virtual compute
environments), where root or CAP_SYS_ADMIN credentials are not available to
mass users of a system, and securely unblocks applicability and scalability
of system performance monitoring and observability operations beyond root
and CAP_SYS_ADMIN use cases.

CAP_PERFMON takes over CAP_SYS_ADMIN credentials related to system performance
monitoring and observability operations and balances amount of CAP_SYS_ADMIN
credentials following the recommendations in the capabilities man page [1]
for CAP_SYS_ADMIN: "Note: this capability is overloaded; see Notes to kernel
developers, below." For backward compatibility reasons access to system
performance monitoring and observability subsystems of the kernel remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN capability
usage for secure system performance monitoring and observability operations
is discouraged with respect to the designed CAP_PERFMON capability.

Although the software running under CAP_PERFMON can not ensure avoidance
of related hardware issues, the software can still mitigate these issues
following the official hardware issues mitigation procedure [2]. The bugs
in the software itself can be fixed following the standard kernel development
process [3] to maintain and harden security of system performance monitoring
and observability operations.

[1] http://man7.org/linux/man-pages/man7/capabilities.7.html
[2] https://www.kernel.org/doc/html/latest/process/embargoed-hardware-issues.html
[3] https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Stephen Smalley <[email protected]>
Acked-by: James Morris <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 4 ++++
include/uapi/linux/capability.h | 8 +++++++-
security/selinux/include/classmap.h | 4 ++--
3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f43c73a..027d7e4a853b 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -251,6 +251,10 @@ extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct
extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap);
extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
+static inline bool perfmon_capable(void)
+{
+ return capable(CAP_PERFMON) || capable(CAP_SYS_ADMIN);
+}

/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 272dc69fa080..e58c9636741b 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -367,8 +367,14 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_READ 37

+/*
+ * Allow system performance and observability privileged operations
+ * using perf_events, i915_perf and other kernel subsystems
+ */
+
+#define CAP_PERFMON 38

-#define CAP_LAST_CAP CAP_AUDIT_READ
+#define CAP_LAST_CAP CAP_PERFMON

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 986f3ac14282..d233ab3f1533 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,9 @@
"audit_control", "setfcap"

#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
- "wake_alarm", "block_suspend", "audit_read"
+ "wake_alarm", "block_suspend", "audit_read", "perfmon"

-#if CAP_LAST_CAP > CAP_AUDIT_READ
+#if CAP_LAST_CAP > CAP_PERFMON
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif

--
2.24.1

2020-04-02 08:47:03

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 02/12] perf/core: open access to the core for CAP_PERFMON privileged process


Open access to monitoring of kernel code, cpus, tracepoints and namespaces
data for a CAP_PERFMON privileged process. Providing the access under
CAP_PERFMON capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39 principle
of least privilege: A security design principle that states that a process or
program be granted only those privileges (e.g., capabilities) necessary to
accomplish its legitimate function, and only for the time that such privileges
are actually required)

For backward compatibility reasons access to perf_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for secure
perf_events monitoring is discouraged with respect to CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
include/linux/perf_event.h | 6 +++---
kernel/events/core.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8768a39b5258..9adf62ebb202 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1304,7 +1304,7 @@ static inline int perf_is_paranoid(void)

static inline int perf_allow_kernel(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
return -EACCES;

return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
@@ -1312,7 +1312,7 @@ static inline int perf_allow_kernel(struct perf_event_attr *attr)

static inline int perf_allow_cpu(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > 0 && !perfmon_capable())
return -EACCES;

return security_perf_event_open(attr, PERF_SECURITY_CPU);
@@ -1320,7 +1320,7 @@ static inline int perf_allow_cpu(struct perf_event_attr *attr)

static inline int perf_allow_tracepoint(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > -1 && !perfmon_capable())
return -EPERM;

return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d22e4ba59dfa..2af0f4557b63 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11360,7 +11360,7 @@ SYSCALL_DEFINE5(perf_event_open,
}

if (attr.namespaces) {
- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;
}

--
2.24.1

2020-04-02 08:47:54

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 03/12] perf/core: open access to probes for CAP_PERFMON privileged process


Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials, excludes
chances to misuse the credentials and makes operation more secure.

perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39 principle
of least privilege: A security design principle that states that a process or
program be granted only those privileges (e.g., capabilities) necessary to
accomplish its legitimate function, and only for the time that such privileges
are actually required)

For backward compatibility reasons access to perf_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure perf_events monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: James Morris <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
kernel/events/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2af0f4557b63..364c233c3f25 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9259,7 +9259,7 @@ static int perf_kprobe_event_init(struct perf_event *event)
if (event->attr.type != perf_kprobe.type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/*
@@ -9319,7 +9319,7 @@ static int perf_uprobe_event_init(struct perf_event *event)
if (event->attr.type != perf_uprobe.type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/*
--
2.24.1


2020-04-02 08:49:17

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 04/12] perf tool: extend Perf tool with CAP_PERFMON capability support


Extend error messages to mention CAP_PERFMON capability as an option
to substitute CAP_SYS_ADMIN capability for secure system performance
monitoring and observability operations. Make perf_event_paranoid_check()
and __cmd_ftrace() to be aware of CAP_PERFMON capability.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to perf_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure perf_events monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
tools/perf/builtin-ftrace.c | 5 +++--
tools/perf/design.txt | 3 ++-
tools/perf/util/cap.h | 4 ++++
tools/perf/util/evsel.c | 10 +++++-----
tools/perf/util/util.c | 1 +
5 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
index d5adc417a4ca..55eda54240fb 100644
--- a/tools/perf/builtin-ftrace.c
+++ b/tools/perf/builtin-ftrace.c
@@ -284,10 +284,11 @@ static int __cmd_ftrace(struct perf_ftrace *ftrace, int argc, const char **argv)
.events = POLLIN,
};

- if (!perf_cap__capable(CAP_SYS_ADMIN)) {
+ if (!(perf_cap__capable(CAP_PERFMON) ||
+ perf_cap__capable(CAP_SYS_ADMIN))) {
pr_err("ftrace only works for %s!\n",
#ifdef HAVE_LIBCAP_SUPPORT
- "users with the SYS_ADMIN capability"
+ "users with the CAP_PERFMON or CAP_SYS_ADMIN capability"
#else
"root"
#endif
diff --git a/tools/perf/design.txt b/tools/perf/design.txt
index 0453ba26cdbd..a42fab308ff6 100644
--- a/tools/perf/design.txt
+++ b/tools/perf/design.txt
@@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any user, for
their own tasks.

A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
-all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
+privilege.

The 'flags' parameter is currently unused and must be zero.

diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
index 051dc590ceee..ae52878c0b2e 100644
--- a/tools/perf/util/cap.h
+++ b/tools/perf/util/cap.h
@@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
#define CAP_SYSLOG 34
#endif

+#ifndef CAP_PERFMON
+#define CAP_PERFMON 38
+#endif
+
#endif /* __PERF_CAP_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 816d930d774e..2696922f06bc 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2507,14 +2507,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, struct target *target,
"You may not have permission to collect %sstats.\n\n"
"Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
"which controls use of the performance events system by\n"
- "unprivileged users (without CAP_SYS_ADMIN).\n\n"
+ "unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).\n\n"
"The current value is %d:\n\n"
" -1: Allow use of (almost) all events by all users\n"
" Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK\n"
- ">= 0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN\n"
- " Disallow raw tracepoint access by users without CAP_SYS_ADMIN\n"
- ">= 1: Disallow CPU event access by users without CAP_SYS_ADMIN\n"
- ">= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN\n\n"
+ ">= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
+ " Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+ ">= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
+ ">= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN\n\n"
"To make this setting permanent, edit /etc/sysctl.conf too, e.g.:\n\n"
" kernel.perf_event_paranoid = -1\n" ,
target->system_wide ? "system-wide " : "",
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index d707c9624dd9..37a9492edb3e 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -290,6 +290,7 @@ int perf_event_paranoid(void)
bool perf_event_paranoid_check(int max_level)
{
return perf_cap__capable(CAP_SYS_ADMIN) ||
+ perf_cap__capable(CAP_PERFMON) ||
perf_event_paranoid() <= max_level;
}

--
2.24.1

2020-04-02 08:49:33

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 05/12] drm/i915/perf: open access for CAP_PERFMON privileged process


Open access to i915_perf monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states that
a process or program be granted only those privileges (e.g., capabilities)
necessary to accomplish its legitimate function, and only for the time
that such privileges are actually required)

For backward compatibility reasons access to i915_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure i915_events monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Lionel Landwerlin <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
drivers/gpu/drm/i915/i915_perf.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 3b6b913bd27a..f59265cebe1e 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -3402,10 +3402,10 @@ i915_perf_open_ioctl_locked(struct i915_perf *perf,
/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
* we check a dev.i915.perf_stream_paranoid sysctl option
* to determine if it's ok to access system wide OA counters
- * without CAP_SYS_ADMIN privileges.
+ * without CAP_PERFMON or CAP_SYS_ADMIN privileges.
*/
if (privileged_op &&
- i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to open i915 perf stream\n");
ret = -EACCES;
goto err_ctx;
@@ -3598,9 +3598,8 @@ static int read_properties_unlocked(struct i915_perf *perf,
} else
oa_freq_hz = 0;

- if (oa_freq_hz > i915_oa_max_sample_rate &&
- !capable(CAP_SYS_ADMIN)) {
- DRM_DEBUG("OA exponent would exceed the max sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without root privileges\n",
+ if (oa_freq_hz > i915_oa_max_sample_rate && !perfmon_capable()) {
+ DRM_DEBUG("OA exponent would exceed the max sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without CAP_PERFMON or CAP_SYS_ADMIN privileges\n",
i915_oa_max_sample_rate);
return -EACCES;
}
@@ -4021,7 +4020,7 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
}

- if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to add i915 OA config\n");
return -EACCES;
}
@@ -4168,7 +4167,7 @@ int i915_perf_remove_config_ioctl(struct drm_device *dev, void *data,
return -ENOTSUPP;
}

- if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to remove i915 OA config\n");
return -EACCES;
}
--
2.24.1


2020-04-02 08:50:09

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 06/12] trace/bpf_trace: open access for CAP_PERFMON privileged process


Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: James Morris <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
kernel/trace/bpf_trace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 19e793aa441a..70e8249eebe5 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1416,7 +1416,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
u32 *ids, prog_cnt, ids_len;
int ret;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EPERM;
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -EINVAL;
--
2.24.1


2020-04-02 08:51:05

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 07/12] powerpc/perf: open access for CAP_PERFMON privileged process


Open access to monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without
the rest of CAP_SYS_ADMIN credentials, excludes chances to misuse
the credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and
only for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
for secure monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Anju T Sudhakar<[email protected]>
Acked-by: James Morris <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
arch/powerpc/perf/imc-pmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index cb50a9e1fd2d..e837717492e4 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -898,7 +898,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/* Sampling not supported */
@@ -1307,7 +1307,7 @@ static int trace_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/* Return if this is a couting event */
--
2.24.1

2020-04-02 08:52:10

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 08/12] parisc/perf: open access for CAP_PERFMON privileged process


Open access to monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without
the rest of CAP_SYS_ADMIN credentials, excludes chances to misuse
the credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
for secure monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Helge Deller <[email protected]>
Acked-by: James Morris <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
arch/parisc/kernel/perf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/perf.c b/arch/parisc/kernel/perf.c
index e1a8fee3ad49..d46b6709ec56 100644
--- a/arch/parisc/kernel/perf.c
+++ b/arch/parisc/kernel/perf.c
@@ -300,7 +300,7 @@ static ssize_t perf_write(struct file *file, const char __user *buf,
else
return -EFAULT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

if (count != sizeof(uint32_t))
--
2.24.1

2020-04-02 08:53:06

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 09/12] drivers/perf: open access for CAP_PERFMON privileged process


Open access to monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without
the rest of CAP_SYS_ADMIN credentials, excludes chances to misuse
the credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and
only for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
for secure monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: Will Deacon <[email protected]>
Acked-by: James Morris <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
drivers/perf/arm_spe_pmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index 4e4984a55cd1..5dff81bc3324 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -274,7 +274,7 @@ static u64 arm_spe_event_to_pmscr(struct perf_event *event)
if (!attr->exclude_kernel)
reg |= BIT(SYS_PMSCR_EL1_E1SPE_SHIFT);

- if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR) && capable(CAP_SYS_ADMIN))
+ if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR) && perfmon_capable())
reg |= BIT(SYS_PMSCR_EL1_CX_SHIFT);

return reg;
@@ -700,7 +700,7 @@ static int arm_spe_pmu_event_init(struct perf_event *event)
return -EOPNOTSUPP;

reg = arm_spe_event_to_pmscr(event);
- if (!capable(CAP_SYS_ADMIN) &&
+ if (!perfmon_capable() &&
(reg & (BIT(SYS_PMSCR_EL1_PA_SHIFT) |
BIT(SYS_PMSCR_EL1_CX_SHIFT) |
BIT(SYS_PMSCR_EL1_PCT_SHIFT))))
--
2.24.1

2020-04-02 08:53:53

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 10/12] drivers/oprofile: open access for CAP_PERFMON privileged process


Open access to monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without
the rest of CAP_SYS_ADMIN credentials, excludes chances to misuse
the credentials and makes operation more secure.

CAP_PERFMON implements the principal of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage
for secure monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: James Morris <[email protected]>
---
drivers/oprofile/event_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/oprofile/event_buffer.c b/drivers/oprofile/event_buffer.c
index 12ea4a4ad607..6c9edc8bbc95 100644
--- a/drivers/oprofile/event_buffer.c
+++ b/drivers/oprofile/event_buffer.c
@@ -113,7 +113,7 @@ static int event_buffer_open(struct inode *inode, struct file *file)
{
int err = -EPERM;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EPERM;

if (test_and_set_bit_lock(0, &buffer_opened))
--
2.24.1

2020-04-02 08:54:59

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 11/12] doc/admin-guide: update perf-security.rst with CAP_PERFMON information


Update perf-security.rst documentation file with the information
related to usage of CAP_PERFMON capability to secure performance
monitoring and observability operations in system.

Signed-off-by: Alexey Budankov <[email protected]>
---
Documentation/admin-guide/perf-security.rst | 65 +++++++++++++--------
1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index 72effa7c23b9..81202d46a1ae 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -1,6 +1,6 @@
.. _perf_security:

-Perf Events and tool security
+Perf events and tool security
=============================

Overview
@@ -42,11 +42,11 @@ categories:
Data that belong to the fourth category can potentially contain
sensitive process data. If PMUs in some monitoring modes capture values
of execution context registers or data from process memory then access
-to such monitoring capabilities requires to be ordered and secured
-properly. So, perf_events/Perf performance monitoring is the subject for
-security access control management [5]_ .
+to such monitoring modes requires to be ordered and secured properly.
+So, perf_events performance monitoring and observability operations is
+the subject for security access control management [5]_ .

-perf_events/Perf access control
+perf_events access control
-------------------------------

To perform security checks, the Linux implementation splits processes
@@ -66,11 +66,25 @@ into distinct units, known as capabilities [6]_ , which can be
independently enabled and disabled on per-thread basis for processes and
files of unprivileged users.

-Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated
+Unprivileged processes with enabled CAP_PERFMON capability are treated
as privileged processes with respect to perf_events performance
-monitoring and bypass *scope* permissions checks in the kernel.
-
-Unprivileged processes using perf_events system call API is also subject
+monitoring and observability operations, thus, bypass *scope* permissions
+checks in the kernel. CAP_PERFMON implements the principal of least
+privilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and
+observability operations in the kernel and provides secure approach to
+perfomance monitoring and observability in the system.
+
+For backward compatibility reasons access to perf_events monitoring and
+observability operations is also open for CAP_SYS_ADMIN privileged
+processes but CAP_SYS_ADMIN usage for secure monitoring and observability
+use cases is discouraged with respect to CAP_PERFMON capability.
+If system audit records [14]_ for a process using perf_events system call
+API contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN
+capabilities then providing the process with CAP_PERFMON capability singly
+is recommended as the preferred secure approach to resolve double access
+denial logging related to usage of performance monitoring and observability.
+
+Unprivileged processes using perf_events system call are also subject
for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
outcome determines whether monitoring is permitted. So unprivileged
processes provided with CAP_SYS_PTRACE capability are effectively
@@ -82,14 +96,14 @@ performance analysis of monitored processes or a system. For example,
CAP_SYSLOG capability permits reading kernel space memory addresses from
/proc/kallsyms file.

-perf_events/Perf privileged users
+Privileged Perf users groups
---------------------------------

Mechanisms of capabilities, privileged capability-dumb files [6]_ and
-file system ACLs [10]_ can be used to create a dedicated group of
-perf_events/Perf privileged users who are permitted to execute
-performance monitoring without scope limits. The following steps can be
-taken to create such a group of privileged Perf users.
+file system ACLs [10]_ can be used to create dedicated groups of
+privileged Perf users who are permitted to execute performance monitoring
+and observability without scope limits. The following steps can be
+taken to create such groups of privileged Perf users.

1. Create perf_users group of privileged Perf users, assign perf_users
group to Perf tool executable and limit access to the executable for
@@ -108,30 +122,30 @@ taken to create such a group of privileged Perf users.
-rwxr-x--- 2 root perf_users 11M Oct 19 15:12 perf

2. Assign the required capabilities to the Perf tool executable file and
- enable members of perf_users group with performance monitoring
+ enable members of perf_users group with monitoring and observability
privileges [6]_ :

::

- # setcap "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
- # setcap -v "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
perf: OK
# getcap perf
- perf = cap_sys_ptrace,cap_sys_admin,cap_syslog+ep
+ perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep

As a result, members of perf_users group are capable of conducting
-performance monitoring by using functionality of the configured Perf
-tool executable that, when executes, passes perf_events subsystem scope
-checks.
+performance monitoring and observability by using functionality of the
+configured Perf tool executable that, when executes, passes perf_events
+subsystem scope checks.

This specific access control management is only available to superuser
or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
capabilities.

-perf_events/Perf unprivileged users
+Unprivileged users
-----------------------------------

-perf_events/Perf *scope* and *access* control for unprivileged processes
+perf_events *scope* and *access* control for unprivileged processes
is governed by perf_event_paranoid [2]_ setting:

-1:
@@ -166,7 +180,7 @@ is governed by perf_event_paranoid [2]_ setting:
perf_event_mlock_kb locking limit is imposed but ignored for
unprivileged processes with CAP_IPC_LOCK capability.

-perf_events/Perf resource control
+Resource control
---------------------------------

Open file descriptors
@@ -227,4 +241,5 @@ Bibliography
.. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_
.. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_
.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
-
+.. [13] `<https://sites.google.com/site/fullycapable>`_
+.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_
--
2.24.1


2020-04-02 08:55:58

by Alexey Budankov

[permalink] [raw]
Subject: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information


Update kernel.rst documentation file with the information
related to usage of CAP_PERFMON capability to secure performance
monitoring and observability operations in system.

Signed-off-by: Alexey Budankov <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index def074807cee..b06ae9389809 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -720,20 +720,26 @@ perf_event_paranoid:
====================

Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN). The default value is 2.
+users (without CAP_PERFMON). The default value is 2.
+
+For backward compatibility reasons access to system performance
+monitoring and observability remains open for CAP_SYS_ADMIN
+privileged processes but CAP_SYS_ADMIN usage for secure system
+performance monitoring and observability operations is discouraged
+with respect to CAP_PERFMON use cases.

=== ==================================================================
-1 Allow use of (almost) all events by all users

Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK

->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
+>=0 Disallow ftrace function tracepoint by users without CAP_PERFMON

- Disallow raw tracepoint access by users without CAP_SYS_ADMIN
+ Disallow raw tracepoint access by users without CAP_PERFMON

->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
+>=1 Disallow CPU event access by users without CAP_PERFMON

->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
+>=2 Disallow kernel profiling by users without CAP_PERFMON
=== ==================================================================


--
2.24.1

2020-04-03 11:26:55

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] perf tool: extend Perf tool with CAP_PERFMON capability support

On Thu, Apr 02, 2020 at 11:47:35AM +0300, Alexey Budankov wrote:
>
> Extend error messages to mention CAP_PERFMON capability as an option
> to substitute CAP_SYS_ADMIN capability for secure system performance
> monitoring and observability operations. Make perf_event_paranoid_check()
> and __cmd_ftrace() to be aware of CAP_PERFMON capability.
>
> CAP_PERFMON implements the principal of least privilege for performance
> monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
> principle of least privilege: A security design principle that states
> that a process or program be granted only those privileges (e.g.,
> capabilities) necessary to accomplish its legitimate function, and only
> for the time that such privileges are actually required)
>
> For backward compatibility reasons access to perf_events subsystem remains
> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
> secure perf_events monitoring is discouraged with respect to CAP_PERFMON
> capability.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> Reviewed-by: James Morris <[email protected]>

Acked-by: Jiri Olsa <[email protected]>

thanks,
jirka

> ---
> tools/perf/builtin-ftrace.c | 5 +++--
> tools/perf/design.txt | 3 ++-
> tools/perf/util/cap.h | 4 ++++
> tools/perf/util/evsel.c | 10 +++++-----
> tools/perf/util/util.c | 1 +
> 5 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
> index d5adc417a4ca..55eda54240fb 100644
> --- a/tools/perf/builtin-ftrace.c
> +++ b/tools/perf/builtin-ftrace.c
> @@ -284,10 +284,11 @@ static int __cmd_ftrace(struct perf_ftrace *ftrace, int argc, const char **argv)
> .events = POLLIN,
> };
>
> - if (!perf_cap__capable(CAP_SYS_ADMIN)) {
> + if (!(perf_cap__capable(CAP_PERFMON) ||
> + perf_cap__capable(CAP_SYS_ADMIN))) {
> pr_err("ftrace only works for %s!\n",
> #ifdef HAVE_LIBCAP_SUPPORT
> - "users with the SYS_ADMIN capability"
> + "users with the CAP_PERFMON or CAP_SYS_ADMIN capability"
> #else
> "root"
> #endif
> diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> index 0453ba26cdbd..a42fab308ff6 100644
> --- a/tools/perf/design.txt
> +++ b/tools/perf/design.txt
> @@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any user, for
> their own tasks.
>
> A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
> -all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
> +all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
> +privilege.
>
> The 'flags' parameter is currently unused and must be zero.
>
> diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
> index 051dc590ceee..ae52878c0b2e 100644
> --- a/tools/perf/util/cap.h
> +++ b/tools/perf/util/cap.h
> @@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
> #define CAP_SYSLOG 34
> #endif
>
> +#ifndef CAP_PERFMON
> +#define CAP_PERFMON 38
> +#endif
> +
> #endif /* __PERF_CAP_H */
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 816d930d774e..2696922f06bc 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -2507,14 +2507,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, struct target *target,
> "You may not have permission to collect %sstats.\n\n"
> "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
> "which controls use of the performance events system by\n"
> - "unprivileged users (without CAP_SYS_ADMIN).\n\n"
> + "unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).\n\n"
> "The current value is %d:\n\n"
> " -1: Allow use of (almost) all events by all users\n"
> " Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK\n"
> - ">= 0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN\n"
> - " Disallow raw tracepoint access by users without CAP_SYS_ADMIN\n"
> - ">= 1: Disallow CPU event access by users without CAP_SYS_ADMIN\n"
> - ">= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN\n\n"
> + ">= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
> + " Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
> + ">= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
> + ">= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN\n\n"
> "To make this setting permanent, edit /etc/sysctl.conf too, e.g.:\n\n"
> " kernel.perf_event_paranoid = -1\n" ,
> target->system_wide ? "system-wide " : "",
> diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
> index d707c9624dd9..37a9492edb3e 100644
> --- a/tools/perf/util/util.c
> +++ b/tools/perf/util/util.c
> @@ -290,6 +290,7 @@ int perf_event_paranoid(void)
> bool perf_event_paranoid_check(int max_level)
> {
> return perf_cap__capable(CAP_SYS_ADMIN) ||
> + perf_cap__capable(CAP_PERFMON) ||
> perf_event_paranoid() <= max_level;
> }
>
> --
> 2.24.1
>

2020-04-03 13:09:48

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] perf tool: extend Perf tool with CAP_PERFMON capability support


On 03.04.2020 14:08, Jiri Olsa wrote:
> On Thu, Apr 02, 2020 at 11:47:35AM +0300, Alexey Budankov wrote:
>>
>> Extend error messages to mention CAP_PERFMON capability as an option
>> to substitute CAP_SYS_ADMIN capability for secure system performance
>> monitoring and observability operations. Make perf_event_paranoid_check()
>> and __cmd_ftrace() to be aware of CAP_PERFMON capability.
>>
>> CAP_PERFMON implements the principal of least privilege for performance
>> monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
>> principle of least privilege: A security design principle that states
>> that a process or program be granted only those privileges (e.g.,
>> capabilities) necessary to accomplish its legitimate function, and only
>> for the time that such privileges are actually required)
>>
>> For backward compatibility reasons access to perf_events subsystem remains
>> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
>> secure perf_events monitoring is discouraged with respect to CAP_PERFMON
>> capability.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> Reviewed-by: James Morris <[email protected]>
>
> Acked-by: Jiri Olsa <[email protected]>

Thanks! I appreciate you support.

~Alexey

>
> thanks,
> jirka
>
>> ---
>> tools/perf/builtin-ftrace.c | 5 +++--
>> tools/perf/design.txt | 3 ++-
>> tools/perf/util/cap.h | 4 ++++
>> tools/perf/util/evsel.c | 10 +++++-----
>> tools/perf/util/util.c | 1 +
>> 5 files changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
>> index d5adc417a4ca..55eda54240fb 100644
>> --- a/tools/perf/builtin-ftrace.c
>> +++ b/tools/perf/builtin-ftrace.c
>> @@ -284,10 +284,11 @@ static int __cmd_ftrace(struct perf_ftrace *ftrace, int argc, const char **argv)
>> .events = POLLIN,
>> };
>>
>> - if (!perf_cap__capable(CAP_SYS_ADMIN)) {
>> + if (!(perf_cap__capable(CAP_PERFMON) ||
>> + perf_cap__capable(CAP_SYS_ADMIN))) {
>> pr_err("ftrace only works for %s!\n",
>> #ifdef HAVE_LIBCAP_SUPPORT
>> - "users with the SYS_ADMIN capability"
>> + "users with the CAP_PERFMON or CAP_SYS_ADMIN capability"
>> #else
>> "root"
>> #endif
>> diff --git a/tools/perf/design.txt b/tools/perf/design.txt
>> index 0453ba26cdbd..a42fab308ff6 100644
>> --- a/tools/perf/design.txt
>> +++ b/tools/perf/design.txt
>> @@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any user, for
>> their own tasks.
>>
>> A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
>> -all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
>> +all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
>> +privilege.
>>
>> The 'flags' parameter is currently unused and must be zero.
>>
>> diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
>> index 051dc590ceee..ae52878c0b2e 100644
>> --- a/tools/perf/util/cap.h
>> +++ b/tools/perf/util/cap.h
>> @@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
>> #define CAP_SYSLOG 34
>> #endif
>>
>> +#ifndef CAP_PERFMON
>> +#define CAP_PERFMON 38
>> +#endif
>> +
>> #endif /* __PERF_CAP_H */
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 816d930d774e..2696922f06bc 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -2507,14 +2507,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, struct target *target,
>> "You may not have permission to collect %sstats.\n\n"
>> "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
>> "which controls use of the performance events system by\n"
>> - "unprivileged users (without CAP_SYS_ADMIN).\n\n"
>> + "unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).\n\n"
>> "The current value is %d:\n\n"
>> " -1: Allow use of (almost) all events by all users\n"
>> " Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK\n"
>> - ">= 0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN\n"
>> - " Disallow raw tracepoint access by users without CAP_SYS_ADMIN\n"
>> - ">= 1: Disallow CPU event access by users without CAP_SYS_ADMIN\n"
>> - ">= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN\n\n"
>> + ">= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
>> + " Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
>> + ">= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
>> + ">= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN\n\n"
>> "To make this setting permanent, edit /etc/sysctl.conf too, e.g.:\n\n"
>> " kernel.perf_event_paranoid = -1\n" ,
>> target->system_wide ? "system-wide " : "",
>> diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
>> index d707c9624dd9..37a9492edb3e 100644
>> --- a/tools/perf/util/util.c
>> +++ b/tools/perf/util/util.c
>> @@ -290,6 +290,7 @@ int perf_event_paranoid(void)
>> bool perf_event_paranoid_check(int max_level)
>> {
>> return perf_cap__capable(CAP_SYS_ADMIN) ||
>> + perf_cap__capable(CAP_PERFMON) ||
>> perf_event_paranoid() <= max_level;
>> }
>>
>> --
>> 2.24.1
>>
>

2020-04-04 02:19:12

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] perf tool: extend Perf tool with CAP_PERFMON capability support

Hello,

On Thu, Apr 2, 2020 at 5:47 PM Alexey Budankov
<[email protected]> wrote:
>
>
> Extend error messages to mention CAP_PERFMON capability as an option
> to substitute CAP_SYS_ADMIN capability for secure system performance
> monitoring and observability operations. Make perf_event_paranoid_check()
> and __cmd_ftrace() to be aware of CAP_PERFMON capability.
>
> CAP_PERFMON implements the principal of least privilege for performance
> monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
> principle of least privilege: A security design principle that states
> that a process or program be granted only those privileges (e.g.,
> capabilities) necessary to accomplish its legitimate function, and only
> for the time that such privileges are actually required)
>
> For backward compatibility reasons access to perf_events subsystem remains
> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
> secure perf_events monitoring is discouraged with respect to CAP_PERFMON
> capability.
>
> Signed-off-by: Alexey Budankov <[email protected]>
> Reviewed-by: James Morris <[email protected]>

Acked-by: Namhyung Kim <[email protected]>

Thanks
Namhyung

2020-04-04 08:19:43

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] perf tool: extend Perf tool with CAP_PERFMON capability support

Hi Namhyung,

On 04.04.2020 5:18, Namhyung Kim wrote:
> Hello,
>
> On Thu, Apr 2, 2020 at 5:47 PM Alexey Budankov
> <[email protected]> wrote:
>>
>>
>> Extend error messages to mention CAP_PERFMON capability as an option
>> to substitute CAP_SYS_ADMIN capability for secure system performance
>> monitoring and observability operations. Make perf_event_paranoid_check()
>> and __cmd_ftrace() to be aware of CAP_PERFMON capability.
>>
>> CAP_PERFMON implements the principal of least privilege for performance
>> monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
>> principle of least privilege: A security design principle that states
>> that a process or program be granted only those privileges (e.g.,
>> capabilities) necessary to accomplish its legitimate function, and only
>> for the time that such privileges are actually required)
>>
>> For backward compatibility reasons access to perf_events subsystem remains
>> open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
>> secure perf_events monitoring is discouraged with respect to CAP_PERFMON
>> capability.
>>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> Reviewed-by: James Morris <[email protected]>
>
> Acked-by: Namhyung Kim <[email protected]>

Thanks! I appreciate you involvement and effort.

~Alexey

>
> Thanks
> Namhyung
>

2020-04-05 14:11:20

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information

Em Thu, Apr 02, 2020 at 11:54:39AM +0300, Alexey Budankov escreveu:
>
> Update kernel.rst documentation file with the information
> related to usage of CAP_PERFMON capability to secure performance
> monitoring and observability operations in system.

This one is failing in my perf/core branch, please take a look. I'm
pushing my perf/core branch with this series applied, please check that
everything is ok, I'll do some testing now, but it all seems ok.

Thanks,

- Arnaldo

> Signed-off-by: Alexey Budankov <[email protected]>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index def074807cee..b06ae9389809 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -720,20 +720,26 @@ perf_event_paranoid:
> ====================
>
> Controls use of the performance events system by unprivileged
> -users (without CAP_SYS_ADMIN). The default value is 2.
> +users (without CAP_PERFMON). The default value is 2.
> +
> +For backward compatibility reasons access to system performance
> +monitoring and observability remains open for CAP_SYS_ADMIN
> +privileged processes but CAP_SYS_ADMIN usage for secure system
> +performance monitoring and observability operations is discouraged
> +with respect to CAP_PERFMON use cases.
>
> === ==================================================================
> -1 Allow use of (almost) all events by all users
>
> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>
> ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
> +>=0 Disallow ftrace function tracepoint by users without CAP_PERFMON
>
> - Disallow raw tracepoint access by users without CAP_SYS_ADMIN
> + Disallow raw tracepoint access by users without CAP_PERFMON
>
> ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
> +>=1 Disallow CPU event access by users without CAP_PERFMON
>
> ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
> +>=2 Disallow kernel profiling by users without CAP_PERFMON
> === ==================================================================
>
>
> --
> 2.24.1
>

--

- Arnaldo

2020-04-05 14:47:52

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information


On 05.04.2020 17:10, Arnaldo Carvalho de Melo wrote:
> Em Thu, Apr 02, 2020 at 11:54:39AM +0300, Alexey Budankov escreveu:
>>
>> Update kernel.rst documentation file with the information
>> related to usage of CAP_PERFMON capability to secure performance
>> monitoring and observability operations in system.
>
> This one is failing in my perf/core branch, please take a look. I'm

Trying to reproduce right now. What kind of failure do you see?
Please share some specifics so I could follow up properly.

Thanks,
Alexey

> pushing my perf/core branch with this series applied, please check that
> everything is ok, I'll do some testing now, but it all seems ok.
>
> Thanks,
>
> - Arnaldo
>
>> Signed-off-by: Alexey Budankov <[email protected]>
>> ---
>> Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
>> 1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>> index def074807cee..b06ae9389809 100644
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -720,20 +720,26 @@ perf_event_paranoid:
>> ====================
>>
>> Controls use of the performance events system by unprivileged
>> -users (without CAP_SYS_ADMIN). The default value is 2.
>> +users (without CAP_PERFMON). The default value is 2.
>> +
>> +For backward compatibility reasons access to system performance
>> +monitoring and observability remains open for CAP_SYS_ADMIN
>> +privileged processes but CAP_SYS_ADMIN usage for secure system
>> +performance monitoring and observability operations is discouraged
>> +with respect to CAP_PERFMON use cases.
>>
>> === ==================================================================
>> -1 Allow use of (almost) all events by all users
>>
>> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>>
>> ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
>> +>=0 Disallow ftrace function tracepoint by users without CAP_PERFMON
>>
>> - Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>> + Disallow raw tracepoint access by users without CAP_PERFMON
>>
>> ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
>> +>=1 Disallow CPU event access by users without CAP_PERFMON
>>
>> ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
>> +>=2 Disallow kernel profiling by users without CAP_PERFMON
>> === ==================================================================
>>
>>
>> --
>> 2.24.1
>>
>

2020-04-05 14:55:44

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information


On 05.04.2020 17:41, Alexey Budankov wrote:
>
> On 05.04.2020 17:10, Arnaldo Carvalho de Melo wrote:
>> Em Thu, Apr 02, 2020 at 11:54:39AM +0300, Alexey Budankov escreveu:
>>>
>>> Update kernel.rst documentation file with the information
>>> related to usage of CAP_PERFMON capability to secure performance
>>> monitoring and observability operations in system.
>>
>> This one is failing in my perf/core branch, please take a look. I'm

Please try applying this:

---
Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 335696d3360d..aaa5bbcd1e33 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -709,7 +709,13 @@ perf_event_paranoid
===================

Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN). The default value is 2.
+users (without CAP_PERFMON). The default value is 2.
+
+For backward compatibility reasons access to system performance
+monitoring and observability remains open for CAP_SYS_ADMIN
+privileged processes but CAP_SYS_ADMIN usage for secure system
+performance monitoring and observability operations is discouraged
+with respect to CAP_PERFMON use cases.

=== ==================================================================
-1 Allow use of (almost) all events by all users.
@@ -718,13 +724,13 @@ users (without CAP_SYS_ADMIN). The default value is 2.
``CAP_IPC_LOCK``.

>=0 Disallow ftrace function tracepoint by users without
- ``CAP_SYS_ADMIN``.
+ ``CAP_PERFMON``.

- Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``.
+ Disallow raw tracepoint access by users without ``CAP_PERFMON``.

->=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``.
+>=1 Disallow CPU event access by users without ``CAP_PERFMON``.

->=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``.
+>=2 Disallow kernel profiling by users without ``CAP_PERFMON``.
=== ==================================================================

---

Thanks,
Alexey

>
> Trying to reproduce right now. What kind of failure do you see?
> Please share some specifics so I could follow up properly.
>
> Thanks,
> Alexey
>
>> pushing my perf/core branch with this series applied, please check that
>> everything is ok, I'll do some testing now, but it all seems ok.
>>
>> Thanks,
>>
>> - Arnaldo
>>
>>> Signed-off-by: Alexey Budankov <[email protected]>
>>> ---
>>> Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
>>> 1 file changed, 11 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
>>> index def074807cee..b06ae9389809 100644
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -720,20 +720,26 @@ perf_event_paranoid:
>>> ====================
>>>
>>> Controls use of the performance events system by unprivileged
>>> -users (without CAP_SYS_ADMIN). The default value is 2.
>>> +users (without CAP_PERFMON). The default value is 2.
>>> +
>>> +For backward compatibility reasons access to system performance
>>> +monitoring and observability remains open for CAP_SYS_ADMIN
>>> +privileged processes but CAP_SYS_ADMIN usage for secure system
>>> +performance monitoring and observability operations is discouraged
>>> +with respect to CAP_PERFMON use cases.
>>>
>>> === ==================================================================
>>> -1 Allow use of (almost) all events by all users
>>>
>>> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>>>
>>> ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
>>> +>=0 Disallow ftrace function tracepoint by users without CAP_PERFMON
>>>
>>> - Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>>> + Disallow raw tracepoint access by users without CAP_PERFMON
>>>
>>> ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
>>> +>=1 Disallow CPU event access by users without CAP_PERFMON
>>>
>>> ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
>>> +>=2 Disallow kernel profiling by users without CAP_PERFMON
>>> === ==================================================================
>>>
>>>
>>> --
>>> 2.24.1
>>>
>>

2020-04-05 15:09:57

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information

Em Sun, Apr 05, 2020 at 05:54:37PM +0300, Alexey Budankov escreveu:
>
> On 05.04.2020 17:41, Alexey Budankov wrote:
> >
> > On 05.04.2020 17:10, Arnaldo Carvalho de Melo wrote:
> >> Em Thu, Apr 02, 2020 at 11:54:39AM +0300, Alexey Budankov escreveu:
> >>>
> >>> Update kernel.rst documentation file with the information
> >>> related to usage of CAP_PERFMON capability to secure performance
> >>> monitoring and observability operations in system.
> >>
> >> This one is failing in my perf/core branch, please take a look. I'm
>
> Please try applying this:

Thanks, applied with the original commit log message,

- Arnaldo

> ---
> Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 335696d3360d..aaa5bbcd1e33 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -709,7 +709,13 @@ perf_event_paranoid
> ===================
>
> Controls use of the performance events system by unprivileged
> -users (without CAP_SYS_ADMIN). The default value is 2.
> +users (without CAP_PERFMON). The default value is 2.
> +
> +For backward compatibility reasons access to system performance
> +monitoring and observability remains open for CAP_SYS_ADMIN
> +privileged processes but CAP_SYS_ADMIN usage for secure system
> +performance monitoring and observability operations is discouraged
> +with respect to CAP_PERFMON use cases.
>
> === ==================================================================
> -1 Allow use of (almost) all events by all users.
> @@ -718,13 +724,13 @@ users (without CAP_SYS_ADMIN). The default value is 2.
> ``CAP_IPC_LOCK``.
>
> >=0 Disallow ftrace function tracepoint by users without
> - ``CAP_SYS_ADMIN``.
> + ``CAP_PERFMON``.
>
> - Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``.
> + Disallow raw tracepoint access by users without ``CAP_PERFMON``.
>
> ->=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``.
> +>=1 Disallow CPU event access by users without ``CAP_PERFMON``.
>
> ->=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``.
> +>=2 Disallow kernel profiling by users without ``CAP_PERFMON``.
> === ==================================================================
>
> ---
>
> Thanks,
> Alexey
>
> >
> > Trying to reproduce right now. What kind of failure do you see?
> > Please share some specifics so I could follow up properly.
> >
> > Thanks,
> > Alexey
> >
> >> pushing my perf/core branch with this series applied, please check that
> >> everything is ok, I'll do some testing now, but it all seems ok.
> >>
> >> Thanks,
> >>
> >> - Arnaldo
> >>
> >>> Signed-off-by: Alexey Budankov <[email protected]>
> >>> ---
> >>> Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
> >>> 1 file changed, 11 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> >>> index def074807cee..b06ae9389809 100644
> >>> --- a/Documentation/admin-guide/sysctl/kernel.rst
> >>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> >>> @@ -720,20 +720,26 @@ perf_event_paranoid:
> >>> ====================
> >>>
> >>> Controls use of the performance events system by unprivileged
> >>> -users (without CAP_SYS_ADMIN). The default value is 2.
> >>> +users (without CAP_PERFMON). The default value is 2.
> >>> +
> >>> +For backward compatibility reasons access to system performance
> >>> +monitoring and observability remains open for CAP_SYS_ADMIN
> >>> +privileged processes but CAP_SYS_ADMIN usage for secure system
> >>> +performance monitoring and observability operations is discouraged
> >>> +with respect to CAP_PERFMON use cases.
> >>>
> >>> === ==================================================================
> >>> -1 Allow use of (almost) all events by all users
> >>>
> >>> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >>>
> >>> ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
> >>> +>=0 Disallow ftrace function tracepoint by users without CAP_PERFMON
> >>>
> >>> - Disallow raw tracepoint access by users without CAP_SYS_ADMIN
> >>> + Disallow raw tracepoint access by users without CAP_PERFMON
> >>>
> >>> ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN
> >>> +>=1 Disallow CPU event access by users without CAP_PERFMON
> >>>
> >>> ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN
> >>> +>=2 Disallow kernel profiling by users without CAP_PERFMON
> >>> === ==================================================================
> >>>
> >>>
> >>> --
> >>> 2.24.1
> >>>
> >>

--

- Arnaldo

2020-04-05 15:52:46

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 12/12] doc/admin-guide: update kernel.rst with CAP_PERFMON information


On 05.04.2020 18:05, Arnaldo Carvalho de Melo wrote:
> Em Sun, Apr 05, 2020 at 05:54:37PM +0300, Alexey Budankov escreveu:
>>
>> On 05.04.2020 17:41, Alexey Budankov wrote:
>>>
>>> On 05.04.2020 17:10, Arnaldo Carvalho de Melo wrote:
>>>> Em Thu, Apr 02, 2020 at 11:54:39AM +0300, Alexey Budankov escreveu:
>>>>>
>>>>> Update kernel.rst documentation file with the information
>>>>> related to usage of CAP_PERFMON capability to secure performance
>>>>> monitoring and observability operations in system.
>>>>
>>>> This one is failing in my perf/core branch, please take a look. I'm
>>
>> Please try applying this:
>
> Thanks, applied with the original commit log message,

Thanks,
Alexey

2020-04-07 14:31:50

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Thu, Apr 02, 2020 at 11:42:05AM +0300, Alexey Budankov escreveu:
> This patch set introduces CAP_PERFMON capability designed to secure
> system performance monitoring and observability operations so that
> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> for performance monitoring and observability subsystems of the kernel.

So, what am I doing wrong?

[perf@five ~]$ type perf
perf is hashed (/home/perf/bin/perf)
[perf@five ~]$
[perf@five ~]$ ls -lahF /home/perf/bin/perf
-rwxr-x---. 1 root perf_users 24M Apr 7 10:34 /home/perf/bin/perf*
[perf@five ~]$
[perf@five ~]$ getcap /home/perf/bin/perf
[perf@five ~]$ perf top --stdio
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).

The current value is 2:

-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
>= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

kernel.perf_event_paranoid = -1

[perf@five ~]$

Ok, the message says I need to have CAP_PERFMON, lets do it, using an
unpatched libcap that doesn't know about it but we can use 38,
CAP_PERFMON value instead, and I tested this with a patched libcap as
well, same results:

As root:

[root@five bin]# setcap "38,cap_sys_ptrace,cap_syslog=ep" perf
[root@five bin]#

Back to the 'perf' user in the 'perf_users' group, ok, so now 'perf
record -a' works for system wide sampling of cycles:u, i.e. only
userspace samples, but 'perf top' is failing:

[perf@five ~]$ type perf
perf is hashed (/home/perf/bin/perf)
[perf@five ~]$ getcap /home/perf/bin/perf
/home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
[perf@five ~]$ groups
perf perf_users
[perf@five ~]$ id
uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[perf@five ~]$ perf top --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
[perf@five ~]$ perf record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]

[perf@five ~]$ perf evlist
cycles:u
[perf@five ~]$

- Arnaldo

2020-04-07 14:37:31

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> [perf@five ~]$ type perf
> perf is hashed (/home/perf/bin/perf)
> [perf@five ~]$ getcap /home/perf/bin/perf
> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> [perf@five ~]$ groups
> perf perf_users
> [perf@five ~]$ id
> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [perf@five ~]$ perf top --stdio
> Error:
> Failed to mmap with 1 (Operation not permitted)
> [perf@five ~]$ perf record -a
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
>
> [perf@five ~]$ perf evlist
> cycles:u
> [perf@five ~]$

Humm, perf record falls back to cycles:u after initially trying cycles
(i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
lemme test, humm not really:

[perf@five ~]$ perf top --stdio -e cycles:u
Error:
Failed to mmap with 1 (Operation not permitted)
[perf@five ~]$ perf record -e cycles:u -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
[perf@five ~]$

Back to debugging this.

- Arnaldo

2020-04-07 14:55:25

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>> [perf@five ~]$ type perf
>> perf is hashed (/home/perf/bin/perf)
>> [perf@five ~]$ getcap /home/perf/bin/perf
>> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
>> [perf@five ~]$ groups
>> perf perf_users
>> [perf@five ~]$ id
>> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>> [perf@five ~]$ perf top --stdio
>> Error:
>> Failed to mmap with 1 (Operation not permitted)
>> [perf@five ~]$ perf record -a
>> ^C[ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
>>
>> [perf@five ~]$ perf evlist
>> cycles:u
>> [perf@five ~]$
>
> Humm, perf record falls back to cycles:u after initially trying cycles
> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> lemme test, humm not really:
>
> [perf@five ~]$ perf top --stdio -e cycles:u
> Error:
> Failed to mmap with 1 (Operation not permitted)
> [perf@five ~]$ perf record -e cycles:u -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> [perf@five ~]$
>
> Back to debugging this.

Could makes sense adding cap_ipc_lock to the binary to isolate from this:

kernel/events/core.c: 6101
if ((locked > lock_limit) && perf_is_paranoid() &&
!capable(CAP_IPC_LOCK)) {
ret = -EPERM;
goto unlock;
}

~Alexey

2020-04-07 16:37:58

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> >> [perf@five ~]$ type perf
> >> perf is hashed (/home/perf/bin/perf)
> >> [perf@five ~]$ getcap /home/perf/bin/perf
> >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> >> [perf@five ~]$ groups
> >> perf perf_users
> >> [perf@five ~]$ id
> >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> >> [perf@five ~]$ perf top --stdio
> >> Error:
> >> Failed to mmap with 1 (Operation not permitted)
> >> [perf@five ~]$ perf record -a
> >> ^C[ perf record: Woken up 1 times to write data ]
> >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> >>
> >> [perf@five ~]$ perf evlist
> >> cycles:u
> >> [perf@five ~]$
> >
> > Humm, perf record falls back to cycles:u after initially trying cycles
> > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > lemme test, humm not really:
> >
> > [perf@five ~]$ perf top --stdio -e cycles:u
> > Error:
> > Failed to mmap with 1 (Operation not permitted)
> > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > [perf@five ~]$
> >
> > Back to debugging this.
>
> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>
> kernel/events/core.c: 6101
> if ((locked > lock_limit) && perf_is_paranoid() &&
> !capable(CAP_IPC_LOCK)) {
> ret = -EPERM;
> goto unlock;
> }


That did the trick, I'll update the documentation and include in my
"Committer testing" section:

[perf@five ~]$ groups
perf perf_users
[perf@five ~]$ ls -lahF bin/perf
-rwxr-x---. 1 root perf_users 24M Apr 7 10:34 bin/perf*
[perf@five ~]$ getcap bin/perf
bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep
[perf@five ~]$
[perf@five ~]$ perf top --stdio


PerfTop: 652 irqs/sec kernel:73.8% exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u], (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------

13.03% [kernel] [k] module_get_kallsym
5.25% [kernel] [k] kallsyms_expand_symbol.constprop.0
5.00% libc-2.30.so [.] __GI_____strtoull_l_internal
4.41% [kernel] [k] memcpy
3.42% [kernel] [k] vsnprintf
2.98% perf [.] map__process_kallsym_symbol
2.86% [kernel] [k] format_decode
2.73% [kernel] [k] number
2.70% perf [.] rb_next
2.59% perf [.] maps__split_kallsyms
2.54% [kernel] [k] string_nocheck
1.90% libc-2.30.so [.] _IO_getdelim
1.86% [kernel] [k] __x86_indirect_thunk_rax
1.53% libc-2.30.so [.] _int_malloc
1.48% libc-2.30.so [.] __memmove_avx_unaligned_erms
1.40% [kernel] [k] clear_page_rep
1.07% perf [.] rb_insert_color
1.01% libc-2.30.so [.] _IO_feof
0.99% perf [.] __dso__load_kallsyms
0.98% [kernel] [k] s_next
0.96% perf [.] __rblist__findnew
0.95% [kernel] [k] strlen
0.95% perf [.] arch__symbols__fixup_end
0.94% libpixman-1.so.0.38.4 [.] 0x000000000006f4af
0.94% perf [.] symbol__new
0.89% libpixman-1.so.0.38.4 [.] 0x000000000006f4a0
0.86% [kernel] [k] seq_read
0.81% libpixman-1.so.0.38.4 [.] 0x000000000006f4ab
0.80% perf [.] __symbols__insert
0.73% libpixman-1.so.0.38.4 [.] 0x000000000006f4a7
0.67% [kernel] [k] s_show
0.66% libc-2.30.so [.] __libc_calloc
0.61% libpixman-1.so.0.38.4 [.] 0x000000000006f4bb
0.59% [kernel] [k] get_page_from_freelist
0.59% perf [.] memcpy@plt
0.58% perf [.] eprintf
exiting.
[perf@five ~]$

There is still something strange in here, the event is cycles:u (see at
the PerfTop line, but it is getting kernel samples :-\

- Arnaldo

2020-04-07 16:42:42

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> > >> [perf@five ~]$ type perf
> > >> perf is hashed (/home/perf/bin/perf)
> > >> [perf@five ~]$ getcap /home/perf/bin/perf
> > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> > >> [perf@five ~]$ groups
> > >> perf perf_users
> > >> [perf@five ~]$ id
> > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > >> [perf@five ~]$ perf top --stdio
> > >> Error:
> > >> Failed to mmap with 1 (Operation not permitted)
> > >> [perf@five ~]$ perf record -a
> > >> ^C[ perf record: Woken up 1 times to write data ]
> > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> > >>
> > >> [perf@five ~]$ perf evlist
> > >> cycles:u
> > >> [perf@five ~]$
> > >
> > > Humm, perf record falls back to cycles:u after initially trying cycles
> > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > > lemme test, humm not really:
> > >
> > > [perf@five ~]$ perf top --stdio -e cycles:u
> > > Error:
> > > Failed to mmap with 1 (Operation not permitted)
> > > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > > [perf@five ~]$
> > >
> > > Back to debugging this.
> >
> > Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> >
> > kernel/events/core.c: 6101
> > if ((locked > lock_limit) && perf_is_paranoid() &&
> > !capable(CAP_IPC_LOCK)) {
> > ret = -EPERM;
> > goto unlock;
> > }
>
>
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:

I ammended this to that patch, please check the wording:

- Arnaldo

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index c0ca0c1a6804..ed33682e26b0 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users.

::

- # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
- # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
perf: OK
# getcap perf
perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep

+If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
+i.e.:
+
+::
+
+ # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+
As a result, members of perf_users group are capable of conducting
performance monitoring and observability by using functionality of the
configured Perf tool executable that, when executes, passes perf_events

2020-04-07 16:53:58

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>>>> [perf@five ~]$ type perf
<SNIP>
>>>> perf is hashed (/home/perf/bin/perf)
>>>> [perf@five ~]$
>>>
>>> Humm, perf record falls back to cycles:u after initially trying cycles
>>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
>>> lemme test, humm not really:
>>>
>>> [perf@five ~]$ perf top --stdio -e cycles:u
>>> Error:
>>> Failed to mmap with 1 (Operation not permitted)
>>> [perf@five ~]$ perf record -e cycles:u -a sleep 1
>>> [ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
>>> [perf@five ~]$
>>>
>>> Back to debugging this.
>>
>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>>
>> kernel/events/core.c: 6101
>> if ((locked > lock_limit) && perf_is_paranoid() &&
>> !capable(CAP_IPC_LOCK)) {
>> ret = -EPERM;
>> goto unlock;
>> }
>
>
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:

Looks like top mode somehow reaches perf mmap limit described here [1].
Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
perf top -e cycles -m 1

~Alexey

[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#memory-allocation

2020-04-07 16:57:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> > >> [perf@five ~]$ type perf
> > >> perf is hashed (/home/perf/bin/perf)
> > >> [perf@five ~]$ getcap /home/perf/bin/perf
> > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> > >> [perf@five ~]$ groups
> > >> perf perf_users
> > >> [perf@five ~]$ id
> > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > >> [perf@five ~]$ perf top --stdio
> > >> Error:
> > >> Failed to mmap with 1 (Operation not permitted)
> > >> [perf@five ~]$ perf record -a
> > >> ^C[ perf record: Woken up 1 times to write data ]
> > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> > >>
> > >> [perf@five ~]$ perf evlist
> > >> cycles:u
> > >> [perf@five ~]$
> > >
> > > Humm, perf record falls back to cycles:u after initially trying cycles
> > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > > lemme test, humm not really:
> > >
> > > [perf@five ~]$ perf top --stdio -e cycles:u
> > > Error:
> > > Failed to mmap with 1 (Operation not permitted)
> > > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > > [perf@five ~]$
> > >
> > > Back to debugging this.
> >
> > Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> >
> > kernel/events/core.c: 6101
> > if ((locked > lock_limit) && perf_is_paranoid() &&
> > !capable(CAP_IPC_LOCK)) {
> > ret = -EPERM;
> > goto unlock;
> > }
>
>
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:
>
> [perf@five ~]$ groups
> perf perf_users
> [perf@five ~]$ ls -lahF bin/perf
> -rwxr-x---. 1 root perf_users 24M Apr 7 10:34 bin/perf*
> [perf@five ~]$ getcap bin/perf
> bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep
> [perf@five ~]$
> [perf@five ~]$ perf top --stdio
>
>
> PerfTop: 652 irqs/sec kernel:73.8% exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u], (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------
>
> 13.03% [kernel] [k] module_get_kallsym
> 5.25% [kernel] [k] kallsyms_expand_symbol.constprop.0
> 5.00% libc-2.30.so [.] __GI_____strtoull_l_internal
> 4.41% [kernel] [k] memcpy
> 3.42% [kernel] [k] vsnprintf
> 2.98% perf [.] map__process_kallsym_symbol
> 2.86% [kernel] [k] format_decode
> 2.73% [kernel] [k] number
> 2.70% perf [.] rb_next
> 2.59% perf [.] maps__split_kallsyms
> 2.54% [kernel] [k] string_nocheck
> 1.90% libc-2.30.so [.] _IO_getdelim
> 1.86% [kernel] [k] __x86_indirect_thunk_rax
> 1.53% libc-2.30.so [.] _int_malloc
> 1.48% libc-2.30.so [.] __memmove_avx_unaligned_erms
> 1.40% [kernel] [k] clear_page_rep
> 1.07% perf [.] rb_insert_color
> 1.01% libc-2.30.so [.] _IO_feof
> 0.99% perf [.] __dso__load_kallsyms
> 0.98% [kernel] [k] s_next
> 0.96% perf [.] __rblist__findnew
> 0.95% [kernel] [k] strlen
> 0.95% perf [.] arch__symbols__fixup_end
> 0.94% libpixman-1.so.0.38.4 [.] 0x000000000006f4af
> 0.94% perf [.] symbol__new
> 0.89% libpixman-1.so.0.38.4 [.] 0x000000000006f4a0
> 0.86% [kernel] [k] seq_read
> 0.81% libpixman-1.so.0.38.4 [.] 0x000000000006f4ab
> 0.80% perf [.] __symbols__insert
> 0.73% libpixman-1.so.0.38.4 [.] 0x000000000006f4a7
> 0.67% [kernel] [k] s_show
> 0.66% libc-2.30.so [.] __libc_calloc
> 0.61% libpixman-1.so.0.38.4 [.] 0x000000000006f4bb
> 0.59% [kernel] [k] get_page_from_freelist
> 0.59% perf [.] memcpy@plt
> 0.58% perf [.] eprintf
> exiting.
> [perf@five ~]$
>
> There is still something strange in here, the event is cycles:u (see at
> the PerfTop line, but it is getting kernel samples :-\

So running with 'perf top --stdio -vv 2> /tmp/output' I see we try
create three events, the first is some capability querying, then we try
to determine the max precision level, but continue with
attr.exclude_kernel=1, which shouldn't be the case, perhaps we're seeing
that it is not the root in the tooling part, and end up setting that to
1 as, previously, we knew it would fail, so we should switch to checking
if we have cap_perfmon too, will check that:

------------------------------------------------------------
perf_event_attr:
type 1
size 120
config 0x9
watermark 1
sample_id_all 1
bpf_event 1
{ wakeup_events, wakeup_watermark } 1
------------------------------------------------------------
------------------------------------------------------------
perf_event_attr:
size 120
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD
read_format ID
disabled 1
inherit 1
exclude_kernel 1
mmap 1
comm 1
freq 1
task 1
precise_ip 3
sample_id_all 1
exclude_guest 1
mmap2 1
comm_exec 1
ksymbol 1
bpf_event 1
------------------------------------------------------------
------------------------------------------------------------
perf_event_attr:
size 120
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD
read_format ID
disabled 1
inherit 1
exclude_kernel 1
mmap 1
comm 1
freq 1
task 1
precise_ip 2
sample_id_all 1
exclude_guest 1
mmap2 1
comm_exec 1
ksymbol 1
bpf_event 1
------------------------------------------------------------

But then, even with that attr.exclude_kernel set to 1 we _still_ get
kernel samples, which looks like another bug, now trying with strace,
which leads us to another rabbit hole:

[perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).

The current value is 2:

-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
>= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

kernel.perf_event_paranoid = -1

[perf@five ~]$

If I remove that strace -e ... from the front, 'perf top' is back
working as a non-cap_sys_admin user, just with cap_perfmon.

- Arnaldo

2020-04-07 17:04:57

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu:
>
> On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> >> Could makes sense adding cap_ipc_lock to the binary to isolate from this:

> >> kernel/events/core.c: 6101
> >> if ((locked > lock_limit) && perf_is_paranoid() &&
> >> !capable(CAP_IPC_LOCK)) {
> >> ret = -EPERM;
> >> goto unlock;
> >> }

> > That did the trick, I'll update the documentation and include in my
> > "Committer testing" section:

> Looks like top mode somehow reaches perf mmap limit described here [1].
> Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
> perf top -e cycles -m 1

So this would read better?

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index ed33682e26b0..d44dd24b0244 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users.

::

- # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
- # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
perf: OK
# getcap perf
perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
@@ -140,6 +140,10 @@ i.e.:

# setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf

+Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
+'perf top', alternatively use 'perf top -m N', to reduce the memory that
+it uses for the perf ring buffer, see the memory allocation section below.
+
As a result, members of perf_users group are capable of conducting
performance monitoring and observability by using functionality of the
configured Perf tool executable that, when executes, passes perf_events

2020-04-07 17:18:10

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 07.04.2020 19:40, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>>> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
>>>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>>>>> [perf@five ~]$ type perf
>>>>> perf is hashed (/home/perf/bin/perf)
>>>>> [perf@five ~]$ getcap /home/perf/bin/perf
>>>>> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
>>>>> [perf@five ~]$ groups
>>>>> perf perf_users
>>>>> [perf@five ~]$ id
>>>>> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>>>>> [perf@five ~]$ perf top --stdio
>>>>> Error:
>>>>> Failed to mmap with 1 (Operation not permitted)
>>>>> [perf@five ~]$ perf record -a
>>>>> ^C[ perf record: Woken up 1 times to write data ]
>>>>> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
>>>>>
>>>>> [perf@five ~]$ perf evlist
>>>>> cycles:u
>>>>> [perf@five ~]$
>>>>
>>>> Humm, perf record falls back to cycles:u after initially trying cycles
>>>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
>>>> lemme test, humm not really:
>>>>
>>>> [perf@five ~]$ perf top --stdio -e cycles:u
>>>> Error:
>>>> Failed to mmap with 1 (Operation not permitted)
>>>> [perf@five ~]$ perf record -e cycles:u -a sleep 1
>>>> [ perf record: Woken up 1 times to write data ]
>>>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
>>>> [perf@five ~]$
>>>>
>>>> Back to debugging this.
>>>
>>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>>>
>>> kernel/events/core.c: 6101
>>> if ((locked > lock_limit) && perf_is_paranoid() &&
>>> !capable(CAP_IPC_LOCK)) {
>>> ret = -EPERM;
>>> goto unlock;
>>> }
>>
>>
>> That did the trick, I'll update the documentation and include in my
>> "Committer testing" section:
>
> I ammended this to that patch, please check the wording:
>
> - Arnaldo
>
> diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
> index c0ca0c1a6804..ed33682e26b0 100644
> --- a/Documentation/admin-guide/perf-security.rst
> +++ b/Documentation/admin-guide/perf-security.rst
> @@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users.
>
> ::
>
> - # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> - # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> + # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> + # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> perf: OK
> # getcap perf
> perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
>
> +If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
> +i.e.:
> +
> +::
> +
> + # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> +
> As a result, members of perf_users group are capable of conducting
> performance monitoring and observability by using functionality of the
> configured Perf tool executable that, when executes, passes perf_events
>

Looks good to me. The paragraph just above should then also be extended to
mention that perf_events subsystem memory limit is ignored due to usage of
CAP_IPC_LOCK:

"As a result, members of perf_users group are capable of conducting
performance monitoring and observability by using functionality of the
configured Perf tool executable that, when executes, passes perf_events
subsystem scope and perf_event_mlock_kb locking limit checks."

~Alexey

2020-04-07 17:25:08

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Apr 07, 2020 at 01:56:43PM -0300, Arnaldo Carvalho de Melo escreveu:
>
> But then, even with that attr.exclude_kernel set to 1 we _still_ get
> kernel samples, which looks like another bug, now trying with strace,
> which leads us to another rabbit hole:
>
> [perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio
> Error:
> You may not have permission to collect system-wide stats.
>
> Consider tweaking /proc/sys/kernel/perf_event_paranoid,
> which controls use of the performance events system by
> unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).
>
> The current value is 2:
>
> -1: Allow use of (almost) all events by all users
> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
> Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
> >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
> >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN
>
> To make this setting permanent, edit /etc/sysctl.conf too, e.g.:
>
> kernel.perf_event_paranoid = -1
>
> [perf@five ~]$
>
> If I remove that strace -e ... from the front, 'perf top' is back
> working as a non-cap_sys_admin user, just with cap_perfmon.
>

So I couldn't figure it out so far why is that exclude_kernel is being
set to 1, as perf-top when no event is passed defaults to this to find
out what to use as a default event:

perf_evlist__add_default(top.evlist)
perf_evsel__new_cycles(true);
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.exclude_kernel = !perf_event_can_profile_kernel(),
};

perf_event_paranoid_check(1);
return perf_cap__capable(CAP_SYS_ADMIN) ||
perf_cap__capable(CAP_PERFMON) ||
perf_event_paranoid() <= max_level;


And then that second condition should hold true, it returns true, and
then .exclude_kernel should be set to !true -> zero.o

Now the wallclock says I need to stop being a programmer and turn into a
daycare provider for Pedro, cya!

- Arnaldo

2020-04-07 17:33:44

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 07.04.2020 20:02, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu:
>>
>> On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>>>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>
>>>> kernel/events/core.c: 6101
>>>> if ((locked > lock_limit) && perf_is_paranoid() &&
>>>> !capable(CAP_IPC_LOCK)) {
>>>> ret = -EPERM;
>>>> goto unlock;
>>>> }
>
>>> That did the trick, I'll update the documentation and include in my
>>> "Committer testing" section:
>
>> Looks like top mode somehow reaches perf mmap limit described here [1].
>> Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
>> perf top -e cycles -m 1
>
> So this would read better?
>
> diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
> index ed33682e26b0..d44dd24b0244 100644
> --- a/Documentation/admin-guide/perf-security.rst
> +++ b/Documentation/admin-guide/perf-security.rst
> @@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users.
>
> ::
>
> - # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> - # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> + # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> + # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> perf: OK
> # getcap perf
> perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
> @@ -140,6 +140,10 @@ i.e.:
>
> # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
>
> +Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
> +'perf top', alternatively use 'perf top -m N', to reduce the memory that
> +it uses for the perf ring buffer, see the memory allocation section below.
> +

Let's stay with the first variant of you addition to this patch and also
extend the paragraph below as suggested in other mail in the thread.

> As a result, members of perf_users group are capable of conducting
> performance monitoring and observability by using functionality of the
> configured Perf tool executable that, when executes, passes perf_events
>

Thanks,
Alexey

Subject: [tip: perf/core] doc/admin-guide: Update perf-security.rst with CAP_PERFMON information

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 902a8dcc5ba6c5dc3332e8806b01be2f0f7ef2e4
Gitweb: https://git.kernel.org/tip/902a8dcc5ba6c5dc3332e8806b01be2f0f7ef2e4
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:54:01 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:10 -03:00

doc/admin-guide: Update perf-security.rst with CAP_PERFMON information

Update perf-security.rst documentation file with the information
related to usage of CAP_PERFMON capability to secure performance
monitoring and observability operations in system.

Committer notes:

While testing 'perf top' under cap_perfmon I noticed that it needs
some more capability and Alexey pointed out cap_ipc_lock, as needed by
this kernel chunk:

kernel/events/core.c: 6101
if ((locked > lock_limit) && perf_is_paranoid() &&
!capable(CAP_IPC_LOCK)) {
ret = -EPERM;
goto unlock;
}

So I added it to the documentation, and also mentioned that if the
libcap version doesn't yet supports 'cap_perfmon', its numeric value can
be used instead, i.e. if:

# setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf

Fails, try:

# setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf

I also added a paragraph stating that using an unpatched libcap will
fail the check for CAP_PERFMON, as it checks the cap number against a
maximum to see if it is valid, which makes it use as the default the
'cycles:u' event, even tho a cap_perfmon capable perf binary can get
kernel samples, to workaround that just use, e.g.:

# perf top -e cycles
# perf record -e cycles

And it will sample kernel and user modes.

Signed-off-by: Alexey Budankov <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
Documentation/admin-guide/perf-security.rst | 86 ++++++++++++++------
1 file changed, 61 insertions(+), 25 deletions(-)

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index 72effa7..1307b52 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -1,6 +1,6 @@
.. _perf_security:

-Perf Events and tool security
+Perf events and tool security
=============================

Overview
@@ -42,11 +42,11 @@ categories:
Data that belong to the fourth category can potentially contain
sensitive process data. If PMUs in some monitoring modes capture values
of execution context registers or data from process memory then access
-to such monitoring capabilities requires to be ordered and secured
-properly. So, perf_events/Perf performance monitoring is the subject for
-security access control management [5]_ .
+to such monitoring modes requires to be ordered and secured properly.
+So, perf_events performance monitoring and observability operations are
+the subject for security access control management [5]_ .

-perf_events/Perf access control
+perf_events access control
-------------------------------

To perform security checks, the Linux implementation splits processes
@@ -66,11 +66,25 @@ into distinct units, known as capabilities [6]_ , which can be
independently enabled and disabled on per-thread basis for processes and
files of unprivileged users.

-Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated
+Unprivileged processes with enabled CAP_PERFMON capability are treated
as privileged processes with respect to perf_events performance
-monitoring and bypass *scope* permissions checks in the kernel.
-
-Unprivileged processes using perf_events system call API is also subject
+monitoring and observability operations, thus, bypass *scope* permissions
+checks in the kernel. CAP_PERFMON implements the principle of least
+privilege [13]_ (POSIX 1003.1e: 2.2.2.39) for performance monitoring and
+observability operations in the kernel and provides a secure approach to
+perfomance monitoring and observability in the system.
+
+For backward compatibility reasons the access to perf_events monitoring and
+observability operations is also open for CAP_SYS_ADMIN privileged
+processes but CAP_SYS_ADMIN usage for secure monitoring and observability
+use cases is discouraged with respect to the CAP_PERFMON capability.
+If system audit records [14]_ for a process using perf_events system call
+API contain denial records of acquiring both CAP_PERFMON and CAP_SYS_ADMIN
+capabilities then providing the process with CAP_PERFMON capability singly
+is recommended as the preferred secure approach to resolve double access
+denial logging related to usage of performance monitoring and observability.
+
+Unprivileged processes using perf_events system call are also subject
for PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose
outcome determines whether monitoring is permitted. So unprivileged
processes provided with CAP_SYS_PTRACE capability are effectively
@@ -82,14 +96,14 @@ performance analysis of monitored processes or a system. For example,
CAP_SYSLOG capability permits reading kernel space memory addresses from
/proc/kallsyms file.

-perf_events/Perf privileged users
+Privileged Perf users groups
---------------------------------

Mechanisms of capabilities, privileged capability-dumb files [6]_ and
-file system ACLs [10]_ can be used to create a dedicated group of
-perf_events/Perf privileged users who are permitted to execute
-performance monitoring without scope limits. The following steps can be
-taken to create such a group of privileged Perf users.
+file system ACLs [10]_ can be used to create dedicated groups of
+privileged Perf users who are permitted to execute performance monitoring
+and observability without scope limits. The following steps can be
+taken to create such groups of privileged Perf users.

1. Create perf_users group of privileged Perf users, assign perf_users
group to Perf tool executable and limit access to the executable for
@@ -108,30 +122,51 @@ taken to create such a group of privileged Perf users.
-rwxr-x--- 2 root perf_users 11M Oct 19 15:12 perf

2. Assign the required capabilities to the Perf tool executable file and
- enable members of perf_users group with performance monitoring
+ enable members of perf_users group with monitoring and observability
privileges [6]_ :

::

- # setcap "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
- # setcap -v "cap_sys_admin,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+ # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
perf: OK
# getcap perf
- perf = cap_sys_ptrace,cap_sys_admin,cap_syslog+ep
+ perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
+
+If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
+i.e.:
+
+::
+
+ # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+
+Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
+'perf top', alternatively use 'perf top -m N', to reduce the memory that
+it uses for the perf ring buffer, see the memory allocation section below.
+
+Using a libcap without support for CAP_PERFMON will make cap_get_flag(caps, 38,
+CAP_EFFECTIVE, &val) fail, which will lead the default event to be 'cycles:u',
+so as a workaround explicitly ask for the 'cycles' event, i.e.:
+
+::
+
+ # perf top -e cycles
+
+To get kernel and user samples with a perf binary with just CAP_PERFMON.

As a result, members of perf_users group are capable of conducting
-performance monitoring by using functionality of the configured Perf
-tool executable that, when executes, passes perf_events subsystem scope
-checks.
+performance monitoring and observability by using functionality of the
+configured Perf tool executable that, when executes, passes perf_events
+subsystem scope checks.

This specific access control management is only available to superuser
or root running processes with CAP_SETPCAP, CAP_SETFCAP [6]_
capabilities.

-perf_events/Perf unprivileged users
+Unprivileged users
-----------------------------------

-perf_events/Perf *scope* and *access* control for unprivileged processes
+perf_events *scope* and *access* control for unprivileged processes
is governed by perf_event_paranoid [2]_ setting:

-1:
@@ -166,7 +201,7 @@ is governed by perf_event_paranoid [2]_ setting:
perf_event_mlock_kb locking limit is imposed but ignored for
unprivileged processes with CAP_IPC_LOCK capability.

-perf_events/Perf resource control
+Resource control
---------------------------------

Open file descriptors
@@ -227,4 +262,5 @@ Bibliography
.. [10] `<http://man7.org/linux/man-pages/man5/acl.5.html>`_
.. [11] `<http://man7.org/linux/man-pages/man2/getrlimit.2.html>`_
.. [12] `<http://man7.org/linux/man-pages/man5/limits.conf.5.html>`_
-
+.. [13] `<https://sites.google.com/site/fullycapable>`_
+.. [14] `<http://man7.org/linux/man-pages/man8/auditd.8.html>`_

Subject: [tip: perf/core] capabilities: Introduce CAP_PERFMON to kernel and user space

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 980737282232b752bb14dab96d77665c15889c36
Gitweb: https://git.kernel.org/tip/980737282232b752bb14dab96d77665c15889c36
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:45:31 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:06 -03:00

capabilities: Introduce CAP_PERFMON to kernel and user space

Introduce the CAP_PERFMON capability designed to secure system
performance monitoring and observability operations so that CAP_PERFMON
can assist CAP_SYS_ADMIN capability in its governing role for
performance monitoring and observability subsystems.

CAP_PERFMON hardens system security and integrity during performance
monitoring and observability operations by decreasing attack surface that
is available to a CAP_SYS_ADMIN privileged process [2]. Providing the access
to system performance monitoring and observability operations under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials, excludes
chances to misuse the credentials and makes the operation more secure.

Thus, CAP_PERFMON implements the principle of least privilege for
performance monitoring and observability operations (POSIX IEEE 1003.1e:
2.2.2.39 principle of least privilege: A security design principle that
states that a process or program be granted only those privileges
(e.g., capabilities) necessary to accomplish its legitimate function,
and only for the time that such privileges are actually required)

CAP_PERFMON meets the demand to secure system performance monitoring and
observability operations for adoption in security sensitive, restricted,
multiuser production environments (e.g. HPC clusters, cloud and virtual compute
environments), where root or CAP_SYS_ADMIN credentials are not available to
mass users of a system, and securely unblocks applicability and scalability
of system performance monitoring and observability operations beyond root
and CAP_SYS_ADMIN use cases.

CAP_PERFMON takes over CAP_SYS_ADMIN credentials related to system performance
monitoring and observability operations and balances amount of CAP_SYS_ADMIN
credentials following the recommendations in the capabilities man page [1]
for CAP_SYS_ADMIN: "Note: this capability is overloaded; see Notes to kernel
developers, below." For backward compatibility reasons access to system
performance monitoring and observability subsystems of the kernel remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN capability
usage for secure system performance monitoring and observability operations
is discouraged with respect to the designed CAP_PERFMON capability.

Although the software running under CAP_PERFMON can not ensure avoidance
of related hardware issues, the software can still mitigate these issues
following the official hardware issues mitigation procedure [2]. The bugs
in the software itself can be fixed following the standard kernel development
process [3] to maintain and harden security of system performance monitoring
and observability operations.

[1] http://man7.org/linux/man-pages/man7/capabilities.7.html
[2] https://www.kernel.org/doc/html/latest/process/embargoed-hardware-issues.html
[3] https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: James Morris <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Stephen Smalley <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
include/linux/capability.h | 4 ++++
include/uapi/linux/capability.h | 8 +++++++-
security/selinux/include/classmap.h | 4 ++--
3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f4..027d7e4 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -251,6 +251,10 @@ extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct
extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
extern bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap);
extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
+static inline bool perfmon_capable(void)
+{
+ return capable(CAP_PERFMON) || capable(CAP_SYS_ADMIN);
+}

/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 272dc69..e58c963 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -367,8 +367,14 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_READ 37

+/*
+ * Allow system performance and observability privileged operations
+ * using perf_events, i915_perf and other kernel subsystems
+ */
+
+#define CAP_PERFMON 38

-#define CAP_LAST_CAP CAP_AUDIT_READ
+#define CAP_LAST_CAP CAP_PERFMON

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 986f3ac..d233ab3 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,9 @@
"audit_control", "setfcap"

#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
- "wake_alarm", "block_suspend", "audit_read"
+ "wake_alarm", "block_suspend", "audit_read", "perfmon"

-#if CAP_LAST_CAP > CAP_AUDIT_READ
+#if CAP_LAST_CAP > CAP_PERFMON
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif

Subject: [tip: perf/core] trace/bpf_trace: Open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 031258da05956646c5606023ab0abe10a7e68ea1
Gitweb: https://git.kernel.org/tip/031258da05956646c5606023ab0abe10a7e68ea1
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:48:54 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

trace/bpf_trace: Open access for CAP_PERFMON privileged process

Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Song Liu <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
kernel/trace/bpf_trace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ca17967..d7d8800 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1468,7 +1468,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
u32 *ids, prog_cnt, ids_len;
int ret;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EPERM;
if (event->attr.type != PERF_TYPE_TRACEPOINT)
return -EINVAL;

Subject: [tip: perf/core] perf/core: open access to probes for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: c9e0924e5c2b59365f9c0d43ff8722e79ecf4088
Gitweb: https://git.kernel.org/tip/c9e0924e5c2b59365f9c0d43ff8722e79ecf4088
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:47:01 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

perf/core: open access to probes for CAP_PERFMON privileged process

Open access to monitoring via kprobes and uprobes and eBPF tracing for
CAP_PERFMON privileged process. Providing the access under CAP_PERFMON
capability singly, without the rest of CAP_SYS_ADMIN credentials,
excludes chances to misuse the credentials and makes operation more
secure.

perf kprobes and uprobes are used by ftrace and eBPF. perf probe uses
ftrace to define new kprobe events, and those events are treated as
tracepoint events. eBPF defines new probes via perf_event_open interface
and then the probes are used in eBPF tracing.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
kernel/events/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 74025b7..52951e9 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9397,7 +9397,7 @@ static int perf_kprobe_event_init(struct perf_event *event)
if (event->attr.type != perf_kprobe.type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/*
@@ -9457,7 +9457,7 @@ static int perf_uprobe_event_init(struct perf_event *event)
if (event->attr.type != perf_uprobe.type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/*

Subject: [tip: perf/core] perf/core: Open access to the core for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 18aa18566218d4a46d940049b835314d2b071cc2
Gitweb: https://git.kernel.org/tip/18aa18566218d4a46d940049b835314d2b071cc2
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:46:24 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

perf/core: Open access to the core for CAP_PERFMON privileged process

Open access to monitoring of kernel code, CPUs, tracepoints and
namespaces data for a CAP_PERFMON privileged process. Providing the
access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons the access to perf_events subsystem
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure perf_events monitoring is discouraged with respect to
CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: [email protected]
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
include/linux/perf_event.h | 6 +++---
kernel/events/core.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c3e761..87e2168 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1305,7 +1305,7 @@ static inline int perf_is_paranoid(void)

static inline int perf_allow_kernel(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > 1 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
return -EACCES;

return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
@@ -1313,7 +1313,7 @@ static inline int perf_allow_kernel(struct perf_event_attr *attr)

static inline int perf_allow_cpu(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > 0 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > 0 && !perfmon_capable())
return -EACCES;

return security_perf_event_open(attr, PERF_SECURITY_CPU);
@@ -1321,7 +1321,7 @@ static inline int perf_allow_cpu(struct perf_event_attr *attr)

static inline int perf_allow_tracepoint(struct perf_event_attr *attr)
{
- if (sysctl_perf_event_paranoid > -1 && !capable(CAP_SYS_ADMIN))
+ if (sysctl_perf_event_paranoid > -1 && !perfmon_capable())
return -EPERM;

return security_perf_event_open(attr, PERF_SECURITY_TRACEPOINT);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bc9b98a..74025b7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11504,7 +11504,7 @@ SYSCALL_DEFINE5(perf_event_open,
}

if (attr.namespaces) {
- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;
}

Subject: [tip: perf/core] drm/i915/perf: Open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 4e3d3456b78fa5a70e65de0d7c5309b814281ae3
Gitweb: https://git.kernel.org/tip/4e3d3456b78fa5a70e65de0d7c5309b814281ae3
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:48:15 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

drm/i915/perf: Open access for CAP_PERFMON privileged process

Open access to i915_perf monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to i915_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure i915_events monitoring is discouraged with respect to CAP_PERFMON
capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Lionel Landwerlin <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
drivers/gpu/drm/i915/i915_perf.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c
index 551be58..5fb1749 100644
--- a/drivers/gpu/drm/i915/i915_perf.c
+++ b/drivers/gpu/drm/i915/i915_perf.c
@@ -3433,10 +3433,10 @@ i915_perf_open_ioctl_locked(struct i915_perf *perf,
/* Similar to perf's kernel.perf_paranoid_cpu sysctl option
* we check a dev.i915.perf_stream_paranoid sysctl option
* to determine if it's ok to access system wide OA counters
- * without CAP_SYS_ADMIN privileges.
+ * without CAP_PERFMON or CAP_SYS_ADMIN privileges.
*/
if (privileged_op &&
- i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to open i915 perf stream\n");
ret = -EACCES;
goto err_ctx;
@@ -3629,9 +3629,8 @@ static int read_properties_unlocked(struct i915_perf *perf,
} else
oa_freq_hz = 0;

- if (oa_freq_hz > i915_oa_max_sample_rate &&
- !capable(CAP_SYS_ADMIN)) {
- DRM_DEBUG("OA exponent would exceed the max sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without root privileges\n",
+ if (oa_freq_hz > i915_oa_max_sample_rate && !perfmon_capable()) {
+ DRM_DEBUG("OA exponent would exceed the max sampling frequency (sysctl dev.i915.oa_max_sample_rate) %uHz without CAP_PERFMON or CAP_SYS_ADMIN privileges\n",
i915_oa_max_sample_rate);
return -EACCES;
}
@@ -4052,7 +4051,7 @@ int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
return -EINVAL;
}

- if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to add i915 OA config\n");
return -EACCES;
}
@@ -4199,7 +4198,7 @@ int i915_perf_remove_config_ioctl(struct drm_device *dev, void *data,
return -ENOTSUPP;
}

- if (i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) {
+ if (i915_perf_stream_paranoid && !perfmon_capable()) {
DRM_DEBUG("Insufficient privileges to remove i915 OA config\n");
return -EACCES;
}

Subject: [tip: perf/core] parisc/perf: open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: cf91baf3f7f39a0cd29072e21ed0e4bb1ab3b382
Gitweb: https://git.kernel.org/tip/cf91baf3f7f39a0cd29072e21ed0e4bb1ab3b382
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:50:15 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

parisc/perf: open access for CAP_PERFMON privileged process

Open access to monitoring for CAP_PERFMON privileged process. Providing
the access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure monitoring is discouraged with respect to CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Helge Deller <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
arch/parisc/kernel/perf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/perf.c b/arch/parisc/kernel/perf.c
index e1a8fee..d46b670 100644
--- a/arch/parisc/kernel/perf.c
+++ b/arch/parisc/kernel/perf.c
@@ -300,7 +300,7 @@ static ssize_t perf_write(struct file *file, const char __user *buf,
else
return -EFAULT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

if (count != sizeof(uint32_t))

Subject: [tip: perf/core] drivers/perf: Open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: cea7d0d4a59b4efd0e1fe067130b4c06ab4d412f
Gitweb: https://git.kernel.org/tip/cea7d0d4a59b4efd0e1fe067130b4c06ab4d412f
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:51:21 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:09 -03:00

drivers/perf: Open access for CAP_PERFMON privileged process

Open access to monitoring for CAP_PERFMON privileged process. Providing
the access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure monitoring is discouraged with respect to CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
drivers/perf/arm_spe_pmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/perf/arm_spe_pmu.c b/drivers/perf/arm_spe_pmu.c
index b72c048..0e0961a 100644
--- a/drivers/perf/arm_spe_pmu.c
+++ b/drivers/perf/arm_spe_pmu.c
@@ -274,7 +274,7 @@ static u64 arm_spe_event_to_pmscr(struct perf_event *event)
if (!attr->exclude_kernel)
reg |= BIT(SYS_PMSCR_EL1_E1SPE_SHIFT);

- if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR) && capable(CAP_SYS_ADMIN))
+ if (IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR) && perfmon_capable())
reg |= BIT(SYS_PMSCR_EL1_CX_SHIFT);

return reg;
@@ -700,7 +700,7 @@ static int arm_spe_pmu_event_init(struct perf_event *event)
return -EOPNOTSUPP;

reg = arm_spe_event_to_pmscr(event);
- if (!capable(CAP_SYS_ADMIN) &&
+ if (!perfmon_capable() &&
(reg & (BIT(SYS_PMSCR_EL1_PA_SHIFT) |
BIT(SYS_PMSCR_EL1_CX_SHIFT) |
BIT(SYS_PMSCR_EL1_PCT_SHIFT))))

Subject: [tip: perf/core] drivers/oprofile: Open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: ab76878bb720cbd35a05ae868387f4373a58c949
Gitweb: https://git.kernel.org/tip/ab76878bb720cbd35a05ae868387f4373a58c949
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:53:07 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:09 -03:00

drivers/oprofile: Open access for CAP_PERFMON privileged process

Open access to monitoring for CAP_PERFMON privileged process. Providing
the access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure monitoring is discouraged with respect to CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Acked-by: James Morris <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
drivers/oprofile/event_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/oprofile/event_buffer.c b/drivers/oprofile/event_buffer.c
index 12ea4a4..6c9edc8 100644
--- a/drivers/oprofile/event_buffer.c
+++ b/drivers/oprofile/event_buffer.c
@@ -113,7 +113,7 @@ static int event_buffer_open(struct inode *inode, struct file *file)
{
int err = -EPERM;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EPERM;

if (test_and_set_bit_lock(0, &buffer_opened))

Subject: [tip: perf/core] powerpc/perf: open access for CAP_PERFMON privileged process

The following commit has been merged into the perf/core branch of tip:

Commit-ID: ff46758313e688fca7d762b3e6ead32843999511
Gitweb: https://git.kernel.org/tip/ff46758313e688fca7d762b3e6ead32843999511
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:49:36 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

powerpc/perf: open access for CAP_PERFMON privileged process

Open access to monitoring for CAP_PERFMON privileged process. Providing
the access under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to the monitoring remains open
for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure monitoring is discouraged with respect to CAP_PERFMON capability.

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Anju T Sudhakar <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
arch/powerpc/perf/imc-pmu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index eb82dda..0edcfd0 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -976,7 +976,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/* Sampling not supported */
@@ -1412,7 +1412,7 @@ static int trace_imc_event_init(struct perf_event *event)
if (event->attr.type != event->pmu->type)
return -ENOENT;

- if (!capable(CAP_SYS_ADMIN))
+ if (!perfmon_capable())
return -EACCES;

/* Return if this is a couting event */

Subject: [tip: perf/core] doc/admin-guide: update kernel.rst with CAP_PERFMON information

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 025b16f81dd7f51f29d0109399d669438c63b6ce
Gitweb: https://git.kernel.org/tip/025b16f81dd7f51f29d0109399d669438c63b6ce
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:54:39 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:12 -03:00

doc/admin-guide: update kernel.rst with CAP_PERFMON information

Update the kernel.rst documentation file with the information related to
usage of CAP_PERFMON capability to secure performance monitoring and
observability operations in system.

Signed-off-by: Alexey Budankov <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: James Morris <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
Documentation/admin-guide/sysctl/kernel.rst | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 39c95c0..7e4c28d 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -730,7 +730,13 @@ perf_event_paranoid
===================

Controls use of the performance events system by unprivileged
-users (without CAP_SYS_ADMIN). The default value is 2.
+users (without CAP_PERFMON). The default value is 2.
+
+For backward compatibility reasons access to system performance
+monitoring and observability remains open for CAP_SYS_ADMIN
+privileged processes but CAP_SYS_ADMIN usage for secure system
+performance monitoring and observability operations is discouraged
+with respect to CAP_PERFMON use cases.

=== ==================================================================
-1 Allow use of (almost) all events by all users.
@@ -739,13 +745,13 @@ users (without CAP_SYS_ADMIN). The default value is 2.
``CAP_IPC_LOCK``.

>=0 Disallow ftrace function tracepoint by users without
- ``CAP_SYS_ADMIN``.
+ ``CAP_PERFMON``.

- Disallow raw tracepoint access by users without ``CAP_SYS_ADMIN``.
+ Disallow raw tracepoint access by users without ``CAP_PERFMON``.

->=1 Disallow CPU event access by users without ``CAP_SYS_ADMIN``.
+>=1 Disallow CPU event access by users without ``CAP_PERFMON``.

->=2 Disallow kernel profiling by users without ``CAP_SYS_ADMIN``.
+>=2 Disallow kernel profiling by users without ``CAP_PERFMON``.
=== ==================================================================


Subject: [tip: perf/core] perf tools: Support CAP_PERFMON capability

The following commit has been merged into the perf/core branch of tip:

Commit-ID: 6b3e0e2e04615df128b2d38fa1dd1fcb84f2504c
Gitweb: https://git.kernel.org/tip/6b3e0e2e04615df128b2d38fa1dd1fcb84f2504c
Author: Alexey Budankov <[email protected]>
AuthorDate: Thu, 02 Apr 2020 11:47:35 +03:00
Committer: Arnaldo Carvalho de Melo <[email protected]>
CommitterDate: Thu, 16 Apr 2020 12:19:08 -03:00

perf tools: Support CAP_PERFMON capability

Extend error messages to mention CAP_PERFMON capability as an option to
substitute CAP_SYS_ADMIN capability for secure system performance
monitoring and observability operations. Make
perf_event_paranoid_check() and __cmd_ftrace() to be aware of
CAP_PERFMON capability.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to perf_events subsystem remains
open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN usage for
secure perf_events monitoring is discouraged with respect to CAP_PERFMON
capability.

Committer testing:

Using a libcap with this patch:

diff --git a/libcap/include/uapi/linux/capability.h b/libcap/include/uapi/linux/capability.h
index 78b2fd4c8a95..89b5b0279b60 100644
--- a/libcap/include/uapi/linux/capability.h
+++ b/libcap/include/uapi/linux/capability.h
@@ -366,8 +366,9 @@ struct vfs_ns_cap_data {

#define CAP_AUDIT_READ 37

+#define CAP_PERFMON 38

-#define CAP_LAST_CAP CAP_AUDIT_READ
+#define CAP_LAST_CAP CAP_PERFMON

#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)

Note that using '38' in place of 'cap_perfmon' works to some degree with
an old libcap, its only when cap_get_flag() is called that libcap
performs an error check based on the maximum value known for
capabilities that it will fail.

This makes determining the default of perf_event_attr.exclude_kernel to
fail, as it can't determine if CAP_PERFMON is in place.

Using 'perf top -e cycles' avoids the default check and sets
perf_event_attr.exclude_kernel to 1.

As root, with a libcap supporting CAP_PERFMON:

# groupadd perf_users
# adduser perf -g perf_users
# mkdir ~perf/bin
# cp ~acme/bin/perf ~perf/bin/
# chgrp perf_users ~perf/bin/perf
# setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" ~perf/bin/perf
# getcap ~perf/bin/perf
/home/perf/bin/perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
# ls -la ~perf/bin/perf
-rwxr-xr-x. 1 root perf_users 16968552 Apr 9 13:10 /home/perf/bin/perf

As the 'perf' user in the 'perf_users' group:

$ perf top -a --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
$

Either add the cap_ipc_lock capability to the perf binary or reduce the
ring buffer size to some smaller value:

$ perf top -m10 -a --stdio
rounding mmap pages size to 64K (16 pages)
Error:
Failed to mmap with 1 (Operation not permitted)
$ perf top -m4 -a --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
$ perf top -m2 -a --stdio
PerfTop: 762 irqs/sec kernel:49.7% exact: 100.0% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, 4 CPUs)
------------------------------------------------------------------------------------------------------

9.83% perf [.] __symbols__insert
8.58% perf [.] rb_next
5.91% [kernel] [k] module_get_kallsym
5.66% [kernel] [k] kallsyms_expand_symbol.constprop.0
3.98% libc-2.29.so [.] __GI_____strtoull_l_internal
3.66% perf [.] rb_insert_color
2.34% [kernel] [k] vsnprintf
2.30% [kernel] [k] string_nocheck
2.16% libc-2.29.so [.] _IO_getdelim
2.15% [kernel] [k] number
2.13% [kernel] [k] format_decode
1.58% libc-2.29.so [.] _IO_feof
1.52% libc-2.29.so [.] __strcmp_avx2
1.50% perf [.] rb_set_parent_color
1.47% libc-2.29.so [.] __libc_calloc
1.24% [kernel] [k] do_syscall_64
1.17% [kernel] [k] __x86_indirect_thunk_rax

$ perf record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.552 MB perf.data (74 samples) ]
$ perf evlist
cycles
$ perf evlist -v
cycles: size: 120, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD, read_format: ID, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, task: 1, precise_ip: 3, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
$ perf report | head -20
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 74 of event 'cycles'
# Event count (approx.): 15694834
#
# Overhead Command Shared Object Symbol
# ........ ............... .......................... ......................................
#
19.62% perf [kernel.vmlinux] [k] strnlen_user
13.88% swapper [kernel.vmlinux] [k] intel_idle
13.83% ksoftirqd/0 [kernel.vmlinux] [k] pfifo_fast_dequeue
13.51% swapper [kernel.vmlinux] [k] kmem_cache_free
6.31% gnome-shell [kernel.vmlinux] [k] kmem_cache_free
5.66% kworker/u8:3+ix [kernel.vmlinux] [k] delay_tsc
4.42% perf [kernel.vmlinux] [k] __set_cpus_allowed_ptr
3.45% kworker/2:1-eve [kernel.vmlinux] [k] shmem_truncate_range
2.29% gnome-shell libgobject-2.0.so.0.6000.7 [.] g_closure_ref
$

Signed-off-by: Alexey Budankov <[email protected]>
Reviewed-by: James Morris <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Acked-by: Namhyung Kim <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Igor Lubashev <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Serge Hallyn <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
---
tools/perf/builtin-ftrace.c | 5 +++--
tools/perf/design.txt | 3 ++-
tools/perf/util/cap.h | 4 ++++
tools/perf/util/evsel.c | 10 +++++-----
tools/perf/util/util.c | 1 +
5 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/tools/perf/builtin-ftrace.c b/tools/perf/builtin-ftrace.c
index d5adc41..55eda54 100644
--- a/tools/perf/builtin-ftrace.c
+++ b/tools/perf/builtin-ftrace.c
@@ -284,10 +284,11 @@ static int __cmd_ftrace(struct perf_ftrace *ftrace, int argc, const char **argv)
.events = POLLIN,
};

- if (!perf_cap__capable(CAP_SYS_ADMIN)) {
+ if (!(perf_cap__capable(CAP_PERFMON) ||
+ perf_cap__capable(CAP_SYS_ADMIN))) {
pr_err("ftrace only works for %s!\n",
#ifdef HAVE_LIBCAP_SUPPORT
- "users with the SYS_ADMIN capability"
+ "users with the CAP_PERFMON or CAP_SYS_ADMIN capability"
#else
"root"
#endif
diff --git a/tools/perf/design.txt b/tools/perf/design.txt
index 0453ba2..a42fab3 100644
--- a/tools/perf/design.txt
+++ b/tools/perf/design.txt
@@ -258,7 +258,8 @@ gets schedule to. Per task counters can be created by any user, for
their own tasks.

A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
-all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.
+all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN
+privilege.

The 'flags' parameter is currently unused and must be zero.

diff --git a/tools/perf/util/cap.h b/tools/perf/util/cap.h
index 051dc59..ae52878 100644
--- a/tools/perf/util/cap.h
+++ b/tools/perf/util/cap.h
@@ -29,4 +29,8 @@ static inline bool perf_cap__capable(int cap __maybe_unused)
#define CAP_SYSLOG 34
#endif

+#ifndef CAP_PERFMON
+#define CAP_PERFMON 38
+#endif
+
#endif /* __PERF_CAP_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index eb880ef..d23db67 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2523,14 +2523,14 @@ int perf_evsel__open_strerror(struct evsel *evsel, struct target *target,
"You may not have permission to collect %sstats.\n\n"
"Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
"which controls use of the performance events system by\n"
- "unprivileged users (without CAP_SYS_ADMIN).\n\n"
+ "unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).\n\n"
"The current value is %d:\n\n"
" -1: Allow use of (almost) all events by all users\n"
" Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK\n"
- ">= 0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN\n"
- " Disallow raw tracepoint access by users without CAP_SYS_ADMIN\n"
- ">= 1: Disallow CPU event access by users without CAP_SYS_ADMIN\n"
- ">= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN\n\n"
+ ">= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
+ " Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN\n"
+ ">= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN\n"
+ ">= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN\n\n"
"To make this setting permanent, edit /etc/sysctl.conf too, e.g.:\n\n"
" kernel.perf_event_paranoid = -1\n" ,
target->system_wide ? "system-wide " : "",
diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index d707c96..37a9492 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -290,6 +290,7 @@ int perf_event_paranoid(void)
bool perf_event_paranoid_check(int max_level)
{
return perf_cap__capable(CAP_SYS_ADMIN) ||
+ perf_cap__capable(CAP_PERFMON) ||
perf_event_paranoid() <= max_level;
}

2020-07-10 13:35:05

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Hi Alexey,

> Currently access to perf_events, i915_perf and other performance
> monitoring and observability subsystems of the kernel is open only for
> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> process effective set [2].
>
> This patch set introduces CAP_PERFMON capability designed to secure
> system performance monitoring and observability operations so that
> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> for performance monitoring and observability subsystems of the kernel.

I'm seeing an issue with CAP_PERFMON when I try to record data for a
specific target. I don't know whether this is sort of a regression or
an expected behavior.

Without setting CAP_PERFMON:

$ getcap ./perf
$ ./perf stat -a ls
Error:
Access to performance monitoring and observability operations is limited.
$ ./perf stat ls
Performance counter stats for 'ls':

2.06 msec task-clock:u # 0.418 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec

With CAP_PERFMON:

$ getcap ./perf
./perf = cap_perfmon+ep
$ ./perf stat -a ls
Performance counter stats for 'system wide':

142.42 msec cpu-clock # 25.062 CPUs utilized
182 context-switches # 0.001 M/sec
48 cpu-migrations # 0.337 K/sec
$ ./perf stat ls
Error:
Access to performance monitoring and observability operations is limited.

Am I missing something silly?

Analysis:
---------
A bit more analysis lead me to below kernel code fs/exec.c:

begin_new_exec()
{
...
if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
!(uid_eq(current_euid(), current_uid()) &&
gid_eq(current_egid(), current_gid())))
set_dumpable(current->mm, suid_dumpable);
else
set_dumpable(current->mm, SUID_DUMP_USER);

...
commit_creds(bprm->cred);
}

When I execute './perf stat ls', it's going into else condition and thus sets
dumpable flag as SUID_DUMP_USER. Then in commit_creds():

int commit_creds(struct cred *new)
{
...
/* dumpability changes */
if (...
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
}

!cred_cap_issubset(old, new) fails for perf without any capability and thus
it doesn't execute set_dumpable(). Whereas that condition passes for perf
with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).

Now while opening an event:

perf_event_open()
ptrace_may_access()
__ptrace_may_access() {
...
if (mm &&
((get_dumpable(mm) != SUID_DUMP_USER) &&
!ptrace_has_cap(cred, mm->user_ns, mode)))
return -EPERM;
}

This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
success. So opening an event fails when perf has CAP_PREFMON and tries to open
process specific event as normal user.

Workarounds:
------------
Based on above analysis, I found couple of workarounds (examples are on
Ubuntu 18.04.4 powerpc):

Workaround1:
Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
issue.

# echo 1 > /proc/sys/fs/suid_dumpable
$ getcap ./perf
./perf = cap_perfmon+ep
$ ./perf stat ls
Performance counter stats for 'ls':

1.47 msec task-clock # 0.806 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec

Workaround2:
Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.

$ cat /proc/sys/fs/suid_dumpable
2
# setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
$ ./perf stat ls
Performance counter stats for 'ls':

1.41 msec task-clock # 0.826 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec

Workaround3:
Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.

$ cat /proc/sys/fs/suid_dumpable
2
# setcap "cap_perfmon=ep" /bin/bash
# setcap "cap_perfmon=ep" ./perf
$ bash
$ ./perf stat ls
Performance counter stats for 'ls':

1.47 msec task-clock # 0.806 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec

- Ravi

2020-07-10 14:31:39

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


Hi Ravi,

On 10.07.2020 16:31, Ravi Bangoria wrote:
> Hi Alexey,
>
>> Currently access to perf_events, i915_perf and other performance
>> monitoring and observability subsystems of the kernel is open only for
>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>> process effective set [2].
>>
>> This patch set introduces CAP_PERFMON capability designed to secure
>> system performance monitoring and observability operations so that
>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>> for performance monitoring and observability subsystems of the kernel.
>
> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> specific target. I don't know whether this is sort of a regression or
> an expected behavior.

Thanks for reporting and root causing this case. The behavior looks like
kind of expected since currently CAP_PERFMON takes over the related part
of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
that access control is also subject to CAP_SYS_PTRACE credentials.

CAP_PERFMON could be used to extend and substitute ptrace_may_access()
check in perf_events subsystem to simplify user experience at least in
this specific case.

Alexei

[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

>
> Without setting CAP_PERFMON:
>
>   $ getcap ./perf
>   $ ./perf stat -a ls
>     Error:
>     Access to performance monitoring and observability operations is limited.
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     2.06 msec task-clock:u              #    0.418 CPUs utilized
>                     0      context-switches:u        #    0.000 K/sec
>                     0      cpu-migrations:u          #    0.000 K/sec
>
> With CAP_PERFMON:
>
>   $ getcap ./perf
>     ./perf = cap_perfmon+ep
>   $ ./perf stat -a ls
>     Performance counter stats for 'system wide':
>                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
>                   182      context-switches          #    0.001 M/sec
>                    48      cpu-migrations            #    0.337 K/sec
>   $ ./perf stat ls
>     Error:
>     Access to performance monitoring and observability operations is limited.
>
> Am I missing something silly?
>
> Analysis:
> ---------
> A bit more analysis lead me to below kernel code fs/exec.c:
>
>   begin_new_exec()
>   {
>         ...
>         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>             !(uid_eq(current_euid(), current_uid()) &&
>               gid_eq(current_egid(), current_gid())))
>                 set_dumpable(current->mm, suid_dumpable);
>         else
>                 set_dumpable(current->mm, SUID_DUMP_USER);
>
>         ...
>         commit_creds(bprm->cred);
>   }
>
> When I execute './perf stat ls', it's going into else condition and thus sets
> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
>
>   int commit_creds(struct cred *new)
>   {
>         ...
>         /* dumpability changes */
>         if (...
>             !cred_cap_issubset(old, new)) {
>                 if (task->mm)
>                         set_dumpable(task->mm, suid_dumpable);
>   }
>
> !cred_cap_issubset(old, new) fails for perf without any capability and thus
> it doesn't execute set_dumpable(). Whereas that condition passes for perf
> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
>
> Now while opening an event:
>
>   perf_event_open()
>     ptrace_may_access()
>       __ptrace_may_access() {
>                 ...
>                 if (mm &&
>                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
>                      !ptrace_has_cap(cred, mm->user_ns, mode)))
>                     return -EPERM;
>       }
>
> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> success. So opening an event fails when perf has CAP_PREFMON and tries to open
> process specific event as normal user.
>
> Workarounds:
> ------------
> Based on above analysis, I found couple of workarounds (examples are on
> Ubuntu 18.04.4 powerpc):
>
> Workaround1:
> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> issue.
>
>   # echo 1 > /proc/sys/fs/suid_dumpable
>   $ getcap ./perf
>     ./perf = cap_perfmon+ep
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.47 msec task-clock                #    0.806 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
>
> Workaround2:
> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
>
>   $ cat /proc/sys/fs/suid_dumpable
>     2
>   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.41 msec task-clock                #    0.826 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
>
> Workaround3:
> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
>
>   $ cat /proc/sys/fs/suid_dumpable
>     2
>   # setcap "cap_perfmon=ep" /bin/bash
>   # setcap "cap_perfmon=ep" ./perf
>   $ bash
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.47 msec task-clock                #    0.806 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
>
> - Ravi

2020-07-10 17:12:16

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >> Currently access to perf_events, i915_perf and other performance
> >> monitoring and observability subsystems of the kernel is open only for
> >> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >> process effective set [2].

> >> This patch set introduces CAP_PERFMON capability designed to secure
> >> system performance monitoring and observability operations so that
> >> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >> for performance monitoring and observability subsystems of the kernel.

> > I'm seeing an issue with CAP_PERFMON when I try to record data for a
> > specific target. I don't know whether this is sort of a regression or
> > an expected behavior.

> Thanks for reporting and root causing this case. The behavior looks like
> kind of expected since currently CAP_PERFMON takes over the related part
> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> that access control is also subject to CAP_SYS_PTRACE credentials.

I think that stating that in the error message would be helpful, after
all, who reads docs? 8-)

I.e., this:

$ ./perf stat ls
? Error:
? Access to performance monitoring and observability operations is limited.
$

Could become:

$ ./perf stat ls
? Error:
? Access to performance monitoring and observability operations is limited.
Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
$

- Arnaldo

> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
> check in perf_events subsystem to simplify user experience at least in
> this specific case.
>
> Alexei
>
> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>
> >
> > Without setting CAP_PERFMON:
> >
> > ? $ getcap ./perf
> > ? $ ./perf stat -a ls
> > ??? Error:
> > ??? Access to performance monitoring and observability operations is limited.
> > ? $ ./perf stat ls
> > ??? Performance counter stats for 'ls':
> > ?? ???????????????? 2.06 msec task-clock:u????????????? #??? 0.418 CPUs utilized
> > ??????????????????? 0????? context-switches:u??????? #??? 0.000 K/sec
> > ??????????????????? 0????? cpu-migrations:u????????? #??? 0.000 K/sec
> >
> > With CAP_PERFMON:
> >
> > ? $ getcap ./perf
> > ??? ./perf = cap_perfmon+ep
> > ? $ ./perf stat -a ls
> > ??? Performance counter stats for 'system wide':
> > ?? ?????????????? 142.42 msec cpu-clock???????????????? #?? 25.062 CPUs utilized
> > ????????????????? 182????? context-switches????????? #??? 0.001 M/sec
> > ?????????????????? 48????? cpu-migrations??????????? #??? 0.337 K/sec
> > ? $ ./perf stat ls
> > ??? Error:
> > ??? Access to performance monitoring and observability operations is limited.
> >
> > Am I missing something silly?
> >
> > Analysis:
> > ---------
> > A bit more analysis lead me to below kernel code fs/exec.c:
> >
> > ? begin_new_exec()
> > ? {
> > ??????? ...
> > ??????? if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> > ??????????? !(uid_eq(current_euid(), current_uid()) &&
> > ????????????? gid_eq(current_egid(), current_gid())))
> > ??????????????? set_dumpable(current->mm, suid_dumpable);
> > ??????? else
> > ??????????????? set_dumpable(current->mm, SUID_DUMP_USER);
> >
> > ??????? ...
> > ??????? commit_creds(bprm->cred);
> > ? }
> >
> > When I execute './perf stat ls', it's going into else condition and thus sets
> > dumpable flag as SUID_DUMP_USER. Then in commit_creds():
> >
> > ? int commit_creds(struct cred *new)
> > ? {
> > ??????? ...
> > ??????? /* dumpability changes */
> > ??????? if (...
> > ??????????? !cred_cap_issubset(old, new)) {
> > ??????????????? if (task->mm)
> > ??????????????????????? set_dumpable(task->mm, suid_dumpable);
> > ? }
> >
> > !cred_cap_issubset(old, new) fails for perf without any capability and thus
> > it doesn't execute set_dumpable(). Whereas that condition passes for perf
> > with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> > suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> > SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
> >
> > Now while opening an event:
> >
> > ? perf_event_open()
> > ??? ptrace_may_access()
> > ????? __ptrace_may_access() {
> > ??????????????? ...
> > ??????????????? if (mm &&
> > ??????????????????? ((get_dumpable(mm) != SUID_DUMP_USER) &&
> > ???????????????????? !ptrace_has_cap(cred, mm->user_ns, mode)))
> > ??????????????????? return -EPERM;
> > ????? }
> >
> > This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> > But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> > success. So opening an event fails when perf has CAP_PREFMON and tries to open
> > process specific event as normal user.
> >
> > Workarounds:
> > ------------
> > Based on above analysis, I found couple of workarounds (examples are on
> > Ubuntu 18.04.4 powerpc):
> >
> > Workaround1:
> > Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> > issue.
> >
> > ? # echo 1 > /proc/sys/fs/suid_dumpable
> > ? $ getcap ./perf
> > ??? ./perf = cap_perfmon+ep
> > ? $ ./perf stat ls
> > ??? Performance counter stats for 'ls':
> > ?? ???????????????? 1.47 msec task-clock??????????????? #??? 0.806 CPUs utilized
> > ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> > ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >
> > Workaround2:
> > Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
> >
> > ? $ cat /proc/sys/fs/suid_dumpable
> > ??? 2
> > ? # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
> > ? $ ./perf stat ls
> > ??? Performance counter stats for 'ls':
> > ?? ???????????????? 1.41 msec task-clock??????????????? #??? 0.826 CPUs utilized
> > ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> > ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >
> > Workaround3:
> > Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
> >
> > ? $ cat /proc/sys/fs/suid_dumpable
> > ??? 2
> > ? # setcap "cap_perfmon=ep" /bin/bash
> > ? # setcap "cap_perfmon=ep" ./perf
> > ? $ bash
> > ? $ ./perf stat ls
> > ??? Performance counter stats for 'ls':
> > ?? ???????????????? 1.47 msec task-clock??????????????? #??? 0.806 CPUs utilized
> > ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> > ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >
> > - Ravi

--

- Arnaldo

2020-07-13 09:49:19

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>> Currently access to perf_events, i915_perf and other performance
>>>> monitoring and observability subsystems of the kernel is open only for
>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>> process effective set [2].
>
>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>> system performance monitoring and observability operations so that
>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>> for performance monitoring and observability subsystems of the kernel.
>
>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>> specific target. I don't know whether this is sort of a regression or
>>> an expected behavior.
>
>> Thanks for reporting and root causing this case. The behavior looks like
>> kind of expected since currently CAP_PERFMON takes over the related part
>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>> that access control is also subject to CAP_SYS_PTRACE credentials.
>
> I think that stating that in the error message would be helpful, after
> all, who reads docs? 8-)

At least those who write it :D ...

>
> I.e., this:
>
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
> $
>
> Could become:
>
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> $

It would better provide reference to perf security docs in the tool output.
Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
makes monitoring simpler and even more secure to use since Perf tool need
not to start/stop/single-step and read/write registers and memory and so on
like a debugger or strace-like tool. What do you think?

Alexei

>
> - Arnaldo
>
>> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
>> check in perf_events subsystem to simplify user experience at least in
>> this specific case.
>>
>> Alexei
>>
>> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>>
>>>
>>> Without setting CAP_PERFMON:
>>>
>>>   $ getcap ./perf
>>>   $ ./perf stat -a ls
>>>     Error:
>>>     Access to performance monitoring and observability operations is limited.
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     2.06 msec task-clock:u              #    0.418 CPUs utilized
>>>                     0      context-switches:u        #    0.000 K/sec
>>>                     0      cpu-migrations:u          #    0.000 K/sec
>>>
>>> With CAP_PERFMON:
>>>
>>>   $ getcap ./perf
>>>     ./perf = cap_perfmon+ep
>>>   $ ./perf stat -a ls
>>>     Performance counter stats for 'system wide':
>>>                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
>>>                   182      context-switches          #    0.001 M/sec
>>>                    48      cpu-migrations            #    0.337 K/sec
>>>   $ ./perf stat ls
>>>     Error:
>>>     Access to performance monitoring and observability operations is limited.
>>>
>>> Am I missing something silly?
>>>
>>> Analysis:
>>> ---------
>>> A bit more analysis lead me to below kernel code fs/exec.c:
>>>
>>>   begin_new_exec()
>>>   {
>>>         ...
>>>         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>>             !(uid_eq(current_euid(), current_uid()) &&
>>>               gid_eq(current_egid(), current_gid())))
>>>                 set_dumpable(current->mm, suid_dumpable);
>>>         else
>>>                 set_dumpable(current->mm, SUID_DUMP_USER);
>>>
>>>         ...
>>>         commit_creds(bprm->cred);
>>>   }
>>>
>>> When I execute './perf stat ls', it's going into else condition and thus sets
>>> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
>>>
>>>   int commit_creds(struct cred *new)
>>>   {
>>>         ...
>>>         /* dumpability changes */
>>>         if (...
>>>             !cred_cap_issubset(old, new)) {
>>>                 if (task->mm)
>>>                         set_dumpable(task->mm, suid_dumpable);
>>>   }
>>>
>>> !cred_cap_issubset(old, new) fails for perf without any capability and thus
>>> it doesn't execute set_dumpable(). Whereas that condition passes for perf
>>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
>>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
>>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
>>>
>>> Now while opening an event:
>>>
>>>   perf_event_open()
>>>     ptrace_may_access()
>>>       __ptrace_may_access() {
>>>                 ...
>>>                 if (mm &&
>>>                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
>>>                      !ptrace_has_cap(cred, mm->user_ns, mode)))
>>>                     return -EPERM;
>>>       }
>>>
>>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
>>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
>>> success. So opening an event fails when perf has CAP_PREFMON and tries to open
>>> process specific event as normal user.
>>>
>>> Workarounds:
>>> ------------
>>> Based on above analysis, I found couple of workarounds (examples are on
>>> Ubuntu 18.04.4 powerpc):
>>>
>>> Workaround1:
>>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
>>> issue.
>>>
>>>   # echo 1 > /proc/sys/fs/suid_dumpable
>>>   $ getcap ./perf
>>>     ./perf = cap_perfmon+ep
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.47 msec task-clock                #    0.806 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> Workaround2:
>>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
>>>
>>>   $ cat /proc/sys/fs/suid_dumpable
>>>     2
>>>   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.41 msec task-clock                #    0.826 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> Workaround3:
>>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
>>>
>>>   $ cat /proc/sys/fs/suid_dumpable
>>>     2
>>>   # setcap "cap_perfmon=ep" /bin/bash
>>>   # setcap "cap_perfmon=ep" ./perf
>>>   $ bash
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.47 msec task-clock                #    0.806 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> - Ravi
>

2020-07-13 12:20:21

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
>
> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> >> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >>>> Currently access to perf_events, i915_perf and other performance
> >>>> monitoring and observability subsystems of the kernel is open only for
> >>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >>>> process effective set [2].

> >>>> This patch set introduces CAP_PERFMON capability designed to secure
> >>>> system performance monitoring and observability operations so that
> >>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >>>> for performance monitoring and observability subsystems of the kernel.

> >>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> >>> specific target. I don't know whether this is sort of a regression or
> >>> an expected behavior.

> >> Thanks for reporting and root causing this case. The behavior looks like
> >> kind of expected since currently CAP_PERFMON takes over the related part
> >> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> >> that access control is also subject to CAP_SYS_PTRACE credentials.

> > I think that stating that in the error message would be helpful, after
> > all, who reads docs? 8-)

> At least those who write it :D ...

Everybody should read it, sure :-)

> > I.e., this:
> >
> > $ ./perf stat ls
> > ? Error:
> > ? Access to performance monitoring and observability operations is limited.
> > $
> >
> > Could become:
> >
> > $ ./perf stat ls
> > ? Error:
> > ? Access to performance monitoring and observability operations is limited.
> > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> > $
>
> It would better provide reference to perf security docs in the tool output.

So add a 3rd line:

$ ./perf stat ls
? Error:
? Access to performance monitoring and observability operations is limited.
Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
Please read the 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON

You mean the following?

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 856d98c36f56..a2397f724c10 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
* perf_event_exit_task() that could imply).
*/
err = -EACCES;
- if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
+ if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
goto err_cred;
}

> makes monitoring simpler and even more secure to use since Perf tool need
> not to start/stop/single-step and read/write registers and memory and so on
> like a debugger or strace-like tool. What do you think?

I tend to agree, Peter?

> Alexei
>
> >
> > - Arnaldo
> >
> >> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
> >> check in perf_events subsystem to simplify user experience at least in
> >> this specific case.
> >>
> >> Alexei
> >>
> >> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>
> >>>
> >>> Without setting CAP_PERFMON:
> >>>
> >>> ? $ getcap ./perf
> >>> ? $ ./perf stat -a ls
> >>> ??? Error:
> >>> ??? Access to performance monitoring and observability operations is limited.
> >>> ? $ ./perf stat ls
> >>> ??? Performance counter stats for 'ls':
> >>> ?? ???????????????? 2.06 msec task-clock:u????????????? #??? 0.418 CPUs utilized
> >>> ??????????????????? 0????? context-switches:u??????? #??? 0.000 K/sec
> >>> ??????????????????? 0????? cpu-migrations:u????????? #??? 0.000 K/sec
> >>>
> >>> With CAP_PERFMON:
> >>>
> >>> ? $ getcap ./perf
> >>> ??? ./perf = cap_perfmon+ep
> >>> ? $ ./perf stat -a ls
> >>> ??? Performance counter stats for 'system wide':
> >>> ?? ?????????????? 142.42 msec cpu-clock???????????????? #?? 25.062 CPUs utilized
> >>> ????????????????? 182????? context-switches????????? #??? 0.001 M/sec
> >>> ?????????????????? 48????? cpu-migrations??????????? #??? 0.337 K/sec
> >>> ? $ ./perf stat ls
> >>> ??? Error:
> >>> ??? Access to performance monitoring and observability operations is limited.
> >>>
> >>> Am I missing something silly?
> >>>
> >>> Analysis:
> >>> ---------
> >>> A bit more analysis lead me to below kernel code fs/exec.c:
> >>>
> >>> ? begin_new_exec()
> >>> ? {
> >>> ??????? ...
> >>> ??????? if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >>> ??????????? !(uid_eq(current_euid(), current_uid()) &&
> >>> ????????????? gid_eq(current_egid(), current_gid())))
> >>> ??????????????? set_dumpable(current->mm, suid_dumpable);
> >>> ??????? else
> >>> ??????????????? set_dumpable(current->mm, SUID_DUMP_USER);
> >>>
> >>> ??????? ...
> >>> ??????? commit_creds(bprm->cred);
> >>> ? }
> >>>
> >>> When I execute './perf stat ls', it's going into else condition and thus sets
> >>> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
> >>>
> >>> ? int commit_creds(struct cred *new)
> >>> ? {
> >>> ??????? ...
> >>> ??????? /* dumpability changes */
> >>> ??????? if (...
> >>> ??????????? !cred_cap_issubset(old, new)) {
> >>> ??????????????? if (task->mm)
> >>> ??????????????????????? set_dumpable(task->mm, suid_dumpable);
> >>> ? }
> >>>
> >>> !cred_cap_issubset(old, new) fails for perf without any capability and thus
> >>> it doesn't execute set_dumpable(). Whereas that condition passes for perf
> >>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> >>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> >>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
> >>>
> >>> Now while opening an event:
> >>>
> >>> ? perf_event_open()
> >>> ??? ptrace_may_access()
> >>> ????? __ptrace_may_access() {
> >>> ??????????????? ...
> >>> ??????????????? if (mm &&
> >>> ??????????????????? ((get_dumpable(mm) != SUID_DUMP_USER) &&
> >>> ???????????????????? !ptrace_has_cap(cred, mm->user_ns, mode)))
> >>> ??????????????????? return -EPERM;
> >>> ????? }
> >>>
> >>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> >>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> >>> success. So opening an event fails when perf has CAP_PREFMON and tries to open
> >>> process specific event as normal user.
> >>>
> >>> Workarounds:
> >>> ------------
> >>> Based on above analysis, I found couple of workarounds (examples are on
> >>> Ubuntu 18.04.4 powerpc):
> >>>
> >>> Workaround1:
> >>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> >>> issue.
> >>>
> >>> ? # echo 1 > /proc/sys/fs/suid_dumpable
> >>> ? $ getcap ./perf
> >>> ??? ./perf = cap_perfmon+ep
> >>> ? $ ./perf stat ls
> >>> ??? Performance counter stats for 'ls':
> >>> ?? ???????????????? 1.47 msec task-clock??????????????? #??? 0.806 CPUs utilized
> >>> ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> >>> ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >>>
> >>> Workaround2:
> >>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
> >>>
> >>> ? $ cat /proc/sys/fs/suid_dumpable
> >>> ??? 2
> >>> ? # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
> >>> ? $ ./perf stat ls
> >>> ??? Performance counter stats for 'ls':
> >>> ?? ???????????????? 1.41 msec task-clock??????????????? #??? 0.826 CPUs utilized
> >>> ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> >>> ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >>>
> >>> Workaround3:
> >>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
> >>>
> >>> ? $ cat /proc/sys/fs/suid_dumpable
> >>> ??? 2
> >>> ? # setcap "cap_perfmon=ep" /bin/bash
> >>> ? # setcap "cap_perfmon=ep" ./perf
> >>> ? $ bash
> >>> ? $ ./perf stat ls
> >>> ??? Performance counter stats for 'ls':
> >>> ?? ???????????????? 1.47 msec task-clock??????????????? #??? 0.806 CPUs utilized
> >>> ??????????????????? 0????? context-switches????????? #??? 0.000 K/sec
> >>> ??????????????????? 0????? cpu-migrations??????????? #??? 0.000 K/sec
> >>>
> >>> - Ravi
> >

--

- Arnaldo

2020-07-13 12:38:59

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
>>
>> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
>>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>>>> Currently access to perf_events, i915_perf and other performance
>>>>>> monitoring and observability subsystems of the kernel is open only for
>>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>>>> process effective set [2].
>
>>>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>>>> system performance monitoring and observability operations so that
>>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>>>> for performance monitoring and observability subsystems of the kernel.
>
>>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>>>> specific target. I don't know whether this is sort of a regression or
>>>>> an expected behavior.
>
>>>> Thanks for reporting and root causing this case. The behavior looks like
>>>> kind of expected since currently CAP_PERFMON takes over the related part
>>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>>>> that access control is also subject to CAP_SYS_PTRACE credentials.
>
>>> I think that stating that in the error message would be helpful, after
>>> all, who reads docs? 8-)
>
>> At least those who write it :D ...
>
> Everybody should read it, sure :-)
>
>>> I.e., this:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>> $
>>>
>>> Could become:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>> $
>>
>> It would better provide reference to perf security docs in the tool output.
>
> So add a 3rd line:
>
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> Please read the 'Perf events and tool security' document:
> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
If it had that patch below then message change would not be required.
However this two sentences in the end of whole message would still add up:
"Please read the 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

>
>> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
>
> You mean the following?

Exactly that.

>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 856d98c36f56..a2397f724c10 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> * perf_event_exit_task() that could imply).
> */
> err = -EACCES;
> - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> goto err_cred;
> }
>
>> makes monitoring simpler and even more secure to use since Perf tool need
>> not to start/stop/single-step and read/write registers and memory and so on
>> like a debugger or strace-like tool. What do you think?
>
> I tend to agree, Peter?
>
>> Alexei
>>
>>>
>>> - Arnaldo

Alexei

2020-07-13 18:52:36

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
>
> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
> >>
> >> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> >>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> >>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >>>>>> Currently access to perf_events, i915_perf and other performance
> >>>>>> monitoring and observability subsystems of the kernel is open only for
> >>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >>>>>> process effective set [2].
> >
> >>>>>> This patch set introduces CAP_PERFMON capability designed to secure
> >>>>>> system performance monitoring and observability operations so that
> >>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >>>>>> for performance monitoring and observability subsystems of the kernel.
> >
> >>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> >>>>> specific target. I don't know whether this is sort of a regression or
> >>>>> an expected behavior.
> >
> >>>> Thanks for reporting and root causing this case. The behavior looks like
> >>>> kind of expected since currently CAP_PERFMON takes over the related part
> >>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> >>>> that access control is also subject to CAP_SYS_PTRACE credentials.
> >
> >>> I think that stating that in the error message would be helpful, after
> >>> all, who reads docs? 8-)
> >
> >> At least those who write it :D ...
> >
> > Everybody should read it, sure :-)
> >
> >>> I.e., this:
> >>>
> >>> $ ./perf stat ls
> >>> ? Error:
> >>> ? Access to performance monitoring and observability operations is limited.
> >>> $
> >>>
> >>> Could become:
> >>>
> >>> $ ./perf stat ls
> >>> ? Error:
> >>> ? Access to performance monitoring and observability operations is limited.
> >>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> >>> $
> >>
> >> It would better provide reference to perf security docs in the tool output.
> >
> > So add a 3rd line:
> >
> > $ ./perf stat ls
> > ? Error:
> > ? Access to performance monitoring and observability operations is limited.
> > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> > Please read the 'Perf events and tool security' document:
> > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

> If it had that patch below then message change would not be required.

Sure, but the tool should continue to work and provide useful messages
when running on kernels without that change. Pointing to the document is
valid and should be done, that is an agreed point. But the tool can do
some checks, narrow down the possible causes for the error message and
provide something that in most cases will make the user make progress.

> However this two sentences in the end of whole message would still add up:
> "Please read the 'Perf events and tool security' document:
> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

We're in violent agreement here. :-)

> >
> >> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
> >
> > You mean the following?
>
> Exactly that.

Sure, lets then wait for others to chime in and then you can go ahead
and submit that patch.

Peter?

- Arnaldo

> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 856d98c36f56..a2397f724c10 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > * perf_event_exit_task() that could imply).
> > */
> > err = -EACCES;
> > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > goto err_cred;
> > }
> >
> >> makes monitoring simpler and even more secure to use since Perf tool need
> >> not to start/stop/single-step and read/write registers and memory and so on
> >> like a debugger or strace-like tool. What do you think?
> >
> > I tend to agree, Peter?
> >
> >> Alexei
> >>
> >>>
> >>> - Arnaldo
>
> Alexei

--

- Arnaldo

2020-07-14 11:03:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote:

> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 856d98c36f56..a2397f724c10 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > > * perf_event_exit_task() that could imply).
> > > */
> > > err = -EACCES;
> > > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > goto err_cred;
> > > }
> > >
> > >> makes monitoring simpler and even more secure to use since Perf tool need
> > >> not to start/stop/single-step and read/write registers and memory and so on
> > >> like a debugger or strace-like tool. What do you think?
> > >
> > > I tend to agree, Peter?

So this basically says that if CAP_PERFMON, we don't care about the
ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow
the ptrace checks?

I suppose that makes sense.

2020-07-14 15:30:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Jul 14, 2020 at 12:59:34PM +0200, Peter Zijlstra escreveu:
> On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote:
>
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 856d98c36f56..a2397f724c10 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > > > * perf_event_exit_task() that could imply).
> > > > */
> > > > err = -EACCES;
> > > > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > > goto err_cred;
> > > > }

> > > >> makes monitoring simpler and even more secure to use since Perf tool need
> > > >> not to start/stop/single-step and read/write registers and memory and so on
> > > >> like a debugger or strace-like tool. What do you think?

> > > > I tend to agree, Peter?

> So this basically says that if CAP_PERFMON, we don't care about the
> ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow
> the ptrace checks?

> I suppose that makes sense.

Yeah, it in fact addresses the comment right above it:

if (task) {
err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
if (err)
goto err_task;

/*
* Reuse ptrace permission checks for now.
*
* We must hold exec_update_mutex across this and any potential
* perf_install_in_context() call for this new event to
* serialize against exec() altering our credentials (and the
* perf_event_exit_task() that could imply).
*/
err = -EACCES;
if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
goto err_cred;
}


that "for now" part :-)

Idea is to not require CAP_PTRACE for that, i.e. the attack surface for the
perf binary is reduced.

- Arnaldo

2020-07-21 13:07:25

by Alexey Budankov

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability


On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
>>
>> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
>>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
>>>>
>>>> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
>>>>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>>>>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>>>>>> Currently access to perf_events, i915_perf and other performance
>>>>>>>> monitoring and observability subsystems of the kernel is open only for
>>>>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>>>>>> process effective set [2].
>>>
>>>>>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>>>>>> system performance monitoring and observability operations so that
>>>>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>>>>>> for performance monitoring and observability subsystems of the kernel.
>>>
>>>>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>>>>>> specific target. I don't know whether this is sort of a regression or
>>>>>>> an expected behavior.
>>>
>>>>>> Thanks for reporting and root causing this case. The behavior looks like
>>>>>> kind of expected since currently CAP_PERFMON takes over the related part
>>>>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>>>>>> that access control is also subject to CAP_SYS_PTRACE credentials.
>>>
>>>>> I think that stating that in the error message would be helpful, after
>>>>> all, who reads docs? 8-)
>>>
>>>> At least those who write it :D ...
>>>
>>> Everybody should read it, sure :-)
>>>
>>>>> I.e., this:
>>>>>
>>>>> $ ./perf stat ls
>>>>>   Error:
>>>>>   Access to performance monitoring and observability operations is limited.
>>>>> $
>>>>>
>>>>> Could become:
>>>>>
>>>>> $ ./perf stat ls
>>>>>   Error:
>>>>>   Access to performance monitoring and observability operations is limited.
>>>>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>>>> $
>>>>
>>>> It would better provide reference to perf security docs in the tool output.
>>>
>>> So add a 3rd line:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>> Please read the 'Perf events and tool security' document:
>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>
>> If it had that patch below then message change would not be required.
>
> Sure, but the tool should continue to work and provide useful messages
> when running on kernels without that change. Pointing to the document is
> valid and should be done, that is an agreed point. But the tool can do
> some checks, narrow down the possible causes for the error message and
> provide something that in most cases will make the user make progress.
>
>> However this two sentences in the end of whole message would still add up:
>> "Please read the 'Perf events and tool security' document:
>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"
>
> We're in violent agreement here. :-)

Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior
v5.8, and b) Perf security document link. The plan is to send a patch extending
perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending
the tool with this message.

"Access to performance monitoring and observability operations is limited.
Enforced MAC policy settings (SELinux) can limit access to performance
monitoring and observability operations. Inspect system audit records for
more perf_event access control information and adjusting the policy.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is -1:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)"

Alexei

[1] https://lore.kernel.org/lkml/[email protected]/

>
>>>
>>>> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
>>>
>>> You mean the following?
>>
>> Exactly that.
>
> Sure, lets then wait for others to chime in and then you can go ahead
> and submit that patch.
>
> Peter?
>
> - Arnaldo
>
>>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 856d98c36f56..a2397f724c10 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
>>> * perf_event_exit_task() that could imply).
>>> */
>>> err = -EACCES;
>>> - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
>>> + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
>>> goto err_cred;
>>> }
>>>
>>>> makes monitoring simpler and even more secure to use since Perf tool need
>>>> not to start/stop/single-step and read/write registers and memory and so on
>>>> like a debugger or strace-like tool. What do you think?
>>>
>>> I tend to agree, Peter?
>>>
>>>> Alexei
>>>>
>>>>>
>>>>> - Arnaldo
>>
>> Alexei
>

2020-07-22 11:33:01

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Em Tue, Jul 21, 2020 at 04:06:34PM +0300, Alexey Budankov escreveu:
>
> On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
> >>
> >> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> >>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
> >> If it had that patch below then message change would not be required.

> > Sure, but the tool should continue to work and provide useful messages
> > when running on kernels without that change. Pointing to the document is
> > valid and should be done, that is an agreed point. But the tool can do
> > some checks, narrow down the possible causes for the error message and
> > provide something that in most cases will make the user make progress.

> >> However this two sentences in the end of whole message would still add up:
> >> "Please read the 'Perf events and tool security' document:
> >> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

> > We're in violent agreement here. :-)

> Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior
> v5.8, and b) Perf security document link. The plan is to send a patch extending
> perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending
> the tool with this message.

> "Access to performance monitoring and observability operations is limited.
> Enforced MAC policy settings (SELinux) can limit access to performance
> monitoring and observability operations. Inspect system audit records for
> more perf_event access control information and adjusting the policy.
> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> access to performance monitoring and observability operations for processes
> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> More information can be found at 'Perf events and tool security' document:
> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> perf_event_paranoid setting is -1:
> -1: Allow use of (almost) all events by all users
> Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >= 0: Disallow raw and ftrace function tracepoint access
> >= 1: Disallow CPU event access
> >= 2: Disallow kernel profiling
> To make the adjusted perf_event_paranoid setting permanent preserve it
> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)"

Looks ok! Lots of knobs to control access as one needs.

- Arnaldo

> Alexei
>
> [1] https://lore.kernel.org/lkml/[email protected]/