2018-01-18 18:41:06

by Tvrtko Ursulin

[permalink] [raw]
Subject: [RFC] perf: Allow fine-grained PMU access control

From: Tvrtko Ursulin <[email protected]>

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
<sysfs>/device/<pmu-name>/perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: [email protected]
---
arch/x86/events/intel/bts.c | 2 +-
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/p4.c | 2 +-
include/linux/perf_event.h | 18 +++++---
kernel/events/core.c | 99 ++++++++++++++++++++++++++++++++++-------
kernel/sysctl.c | 4 +-
kernel/trace/trace_event_perf.c | 6 ++-
7 files changed, 105 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ static int bts_event_init(struct perf_event *event)
* Note that the default paranoia setting permits unprivileged
* users to profile the kernel.
*/
- if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
+ if (event->attr.exclude_kernel && perf_paranoid_kernel(event->pmu) &&
!capable(CAP_SYS_ADMIN))
return -EACCES;

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 731153a4681e..d623db13f212 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3009,7 +3009,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (x86_pmu.version < 3)
return -EINVAL;

- if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;

event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..878451ef1ace 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
* the user needs special permissions to be able to use it
*/
if (p4_ht_active() && p4_event_bind_map[v].shared) {
- if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7546822a1d74..1cb4e00d7f96 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned int nr_addr_filters;

+ /* fine grained access control */
+ int perf_event_paranoid;
+
/*
* Fully disable/enable this PMU, can be used to protect from the PMI
* as well as for lazy/batch writing of the MSRs.
@@ -1141,6 +1144,9 @@ extern int sysctl_perf_cpu_time_max_percent;

extern void perf_sample_event_took(u64 sample_len_ns);

+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos);
extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1151,19 +1157,19 @@ extern int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
int perf_event_max_stack_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);

-static inline bool perf_paranoid_tracepoint_raw(void)
+static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
{
- return sysctl_perf_event_paranoid > -1;
+ return pmu->perf_event_paranoid > -1;
}

-static inline bool perf_paranoid_cpu(void)
+static inline bool perf_paranoid_cpu(const struct pmu *pmu)
{
- return sysctl_perf_event_paranoid > 0;
+ return pmu->perf_event_paranoid > 0;
}

-static inline bool perf_paranoid_kernel(void)
+static inline bool perf_paranoid_kernel(const struct pmu *pmu)
{
- return sysctl_perf_event_paranoid > 1;
+ return pmu->perf_event_paranoid > 1;
}

extern void perf_event_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b4152da656fa..21fd4430df66 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)

static int perf_rotate_context(struct perf_cpu_context *cpuctx);

+int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp,
+ loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ struct pmu *pmu;
+
+ if (ret || !write)
+ return ret;
+
+ mutex_lock(&pmus_lock);
+ list_for_each_entry(pmu, &pmus, entry)
+ pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+ mutex_unlock(&pmus_lock);
+
+ return 0;
+}
+
int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -3772,7 +3790,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,

if (!task) {
/* Must be root to operate on a CPU event: */
- if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ if (perf_paranoid_cpu(pmu) && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);

cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
@@ -5313,7 +5331,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
lock_limit >>= PAGE_SHIFT;
locked = vma->vm_mm->pinned_vm + extra;

- if ((locked > lock_limit) && perf_paranoid_tracepoint_raw() &&
+ if ((locked > lock_limit) && perf_paranoid_tracepoint_raw(event->pmu) &&
!capable(CAP_IPC_LOCK)) {
ret = -EPERM;
goto unlock;
@@ -8880,6 +8898,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(&pmus_lock);
}

+/*
+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+
+ return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static ssize_t
+perf_event_paranoid_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+ int ret, val;
+
+ ret = kstrtoint(buf, 0, &val);
+ if (ret)
+ return ret;
+
+ if (val < -1 || val > 2)
+ return -EINVAL;
+
+ pmu->perf_event_paranoid = val;
+
+ return count;
+}
+
+DEVICE_ATTR_RW(perf_event_paranoid);
+
/*
* Let userspace know that this PMU supports address range filtering:
*/
@@ -8994,6 +9047,11 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (ret)
goto free_dev;

+ /* Add fine-grained access control attribute. */
+ ret = device_create_file(pmu->dev, &dev_attr_perf_event_paranoid);
+ if (ret)
+ goto del_dev;
+
/* For PMUs with address filters, throw in an extra attribute: */
if (pmu->nr_addr_filters)
ret = device_create_file(pmu->dev, &dev_attr_nr_addr_filters);
@@ -9025,6 +9083,7 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
if (!pmu->pmu_disable_count)
goto unlock;

+ pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
pmu->type = -1;
if (!name)
goto skip_type;
@@ -9634,10 +9693,6 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
*/
attr->branch_sample_type = mask;
}
- /* privileged levels capture (kernel, hv): check permissions */
- if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
- && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
- return -EACCES;
}

if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
@@ -9851,11 +9906,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;

- if (!attr.exclude_kernel) {
- if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
- return -EACCES;
- }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -9869,11 +9919,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}

- /* Only privileged users can get physical addresses */
- if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
- perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
- return -EACCES;
-
if (!attr.sample_max_stack)
attr.sample_max_stack = sysctl_perf_event_max_stack;

@@ -9946,6 +9991,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}

+ if (!attr.exclude_kernel) {
+ if (perf_paranoid_kernel(event->pmu) &&
+ !capable(CAP_SYS_ADMIN)) {
+ err = -EACCES;
+ goto err_alloc;
+ }
+ }
+
+ /* Only privileged users can get physical addresses */
+ if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+ perf_paranoid_kernel(event->pmu) && !capable(CAP_SYS_ADMIN)) {
+ err = -EACCES;
+ goto err_alloc;
+ }
+
+ /* privileged levels capture (kernel, hv): check permissions */
+ if ((attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+ perf_paranoid_kernel(event->pmu) && !capable(CAP_SYS_ADMIN)) {
+ err = -EACCES;
+ goto err_alloc;
+ }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d46728577..2f724b951e92 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1129,7 +1129,9 @@ static struct ctl_table kern_table[] = {
.data = &sysctl_perf_event_paranoid,
.maxlen = sizeof(sysctl_perf_event_paranoid),
.mode = 0644,
- .proc_handler = proc_dointvec,
+ .proc_handler = perf_proc_paranoid_handler,
+ .extra1 = &neg_one,
+ .extra2 = &two,
},
{
.procname = "perf_event_mlock_kb",
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 55d6dff37daf..f23e3560237e 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -44,7 +44,8 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,

/* The ftrace function trace is allowed only for root. */
if (ftrace_event_is_function(tp_event)) {
- if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+ if (perf_paranoid_tracepoint_raw(p_event->pmu) &&
+ !capable(CAP_SYS_ADMIN))
return -EPERM;

if (!is_sampling_event(p_event))
@@ -80,7 +81,8 @@ static int perf_trace_event_perm(struct trace_event_call *tp_event,
* ...otherwise raw tracepoint data can be a severe data leak,
* only allow root to have these.
*/
- if (perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN))
+ if (perf_paranoid_tracepoint_raw(p_event->pmu) &&
+ !capable(CAP_SYS_ADMIN))
return -EPERM;

return 0;
--
2.14.1



2018-01-19 16:47:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf: Allow fine-grained PMU access control

On Thu, Jan 18, 2018 at 06:40:07PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <[email protected]>
>
> For situations where sysadmins might want to allow different level of
> of access control for different PMUs, we start creating per-PMU
> perf_event_paranoid controls in sysfs.

You've completely and utterly failed to explain why.

2018-01-19 17:12:39

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [Intel-gfx] [RFC] perf: Allow fine-grained PMU access control


Hi,

On 19/01/2018 16:45, Peter Zijlstra wrote:
> On Thu, Jan 18, 2018 at 06:40:07PM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <[email protected]>
>>
>> For situations where sysadmins might want to allow different level of
>> of access control for different PMUs, we start creating per-PMU
>> perf_event_paranoid controls in sysfs.
>
> You've completely and utterly failed to explain why.

On an abstract level, if there is a desire to decrease the security knob
on one particular PMU provider, it is better to be able to do it just
for the one, rather for the whole system.

On a more concrete level, we have customers who want to look at certain
i915 metrics, most probably engine utilization or queue depth, in order
to make load-balancing decisions. (The two would be roughly analogous to
CPU usage and load.)

This data needs to be available to their userspaces dynamically and
would be used to pick a best GPU engine (mostly analogous to a CPU core)
to run a particular packet of work.

It would be impossible to run their product as root, and while one
option could be to write a proxy daemon which would allow unprivileged
queries, it is also a significant complication which introduces a time
shift problem on the PMU data as well.

So my thinking was that a per-PMU paranoid control should not be a
problematic concept in general. And my gut feeling anyway was that not
all PMU providers are the same class data, security wise, which was
another reason I thought per-PMU controls would be fine.

There is one more way of thinking about it, and that is that the access
control could even be extended to be per-event, and not just per-PMU.
That would allow registered PMUs to let the core know which counters are
potentially security sensitive, and which are not.

I've sent another RFC along those lines some time ago, but afterwards
I've changed my mind and thought the approach from this patch should be
less controversial since it retains all control fully in the perf core
and in the hands of sysadmins.

Regards,

Tvrtko



2018-02-23 15:59:30

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [Intel-gfx] [RFC] perf: Allow fine-grained PMU access control


Hi,

On 19/01/2018 17:10, Tvrtko Ursulin wrote:
>
> Hi,
>
> On 19/01/2018 16:45, Peter Zijlstra wrote:
>> On Thu, Jan 18, 2018 at 06:40:07PM +0000, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <[email protected]>
>>>
>>> For situations where sysadmins might want to allow different level of
>>> of access control for different PMUs, we start creating per-PMU
>>> perf_event_paranoid controls in sysfs.
>>
>> You've completely and utterly failed to explain why.
>
> On an abstract level, if there is a desire to decrease the security knob
> on one particular PMU provider, it is better to be able to do it just
> for the one, rather for the whole system.
>
> On a more concrete level, we have customers who want to look at certain
> i915 metrics, most probably engine utilization or queue depth, in order
> to make load-balancing decisions. (The two would be roughly analogous to
> CPU usage and load.)
>
> This data needs to be available to their userspaces dynamically and
> would be used to pick a best GPU engine (mostly analogous to a CPU core)
> to run a particular packet of work.
>
> It would be impossible to run their product as root, and while one
> option could be to write a proxy daemon which would allow unprivileged
> queries, it is also a significant complication which introduces a time
> shift problem on the PMU data as well.
>
> So my thinking was that a per-PMU paranoid control should not be a
> problematic concept in general. And my gut feeling anyway was that not
> all PMU providers are the same class data, security wise, which was
> another reason I thought per-PMU controls would be fine.
>
> There is one more way of thinking about it, and that is that the access
> control could even be extended to be per-event, and not just per-PMU.
> That would allow registered PMUs to let the core know which counters are
> potentially security sensitive, and which are not.
>
> I've sent another RFC along those lines some time ago, but afterwards
> I've changed my mind and thought the approach from this patch should be
> less controversial since it retains all control fully in the perf core
> and in the hands of sysadmins.

Any thoughts on this one? Is the approach acceptable?

Regards,

Tvrtko