2015-06-26 17:52:26

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

Customers write system monitoring software for single systems as well as
clusters. In load-balancing software it is useful to know how "busy" a
core is. Unfortunately the only way to get this data is to run as root,
or use setcap to allow userspace access for particular programs. Both of
these options are clunky at best.

This patch allows read access to the msr dev files which should be okay.
No damage can be done by reading the MSR values and it allows non-root
users to run system monitoring software.

The turbostat code specifically checks for CAP_SYS_RAWIO, which it
shouldn't have to and I've removed that code. Additionally I've modified
the turbostat man page to remove documentation about configuring
CAP_SYS_RAW_IO.

Note: Write access to msr is still restricted with this patch.

Cc: "H. Peter Anvin" <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: Len Brown <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Dasaratharaman Chandramouli <[email protected]>
Signed-off-by: Prarit Bhargava <[email protected]>
---
arch/x86/kernel/msr.c | 11 ++++++++---
tools/power/x86/turbostat/turbostat.8 | 8 --------
tools/power/x86/turbostat/turbostat.c | 17 -----------------
3 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/msr.c b/arch/x86/kernel/msr.c
index 113e707..380d2ac 100644
--- a/arch/x86/kernel/msr.c
+++ b/arch/x86/kernel/msr.c
@@ -105,6 +105,9 @@ static ssize_t msr_write(struct file *file, const char __user *buf,
int err = 0;
ssize_t bytes = 0;

+ if (!capable(CAP_SYS_RAWIO))
+ return -EPERM;
+
if (count % 8)
return -EINVAL; /* Invalid chunk size */

@@ -148,6 +151,10 @@ static long msr_ioctl(struct file *file, unsigned int ioc, unsigned long arg)
break;

case X86_IOC_WRMSR_REGS:
+ if (!capable(CAP_SYS_RAWIO)) {
+ err = -EPERM;
+ break;
+ }
if (!(file->f_mode & FMODE_WRITE)) {
err = -EBADF;
break;
@@ -176,9 +183,6 @@ static int msr_open(struct inode *inode, struct file *file)
unsigned int cpu = iminor(file_inode(file));
struct cpuinfo_x86 *c;

- if (!capable(CAP_SYS_RAWIO))
- return -EPERM;
-
if (cpu >= nr_cpu_ids || !cpu_online(cpu))
return -ENXIO; /* No such CPU */

@@ -241,6 +245,7 @@ static struct notifier_block __refdata msr_class_cpu_notifier = {

static char *msr_devnode(struct device *dev, umode_t *mode)
{
+ *mode = (S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
return kasprintf(GFP_KERNEL, "cpu/%u/msr", MINOR(dev->devt));
}

diff --git a/tools/power/x86/turbostat/turbostat.8 b/tools/power/x86/turbostat/turbostat.8
index 05b8fc3..7b3fce4 100644
--- a/tools/power/x86/turbostat/turbostat.8
+++ b/tools/power/x86/turbostat/turbostat.8
@@ -204,14 +204,6 @@ not including any non-busy idle time.
.SH NOTES

.B "turbostat "
-must be run as root.
-Alternatively, non-root users can be enabled to run turbostat this way:
-
-# setcap cap_sys_rawio=ep ./turbostat
-
-# chmod +r /dev/cpu/*/msr
-
-.B "turbostat "
reads hardware counters, but doesn't write them.
So it will not interfere with the OS or other programs, including
multiple invocations of itself.
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 323b65e..d043ae8 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -1705,26 +1705,9 @@ void check_dev_msr()

void check_permissions()
{
- struct __user_cap_header_struct cap_header_data;
- cap_user_header_t cap_header = &cap_header_data;
- struct __user_cap_data_struct cap_data_data;
- cap_user_data_t cap_data = &cap_data_data;
- extern int capget(cap_user_header_t hdrp, cap_user_data_t datap);
int do_exit = 0;
char pathname[32];

- /* check for CAP_SYS_RAWIO */
- cap_header->pid = getpid();
- cap_header->version = _LINUX_CAPABILITY_VERSION;
- if (capget(cap_header, cap_data) < 0)
- err(-6, "capget(2) failed");
-
- if ((cap_data->effective & (1 << CAP_SYS_RAWIO)) == 0) {
- do_exit++;
- warnx("capget(CAP_SYS_RAWIO) failed,"
- " try \"# setcap cap_sys_rawio=ep %s\"", progname);
- }
-
/* test file permissions */
sprintf(pathname, "/dev/cpu/%d/msr", base_cpu);
if (euidaccess(pathname, R_OK)) {
--
1.7.9.3


2015-06-26 18:46:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On 06/26/2015 10:52 AM, Prarit Bhargava wrote:
>
> This patch allows read access to the msr dev files which should be okay.
> No damage can be done by reading the MSR values and it allows non-root
> users to run system monitoring software.
>

I don't believe that is true.

-hpa

2015-06-26 19:23:49

by Brian Gerst

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Fri, Jun 26, 2015 at 1:52 PM, Prarit Bhargava <[email protected]> wrote:
> Customers write system monitoring software for single systems as well as
> clusters. In load-balancing software it is useful to know how "busy" a
> core is. Unfortunately the only way to get this data is to run as root,
> or use setcap to allow userspace access for particular programs. Both of
> these options are clunky at best.
>
> This patch allows read access to the msr dev files which should be okay.
> No damage can be done by reading the MSR values and it allows non-root
> users to run system monitoring software.
>
> The turbostat code specifically checks for CAP_SYS_RAWIO, which it
> shouldn't have to and I've removed that code. Additionally I've modified
> the turbostat man page to remove documentation about configuring
> CAP_SYS_RAW_IO.
>
> Note: Write access to msr is still restricted with this patch.

Allowing unrestricted read access to all MSRs is wrong. Some MSRs
contain addresses of kernel data structures, which can be used in
security exploits.

The proper way to do this is to write a driver to only expose the MSRs
that the user tools need, and nothing else.

--
Brian Gerst

2015-06-26 21:26:40

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr



On 06/26/2015 03:23 PM, Brian Gerst wrote:
> On Fri, Jun 26, 2015 at 1:52 PM, Prarit Bhargava <[email protected]> wrote:
>> Customers write system monitoring software for single systems as well as
>> clusters. In load-balancing software it is useful to know how "busy" a
>> core is. Unfortunately the only way to get this data is to run as root,
>> or use setcap to allow userspace access for particular programs. Both of
>> these options are clunky at best.
>>
>> This patch allows read access to the msr dev files which should be okay.
>> No damage can be done by reading the MSR values and it allows non-root
>> users to run system monitoring software.
>>
>> The turbostat code specifically checks for CAP_SYS_RAWIO, which it
>> shouldn't have to and I've removed that code. Additionally I've modified
>> the turbostat man page to remove documentation about configuring
>> CAP_SYS_RAW_IO.
>>
>> Note: Write access to msr is still restricted with this patch.
>
> Allowing unrestricted read access to all MSRs is wrong. Some MSRs
> contain addresses of kernel data structures, which can be used in
> security exploits.
>
> The proper way to do this is to write a driver to only expose the MSRs
> that the user tools need, and nothing else.

Will do -- At least I got everyone's attention with this :).

P.

>
> --
> Brian Gerst
>

2015-06-27 08:34:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Prarit Bhargava <[email protected]> wrote:

> Customers write system monitoring software for single systems as well as
> clusters. In load-balancing software it is useful to know how "busy" a
> core is. Unfortunately the only way to get this data is to run as root,
> or use setcap to allow userspace access for particular programs. Both of
> these options are clunky at best.
>
> This patch allows read access to the msr dev files which should be okay.
> No damage can be done by reading the MSR values and it allows non-root
> users to run system monitoring software.
>
> The turbostat code specifically checks for CAP_SYS_RAWIO, which it
> shouldn't have to and I've removed that code. Additionally I've modified
> the turbostat man page to remove documentation about configuring
> CAP_SYS_RAW_IO.
>
> Note: Write access to msr is still restricted with this patch.
>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected]
> Cc: Len Brown <[email protected]>
> Cc: Prarit Bhargava <[email protected]>
> Cc: Dasaratharaman Chandramouli <[email protected]>
> Signed-off-by: Prarit Bhargava <[email protected]>
> ---
> arch/x86/kernel/msr.c | 11 ++++++++---
> tools/power/x86/turbostat/turbostat.8 | 8 --------
> tools/power/x86/turbostat/turbostat.c | 17 -----------------
> 3 files changed, 8 insertions(+), 28 deletions(-)

So what's wrong with exposing them as a simplified PMU driver?

That way we only expose the ones we want to - plus tooling can use all the rich
perf features that can be used around this. (sampling, counting, call chains,
etc.)

Thanks,

Ingo

2015-06-27 08:39:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Ingo Molnar <[email protected]> wrote:

> So what's wrong with exposing them as a simplified PMU driver?
>
> That way we only expose the ones we want to - plus tooling can use all the rich
> perf features that can be used around this. (sampling, counting, call chains,
> etc.)

See below code from Andy that exposes a single MSR via perf. At the core of the
PMU driver is a single rdmsrl():

+static void aperfmperf_event_start(struct perf_event *event, int flags)
+{
+ u64 now;
+
+ rdmsrl(event->hw.event_base, now);
+ local64_set(&event->hw.prev_count, now);
+}

Now I think what we really want is to expose not a single MSR but multiple MSRs in
a single driver, i.e. don't have one PMU driver per MSR, but have a driver that
allows the exposure of select MSRs as counters.

There should also be a maker/family/model filter mechanism, so that certain MSRs
are only exposed on models that are known to support them, etc.

Thanks,

Ingo

----- Forwarded message from Andy Lutomirski <[email protected]> -----

Date: Tue, 28 Apr 2015 14:25:37 -0700
From: Andy Lutomirski <[email protected]>
To: Len Brown <[email protected]>, Peter Zijlstra <[email protected]>, "[email protected]" <[email protected]>
Cc: Paul Mackerras <[email protected]>, Ingo Molnar <[email protected]>, Arnaldo Carvalho de Melo <[email protected]>, Andy Lutomirski <[email protected]>
Subject: [RFC] x86, perf: Add an aperfmperf driver

Signed-off-by: Andy Lutomirski <[email protected]>
---

This driver seems a little bit silly, but I can imagine it being useful. For
example, I think that turbostat could do some of its work without being
root if we had a driver like this.

Thoughts? Would it make sense at all? Did I wire it up right? This is
the only PMU driver I've ever written, and it could have any number of
issues.

arch/x86/kernel/cpu/Makefile | 2 +
arch/x86/kernel/cpu/perf_event_aperfmperf.c | 119 ++++++++++++++++++++++++++++
2 files changed, 121 insertions(+)
create mode 100644 arch/x86/kernel/cpu/perf_event_aperfmperf.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 80091ae54c2b..fadc822efc90 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -45,6 +45,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_snb.o \
perf_event_intel_uncore_snbep.o \
perf_event_intel_uncore_nhmex.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_aperf_mperf.o
+obj-$(CONFIG_CPU_SUP_AMD) += perf_event_aperf_mperf.o
endif


diff --git a/arch/x86/kernel/cpu/perf_event_aperfmperf.c b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
new file mode 100644
index 000000000000..6e6d113bd9ce
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
@@ -0,0 +1,119 @@
+#include <linux/perf_event.h>
+
+#define APERFMPERF_EVENT_APERF 0
+#define APERFMPERF_EVENT_MPERF 1
+
+PMU_EVENT_ATTR_STRING(aperf, evattr_aperf, "event=0x00");
+PMU_EVENT_ATTR_STRING(mperf, evattr_mperf, "event=0x01");
+static struct attribute *events_attrs[] = {
+ &evattr_aperf.attr.attr,
+ &evattr_mperf.attr.attr,
+ NULL,
+};
+static struct attribute_group events_attr_group = {
+ .name = "events",
+ .attrs = events_attrs,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-63");
+static struct attribute *format_attrs[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+static struct attribute_group format_attr_group = {
+ .name = "format",
+ .attrs = format_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+ &events_attr_group,
+ &format_attr_group,
+ NULL,
+};
+
+static int aperfmperf_event_init(struct perf_event *event)
+{
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ if (event->attr.config != APERFMPERF_EVENT_APERF &&
+ event->attr.config != APERFMPERF_EVENT_MPERF)
+ return -ENOENT;
+
+ if (event->attr.config1 != 0)
+ return -ENOENT;
+
+ /* no sampling */
+ if (event->hw.sample_period)
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.freq ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ event->hw.idx = -1;
+ event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
+ MSR_IA32_APERF : MSR_IA32_MPERF);
+
+ return 0;
+}
+
+static void aperfmperf_event_update(struct perf_event *event)
+{
+ u64 prev;
+ u64 now;
+
+ rdmsrl(event->hw.event_base, now);
+ prev = local64_xchg(&event->hw.prev_count, now);
+ local64_add(now - prev, &event->count);
+}
+
+static void aperfmperf_event_start(struct perf_event *event, int flags)
+{
+ u64 now;
+
+ rdmsrl(event->hw.event_base, now);
+ local64_set(&event->hw.prev_count, now);
+}
+
+static void aperfmperf_event_stop_or_del(struct perf_event *event, int flags)
+{
+ aperfmperf_event_update(event);
+}
+
+static int aperfmperf_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ aperfmperf_event_start(event, flags);
+
+ return 0;
+}
+
+static struct pmu pmu_aperfmperf = {
+ .task_ctx_nr = perf_invalid_context,
+ .attr_groups = attr_groups,
+ .event_init = aperfmperf_event_init,
+ .add = aperfmperf_event_add,
+ .del = aperfmperf_event_stop_or_del,
+ .start = aperfmperf_event_start,
+ .stop = aperfmperf_event_stop_or_del,
+ .read = aperfmperf_event_update,
+};
+
+static int __init aperfmperf_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return -ENODEV;
+
+ perf_pmu_register(&pmu_aperfmperf, "aperfmperf", -1);
+
+ return 0;
+}
+device_initcall(aperfmperf_init);
--
2.3.0

2015-06-27 15:52:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Sat, Jun 27, 2015 at 1:39 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> So what's wrong with exposing them as a simplified PMU driver?
>>
>> That way we only expose the ones we want to - plus tooling can use all the rich
>> perf features that can be used around this. (sampling, counting, call chains,
>> etc.)
>
> See below code from Andy that exposes a single MSR via perf. At the core of the
> PMU driver is a single rdmsrl():
>
> +static void aperfmperf_event_start(struct perf_event *event, int flags)
> +{
> + u64 now;
> +
> + rdmsrl(event->hw.event_base, now);
> + local64_set(&event->hw.prev_count, now);
> +}
>
> Now I think what we really want is to expose not a single MSR but multiple MSRs in
> a single driver, i.e. don't have one PMU driver per MSR, but have a driver that
> allows the exposure of select MSRs as counters.

I'm way ahead of you: this driver can expose *two* MSRs as counters :)

Seriously, though, it would be straightforward to make it handle a
more general list, complete with non-architectural stuff (such as the
upcoming PPERF in Skylake).

This driver only knows how to handle counters, though. I'm not sure
whether all of the MSRs that turbostat needs are counters.

--Andy

2015-06-28 14:34:15

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr



On 06/27/2015 11:52 AM, Andy Lutomirski wrote:
> On Sat, Jun 27, 2015 at 1:39 AM, Ingo Molnar <[email protected]> wrote:
>>
>> * Ingo Molnar <[email protected]> wrote:
>>
>>> So what's wrong with exposing them as a simplified PMU driver?
>>>
>>> That way we only expose the ones we want to - plus tooling can use all the rich
>>> perf features that can be used around this. (sampling, counting, call chains,
>>> etc.)
>>
>> See below code from Andy that exposes a single MSR via perf. At the core of the
>> PMU driver is a single rdmsrl():
>>
>> +static void aperfmperf_event_start(struct perf_event *event, int flags)
>> +{
>> + u64 now;
>> +
>> + rdmsrl(event->hw.event_base, now);
>> + local64_set(&event->hw.prev_count, now);
>> +}
>>

I just sat down to do something similar to what Andy had proposed here :).

>> Now I think what we really want is to expose not a single MSR but multiple MSRs in
>> a single driver, i.e. don't have one PMU driver per MSR, but have a driver that
>> allows the exposure of select MSRs as counters.
>
> I'm way ahead of you: this driver can expose *two* MSRs as counters :)


>
> Seriously, though, it would be straightforward to make it handle a
> more general list, complete with non-architectural stuff (such as the
> upcoming PPERF in Skylake).

Is it easier to blacklist MSRs we don't want generally exposed, or only expose
the ones that we think are safe? That's sort of a devil's advocate sort of
question ;) and I'm wondering what the shorter list is.

>
> This driver only knows how to handle counters, though. I'm not sure
> whether all of the MSRs that turbostat needs are counters.

I knew that turbostat only did MSR reads and that's why turbostat's code was
changed in this patch. TBH I'm more concerned for software that monitors system
power consumption, performance, and load.

I'll take what Andy has proposed and expand on it.

P.

>
> --Andy
>

Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Sun, 28 Jun 2015, Prarit Bhargava wrote:
> Is it easier to blacklist MSRs we don't want generally exposed, or only expose
> the ones that we think are safe? That's sort of a devil's advocate sort of
> question ;) and I'm wondering what the shorter list is.

The only way to make MSR access safe is to allow it only by whitelisting.
The x86 platform restricts all MSR access to ring 0 for a damn good reason.

Also, such a whitelist would most likely need to be vendor and model-aware,
and to differentiate "allow reads" from "allow writes"...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Fri, 26 Jun 2015, Prarit Bhargava wrote:
> > The proper way to do this is to write a driver to only expose the MSRs
> > that the user tools need, and nothing else.
>
> Will do -- At least I got everyone's attention with this :).

IMHO we need both a new driver that exposes semanthic functionality instead
of raw MSRs, and also to lock down the current MSR driver using processor
vendor, family and model-aware whitelists.

We have absolutely no idea of the real security impact of the CONFIG_X86_MSR
/dev/cpu/cpu#/msr driver, as that driver allows CAP_SYS_RAWIO userspace to
have complete access to all documented *and undocumented* MSRs.

Maybe we could build on the ideas and data already colleced by the msr-safe
LLNS code?

https://github.com/scalability-llnl/msr-safe
http://www.vi-hps.org/upload/program/espt-sc14/vi-hps-ESPT14-Shoga.pdf

At the very least, their work on building a list of safe MSRs should help...
Their code seems to be licensed under the GPLv3, which is a rather strange
choice of license for a kernel module.


A quick look using Debian's codesearch found these users of
/dev/cpu/cpu#/msr:

* cpufrequtils
* flashrom
* i7z
* intel-gpu-tools
* inteltool
* mcelog
* msrtool, msr-tools
* PAPI (can use msr_safe)
* powertop
* qemu
* slurm-llnl (maybe this can also use msr_safe?)
* stressapptest
* turbostat
* wmlongrun, longrun
* x86info
* xserver-xorg-video-geode

As well as the other stuff that ships from the Linux kernel tree.

Looking at what they need the MSR access for (well, except for msrtools,
which is just a way for shell scripts to easily talk to the MSR driver)
might help define the problem space better.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2015-06-29 06:42:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Henrique de Moraes Holschuh <[email protected]> wrote:

> On Sun, 28 Jun 2015, Prarit Bhargava wrote:
>
> > Is it easier to blacklist MSRs we don't want generally exposed, or only expose
> > the ones that we think are safe? That's sort of a devil's advocate sort of
> > question ;) and I'm wondering what the shorter list is.
>
> The only way to make MSR access safe is to allow it only by whitelisting. The
> x86 platform restricts all MSR access to ring 0 for a damn good reason.

Exactly.

We also want to document them along the way: just exposing all doesn't achieve
that.

> Also, such a whitelist would most likely need to be vendor and model-aware, and
> to differentiate "allow reads" from "allow writes"...

Initially it should only allow reads - which I believe fully meets turbostat's
needs.

Thanks,

Ingo

2015-06-29 10:59:04

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Sun, 28 Jun, at 12:10:49PM, Henrique de Moraes Holschuh wrote:
> On Sun, 28 Jun 2015, Prarit Bhargava wrote:
> > Is it easier to blacklist MSRs we don't want generally exposed, or only expose
> > the ones that we think are safe? That's sort of a devil's advocate sort of
> > question ;) and I'm wondering what the shorter list is.
>
> The only way to make MSR access safe is to allow it only by whitelisting.
> The x86 platform restricts all MSR access to ring 0 for a damn good reason.

Blacklisting also breaks horribly if you run old kernels on new
hardware.

We need to "fail-closed" if someone tries to access an MSR the kernel
doesn't know about.

--
Matt Fleming, Intel Open Source Technology Center

2015-06-29 19:52:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On 06/28/2015 07:34 AM, Prarit Bhargava wrote:
>>
>> Seriously, though, it would be straightforward to make it handle a
>> more general list, complete with non-architectural stuff (such as the
>> upcoming PPERF in Skylake).
>
> Is it easier to blacklist MSRs we don't want generally exposed, or only expose
> the ones that we think are safe? That's sort of a devil's advocate sort of
> question ;) and I'm wondering what the shorter list is.
>

The second is the only option. Blacklisting MSRs is not safe, as you
have no idea what new MSRs might be introduced.

-hpa

2015-06-30 12:21:05

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


The MSR exposure seems to be okay with the following statements:

- complete read of /dev/cpu/X/msr is bad, whitelist instead
- needs to be dependent on either CPU version or reading MSRs for support.
IIRC the Intel documentaton on the MSRs indicated that there are ways to
check to see if a particular MSR is supported by a processor. That doesn't
seem difficult to do IMO.

Here are the options that have been discussed in this thread, as well as a
third that I'm proposing:

1. Andy's PMU driver suggestion (eventually exposed via sysfs)
2. Standalone driver (LLNL) which exposes values in sysfs (one value per
sysfs file)
3. Make /dev/cpu/X/msr readable for those addresses in the whitelist.
ie) Allow read access to address for IA32_MPERF, etc.

I find exposing MSRs via sysfs difficult to maintain as we move forward.
I suppose we could name them according to their MSR names in the Intel
documentation, however from a userspace point of view I still find it
cumbersome to code that way. Doing this in the existing /dev/ space has the
benefit that we wouldn't have to change any userspace code. For
those users who did (in some crazy situation) want full read access, they can
still do 'setcap' on that particular executable.

Using Henrique's list of packages that use /dev/cpu/X/msr,

* cpufrequtils
* flashrom
* i7z
* intel-gpu-tools
* inteltool
* mcelog
* msrtool, msr-tools
* PAPI (can use msr_safe)
* powertop
* qemu
* slurm-llnl (maybe this can also use msr_safe?)
* stressapptest
* turbostat
* wmlongrun, longrun
* x86info
* xserver-xorg-video-geode

it seems like visiting changes on each of these packages (and the other
packages that I'm sure I've missed) will be moderately difficult.

Thoughts?

P.

2015-06-30 12:44:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Tue, Jun 30, 2015 at 08:20:55AM -0400, Prarit Bhargava wrote:
> it seems like visiting changes on each of these packages (and the other
> packages that I'm sure I've missed) will be moderately difficult.
>
> Thoughts?

Start by changing the ones users want to run most and leave the rest
requiring root privs until someone has the time to convert them.

Its not like we can remove the msr driver any time soon anyway.

So I would suggest starting with the perf MSR driver thingy for all
those MSRs that count things and see if you can convert say
turbostat/cpufrequtils/powertop over to that.

I suspect there's MSR that are useful to expose but are not counting,
I'm not sure perf is the right interface for those.

Making an inventory on which MSRs are required by these tools and what
kind of data they provide might give a good idea on how to continue.

If most of these tools only use counting MSRs that can be serviced with
the perf-msr driver then that would be great.

2015-06-30 12:57:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Peter Zijlstra <[email protected]> wrote:

> On Tue, Jun 30, 2015 at 08:20:55AM -0400, Prarit Bhargava wrote:
> > it seems like visiting changes on each of these packages (and the other
> > packages that I'm sure I've missed) will be moderately difficult.
> >
> > Thoughts?
>
> Start by changing the ones users want to run most and leave the rest
> requiring root privs until someone has the time to convert them.

Yes, start by just converting the ones used by a single tool, say turbostat, and
then convert turbostat (while keeping /dev/msr fall-back code as well) just to see
what the tooling fallout is.

> Its not like we can remove the msr driver any time soon anyway.

Yes.

> So I would suggest starting with the perf MSR driver thingy for all
> those MSRs that count things and see if you can convert say
> turbostat/cpufrequtils/powertop over to that.

One of those tools would be enough, to keep the complexity of the initial
submission low - and to allow a change of plans if necessary.

> I suspect there's MSR that are useful to expose but are not counting, I'm not
> sure perf is the right interface for those.
>
> Making an inventory on which MSRs are required by these tools and what kind of
> data they provide might give a good idea on how to continue.
>
> If most of these tools only use counting MSRs that can be serviced with the
> perf-msr driver then that would be great.

Fully agreed!

Thanks,

Ingo

2015-06-30 13:23:11

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr



On 06/30/2015 08:44 AM, Peter Zijlstra wrote:
> On Tue, Jun 30, 2015 at 08:20:55AM -0400, Prarit Bhargava wrote:
>> it seems like visiting changes on each of these packages (and the other
>> packages that I'm sure I've missed) will be moderately difficult.
>>
>> Thoughts?
>
> Start by changing the ones users want to run most and leave the rest
> requiring root privs until someone has the time to convert them.
>
> Its not like we can remove the msr driver any time soon anyway.
>

Yes, I agree.

> So I would suggest starting with the perf MSR driver thingy for all
> those MSRs that count things and see if you can convert say
> turbostat/cpufrequtils/powertop over to that.

Yep, that was my plan -- I'm going to do the in-tree utilities first then look
at the external programs.

>
> I suspect there's MSR that are useful to expose but are not counting,
> I'm not sure perf is the right interface for those.
>
> Making an inventory on which MSRs are required by these tools and what
> kind of data they provide might give a good idea on how to continue.
>
> If most of these tools only use counting MSRs that can be serviced with
> the perf-msr driver then that would be great.
>

Thanks Peter!

P.

2015-07-01 16:39:00

by Brown, Len

[permalink] [raw]
Subject: RE: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

> This driver only knows how to handle counters, though. I'm not sure
> whether all of the MSRs that turbostat needs are counters.

turbostat --debug
dumps a lot of configuration MSRs that are not counters.

"--debug" is not an obscure option, it is the only way that
the turbostat is used by advanced users, since it shows not just
how fast a system is running, but explains why.

turbostat -M or -C 'address'
will dump an arbitrary MSR at offset 'address' in hex or counter format.

This allows somebody to use yesterday's turbostat application
to dump an MSR that didn't exist when the application was written.
(and since the MSR driver doesn't have specific MSR addresses
hard-coded into it, that is permitted)

> I knew that turbostat only did MSR reads

FYI, There have been proposals for turbostat to do MSR writes,
and I've resisted them because I like that multiple turbostats
can run at different intervals and even different users,
and not interfere with each other. One of the proposals
was due to a short counter that tends to over-flow. That,
I think would be better done in a central place in the kernel,
though it shouldn't poll unless it is actually being used.
The other is to be able to fix configuration bits that
are recognized to be incorrect or non-optimal.

BTW. I've had a discussion w/ LLNL about their needs,
both for security and performance. For security, as concluded
by this thread, a white list is the only way to go.
I'm thinking a bit-vector of allowed MSR offsets...
For performance, they absolutely can not afford a system call
for every single MSR access. Here an ioctl to have the
msr driver perform a vector of accesses in a single system
call seems the way to go. I can prototype both of these
using turbostat as the customer.

-Len

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-07-01 17:34:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Wed, Jul 1, 2015 at 9:38 AM, Brown, Len <[email protected]> wrote:
> BTW. I've had a discussion w/ LLNL about their needs,
> both for security and performance. For security, as concluded
> by this thread, a white list is the only way to go.
> I'm thinking a bit-vector of allowed MSR offsets...
> For performance, they absolutely can not afford a system call
> for every single MSR access.

I'm surprised. On a sane kernel, a syscall is about 120 cycles. Just
rdmsr to an unoptimized MSR is probably fifty cycles, I'd guess.

Of course, LLNL is probably using NOHZ_FULL, which is currently very,
very slow. Work is afoot to fix that.

> Here an ioctl to have the
> msr driver perform a vector of accesses in a single system
> call seems the way to go. I can prototype both of these
> using turbostat as the customer.

How about preadv?

--Andy

2015-07-02 09:15:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Andy Lutomirski <[email protected]> wrote:

> On Wed, Jul 1, 2015 at 9:38 AM, Brown, Len <[email protected]> wrote:
>
> > BTW. I've had a discussion w/ LLNL about their needs, both for security and
> > performance. For security, as concluded by this thread, a white list is the
> > only way to go. I'm thinking a bit-vector of allowed MSR offsets... For
> > performance, they absolutely can not afford a system call for every single MSR
> > access.
>
> I'm surprised. On a sane kernel, a syscall is about 120 cycles. Just rdmsr to
> an unoptimized MSR is probably fifty cycles, I'd guess.

RDMSR to a non-fastpath MSR is more like a hundred cycles:

[ 104.151166] x86/bench: ---------------------------
[ 104.155350] x86/bench: | Running x86 benchmarks: |
[ 104.159530] x86/bench: -------------------------------------------------------------------
[ 104.167604] x86/bench: | RDTSC-cycles: hot (?noise) / cold (?noise)
[ 104.175870] x86/bench: -------------------------------------------------------------------

Ancient box (10 years old):

x86/bench: rdmsr : 36 / 17 (?29.4%)
x86/bench: wrmsr : 198 / 245

AMD box (2 years old):
...
[ 173.208130] x86/bench: rdmsr : 121 / 169 (?18.9%)
[ 174.633653] x86/bench: wrmsr : 365 / 422 (? 9.2%)

Intel box (1 year old):
...
[ 130.185195] x86/bench: rdmsr : 100 / 112
[ 131.263560] x86/bench: wrmsr : 492 / 728 (?15.3%)

so the RDMSR cost got progressively worse as MSRs got farther and farther away
from the core and microcode execution got progressively worse as well.

Thanks,

Ingo

2015-07-02 19:24:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On 07/01/2015 09:38 AM, Brown, Len wrote:
>
> BTW. I've had a discussion w/ LLNL about their needs,
> both for security and performance. For security, as concluded
> by this thread, a white list is the only way to go.
> I'm thinking a bit-vector of allowed MSR offsets...
> For performance, they absolutely can not afford a system call
> for every single MSR access. Here an ioctl to have the
> msr driver perform a vector of accesses in a single system
> call seems the way to go. I can prototype both of these
> using turbostat as the customer.
>

Every time I have heard about people having issues with performance for
MSR access, it is because they are doing cross-CPU accesses which means
a neverending stream of IPIs. You get immensely better performance by
tying a thread to a CPU and only accessing the local CPU from that
thread. This has addressed any performance problems anyone has ever
come to me with. As Andy and Ingo have already pointed out, the MSR
access itself is pretty much as expensive as the system call overhead.

-hpa

2015-07-02 19:26:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr

On Thu, Jul 2, 2015 at 12:22 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/01/2015 09:38 AM, Brown, Len wrote:
>>
>> BTW. I've had a discussion w/ LLNL about their needs,
>> both for security and performance. For security, as concluded
>> by this thread, a white list is the only way to go.
>> I'm thinking a bit-vector of allowed MSR offsets...
>> For performance, they absolutely can not afford a system call
>> for every single MSR access. Here an ioctl to have the
>> msr driver perform a vector of accesses in a single system
>> call seems the way to go. I can prototype both of these
>> using turbostat as the customer.
>>
>
> Every time I have heard about people having issues with performance for
> MSR access, it is because they are doing cross-CPU accesses which means
> a neverending stream of IPIs. You get immensely better performance by
> tying a thread to a CPU and only accessing the local CPU from that
> thread. This has addressed any performance problems anyone has ever
> come to me with. As Andy and Ingo have already pointed out, the MSR
> access itself is pretty much as expensive as the system call overhead.

To be fair, before we had opportunistic sysret,
CONFIG_CONTEXT_TRACKING was *extremely* expensive. Even now, it's
still pretty bad.

Len, do you know what configuration and kernel version this was on or
what the apparent syscall overhead was?

--Andy

2015-07-03 07:42:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86, msr: Allow read access to /dev/cpu/X/msr


* Brown, Len <[email protected]> wrote:

> > This driver only knows how to handle counters, though. I'm not sure
> > whether all of the MSRs that turbostat needs are counters.
>
> turbostat --debug
> dumps a lot of configuration MSRs that are not counters.
>
> "--debug" is not an obscure option, it is the only way that the turbostat is
> used by advanced users, since it shows not just how fast a system is running,
> but explains why.

I see no problems there: the main property for perf is read-only access - so it
will work fine with non-counting MSRs as well.

Really, for system-wide statistics perf is the way to go, it has rich kernel side
and user side support. We already kind of expose MSRs via the RAPL energy PMU
bits. All we need is to extend it to an enumerated list of MSRs.

Thanks,

Ingo