2019-01-17 00:38:06

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 0/3] x86/umwait: Enable user wait instructions

User wait instructions (umwait, umonitor, and tpause) and
IA32_MWAIT_CONTROL MSR to control umwait/umonitor/tpause behaviors
will be available in Tremont and other future x86 processors.

This patch set enumerates the instructions, adds a sysfs interface for
user to configure the umwait/umonitor/tpause instructions.

The sysfs interface files are in /sys/devices/system/cpu/umwait_control/
because it's hard to find an existing place to host the files.

The instructions generates #GP if CR4.TSD=1 and CPL>0. If user worries
security issues from the instructions, disabling TSC can prevent
malicious code from executing the instructions.

Detailed information on the instructions and the MSR can be found in
the latest Intel Architecture Instruction Set Extensions and Future
Features Programming Reference.

Changelog:
v2:
- Address comments from Thomas Gleixner and Andy Lutomirski
- Remove vDSO functions
- Add sysfs control file for umwait max time

v1:
Based on comments from Thomas:
- Change user APIs to vDSO functions
- Change sysfs to positive logic and enable file name
- Change patch descriptions etc

Fenghua Yu (3):
x86/cpufeatures: Enumerate user wait instructions
x86/umwait: Setup umwait C0.2 state
x86/umwait: Control umwait maximum time

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++
arch/x86/power/Makefile | 1 +
arch/x86/power/umwait.c | 154 +++++++++++++++++++++++++++++
4 files changed, 162 insertions(+)
create mode 100644 arch/x86/power/umwait.c

--
2.19.1



2019-01-17 00:38:25

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
that processor can stay in C0.1 or C0.2.

The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
means there is no global time limit for UMWAIT and TPAUSE instructions.
Each process sets its own umwait maximum time as the instructions operand.

User can specify global umwait maximum time through interface:
/sys/devices/system/cpu/umwait_control/umwait_max_time
The value in the interface is in decimal in TSC-quanta. Bits[1:0]
are cleared when the value is stored.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b56bfecae0de..42b9104fc15b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -62,6 +62,8 @@
#define MSR_IA32_UMWAIT_CONTROL 0xe1
#define UMWAIT_CONTROL_C02_BIT 0x0
#define UMWAIT_CONTROL_C02_MASK 0x00000001
+#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
+#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc

#define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
#define NHM_C3_AUTO_DEMOTE (1UL << 25)
diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
index 95b3867aac1e..4a1a507d3bb7 100644
--- a/arch/x86/power/umwait.c
+++ b/arch/x86/power/umwait.c
@@ -10,6 +10,7 @@
#include <asm/msr.h>

static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
+static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
static DEFINE_MUTEX(umwait_lock);

/* Return value that will be used to set umwait control MSR */
@@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
* When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
* So value in bit 0 is opposite of umwait_enable_c0_2.
*/
- return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
+ return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
+ umwait_max_time;
}

static ssize_t umwait_enable_c0_2_show(struct device *dev,
@@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,

static DEVICE_ATTR_RW(umwait_enable_c0_2);

+static ssize_t umwait_max_time_show(struct device *kobj,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", umwait_max_time);
+}
+
+static ssize_t umwait_max_time_store(struct device *kobj,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ u32 msr_val, max_time;
+ int cpu, ret;
+
+ ret = kstrtou32(buf, 10, &max_time);
+ if (ret)
+ return ret;
+
+ mutex_lock(&umwait_lock);
+
+ /* Only get max time value from bits [31:2] */
+ max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
+ /* Update the max time value in memory */
+ umwait_max_time = max_time;
+ msr_val = umwait_control_val();
+ get_online_cpus();
+ /* All CPUs have same umwait max time */
+ for_each_online_cpu(cpu)
+ wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+ put_online_cpus();
+
+ mutex_unlock(&umwait_lock);
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(umwait_max_time);
+
static struct attribute *umwait_attrs[] = {
&dev_attr_umwait_enable_c0_2.attr,
+ &dev_attr_umwait_max_time.attr,
NULL
};

--
2.19.1


2019-01-17 00:38:34

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

UMONITOR, UMWAIT, and TPAUSE are a set of user wait instructions.

UMONITOR arms address monitoring hardware using an address. A store
to an address within the specified address range triggers the
monitoring hardware to wake up the processor waiting in umwait.

UMWAIT instructs the processor to enter an implementation-dependent
optimized state while monitoring a range of addresses. The optimized
state may be either a light-weight power/performance optimized state
(c0.1 state) or an improved power/performance optimized state
(c0.2 state).

The UMONITOR and UMWAIT operate together to provide power saving
in idle.

TPAUSE instructs the processor to enter an implementation-dependent
optimized state c0.1 or c0.2 state and wake up when time-stamp counter
reaches specified timeout.

The three instructions may be executed at any privilege level.

Availability of the user wait instructions is indicated by the presence
of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].

Please check the latest Intel Architecture Instruction Set Extensions
and Future Features Programming Reference for more details on the
instructions and CPUID feature WAITPKG flag.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..344874df5dc3 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -322,6 +322,7 @@
#define X86_FEATURE_UMIP (16*32+ 2) /* User Mode Instruction Protection */
#define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
--
2.19.1


2019-01-17 00:39:23

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 2/3] x86/umwait: Setup umwait C0.2 state

UMWAIT or TPAUSE called by user process makes processor to reside in
a light-weight power/performance optimized state (C0.1 state) or an
improved power/performance optimized state (C0.2 state).

IA32_UMWAIT_CONTROL MSR register allows OS to set global maximum umwait
time and disable C0.2 on the processor.

By default C0.2 is enabled so user wait instructions can enter the state
if user wants to save more power but wakeup time is slower. In some cases
e.g. real time, user wants to disable C0.2 and all C0.2 requests revert
to C0.1.

A new "/sys/devices/system/cpu/umwait_control/umwait_enable_c0_2" file is
created to allow user to check if C0.2 is enabled or disabled and also
allow user to enable or disable C0.2. Value "0" in the file means C0.2 is
disabled. Value "1" means C0.2 is enabled.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/msr-index.h | 4 ++
arch/x86/power/Makefile | 1 +
arch/x86/power/umwait.c | 114 +++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+)
create mode 100644 arch/x86/power/umwait.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..b56bfecae0de 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -59,6 +59,10 @@
#define MSR_PLATFORM_INFO_CPUID_FAULT_BIT 31
#define MSR_PLATFORM_INFO_CPUID_FAULT BIT_ULL(MSR_PLATFORM_INFO_CPUID_FAULT_BIT)

+#define MSR_IA32_UMWAIT_CONTROL 0xe1
+#define UMWAIT_CONTROL_C02_BIT 0x0
+#define UMWAIT_CONTROL_C02_MASK 0x00000001
+
#define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
#define NHM_C3_AUTO_DEMOTE (1UL << 25)
#define NHM_C1_AUTO_DEMOTE (1UL << 26)
diff --git a/arch/x86/power/Makefile b/arch/x86/power/Makefile
index 37923d715741..62e2c609d1fe 100644
--- a/arch/x86/power/Makefile
+++ b/arch/x86/power/Makefile
@@ -8,3 +8,4 @@ CFLAGS_cpu.o := $(nostackp)

obj-$(CONFIG_PM_SLEEP) += cpu.o
obj-$(CONFIG_HIBERNATION) += hibernate_$(BITS).o hibernate_asm_$(BITS).o hibernate.o
+obj-y += umwait.o
diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
new file mode 100644
index 000000000000..95b3867aac1e
--- /dev/null
+++ b/arch/x86/power/umwait.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sysfs interface for umwait control
+ *
+ * Copyright (C) 2018, Intel Corporation.
+ *
+ * Author: Fenghua Yu <[email protected]>
+ */
+#include <linux/cpu.h>
+#include <asm/msr.h>
+
+static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
+static DEFINE_MUTEX(umwait_lock);
+
+/* Return value that will be used to set umwait control MSR */
+static inline u32 umwait_control_val(void)
+{
+ /*
+ * Enable or disable C0.2 (bit 0) based on global setting on all CPUs.
+ * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
+ * So value in bit 0 is opposite of umwait_enable_c0_2.
+ */
+ return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
+}
+
+static ssize_t umwait_enable_c0_2_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", umwait_enable_c0_2);
+}
+
+static ssize_t umwait_enable_c0_2_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int enable_c0_2, cpu, ret;
+ u32 msr_val;
+
+ ret = kstrtou32(buf, 10, &enable_c0_2);
+ if (ret)
+ return ret;
+
+ if (enable_c0_2 != 1 && enable_c0_2 != 0)
+ return -EINVAL;
+
+ mutex_lock(&umwait_lock);
+
+ umwait_enable_c0_2 = enable_c0_2;
+ msr_val = umwait_control_val();
+ get_online_cpus();
+ /* All CPUs have same umwait control setting */
+ for_each_online_cpu(cpu)
+ wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+ put_online_cpus();
+
+ mutex_unlock(&umwait_lock);
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(umwait_enable_c0_2);
+
+static struct attribute *umwait_attrs[] = {
+ &dev_attr_umwait_enable_c0_2.attr,
+ NULL
+};
+
+static struct attribute_group umwait_attr_group = {
+ .attrs = umwait_attrs,
+ .name = "umwait_control",
+};
+
+/* Set up umwait control MSR on this CPU using the current global setting. */
+static int umwait_cpu_online(unsigned int cpu)
+{
+ u32 msr_val;
+
+ mutex_lock(&umwait_lock);
+
+ msr_val = umwait_control_val();
+ wrmsr(MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+
+ mutex_unlock(&umwait_lock);
+
+ return 0;
+}
+
+static int __init umwait_init(void)
+{
+ struct device *dev;
+ int ret;
+
+ if (!boot_cpu_has(X86_FEATURE_WAITPKG))
+ return -ENODEV;
+
+ /* Add CPU global user wait interface to control umwait. */
+ dev = cpu_subsys.dev_root;
+ ret = sysfs_create_group(&dev->kobj, &umwait_attr_group);
+ if (ret)
+ return ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "umwait/intel:online",
+ umwait_cpu_online, NULL);
+ if (ret < 0)
+ goto out_group;
+
+ return 0;
+out_group:
+ sysfs_remove_group(&dev->kobj, &umwait_attr_group);
+
+ return ret;
+}
+device_initcall(umwait_init);
--
2.19.1


2019-01-17 01:18:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] x86/umwait: Setup umwait C0.2 state

On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
>
> UMWAIT or TPAUSE called by user process makes processor to reside in
> a light-weight power/performance optimized state (C0.1 state) or an
> improved power/performance optimized state (C0.2 state).
>
> IA32_UMWAIT_CONTROL MSR register allows OS to set global maximum umwait
> time and disable C0.2 on the processor.
>
> By default C0.2 is enabled so user wait instructions can enter the state
> if user wants to save more power but wakeup time is slower. In some cases
> e.g. real time, user wants to disable C0.2 and all C0.2 requests revert
> to C0.1.
>
> A new "/sys/devices/system/cpu/umwait_control/umwait_enable_c0_2" file is
> created to allow user to check if C0.2 is enabled or disabled and also
> allow user to enable or disable C0.2. Value "0" in the file means C0.2 is
> disabled. Value "1" means C0.2 is enabled.

Do you have any sense as to what the actual C0.2 entry and exit latency is?

--Andy

2019-01-17 01:21:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
>
> IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
> that processor can stay in C0.1 or C0.2.
>
> The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
> means there is no global time limit for UMWAIT and TPAUSE instructions.
> Each process sets its own umwait maximum time as the instructions operand.
>
> User can specify global umwait maximum time through interface:
> /sys/devices/system/cpu/umwait_control/umwait_max_time
> The value in the interface is in decimal in TSC-quanta. Bits[1:0]
> are cleared when the value is stored.
>
> Signed-off-by: Fenghua Yu <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 2 ++
> arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
> 2 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index b56bfecae0de..42b9104fc15b 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -62,6 +62,8 @@
> #define MSR_IA32_UMWAIT_CONTROL 0xe1
> #define UMWAIT_CONTROL_C02_BIT 0x0
> #define UMWAIT_CONTROL_C02_MASK 0x00000001
> +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
> +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
>
> #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
> #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
> index 95b3867aac1e..4a1a507d3bb7 100644
> --- a/arch/x86/power/umwait.c
> +++ b/arch/x86/power/umwait.c
> @@ -10,6 +10,7 @@
> #include <asm/msr.h>
>
> static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
> static DEFINE_MUTEX(umwait_lock);
>
> /* Return value that will be used to set umwait control MSR */
> @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> * So value in bit 0 is opposite of umwait_enable_c0_2.
> */
> - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> + umwait_max_time;
> }
>
> static ssize_t umwait_enable_c0_2_show(struct device *dev,
> @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
>
> static DEVICE_ATTR_RW(umwait_enable_c0_2);
>
> +static ssize_t umwait_max_time_show(struct device *kobj,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", umwait_max_time);
> +}
> +
> +static ssize_t umwait_max_time_store(struct device *kobj,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + u32 msr_val, max_time;
> + int cpu, ret;
> +
> + ret = kstrtou32(buf, 10, &max_time);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&umwait_lock);
> +
> + /* Only get max time value from bits [31:2] */
> + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> + /* Update the max time value in memory */
> + umwait_max_time = max_time;
> + msr_val = umwait_control_val();
> + get_online_cpus();
> + /* All CPUs have same umwait max time */
> + for_each_online_cpu(cpu)
> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> + put_online_cpus();
> +
> + mutex_unlock(&umwait_lock);
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(umwait_max_time);
> +
> static struct attribute *umwait_attrs[] = {
> &dev_attr_umwait_enable_c0_2.attr,
> + &dev_attr_umwait_max_time.attr,
> NULL
> };

You need something to make sure that newly onlined CPUs get the right
value in the MSR. You also need to make sure you restore it on resume
from suspend. Something like cpu_init() might be the right place.

Also, as previously discussed, I think we should set the default to
something quite small, maybe 100 microseconds. IMO the goal is to
pick a value that is a high enough multiple of the C0.2 entry+exit
latency that we get most of the power and SMT resource savings while
being small enough that no one things that UMWAIT is more than a
glorified, slightly improved, and far more misleading version of REP
NOP.

Andrew, would having Linux default to a small value do much to
mitigate your concerns that UMWAIT is problematic for hypervisors?

Also, can someone who understands the hardware clarify just how
dangerous UMWAIT is from a perspective of making speculation attacks
more dangerous than they already are? I'm wondering what events will
wake up a UMONITOR. I bet that CLFLUSH does. I wonder if a faulting
write to a read-only page also does. Or a load from a remote node.
Or a speculative store that does not subsequently retire. This
instruction seems quite delightful as a tool to create a
highish-bandwidth covert channel, and it's possibly quite nice to
agument Spectre-like attacks. If it ends up being bad enough, we
might need to set the default timeout to the absolute minimum value
and possibly ask Intel to give us a way to turn it off entirely.

2019-01-17 01:26:52

by Fenghua Yu

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

On Wed, Jan 16, 2019 at 04:00:29PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
> >
> > IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
> > that processor can stay in C0.1 or C0.2.
> >
> > The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
> > means there is no global time limit for UMWAIT and TPAUSE instructions.
> > Each process sets its own umwait maximum time as the instructions operand.
> >
> > User can specify global umwait maximum time through interface:
> > /sys/devices/system/cpu/umwait_control/umwait_max_time
> > The value in the interface is in decimal in TSC-quanta. Bits[1:0]
> > are cleared when the value is stored.
> >
> > Signed-off-by: Fenghua Yu <[email protected]>
> > ---
> > arch/x86/include/asm/msr-index.h | 2 ++
> > arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
> > 2 files changed, 43 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > index b56bfecae0de..42b9104fc15b 100644
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -62,6 +62,8 @@
> > #define MSR_IA32_UMWAIT_CONTROL 0xe1
> > #define UMWAIT_CONTROL_C02_BIT 0x0
> > #define UMWAIT_CONTROL_C02_MASK 0x00000001
> > +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
> > +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
> >
> > #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
> > #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> > diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
> > index 95b3867aac1e..4a1a507d3bb7 100644
> > --- a/arch/x86/power/umwait.c
> > +++ b/arch/x86/power/umwait.c
> > @@ -10,6 +10,7 @@
> > #include <asm/msr.h>
> >
> > static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> > +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
> > static DEFINE_MUTEX(umwait_lock);
> >
> > /* Return value that will be used to set umwait control MSR */
> > @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> > * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> > * So value in bit 0 is opposite of umwait_enable_c0_2.
> > */
> > - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> > + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> > + umwait_max_time;
> > }
> >
> > static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
> >
> > static DEVICE_ATTR_RW(umwait_enable_c0_2);
> >
> > +static ssize_t umwait_max_time_show(struct device *kobj,
> > + struct device_attribute *attr, char *buf)
> > +{
> > + return sprintf(buf, "%u\n", umwait_max_time);
> > +}
> > +
> > +static ssize_t umwait_max_time_store(struct device *kobj,
> > + struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + u32 msr_val, max_time;
> > + int cpu, ret;
> > +
> > + ret = kstrtou32(buf, 10, &max_time);
> > + if (ret)
> > + return ret;
> > +
> > + mutex_lock(&umwait_lock);
> > +
> > + /* Only get max time value from bits [31:2] */
> > + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> > + /* Update the max time value in memory */
> > + umwait_max_time = max_time;
> > + msr_val = umwait_control_val();
> > + get_online_cpus();
> > + /* All CPUs have same umwait max time */
> > + for_each_online_cpu(cpu)
> > + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > + put_online_cpus();
> > +
> > + mutex_unlock(&umwait_lock);
> > +
> > + return count;
> > +}
> > +
> > +static DEVICE_ATTR_RW(umwait_max_time);
> > +
> > static struct attribute *umwait_attrs[] = {
> > &dev_attr_umwait_enable_c0_2.attr,
> > + &dev_attr_umwait_max_time.attr,
> > NULL
> > };
>
> You need something to make sure that newly onlined CPUs get the right
> value in the MSR.

Onlined CPU takes the umwait_control value in umwait_cpu_online() in
patch 2. Please check if it's correct.

> You also need to make sure you restore it on resume
> from suspend. Something like cpu_init() might be the right place.
>
> Also, as previously discussed, I think we should set the default to
> something quite small, maybe 100 microseconds. IMO the goal is to
> pick a value that is a high enough multiple of the C0.2 entry+exit
> latency that we get most of the power and SMT resource savings while
> being small enough that no one things that UMWAIT is more than a
> glorified, slightly improved, and far more misleading version of REP
> NOP.
>
> Andrew, would having Linux default to a small value do much to
> mitigate your concerns that UMWAIT is problematic for hypervisors?
>
> Also, can someone who understands the hardware clarify just how
> dangerous UMWAIT is from a perspective of making speculation attacks
> more dangerous than they already are? I'm wondering what events will
> wake up a UMONITOR. I bet that CLFLUSH does. I wonder if a faulting
> write to a read-only page also does. Or a load from a remote node.
> Or a speculative store that does not subsequently retire. This
> instruction seems quite delightful as a tool to create a
> highish-bandwidth covert channel, and it's possibly quite nice to
> agument Spectre-like attacks. If it ends up being bad enough, we
> might need to set the default timeout to the absolute minimum value
> and possibly ask Intel to give us a way to turn it off entirely.

If CR4.TSD=1 and CPL>0, umwait and tpause generate #GP. So if user thinks
the instructions are dangerous, he can disable TSC.

Is this the right handling for security concerns?

Thanks.

-Fenghua

2019-01-17 01:29:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

On Wed, Jan 16, 2019 at 4:13 PM Fenghua Yu <[email protected]> wrote:
>
> On Wed, Jan 16, 2019 at 04:00:29PM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
> > >
> > > IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
> > > that processor can stay in C0.1 or C0.2.
> > >
> > > The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
> > > means there is no global time limit for UMWAIT and TPAUSE instructions.
> > > Each process sets its own umwait maximum time as the instructions operand.
> > >
> > > User can specify global umwait maximum time through interface:
> > > /sys/devices/system/cpu/umwait_control/umwait_max_time
> > > The value in the interface is in decimal in TSC-quanta. Bits[1:0]
> > > are cleared when the value is stored.
> > >
> > > Signed-off-by: Fenghua Yu <[email protected]>
> > > ---
> > > arch/x86/include/asm/msr-index.h | 2 ++
> > > arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
> > > 2 files changed, 43 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > > index b56bfecae0de..42b9104fc15b 100644
> > > --- a/arch/x86/include/asm/msr-index.h
> > > +++ b/arch/x86/include/asm/msr-index.h
> > > @@ -62,6 +62,8 @@
> > > #define MSR_IA32_UMWAIT_CONTROL 0xe1
> > > #define UMWAIT_CONTROL_C02_BIT 0x0
> > > #define UMWAIT_CONTROL_C02_MASK 0x00000001
> > > +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
> > > +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
> > >
> > > #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
> > > #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> > > diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
> > > index 95b3867aac1e..4a1a507d3bb7 100644
> > > --- a/arch/x86/power/umwait.c
> > > +++ b/arch/x86/power/umwait.c
> > > @@ -10,6 +10,7 @@
> > > #include <asm/msr.h>
> > >
> > > static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> > > +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
> > > static DEFINE_MUTEX(umwait_lock);
> > >
> > > /* Return value that will be used to set umwait control MSR */
> > > @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> > > * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> > > * So value in bit 0 is opposite of umwait_enable_c0_2.
> > > */
> > > - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> > > + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> > > + umwait_max_time;
> > > }
> > >
> > > static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > > @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
> > >
> > > static DEVICE_ATTR_RW(umwait_enable_c0_2);
> > >
> > > +static ssize_t umwait_max_time_show(struct device *kobj,
> > > + struct device_attribute *attr, char *buf)
> > > +{
> > > + return sprintf(buf, "%u\n", umwait_max_time);
> > > +}
> > > +
> > > +static ssize_t umwait_max_time_store(struct device *kobj,
> > > + struct device_attribute *attr,
> > > + const char *buf, size_t count)
> > > +{
> > > + u32 msr_val, max_time;
> > > + int cpu, ret;
> > > +
> > > + ret = kstrtou32(buf, 10, &max_time);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + mutex_lock(&umwait_lock);
> > > +
> > > + /* Only get max time value from bits [31:2] */
> > > + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> > > + /* Update the max time value in memory */
> > > + umwait_max_time = max_time;
> > > + msr_val = umwait_control_val();
> > > + get_online_cpus();
> > > + /* All CPUs have same umwait max time */
> > > + for_each_online_cpu(cpu)
> > > + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > > + put_online_cpus();
> > > +
> > > + mutex_unlock(&umwait_lock);
> > > +
> > > + return count;
> > > +}
> > > +
> > > +static DEVICE_ATTR_RW(umwait_max_time);
> > > +
> > > static struct attribute *umwait_attrs[] = {
> > > &dev_attr_umwait_enable_c0_2.attr,
> > > + &dev_attr_umwait_max_time.attr,
> > > NULL
> > > };
> >
> > You need something to make sure that newly onlined CPUs get the right
> > value in the MSR.
>
> Onlined CPU takes the umwait_control value in umwait_cpu_online() in
> patch 2. Please check if it's correct.
>
> > You also need to make sure you restore it on resume
> > from suspend. Something like cpu_init() might be the right place.
> >
> > Also, as previously discussed, I think we should set the default to
> > something quite small, maybe 100 microseconds. IMO the goal is to
> > pick a value that is a high enough multiple of the C0.2 entry+exit
> > latency that we get most of the power and SMT resource savings while
> > being small enough that no one things that UMWAIT is more than a
> > glorified, slightly improved, and far more misleading version of REP
> > NOP.
> >
> > Andrew, would having Linux default to a small value do much to
> > mitigate your concerns that UMWAIT is problematic for hypervisors?
> >
> > Also, can someone who understands the hardware clarify just how
> > dangerous UMWAIT is from a perspective of making speculation attacks
> > more dangerous than they already are? I'm wondering what events will
> > wake up a UMONITOR. I bet that CLFLUSH does. I wonder if a faulting
> > write to a read-only page also does. Or a load from a remote node.
> > Or a speculative store that does not subsequently retire. This
> > instruction seems quite delightful as a tool to create a
> > highish-bandwidth covert channel, and it's possibly quite nice to
> > agument Spectre-like attacks. If it ends up being bad enough, we
> > might need to set the default timeout to the absolute minimum value
> > and possibly ask Intel to give us a way to turn it off entirely.
>
> If CR4.TSD=1 and CPL>0, umwait and tpause generate #GP. So if user thinks
> the instructions are dangerous, he can disable TSC.
>
> Is this the right handling for security concerns?
>

Setting CR4.TSD=1 systemwide would utterly destroy performance of
quite a few workloads. And my argument for setting the value to a
lowish but not minimal value is about functionality, not security.

2019-01-17 02:13:45

by Fenghua Yu

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

On Wed, Jan 16, 2019 at 04:30:59PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 16, 2019 at 4:13 PM Fenghua Yu <[email protected]> wrote:
> >
> > On Wed, Jan 16, 2019 at 04:00:29PM -0800, Andy Lutomirski wrote:
> > > On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
> > > >
> > > > IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
> > > > that processor can stay in C0.1 or C0.2.
> > > >
> > > > The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
> > > > means there is no global time limit for UMWAIT and TPAUSE instructions.
> > > > Each process sets its own umwait maximum time as the instructions operand.
> > > >
> > > > User can specify global umwait maximum time through interface:
> > > > /sys/devices/system/cpu/umwait_control/umwait_max_time
> > > > The value in the interface is in decimal in TSC-quanta. Bits[1:0]
> > > > are cleared when the value is stored.
> > > >
> > > > Signed-off-by: Fenghua Yu <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/msr-index.h | 2 ++
> > > > arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
> > > > 2 files changed, 43 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> > > > index b56bfecae0de..42b9104fc15b 100644
> > > > --- a/arch/x86/include/asm/msr-index.h
> > > > +++ b/arch/x86/include/asm/msr-index.h
> > > > @@ -62,6 +62,8 @@
> > > > #define MSR_IA32_UMWAIT_CONTROL 0xe1
> > > > #define UMWAIT_CONTROL_C02_BIT 0x0
> > > > #define UMWAIT_CONTROL_C02_MASK 0x00000001
> > > > +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
> > > > +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
> > > >
> > > > #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
> > > > #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> > > > diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
> > > > index 95b3867aac1e..4a1a507d3bb7 100644
> > > > --- a/arch/x86/power/umwait.c
> > > > +++ b/arch/x86/power/umwait.c
> > > > @@ -10,6 +10,7 @@
> > > > #include <asm/msr.h>
> > > >
> > > > static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> > > > +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
> > > > static DEFINE_MUTEX(umwait_lock);
> > > >
> > > > /* Return value that will be used to set umwait control MSR */
> > > > @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> > > > * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> > > > * So value in bit 0 is opposite of umwait_enable_c0_2.
> > > > */
> > > > - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> > > > + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> > > > + umwait_max_time;
> > > > }
> > > >
> > > > static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > > > @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
> > > >
> > > > static DEVICE_ATTR_RW(umwait_enable_c0_2);
> > > >
> > > > +static ssize_t umwait_max_time_show(struct device *kobj,
> > > > + struct device_attribute *attr, char *buf)
> > > > +{
> > > > + return sprintf(buf, "%u\n", umwait_max_time);
> > > > +}
> > > > +
> > > > +static ssize_t umwait_max_time_store(struct device *kobj,
> > > > + struct device_attribute *attr,
> > > > + const char *buf, size_t count)
> > > > +{
> > > > + u32 msr_val, max_time;
> > > > + int cpu, ret;
> > > > +
> > > > + ret = kstrtou32(buf, 10, &max_time);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > > > + mutex_lock(&umwait_lock);
> > > > +
> > > > + /* Only get max time value from bits [31:2] */
> > > > + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> > > > + /* Update the max time value in memory */
> > > > + umwait_max_time = max_time;
> > > > + msr_val = umwait_control_val();
> > > > + get_online_cpus();
> > > > + /* All CPUs have same umwait max time */
> > > > + for_each_online_cpu(cpu)
> > > > + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > > > + put_online_cpus();
> > > > +
> > > > + mutex_unlock(&umwait_lock);
> > > > +
> > > > + return count;
> > > > +}
> > > > +
> > > > +static DEVICE_ATTR_RW(umwait_max_time);
> > > > +
> > > > static struct attribute *umwait_attrs[] = {
> > > > &dev_attr_umwait_enable_c0_2.attr,
> > > > + &dev_attr_umwait_max_time.attr,
> > > > NULL
> > > > };
> > >
> > > You need something to make sure that newly onlined CPUs get the right
> > > value in the MSR.
> >
> > Onlined CPU takes the umwait_control value in umwait_cpu_online() in
> > patch 2. Please check if it's correct.
> >
> > > You also need to make sure you restore it on resume
> > > from suspend. Something like cpu_init() might be the right place.
> > >
> > > Also, as previously discussed, I think we should set the default to
> > > something quite small, maybe 100 microseconds. IMO the goal is to
> > > pick a value that is a high enough multiple of the C0.2 entry+exit
> > > latency that we get most of the power and SMT resource savings while
> > > being small enough that no one things that UMWAIT is more than a
> > > glorified, slightly improved, and far more misleading version of REP
> > > NOP.
> > >
> > > Andrew, would having Linux default to a small value do much to
> > > mitigate your concerns that UMWAIT is problematic for hypervisors?
> > >
> > > Also, can someone who understands the hardware clarify just how
> > > dangerous UMWAIT is from a perspective of making speculation attacks
> > > more dangerous than they already are? I'm wondering what events will
> > > wake up a UMONITOR. I bet that CLFLUSH does. I wonder if a faulting
> > > write to a read-only page also does. Or a load from a remote node.
> > > Or a speculative store that does not subsequently retire. This
> > > instruction seems quite delightful as a tool to create a
> > > highish-bandwidth covert channel, and it's possibly quite nice to
> > > agument Spectre-like attacks. If it ends up being bad enough, we
> > > might need to set the default timeout to the absolute minimum value
> > > and possibly ask Intel to give us a way to turn it off entirely.
> >
> > If CR4.TSD=1 and CPL>0, umwait and tpause generate #GP. So if user thinks
> > the instructions are dangerous, he can disable TSC.
> >
> > Is this the right handling for security concerns?
> >
>
> Setting CR4.TSD=1 systemwide would utterly destroy performance of
> quite a few workloads. And my argument for setting the value to a
> lowish but not minimal value is about functionality, not security.

CR4.TSD can be set up per process by prctl(PR_SET_TSC, PR_TSC_ENABLE)
to enable TSC or prctl(PR_SET_TSC, PR_TSC_SIGSEGV) to disable TSC.

For high performance workloads, user can turn TSC on.

Does it make sense?

Thanks.

-Fenghua


2019-01-17 07:16:06

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
that processor can stay in C0.1 or C0.2.

The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
means there is no global time limit for UMWAIT and TPAUSE instructions.
Each process sets its own umwait maximum time as the instructions operand.

User can specify global umwait maximum time through interface:
/sys/devices/system/cpu/umwait_control/umwait_max_time
The value in the interface is in decimal in TSC-quanta. Bits[1:0]
are cleared when the value is stored.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/msr-index.h | 2 ++
arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index b56bfecae0de..42b9104fc15b 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -62,6 +62,8 @@
#define MSR_IA32_UMWAIT_CONTROL 0xe1
#define UMWAIT_CONTROL_C02_BIT 0x0
#define UMWAIT_CONTROL_C02_MASK 0x00000001
+#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
+#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc

#define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
#define NHM_C3_AUTO_DEMOTE (1UL << 25)
diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
index 95b3867aac1e..4a1a507d3bb7 100644
--- a/arch/x86/power/umwait.c
+++ b/arch/x86/power/umwait.c
@@ -10,6 +10,7 @@
#include <asm/msr.h>

static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
+static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
static DEFINE_MUTEX(umwait_lock);

/* Return value that will be used to set umwait control MSR */
@@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
* When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
* So value in bit 0 is opposite of umwait_enable_c0_2.
*/
- return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
+ return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
+ umwait_max_time;
}

static ssize_t umwait_enable_c0_2_show(struct device *dev,
@@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,

static DEVICE_ATTR_RW(umwait_enable_c0_2);

+static ssize_t umwait_max_time_show(struct device *kobj,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%u\n", umwait_max_time);
+}
+
+static ssize_t umwait_max_time_store(struct device *kobj,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ u32 msr_val, max_time;
+ int cpu, ret;
+
+ ret = kstrtou32(buf, 10, &max_time);
+ if (ret)
+ return ret;
+
+ mutex_lock(&umwait_lock);
+
+ /* Only get max time value from bits [31:2] */
+ max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
+ /* Update the max time value in memory */
+ umwait_max_time = max_time;
+ msr_val = umwait_control_val();
+ get_online_cpus();
+ /* All CPUs have same umwait max time */
+ for_each_online_cpu(cpu)
+ wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+ put_online_cpus();
+
+ mutex_unlock(&umwait_lock);
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(umwait_max_time);
+
static struct attribute *umwait_attrs[] = {
&dev_attr_umwait_enable_c0_2.attr,
+ &dev_attr_umwait_max_time.attr,
NULL
};

--
2.19.1


2019-01-17 07:17:06

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 0/3] x86/umwait: Enable user wait instructions

User wait instructions (umwait, umonitor, and tpause) and
IA32_MWAIT_CONTROL MSR to control umwait/umonitor/tpause behaviors
will be available in Tremont and other future x86 processors.

This patch set enumerates the instructions, adds a sysfs interface for
user to configure the umwait/umonitor/tpause instructions.

The sysfs interface files are in /sys/devices/system/cpu/umwait_control/
because it's hard to find an existing place to host the files.

The instructions generates #GP if CR4.TSD=1 and CPL>0. If user worries
security issues from the instructions, disabling TSC can prevent
malicious code from executing the instructions.

Detailed information on the instructions and the MSR can be found in
the latest Intel Architecture Instruction Set Extensions and Future
Features Programming Reference.

Changelog:
v2:
- Address comments from Thomas Gleixner and Andy Lutomirski
- Remove vDSO functions
- Add sysfs control file for umwait max time

v1:
Based on comments from Thomas:
- Change user APIs to vDSO functions
- Change sysfs to positive logic and enable file name
- Change patch descriptions etc

Fenghua Yu (3):
x86/cpufeatures: Enumerate user wait instructions
x86/umwait: Setup umwait C0.2 state
x86/umwait: Control umwait maximum time

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++
arch/x86/power/Makefile | 1 +
arch/x86/power/umwait.c | 154 +++++++++++++++++++++++++++++
4 files changed, 162 insertions(+)
create mode 100644 arch/x86/power/umwait.c

--
2.19.1


2019-01-17 07:51:33

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

UMONITOR, UMWAIT, and TPAUSE are a set of user wait instructions.

UMONITOR arms address monitoring hardware using an address. A store
to an address within the specified address range triggers the
monitoring hardware to wake up the processor waiting in umwait.

UMWAIT instructs the processor to enter an implementation-dependent
optimized state while monitoring a range of addresses. The optimized
state may be either a light-weight power/performance optimized state
(c0.1 state) or an improved power/performance optimized state
(c0.2 state).

The UMONITOR and UMWAIT operate together to provide power saving
in idle.

TPAUSE instructs the processor to enter an implementation-dependent
optimized state c0.1 or c0.2 state and wake up when time-stamp counter
reaches specified timeout.

The three instructions may be executed at any privilege level.

Availability of the user wait instructions is indicated by the presence
of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].

Please check the latest Intel Architecture Instruction Set Extensions
and Future Features Programming Reference for more details on the
instructions and CPUID feature WAITPKG flag.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..344874df5dc3 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -322,6 +322,7 @@
#define X86_FEATURE_UMIP (16*32+ 2) /* User Mode Instruction Protection */
#define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */
#define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */
+#define X86_FEATURE_WAITPKG (16*32+ 5) /* UMONITOR/UMWAIT/TPAUSE Instructions */
#define X86_FEATURE_AVX512_VBMI2 (16*32+ 6) /* Additional AVX512 Vector Bit Manipulation Instructions */
#define X86_FEATURE_GFNI (16*32+ 8) /* Galois Field New Instructions */
#define X86_FEATURE_VAES (16*32+ 9) /* Vector AES */
--
2.19.1


2019-01-17 07:51:36

by Fenghua Yu

[permalink] [raw]
Subject: [PATCH v2 2/3] x86/umwait: Setup umwait C0.2 state

UMWAIT or TPAUSE called by user process makes processor to reside in
a light-weight power/performance optimized state (C0.1 state) or an
improved power/performance optimized state (C0.2 state).

IA32_UMWAIT_CONTROL MSR register allows OS to set global maximum umwait
time and disable C0.2 on the processor.

By default C0.2 is enabled so user wait instructions can enter the state
if user wants to save more power but wakeup time is slower. In some cases
e.g. real time, user wants to disable C0.2 and all C0.2 requests revert
to C0.1.

A new "/sys/devices/system/cpu/umwait_control/umwait_enable_c0_2" file is
created to allow user to check if C0.2 is enabled or disabled and also
allow user to enable or disable C0.2. Value "0" in the file means C0.2 is
disabled. Value "1" means C0.2 is enabled.

Signed-off-by: Fenghua Yu <[email protected]>
---
arch/x86/include/asm/msr-index.h | 4 ++
arch/x86/power/Makefile | 1 +
arch/x86/power/umwait.c | 114 +++++++++++++++++++++++++++++++
3 files changed, 119 insertions(+)
create mode 100644 arch/x86/power/umwait.c

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..b56bfecae0de 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -59,6 +59,10 @@
#define MSR_PLATFORM_INFO_CPUID_FAULT_BIT 31
#define MSR_PLATFORM_INFO_CPUID_FAULT BIT_ULL(MSR_PLATFORM_INFO_CPUID_FAULT_BIT)

+#define MSR_IA32_UMWAIT_CONTROL 0xe1
+#define UMWAIT_CONTROL_C02_BIT 0x0
+#define UMWAIT_CONTROL_C02_MASK 0x00000001
+
#define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
#define NHM_C3_AUTO_DEMOTE (1UL << 25)
#define NHM_C1_AUTO_DEMOTE (1UL << 26)
diff --git a/arch/x86/power/Makefile b/arch/x86/power/Makefile
index 37923d715741..62e2c609d1fe 100644
--- a/arch/x86/power/Makefile
+++ b/arch/x86/power/Makefile
@@ -8,3 +8,4 @@ CFLAGS_cpu.o := $(nostackp)

obj-$(CONFIG_PM_SLEEP) += cpu.o
obj-$(CONFIG_HIBERNATION) += hibernate_$(BITS).o hibernate_asm_$(BITS).o hibernate.o
+obj-y += umwait.o
diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
new file mode 100644
index 000000000000..95b3867aac1e
--- /dev/null
+++ b/arch/x86/power/umwait.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Sysfs interface for umwait control
+ *
+ * Copyright (C) 2018, Intel Corporation.
+ *
+ * Author: Fenghua Yu <[email protected]>
+ */
+#include <linux/cpu.h>
+#include <asm/msr.h>
+
+static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
+static DEFINE_MUTEX(umwait_lock);
+
+/* Return value that will be used to set umwait control MSR */
+static inline u32 umwait_control_val(void)
+{
+ /*
+ * Enable or disable C0.2 (bit 0) based on global setting on all CPUs.
+ * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
+ * So value in bit 0 is opposite of umwait_enable_c0_2.
+ */
+ return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
+}
+
+static ssize_t umwait_enable_c0_2_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", umwait_enable_c0_2);
+}
+
+static ssize_t umwait_enable_c0_2_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int enable_c0_2, cpu, ret;
+ u32 msr_val;
+
+ ret = kstrtou32(buf, 10, &enable_c0_2);
+ if (ret)
+ return ret;
+
+ if (enable_c0_2 != 1 && enable_c0_2 != 0)
+ return -EINVAL;
+
+ mutex_lock(&umwait_lock);
+
+ umwait_enable_c0_2 = enable_c0_2;
+ msr_val = umwait_control_val();
+ get_online_cpus();
+ /* All CPUs have same umwait control setting */
+ for_each_online_cpu(cpu)
+ wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+ put_online_cpus();
+
+ mutex_unlock(&umwait_lock);
+
+ return count;
+}
+
+static DEVICE_ATTR_RW(umwait_enable_c0_2);
+
+static struct attribute *umwait_attrs[] = {
+ &dev_attr_umwait_enable_c0_2.attr,
+ NULL
+};
+
+static struct attribute_group umwait_attr_group = {
+ .attrs = umwait_attrs,
+ .name = "umwait_control",
+};
+
+/* Set up umwait control MSR on this CPU using the current global setting. */
+static int umwait_cpu_online(unsigned int cpu)
+{
+ u32 msr_val;
+
+ mutex_lock(&umwait_lock);
+
+ msr_val = umwait_control_val();
+ wrmsr(MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
+
+ mutex_unlock(&umwait_lock);
+
+ return 0;
+}
+
+static int __init umwait_init(void)
+{
+ struct device *dev;
+ int ret;
+
+ if (!boot_cpu_has(X86_FEATURE_WAITPKG))
+ return -ENODEV;
+
+ /* Add CPU global user wait interface to control umwait. */
+ dev = cpu_subsys.dev_root;
+ ret = sysfs_create_group(&dev->kobj, &umwait_attr_group);
+ if (ret)
+ return ret;
+
+ ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "umwait/intel:online",
+ umwait_cpu_online, NULL);
+ if (ret < 0)
+ goto out_group;
+
+ return 0;
+out_group:
+ sysfs_remove_group(&dev->kobj, &umwait_attr_group);
+
+ return ret;
+}
+device_initcall(umwait_init);
--
2.19.1


2019-01-20 19:14:42

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

On 17/01/2019 00:00, Andy Lutomirski wrote:
> On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
>> IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
>> that processor can stay in C0.1 or C0.2.
>>
>> The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
>> means there is no global time limit for UMWAIT and TPAUSE instructions.
>> Each process sets its own umwait maximum time as the instructions operand.
>>
>> User can specify global umwait maximum time through interface:
>> /sys/devices/system/cpu/umwait_control/umwait_max_time
>> The value in the interface is in decimal in TSC-quanta. Bits[1:0]
>> are cleared when the value is stored.
>>
>> Signed-off-by: Fenghua Yu <[email protected]>
>> ---
>> arch/x86/include/asm/msr-index.h | 2 ++
>> arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
>> 2 files changed, 43 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
>> index b56bfecae0de..42b9104fc15b 100644
>> --- a/arch/x86/include/asm/msr-index.h
>> +++ b/arch/x86/include/asm/msr-index.h
>> @@ -62,6 +62,8 @@
>> #define MSR_IA32_UMWAIT_CONTROL 0xe1
>> #define UMWAIT_CONTROL_C02_BIT 0x0
>> #define UMWAIT_CONTROL_C02_MASK 0x00000001
>> +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
>> +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
>>
>> #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
>> #define NHM_C3_AUTO_DEMOTE (1UL << 25)
>> diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
>> index 95b3867aac1e..4a1a507d3bb7 100644
>> --- a/arch/x86/power/umwait.c
>> +++ b/arch/x86/power/umwait.c
>> @@ -10,6 +10,7 @@
>> #include <asm/msr.h>
>>
>> static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
>> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
>> static DEFINE_MUTEX(umwait_lock);
>>
>> /* Return value that will be used to set umwait control MSR */
>> @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
>> * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
>> * So value in bit 0 is opposite of umwait_enable_c0_2.
>> */
>> - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
>> + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
>> + umwait_max_time;
>> }
>>
>> static ssize_t umwait_enable_c0_2_show(struct device *dev,
>> @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
>>
>> static DEVICE_ATTR_RW(umwait_enable_c0_2);
>>
>> +static ssize_t umwait_max_time_show(struct device *kobj,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + return sprintf(buf, "%u\n", umwait_max_time);
>> +}
>> +
>> +static ssize_t umwait_max_time_store(struct device *kobj,
>> + struct device_attribute *attr,
>> + const char *buf, size_t count)
>> +{
>> + u32 msr_val, max_time;
>> + int cpu, ret;
>> +
>> + ret = kstrtou32(buf, 10, &max_time);
>> + if (ret)
>> + return ret;
>> +
>> + mutex_lock(&umwait_lock);
>> +
>> + /* Only get max time value from bits [31:2] */
>> + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
>> + /* Update the max time value in memory */
>> + umwait_max_time = max_time;
>> + msr_val = umwait_control_val();
>> + get_online_cpus();
>> + /* All CPUs have same umwait max time */
>> + for_each_online_cpu(cpu)
>> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
>> + put_online_cpus();
>> +
>> + mutex_unlock(&umwait_lock);
>> +
>> + return count;
>> +}
>> +
>> +static DEVICE_ATTR_RW(umwait_max_time);
>> +
>> static struct attribute *umwait_attrs[] = {
>> &dev_attr_umwait_enable_c0_2.attr,
>> + &dev_attr_umwait_max_time.attr,
>> NULL
>> };
> You need something to make sure that newly onlined CPUs get the right
> value in the MSR. You also need to make sure you restore it on resume
> from suspend. Something like cpu_init() might be the right place.
>
> Also, as previously discussed, I think we should set the default to
> something quite small, maybe 100 microseconds. IMO the goal is to
> pick a value that is a high enough multiple of the C0.2 entry+exit
> latency that we get most of the power and SMT resource savings while
> being small enough that no one things that UMWAIT is more than a
> glorified, slightly improved, and far more misleading version of REP
> NOP.
>
> Andrew, would having Linux default to a small value do much to
> mitigate your concerns that UMWAIT is problematic for hypervisors?

Sadly no - not really.

Being an MSR, there is no way the guest kernel is having unfiltered
access, so the hypervisor can set whatever bound it wishes.

For any non-trivial wait period, it would be better for the system as a
whole to switch to a different vcpu, but the semantics don't allow for
that.  Shortening the timeout just results in userspace taking over
again, and most likely concluding that there was an early wakeup and
going back to sleep.

More useful semantics would be something similar to pause-loop-exiting
so we can swap contexts while the processor is logically idle in userspace.

~Andrew

2019-01-21 09:13:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] x86/umwait: Control umwait maximum time



> On Jan 20, 2019, at 11:12 AM, Andrew Cooper <[email protected]> wrote:
>
>> On 17/01/2019 00:00, Andy Lutomirski wrote:
>>> On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]> wrote:
>>> IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
>>> that processor can stay in C0.1 or C0.2.
>>>
>>> The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero which
>>> means there is no global time limit for UMWAIT and TPAUSE instructions.
>>> Each process sets its own umwait maximum time as the instructions operand.
>>>
>>> User can specify global umwait maximum time through interface:
>>> /sys/devices/system/cpu/umwait_control/umwait_max_time
>>> The value in the interface is in decimal in TSC-quanta. Bits[1:0]
>>> are cleared when the value is stored.
>>>
>>> Signed-off-by: Fenghua Yu <[email protected]>
>>> ---
>>> arch/x86/include/asm/msr-index.h | 2 ++
>>> arch/x86/power/umwait.c | 42 +++++++++++++++++++++++++++++++-
>>> 2 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
>>> index b56bfecae0de..42b9104fc15b 100644
>>> --- a/arch/x86/include/asm/msr-index.h
>>> +++ b/arch/x86/include/asm/msr-index.h
>>> @@ -62,6 +62,8 @@
>>> #define MSR_IA32_UMWAIT_CONTROL 0xe1
>>> #define UMWAIT_CONTROL_C02_BIT 0x0
>>> #define UMWAIT_CONTROL_C02_MASK 0x00000001
>>> +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
>>> +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
>>>
>>> #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
>>> #define NHM_C3_AUTO_DEMOTE (1UL << 25)
>>> diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c
>>> index 95b3867aac1e..4a1a507d3bb7 100644
>>> --- a/arch/x86/power/umwait.c
>>> +++ b/arch/x86/power/umwait.c
>>> @@ -10,6 +10,7 @@
>>> #include <asm/msr.h>
>>>
>>> static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
>>> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
>>> static DEFINE_MUTEX(umwait_lock);
>>>
>>> /* Return value that will be used to set umwait control MSR */
>>> @@ -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
>>> * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
>>> * So value in bit 0 is opposite of umwait_enable_c0_2.
>>> */
>>> - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
>>> + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
>>> + umwait_max_time;
>>> }
>>>
>>> static ssize_t umwait_enable_c0_2_show(struct device *dev,
>>> @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
>>>
>>> static DEVICE_ATTR_RW(umwait_enable_c0_2);
>>>
>>> +static ssize_t umwait_max_time_show(struct device *kobj,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + return sprintf(buf, "%u\n", umwait_max_time);
>>> +}
>>> +
>>> +static ssize_t umwait_max_time_store(struct device *kobj,
>>> + struct device_attribute *attr,
>>> + const char *buf, size_t count)
>>> +{
>>> + u32 msr_val, max_time;
>>> + int cpu, ret;
>>> +
>>> + ret = kstrtou32(buf, 10, &max_time);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + mutex_lock(&umwait_lock);
>>> +
>>> + /* Only get max time value from bits [31:2] */
>>> + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
>>> + /* Update the max time value in memory */
>>> + umwait_max_time = max_time;
>>> + msr_val = umwait_control_val();
>>> + get_online_cpus();
>>> + /* All CPUs have same umwait max time */
>>> + for_each_online_cpu(cpu)
>>> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
>>> + put_online_cpus();
>>> +
>>> + mutex_unlock(&umwait_lock);
>>> +
>>> + return count;
>>> +}
>>> +
>>> +static DEVICE_ATTR_RW(umwait_max_time);
>>> +
>>> static struct attribute *umwait_attrs[] = {
>>> &dev_attr_umwait_enable_c0_2.attr,
>>> + &dev_attr_umwait_max_time.attr,
>>> NULL
>>> };
>> You need something to make sure that newly onlined CPUs get the right
>> value in the MSR. You also need to make sure you restore it on resume
>> from suspend. Something like cpu_init() might be the right place.
>>
>> Also, as previously discussed, I think we should set the default to
>> something quite small, maybe 100 microseconds. IMO the goal is to
>> pick a value that is a high enough multiple of the C0.2 entry+exit
>> latency that we get most of the power and SMT resource savings while
>> being small enough that no one things that UMWAIT is more than a
>> glorified, slightly improved, and far more misleading version of REP
>> NOP.
>>
>> Andrew, would having Linux default to a small value do much to
>> mitigate your concerns that UMWAIT is problematic for hypervisors?
>
> Sadly no - not really.
>
> Being an MSR, there is no way the guest kernel is having unfiltered
> access, so the hypervisor can set whatever bound it wishes.
>
> For any non-trivial wait period, it would be better for the system as a
> whole to switch to a different vcpu, but the semantics don't allow for
> that. Shortening the timeout just results in userspace taking over
> again, and most likely concluding that there was an early wakeup and
> going back to sleep.

What I mean is: if Linux makes the timeout short for everyone, then applications that use UNWAIT will have to be written with the expectation that they are spinning, so the incidence of problematic cases may drop.

>
> More useful semantics would be something similar to pause-loop-exiting
> so we can swap contexts while the processor is logically idle in userspace.
>
> ~Andrew

2019-02-08 18:53:04

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

> From: Andrew Cooper [mailto:[email protected]]
> On 17/01/2019 00:00, Andy Lutomirski wrote:
> > On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu <[email protected]>
> wrote:
> >> IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-
> quanta
> >> that processor can stay in C0.1 or C0.2.
> >>
> >> The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero
> >> which means there is no global time limit for UMWAIT and TPAUSE
> instructions.
> >> Each process sets its own umwait maximum time as the instructions
> operand.
> >>
> >> User can specify global umwait maximum time through interface:
> >> /sys/devices/system/cpu/umwait_control/umwait_max_time
> >> The value in the interface is in decimal in TSC-quanta. Bits[1:0] are
> >> cleared when the value is stored.
> >>
> >> Signed-off-by: Fenghua Yu <[email protected]>
> >> ---
> >> arch/x86/include/asm/msr-index.h | 2 ++
> >> arch/x86/power/umwait.c | 42
> +++++++++++++++++++++++++++++++-
> >> 2 files changed, 43 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/asm/msr-index.h
> >> b/arch/x86/include/asm/msr-index.h
> >> index b56bfecae0de..42b9104fc15b 100644
> >> --- a/arch/x86/include/asm/msr-index.h
> >> +++ b/arch/x86/include/asm/msr-index.h
> >> @@ -62,6 +62,8 @@
> >> #define MSR_IA32_UMWAIT_CONTROL 0xe1
> >> #define UMWAIT_CONTROL_C02_BIT 0x0
> >> #define UMWAIT_CONTROL_C02_MASK 0x00000001
> >> +#define UMWAIT_CONTROL_MAX_TIME_BIT 0x2
> >> +#define UMWAIT_CONTROL_MAX_TIME_MASK 0xfffffffc
> >>
> >> #define MSR_PKG_CST_CONFIG_CONTROL 0x000000e2
> >> #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> >> diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c index
> >> 95b3867aac1e..4a1a507d3bb7 100644
> >> --- a/arch/x86/power/umwait.c
> >> +++ b/arch/x86/power/umwait.c
> >> @@ -10,6 +10,7 @@
> >> #include <asm/msr.h>
> >>
> >> static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable
> >> C0.2. */
> >> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are
> >> +used. */
> >> static DEFINE_MUTEX(umwait_lock);
> >>
> >> /* Return value that will be used to set umwait control MSR */ @@
> >> -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> >> * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> >> * So value in bit 0 is opposite of umwait_enable_c0_2.
> >> */
> >> - return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> >> + return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> >> + umwait_max_time;
> >> }
> >>
> >> static ssize_t umwait_enable_c0_2_show(struct device *dev, @@ -61,8
> >> +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
> >>
> >> static DEVICE_ATTR_RW(umwait_enable_c0_2);
> >>
> >> +static ssize_t umwait_max_time_show(struct device *kobj,
> >> + struct device_attribute *attr,
> >> +char *buf) {
> >> + return sprintf(buf, "%u\n", umwait_max_time); }
> >> +
> >> +static ssize_t umwait_max_time_store(struct device *kobj,
> >> + struct device_attribute *attr,
> >> + const char *buf, size_t count) {
> >> + u32 msr_val, max_time;
> >> + int cpu, ret;
> >> +
> >> + ret = kstrtou32(buf, 10, &max_time);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + mutex_lock(&umwait_lock);
> >> +
> >> + /* Only get max time value from bits [31:2] */
> >> + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> >> + /* Update the max time value in memory */
> >> + umwait_max_time = max_time;
> >> + msr_val = umwait_control_val();
> >> + get_online_cpus();
> >> + /* All CPUs have same umwait max time */
> >> + for_each_online_cpu(cpu)
> >> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val,
> 0);
> >> + put_online_cpus();
> >> +
> >> + mutex_unlock(&umwait_lock);
> >> +
> >> + return count;
> >> +}
> >> +
> >> +static DEVICE_ATTR_RW(umwait_max_time);
> >> +
> >> static struct attribute *umwait_attrs[] = {
> >> &dev_attr_umwait_enable_c0_2.attr,
> >> + &dev_attr_umwait_max_time.attr,
> >> NULL
> >> };
> > You need something to make sure that newly onlined CPUs get the right
> > value in the MSR. You also need to make sure you restore it on resume
> > from suspend. Something like cpu_init() might be the right place.
> >
> > Also, as previously discussed, I think we should set the default to
> > something quite small, maybe 100 microseconds. IMO the goal is to
> > pick a value that is a high enough multiple of the C0.2 entry+exit
> > latency that we get most of the power and SMT resource savings while
> > being small enough that no one things that UMWAIT is more than a
> > glorified, slightly improved, and far more misleading version of REP
> > NOP.
> >
> > Andrew, would having Linux default to a small value do much to
> > mitigate your concerns that UMWAIT is problematic for hypervisors?
>
> Sadly no - not really.
>
> Being an MSR, there is no way the guest kernel is having unfiltered access,
> so the hypervisor can set whatever bound it wishes.
>
> For any non-trivial wait period, it would be better for the system as a whole
> to switch to a different vcpu, but the semantics don't allow for
> that.  Shortening the timeout just results in userspace taking over again,
> and most likely concluding that there was an early wakeup and going back
> to sleep.
>
> More useful semantics would be something similar to pause-loop-exiting so
> we can swap contexts while the processor is logically idle in userspace.

So do we still keep the umwait max time out value as 0 which means there is no global time out for umwait?
Sys admin can always change it to different time out based on usage.

BTW, latency exiting from umwait/tpause varies depending on sleeping in C0.1 or C0.2 states.
On machine, it shows a few cycles to hundreds cycles. But I guess it could be different on different machine as well. So I guess it's hard to get a uniform latency value and use it.

Thanks.

-Fenghua


2019-02-21 06:38:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Wed, Feb 20, 2019 at 7:44 PM Tao Xu <[email protected]> wrote:
>
> From: Fenghua Yu <[email protected]>

>
> From patchwork Wed Jan 16 21:18:41 2019
> Content-Type: text/plain; charset="utf-8"

[snipped more stuff like this]

What happened here?

> +/* Return value that will be used to set umwait control MSR */
> +static inline u32 umwait_control_val(void)
> +{
> + /*
> + * Enable or disable C0.2 (bit 0) based on global setting on all CPUs.
> + * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> + * So value in bit 0 is opposite of umwait_enable_c0_2.
> + */
> + return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> +}

This function is horribly named. How about something like
umwait_compute_msr_value() or something liek that? Also, what
happened to the maximum wait time?

> +
> +static ssize_t umwait_enable_c0_2_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%d\n", umwait_enable_c0_2);

I realize that it's traditional to totally ignore races in sysfs and
such, but it's a bad tradition. Please either READ_ONCE it with a
comment or take the mutex.

> +static ssize_t umwait_enable_c0_2_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int enable_c0_2, cpu, ret;
> + u32 msr_val;
> +
> + ret = kstrtou32(buf, 10, &enable_c0_2);
> + if (ret)
> + return ret;
> +
> + if (enable_c0_2 != 1 && enable_c0_2 != 0)
> + return -EINVAL;

How about if (enable_c0_2 > 1)?

> +
> + mutex_lock(&umwait_lock);
> +
> + umwait_enable_c0_2 = enable_c0_2;
> + msr_val = umwait_control_val();
> + get_online_cpus();
> + /* All CPUs have same umwait control setting */
> + for_each_online_cpu(cpu)
> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> + put_online_cpus();
> +
> + mutex_unlock(&umwait_lock);

Please factor this thing out into a helper like
umwait_update_all_cpus(). That helper can assert that the lock is
held.

> +/* Set up umwait control MSR on this CPU using the current global setting. */
> +static int umwait_cpu_online(unsigned int cpu)
> +{
> + u32 msr_val;
> +
> + mutex_lock(&umwait_lock);
> +
> + msr_val = umwait_control_val();
> + wrmsr(MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> +
> + mutex_unlock(&umwait_lock);
> +
> + return 0;
> +}
> +
> +static int __init umwait_init(void)
> +{
> + struct device *dev;
> + int ret;
> +
> + if (!boot_cpu_has(X86_FEATURE_WAITPKG))
> + return -ENODEV;
> +
> + /* Add CPU global user wait interface to control umwait. */
> + dev = cpu_subsys.dev_root;
> + ret = sysfs_create_group(&dev->kobj, &umwait_attr_group);
> + if (ret)
> + return ret;
> +
> + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "umwait/intel:online",
> + umwait_cpu_online, NULL);

This hotplug notifier thing is awful. Thomas, do we have a function
that gets called every time a CPU is brought up (via BSP boot, AP
boot, hotplug, hibernation resume, etc) where we can just put all
these things? cpu_init() is almost appropriate, except that it's
called at somewhat erratic times (quite different for BSP and AP IIRC)
and it's not called AFAICT during hibernation restore. I suppose we
could add a new thing that is called by cpu_init() and
restore_processor_state().

Also, surely you should actually write the MSR in this function, too.

>
> static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */

I still think the default should be some reasonable nonzero value. It
should be long enough that we get decent C0.2 residency and short
enough that UMWAIT never gives the impression that it is anything
other than a fancy way to save a bit of power and SMT resources when
spinning. I don't want to see a situation where some library uses
UMWAIT under the expectation that it genuinely waits for an event,
appears to work well on bare metal on an otherwise idle system, and
falls apart when it's run in a VM guest or with other software
running. IOW, programs more or less must be written to expect many
spurious wakeups, so I think we should pick a default value so that
there are essentially always many spurious wakeups.

As a guess, I think that the default wait time should be well under 1
ms but at least 20x the C0.2 entry+exit latency.

--Andy
> static ssize_t umwait_enable_c0_2_show(struct device *dev,
> @@ -61,8 +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
>
> static DEVICE_ATTR_RW(umwait_enable_c0_2);
>
> +static ssize_t umwait_max_time_show(struct device *kobj,
> + struct device_attribute *attr, char *buf)
> +{
> + return sprintf(buf, "%u\n", umwait_max_time);
> +}
> +
> +static ssize_t umwait_max_time_store(struct device *kobj,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + u32 msr_val, max_time;
> + int cpu, ret;
> +
> + ret = kstrtou32(buf, 10, &max_time);
> + if (ret)
> + return ret;
> +
> + mutex_lock(&umwait_lock);
> +
> + /* Only get max time value from bits [31:2] */
> + max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> + /* Update the max time value in memory */
> + umwait_max_time = max_time;
> + msr_val = umwait_control_val();
> + get_online_cpus();
> + /* All CPUs have same umwait max time */
> + for_each_online_cpu(cpu)
> + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> + put_online_cpus();
> +
> + mutex_unlock(&umwait_lock);
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR_RW(umwait_max_time);
> +
> static struct attribute *umwait_attrs[] = {
> &dev_attr_umwait_enable_c0_2.attr,
> + &dev_attr_umwait_max_time.attr,
> NULL
> };
>

2019-02-21 09:24:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> On Wed, Feb 20, 2019 at 7:44 PM Tao Xu <[email protected]> wrote:

> > +static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + return sprintf(buf, "%d\n", umwait_enable_c0_2);
>
> I realize that it's traditional to totally ignore races in sysfs and
> such, but it's a bad tradition. Please either READ_ONCE it with a
> comment or take the mutex.

Note that when using READ_ONCE(), the other side must use WRITE_ONCE().
Mixed usage is not valid.

2019-02-21 19:31:47

by Fenghua Yu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> On Wed, Feb 20, 2019 at 7:44 PM Tao Xu <[email protected]> wrote:
> >
> > From: Fenghua Yu <[email protected]>
>
> >
> > From patchwork Wed Jan 16 21:18:41 2019
> > Content-Type: text/plain; charset="utf-8"
>
> [snipped more stuff like this]
>
> What happened here?

I don't know either:(

Tao never talked to me before he sent out this patch set. And the email
format is totally wrong and he didn't change any code in this patch set.

I have no idea why he sent out this patch set. As of now I haven't got
any response from him yet.

I believe he made a mistake here.

>
> > +/* Return value that will be used to set umwait control MSR */
> > +static inline u32 umwait_control_val(void)
> > +{
> > + /*
> > + * Enable or disable C0.2 (bit 0) based on global setting on all CPUs.
> > + * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> > + * So value in bit 0 is opposite of umwait_enable_c0_2.
> > + */
> > + return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> > +}
>
> This function is horribly named. How about something like
> umwait_compute_msr_value() or something liek that?
OK. Will change the name.

> Also, what
> happened to the maximum wait time?
>
> > +
> > +static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + return sprintf(buf, "%d\n", umwait_enable_c0_2);
>
> I realize that it's traditional to totally ignore races in sysfs and
> such, but it's a bad tradition. Please either READ_ONCE it with a
> comment or take the mutex.

Will change to READ_ONCE.

>
> > +static ssize_t umwait_enable_c0_2_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + int enable_c0_2, cpu, ret;
> > + u32 msr_val;
> > +
> > + ret = kstrtou32(buf, 10, &enable_c0_2);
> > + if (ret)
> > + return ret;
> > +
> > + if (enable_c0_2 != 1 && enable_c0_2 != 0)
> > + return -EINVAL;
>
> How about if (enable_c0_2 > 1)?
Ok. Will change to this.

>
> > +
> > + mutex_lock(&umwait_lock);
> > +
> > + umwait_enable_c0_2 = enable_c0_2;
> > + msr_val = umwait_control_val();
> > + get_online_cpus();
> > + /* All CPUs have same umwait control setting */
> > + for_each_online_cpu(cpu)
> > + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > + put_online_cpus();
> > +
> > + mutex_unlock(&umwait_lock);
>
> Please factor this thing out into a helper like
> umwait_update_all_cpus(). That helper can assert that the lock is
> held.
Ok.

>
> > +/* Set up umwait control MSR on this CPU using the current global setting. */
> > +static int umwait_cpu_online(unsigned int cpu)
> > +{
> > + u32 msr_val;
> > +
> > + mutex_lock(&umwait_lock);
> > +
> > + msr_val = umwait_control_val();
> > + wrmsr(MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > +
> > + mutex_unlock(&umwait_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int __init umwait_init(void)
> > +{
> > + struct device *dev;
> > + int ret;
> > +
> > + if (!boot_cpu_has(X86_FEATURE_WAITPKG))
> > + return -ENODEV;
> > +
> > + /* Add CPU global user wait interface to control umwait. */
> > + dev = cpu_subsys.dev_root;
> > + ret = sysfs_create_group(&dev->kobj, &umwait_attr_group);
> > + if (ret)
> > + return ret;
> > +
> > + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "umwait/intel:online",
> > + umwait_cpu_online, NULL);
>
> This hotplug notifier thing is awful. Thomas, do we have a function
> that gets called every time a CPU is brought up (via BSP boot, AP
> boot, hotplug, hibernation resume, etc) where we can just put all
> these things? cpu_init() is almost appropriate, except that it's
> called at somewhat erratic times (quite different for BSP and AP IIRC)
> and it's not called AFAICT during hibernation restore. I suppose we
> could add a new thing that is called by cpu_init() and
> restore_processor_state().
>
> Also, surely you should actually write the MSR in this function, too.
>
> >
> > static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable C0.2. */
> > +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are used. */
>
> I still think the default should be some reasonable nonzero value. It
> should be long enough that we get decent C0.2 residency and short
> enough that UMWAIT never gives the impression that it is anything
> other than a fancy way to save a bit of power and SMT resources when
> spinning. I don't want to see a situation where some library uses
> UMWAIT under the expectation that it genuinely waits for an event,
> appears to work well on bare metal on an otherwise idle system, and
> falls apart when it's run in a VM guest or with other software
> running. IOW, programs more or less must be written to expect many
> spurious wakeups, so I think we should pick a default value so that
> there are essentially always many spurious wakeups.
>
> As a guess, I think that the default wait time should be well under 1
> ms but at least 20x the C0.2 entry+exit latency.

It's hard to agree on a global maximum wait time value as we discussed
before:
1. Andrew doesn't like the short wait time value
2. C0.1/C0.2 entry/exit latency is hardware implementation specific and
may vary from platform to platform. My measurement on one machine is
about hundreds of cycles. It's hard to justify why 1ms or any value is
right value on one machine. Or simply set it as 1ms (converted to tsc).

Or to simplify the situation, how about we still use zero as global max wait
time (i.e. no limitation for global wait time) as implemented in this
patch set? User can still change the value to any value based on their
usage. Isn't this a right way to take care of any usage?

Thanks.

-Fenghua

2019-02-21 22:59:20

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

> From: Fenghua Yu [mailto:[email protected]]
> On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> Or to simplify the situation, how about we still use zero as global max wait
> time (i.e. no limitation for global wait time) as implemented in this patch
> set? User can still change the value to any value based on their usage. Isn't
> this a right way to take care of any usage?

Plus, if scheduler timers works, umwait will be forced time out by timer in HZ anyway.

Only for case without scheduler timer, sysadmin may need to set a small max umwait time to force timout. But in this case (real time), I doubt user app really wants to call umwait to save power with long latency of waking up from umwait. So likely in this case, user app won't call umwait and there is no usage to set a small value for max umwait time.

Thanks.

-Fenghua

2019-02-22 02:11:28

by Tao Xu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On 2/22/2019 3:24 AM, Yu, Fenghua wrote:
> On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
>> On Wed, Feb 20, 2019 at 7:44 PM Tao Xu <[email protected]> wrote:
>>>
>>> From: Fenghua Yu <[email protected]>
>>
>>>
>>> From patchwork Wed Jan 16 21:18:41 2019
>>> Content-Type: text/plain; charset="utf-8"
>>
>> [snipped more stuff like this]
>>
>> What happened here?
>
> I don't know either:(
>
> Tao never talked to me before he sent out this patch set. And the email
> format is totally wrong and he didn't change any code in this patch set.
>
> I have no idea why he sent out this patch set. As of now I haven't got
> any response from him yet.
>
> I believe he made a mistake here.
>
I am really sorry for sending the wrong email. Yesterday I make a
mistake to cc Fenghua's former patch file when I sending other patch.

I'm sorry for this trouble and I will be more careful in my work.

2019-02-24 19:47:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Thu, Feb 21, 2019 at 2:57 PM Yu, Fenghua <[email protected]> wrote:
>
> > From: Fenghua Yu [mailto:[email protected]]
> > On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> > Or to simplify the situation, how about we still use zero as global max wait
> > time (i.e. no limitation for global wait time) as implemented in this patch
> > set? User can still change the value to any value based on their usage. Isn't
> > this a right way to take care of any usage?
>
> Plus, if scheduler timers works, umwait will be forced time out by timer in HZ anyway.

Indeed. But, on some configs, the scheduler timer intentionally will
*not* fire.

>
> Only for case without scheduler timer, sysadmin may need to set a small max umwait time to force timout. But in this case (real time), I doubt user app really wants to call umwait to save power with long latency of waking up from umwait. So likely in this case, user app won't call umwait and there is no usage to set a small value for max umwait time.

I don't really know why an application would use umwait at all. About
the only legitimate use I can think of is to treat it as a somewhat
power-saving variant of REP NOP. What I want to avoid is the case
where it works dramatically differently on NO_HZ_FULL systems as
compared to everything else. Also, UMWAIT may behave a bit
differently if the max timeout is hit, and I'd like that path to get
exercised widely by making it happen even on default configs.

So I propose setting the timeout to either 100 microseconds or 100k
"cycles" by default. In the event someone determines that they save
materially more power or gets materially better performance with a
longer timeout, we can revisit the value.

--Andy

2019-02-26 20:01:18

by Fenghua Yu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Sun, Feb 24, 2019 at 11:45:35AM -0800, Andy Lutomirski wrote:
> On Thu, Feb 21, 2019 at 2:57 PM Yu, Fenghua <[email protected]> wrote:
> >
> > > From: Fenghua Yu [mailto:[email protected]]
> > > On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> > > Or to simplify the situation, how about we still use zero as global max wait
> > > time (i.e. no limitation for global wait time) as implemented in this patch
> > > set? User can still change the value to any value based on their usage. Isn't
> > > this a right way to take care of any usage?
> >
> > Plus, if scheduler timers works, umwait will be forced time out by timer in HZ anyway.
>
> Indeed. But, on some configs, the scheduler timer intentionally will
> *not* fire.
>
> >
> > Only for case without scheduler timer, sysadmin may need to set a small max umwait time to force timout. But in this case (real time), I doubt user app really wants to call umwait to save power with long latency of waking up from umwait. So likely in this case, user app won't call umwait and there is no usage to set a small value for max umwait time.
>
> I don't really know why an application would use umwait at all. About
> the only legitimate use I can think of is to treat it as a somewhat
> power-saving variant of REP NOP. What I want to avoid is the case
> where it works dramatically differently on NO_HZ_FULL systems as
> compared to everything else. Also, UMWAIT may behave a bit
> differently if the max timeout is hit, and I'd like that path to get
> exercised widely by making it happen even on default configs.
>
> So I propose setting the timeout to either 100 microseconds or 100k
> "cycles" by default. In the event someone determines that they save
> materially more power or gets materially better performance with a
> longer timeout, we can revisit the value.

OK. I will set the default umwait max time value to 100K cycles.

Thanks.

-Fenghua

2019-02-26 20:51:10

by Fenghua Yu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> On Wed, Feb 20, 2019 at 7:44 PM Tao Xu <[email protected]> wrote:
> >
> > From: Fenghua Yu <[email protected]>
>
> >
> > From patchwork Wed Jan 16 21:18:41 2019
> > Content-Type: text/plain; charset="utf-8"
>
> [snipped more stuff like this]
>
> What happened here?
>
> > +/* Return value that will be used to set umwait control MSR */
> > +static inline u32 umwait_control_val(void)
> > +{
> > + /*
> > + * Enable or disable C0.2 (bit 0) based on global setting on all CPUs.
> > + * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> > + * So value in bit 0 is opposite of umwait_enable_c0_2.
> > + */
> > + return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> > +}
>
> This function is horribly named. How about something like
> umwait_compute_msr_value() or something liek that? Also, what
> happened to the maximum wait time?
>
> > +
> > +static ssize_t umwait_enable_c0_2_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + return sprintf(buf, "%d\n", umwait_enable_c0_2);
>
> I realize that it's traditional to totally ignore races in sysfs and
> such, but it's a bad tradition. Please either READ_ONCE it with a
> comment or take the mutex.
>
> > +static ssize_t umwait_enable_c0_2_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + int enable_c0_2, cpu, ret;
> > + u32 msr_val;
> > +
> > + ret = kstrtou32(buf, 10, &enable_c0_2);
> > + if (ret)
> > + return ret;
> > +
> > + if (enable_c0_2 != 1 && enable_c0_2 != 0)
> > + return -EINVAL;
>
> How about if (enable_c0_2 > 1)?
>
> > +
> > + mutex_lock(&umwait_lock);
> > +
> > + umwait_enable_c0_2 = enable_c0_2;
> > + msr_val = umwait_control_val();
> > + get_online_cpus();
> > + /* All CPUs have same umwait control setting */
> > + for_each_online_cpu(cpu)
> > + wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > + put_online_cpus();
> > +
> > + mutex_unlock(&umwait_lock);
>
> Please factor this thing out into a helper like
> umwait_update_all_cpus(). That helper can assert that the lock is
> held.
>
> > +/* Set up umwait control MSR on this CPU using the current global setting. */
> > +static int umwait_cpu_online(unsigned int cpu)
> > +{
> > + u32 msr_val;
> > +
> > + mutex_lock(&umwait_lock);
> > +
> > + msr_val = umwait_control_val();
> > + wrmsr(MSR_IA32_UMWAIT_CONTROL, msr_val, 0);
> > +
> > + mutex_unlock(&umwait_lock);
> > +
> > + return 0;
> > +}
> > +
> > +static int __init umwait_init(void)
> > +{
> > + struct device *dev;
> > + int ret;
> > +
> > + if (!boot_cpu_has(X86_FEATURE_WAITPKG))
> > + return -ENODEV;
> > +
> > + /* Add CPU global user wait interface to control umwait. */
> > + dev = cpu_subsys.dev_root;
> > + ret = sysfs_create_group(&dev->kobj, &umwait_attr_group);
> > + if (ret)
> > + return ret;
> > +
> > + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "umwait/intel:online",
> > + umwait_cpu_online, NULL);
>
> This hotplug notifier thing is awful. Thomas, do we have a function
> that gets called every time a CPU is brought up (via BSP boot, AP
> boot, hotplug, hibernation resume, etc) where we can just put all
> these things? cpu_init() is almost appropriate, except that it's
> called at somewhat erratic times (quite different for BSP and AP IIRC)
> and it's not called AFAICT during hibernation restore. I suppose we
> could add a new thing that is called by cpu_init() and
> restore_processor_state().
>
> Also, surely you should actually write the MSR in this function, too.

Seems the current patch set misses pm_notifier for hibernation on BSP.
All APs are all updated by the online funciton in the current patch set.
If adding hiberation pm_notifier to update MSR 0xe1 on BSP, is that good
enough?

Thanks.

-Fenghua