2018-06-17 00:34:59

by Keno Fischer

[permalink] [raw]
Subject: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

From: Keno Fischer <[email protected]>

The rr (http://rr-project.org/) debugger provides user space
record-and-replay functionality by carefully controlling the process
environment in order to ensure completely deterministic execution
of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
allows rr to move traces across (Intel) machines, by allowing cpuid
invocations to be reliably recorded and replayed. This works very
well, with one catch: It is currently not possible to replay a
recording from a machine supporting a smaller set of XCR0 state
components on one supporting a larger set. This is because the
value of XCR0 is obersevable in userspace (either by explicit
xgetbv or by looking at the result of xsave) and since glibc
does oberseve this value, replay divergence is almost immediate.
I also suspect that people intrested in process (or container)
live-migration may eventually care about this if a migration happens
in between a userspace xsave and a corresponding xrstor.

We encounter this problem quite frequently since most of our users
are using pre-Skylake systems (and thus don't support the AVX512
state components), while we recently upgraded our main development
machines to Skylake.

This patch attempts to provide a way disable XCR0 state components
on a per-thread basis, such that rr may use this feature to emulate
the recording machine's XCR0 for the replayed process.

We do this by providing a new ARCH_SET_XCR0 arch_prctl that takes
as its argument the desired XCR0 value. The system call fails if:
- XSAVE is not available
- The user attempts to enable a state component that would not regularly
be enabled by the kernel.
- The value of XCR0 is illegal (in the sense that setting it would
cause a fault).
- Any state component being disabled is not in its initial state.

The last restriction is debatable, but it seemed sensible to me,
since it means the kernel need not be in the business of trying to
maintain the disabled state components when the ordinary xsave/xrestor
will no longer save/restore them.

The patch does not currently provide a corresponding ARCH_GET_XCR0,
since the `xgetbv` instruction fulfills this purpose. However, we
may want to provide this for consistency. It may also be useful (and
perhaps more useful) to provide a mechanism to obtain the kernel's
XCR0 (i.e. the upper bound on which bits are allowed to be set through
this interface).

On the kernel side, this value is stored as a mask, rather than a value
to set. The thought behind this was to be defensive about new state
components being added but forgetting to update the validity check
in the arch_prctl. However, getting this wrong in either direction
is probably bad (e.g. if Intel added an AVX1024 extension that always
required AVX512, then this would be the wrong way to be defensive about
it), so I'd appreciate some advice on which would be preferred.

Please take this patch as an RFC that I hope is sufficient to enable
discussion about this feature, rather than a complete patch.
In particular, this patch is missing:
- Cleanup
- A selftest exercising all the corner cases
- There's code sharing opportunities with KVM (which has to provide
similar functionality for virtual machines modifying XCR0), which
I did not take advantage of in this patch to keep the changes local.
A full patch would probably involve some refactoring there.

There is one remaining TODO in the code, which has to do with the
xcomp_bv xsave header. The `xrstors` instructions requires that
no bits in this headers field be set that are not active in XCR0.
However, this unfortunately means that changing the value of XCR0
can change the layout of the compacted xsave area, which so far the
kernel does not do anywhere else (except for some handling in the
KVM context). For my use case, it would be sufficient to simply disallow
any value of XCR0 with "holes" in it, that would change the layout
to anything other than a strict prefix, but I would appreciate
hearing any alternative solutions you may have.

Please also let me know if I missed a fundamental way in which this
causes a problem. I realize that this is a very sensitive part of
the kernel code. Since this patch changes XCR0, any xsave/xrestor or any
use of XCR0-enabled features in kernel space, not guarded by
kernel_fpu_begin() (which I modified for this patch set), would cause
a problem. I don't have a sufficiently wide view of the kernel
to know if there are any such uses, so please let me know if I missed
something.

Signed-off-by: Keno Fischer <[email protected]>
---
arch/x86/include/asm/fpu/xstate.h | 1 +
arch/x86/include/asm/processor.h | 2 +
arch/x86/include/asm/thread_info.h | 5 +-
arch/x86/include/uapi/asm/prctl.h | 2 +
arch/x86/kernel/fpu/core.c | 10 ++-
arch/x86/kernel/fpu/xstate.c | 3 +-
arch/x86/kernel/process.c | 125 ++++++++++++++++++++++++++++++++++++-
arch/x86/kernel/process_64.c | 16 ++---
8 files changed, 152 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 48581988d78c..d8e5547ec5b6 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -47,6 +47,7 @@ extern void __init update_regset_xstate_info(unsigned int size,

void fpu__xstate_clear_all_cpu_caps(void);
void *get_xsave_addr(struct xregs_state *xsave, int xstate);
+int xfeature_size(int xfeature_nr);
const void *get_xsave_field_ptr(int xstate_field);
int using_compacted_format(void);
int copy_xstate_to_kernel(void *kbuf, struct xregs_state *xsave, unsigned int offset, unsigned int size);
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index e28add6b791f..60d54731af66 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -479,6 +479,8 @@ struct thread_struct {
unsigned long debugreg6;
/* Keep track of the exact dr7 value set by the user */
unsigned long ptrace_dr7;
+ /* Keeps track of which XCR0 bits the used wants masked out */
+ unsigned long xcr0_mask;
/* Fault info: */
unsigned long cr2;
unsigned long trap_nr;
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 2ff2a30a264f..e5f928f8ef93 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -86,6 +86,7 @@ struct thread_info {
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
+#define TIF_MASKXCR0 14 /* XCR0 is maked in this thread */
#define TIF_NOCPUID 15 /* CPUID is not accessible in userland */
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* IA32 compatibility process */
@@ -113,6 +114,7 @@ struct thread_info {
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
+#define _TIF_MASKXCR0 (1 << TIF_MASKXCR0)
#define _TIF_NOCPUID (1 << TIF_NOCPUID)
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
@@ -146,7 +148,8 @@ struct thread_info {

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
- (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_SSBD)
+ (_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP| \
+ _TIF_SSBD|_TIF_MASKXCR0)

#define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
#define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW)
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 5a6aac9fa41f..5a31a8420baa 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -10,6 +10,8 @@
#define ARCH_GET_CPUID 0x1011
#define ARCH_SET_CPUID 0x1012

+#define ARCH_SET_XCR0 0x1021
+
#define ARCH_MAP_VDSO_X32 0x2001
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index f92a6593de1e..e8e4150319f8 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -101,6 +101,8 @@ void __kernel_fpu_begin(void)
kernel_fpu_disable();

if (fpu->initialized) {
+ if (unlikely(test_thread_flag(TIF_MASKXCR0)))
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
/*
* Ignore return value -- we don't care if reg state
* is clobbered.
@@ -116,9 +118,15 @@ void __kernel_fpu_end(void)
{
struct fpu *fpu = &current->thread.fpu;

- if (fpu->initialized)
+ if (fpu->initialized) {
copy_kernel_to_fpregs(&fpu->state);

+ if (unlikely(test_thread_flag(TIF_MASKXCR0))) {
+ xsetbv(XCR_XFEATURE_ENABLED_MASK,
+ xfeatures_mask & ~current->thread.xcr0_mask);
+ }
+ }
+
kernel_fpu_enable();
}
EXPORT_SYMBOL(__kernel_fpu_end);
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d3..cb9a9e57feae 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -454,7 +454,7 @@ static int xfeature_uncompacted_offset(int xfeature_nr)
return ebx;
}

-static int xfeature_size(int xfeature_nr)
+int xfeature_size(int xfeature_nr)
{
u32 eax, ebx, ecx, edx;

@@ -462,6 +462,7 @@ static int xfeature_size(int xfeature_nr)
cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
return eax;
}
+EXPORT_SYMBOL_GPL(xfeature_size);

/*
* 'XSAVES' implies two different things:
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 30ca2d1a9231..4a774602d34a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -244,6 +244,55 @@ static int set_cpuid_mode(struct task_struct *task, unsigned long cpuid_enabled)
return 0;
}

+static void change_xcr0_mask(unsigned long prev_mask, unsigned long next_mask)
+{
+ unsigned long deactivated_features = next_mask & ~prev_mask;
+
+ if (deactivated_features) {
+ /*
+ * Clear any state components that were active before,
+ * but are not active now (xrstor would not touch
+ * it otherwise, exposing the previous values).
+ */
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+ __copy_kernel_to_fpregs(&init_fpstate, deactivated_features);
+ }
+
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask & ~next_mask);
+}
+
+void reset_xcr0_mask(void)
+{
+ preempt_disable();
+ if (test_and_clear_thread_flag(TIF_MASKXCR0)) {
+ current->thread.xcr0_mask = 0;
+ xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask);
+ }
+ preempt_enable();
+}
+
+void set_xcr0_mask(unsigned long mask)
+{
+ if (mask == 0) {
+ reset_xcr0_mask();
+ } else {
+ struct xregs_state *xsave = &current->thread.fpu.state.xsave;
+
+ preempt_disable();
+
+ change_xcr0_mask(current->thread.xcr0_mask, mask);
+
+ xsave->header.xfeatures = xsave->header.xfeatures & ~mask;
+ /* TODO: We may have to compress the xstate here */
+ xsave->header.xcomp_bv = xsave->header.xcomp_bv & ~mask;
+
+ set_thread_flag(TIF_MASKXCR0);
+ current->thread.xcr0_mask = mask;
+
+ preempt_enable();
+ }
+}
+
/*
* Called immediately after a successful exec.
*/
@@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
/* If cpuid was previously disabled for this task, re-enable it. */
if (test_thread_flag(TIF_NOCPUID))
enable_cpuid();
+ if (test_thread_flag(TIF_MASKXCR0))
+ reset_xcr0_mask();
}

static inline void switch_to_bitmap(struct tss_struct *tss,
@@ -455,6 +506,10 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,

if ((tifp ^ tifn) & _TIF_SSBD)
__speculative_store_bypass_update(tifn);
+
+ if ((tifp | tifn) & _TIF_MASKXCR0 &&
+ prev->xcr0_mask != next->xcr0_mask)
+ change_xcr0_mask(prev->xcr0_mask, next->xcr0_mask);
}

/*
@@ -783,14 +838,80 @@ unsigned long get_wchan(struct task_struct *p)
return ret;
}

+static int xcr0_is_legal(unsigned long xcr0)
+{
+ // Conservatively disallow anything above bit 9,
+ // to avoid accidentally allowing the disabling of
+ // new features without updating these checks
+ if (xcr0 & ~((1 << 10) - 1))
+ return 0;
+ if (!(xcr0 & XFEATURE_MASK_FP))
+ return 0;
+ if ((xcr0 & XFEATURE_MASK_YMM) && !(xcr0 & XFEATURE_MASK_SSE))
+ return 0;
+ if ((!(xcr0 & XFEATURE_MASK_BNDREGS)) !=
+ (!(xcr0 & XFEATURE_MASK_BNDCSR)))
+ return 0;
+ if (xcr0 & XFEATURE_MASK_AVX512) {
+ if (!(xcr0 & XFEATURE_MASK_YMM))
+ return 0;
+ if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
+ return 0;
+ }
+ return 1;
+}
+
+static int xstate_is_initial(unsigned long mask)
+{
+ int i, j;
+ unsigned long max_bit = __ffs(mask);
+
+ for (i = 0; i < max_bit; ++i) {
+ if (mask & (1 << i)) {
+ char *xfeature_addr = (char *)get_xsave_addr(
+ &current->thread.fpu.state.xsave,
+ 1 << i);
+ unsigned long feature_size = xfeature_size(i);
+
+ for (j = 0; j < feature_size; ++j) {
+ if (xfeature_addr[j] != 0)
+ return 0;
+ }
+ }
+ }
+ return 1;
+}
+
long do_arch_prctl_common(struct task_struct *task, int option,
- unsigned long cpuid_enabled)
+ unsigned long arg2)
{
switch (option) {
case ARCH_GET_CPUID:
return get_cpuid_mode();
case ARCH_SET_CPUID:
- return set_cpuid_mode(task, cpuid_enabled);
+ return set_cpuid_mode(task, arg2);
+ case ARCH_SET_XCR0: {
+ unsigned long mask = xfeatures_mask & ~arg2;
+
+ if (!use_xsave())
+ return -ENODEV;
+
+ if (arg2 & ~xfeatures_mask)
+ return -ENODEV;
+
+ if (!xcr0_is_legal(arg2))
+ return -EINVAL;
+
+ /*
+ * We require that any state components being disabled by
+ * this prctl be currently in their initial state.
+ */
+ if (!xstate_is_initial(mask))
+ return -EPERM;
+
+ set_xcr0_mask(mask);
+ return 0;
+ }
}

return -EINVAL;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 12bb445fb98d..d220d93f5ffa 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -469,6 +469,15 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
load_seg_legacy(prev->gsindex, prev->gsbase,
next->gsindex, next->gsbase, GS);

+ /*
+ * Now maybe reload the debug registers and handle I/O bitmaps.
+ * N.B.: This may change XCR0 and must thus happen before,
+ * `switch_fpu_finish`.
+ */
+ if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
+ task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
+ __switch_to_xtra(prev_p, next_p, tss);
+
switch_fpu_finish(next_fpu, cpu);

/*
@@ -480,13 +489,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reload sp0. */
update_sp0(next_p);

- /*
- * Now maybe reload the debug registers and handle I/O bitmaps
- */
- if (unlikely(task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT ||
- task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV))
- __switch_to_xtra(prev_p, next_p, tss);
-
#ifdef CONFIG_XEN_PV
/*
* On Xen PV, IOPL bits in pt_regs->flags have no effect, and
--
2.14.1



2018-06-17 16:36:14

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread


Patch seems pointless if you can already control CPUID, which rr
supports. Just disable AVX512 in CPUID. All code using AVX should check
cpuid (or will fail anyways).

-Andi

2018-06-17 16:52:27

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> Patch seems pointless if you can already control CPUID, which rr
> supports. Just disable AVX512 in CPUID. All code using AVX should check
> cpuid (or will fail anyways).

Unfortunately, that is insufficient. Almost difference in CPU behavior
between the replayer
and the replayee. In particular, the problems for rr here are
1) xgetbv, which can introduce differing register contents (and if
code branches on this,
potentially differences in control flow).
2) xsave writing more memory than the trace expects, causing
divergence in the memory
contents of the process.

Robert O'Callahan has a blog post that goes into some detail here:
https://robert.ocallahan.org/2018/04/cpuid-features-xsave-and-rr-trace.html

I hope that makes sense. Please let me know if you'd like me to explain the
problem in more detail or give an example. FWIW, I do have a version of rr
that uses the proposed feature in this patch and it does fix our problem with
replaying AVX traces on skylake machines. Without this patch, rr does emulate
the cpuid response of the recording machine, but the replay fails because
of the mentioned issue.

2018-06-17 18:25:02

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> Almost difference in CPU behavior
> between the replayer and the replayee.

Not sure what happened to this sentence.
What I meant to say was:

Almost any difference in CPU behavior between
the replayer and the replayee will cause problems
for the determinism of the trace.

To elaborate on this, even if a register whose content
differs between the recording and the replay, it can
still cause problems down the line, e.g. if it is spilled
to the stack and that stack address is then re-used later.
In order for rr to work, we basically rely on the CPU
behaving *exactly* the same during the record and the
replay (down to counting performance counters the same).
This works well, but there are corner cases (like this XCR0
one).

2018-06-18 12:48:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On 06/16/2018 05:33 PM, Keno Fischer wrote:
> For my use case, it would be sufficient to simply disallow
> any value of XCR0 with "holes" in it,
But what if the hardware you are migrating to/from *has* holes?

There's no way this is even close to viable until it has been made to
cope with holes.

FWIW, I just don't think this is going to be viable. I have the feeling
that there's way too much stuff that hard-codes assumptions about XCR0
inside the kernel and out. This is just going to make it much more fragile.

Folks that want this level of container migration are probably better
off running one of the hardware-based containers and migrating _those_.
Or, just ensuring the places to/from they want to migrate have a
homogeneous XCR0 mix.

> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
> /* If cpuid was previously disabled for this task, re-enable it. */
> if (test_thread_flag(TIF_NOCPUID))
> enable_cpuid();
> + if (test_thread_flag(TIF_MASKXCR0))
> + reset_xcr0_mask();
> }

So the mask is cleared on exec(). Does that mean that *every*
individual process using this interface has to set up its own mask
before anything in the C library establishes its cached value of XCR0.
I'd want to see how that's being accomplished.

> +static int xstate_is_initial(unsigned long mask)
> +{
> + int i, j;
> + unsigned long max_bit = __ffs(mask);
> +
> + for (i = 0; i < max_bit; ++i) {
> + if (mask & (1 << i)) {
> + char *xfeature_addr = (char *)get_xsave_addr(
> + &current->thread.fpu.state.xsave,
> + 1 << i);
> + unsigned long feature_size = xfeature_size(i);
> +
> + for (j = 0; j < feature_size; ++j) {
> + if (xfeature_addr[j] != 0)
> + return 0;
> + }
> + }
> + }
> + return 1;
> +}

There is nothing architectural saying that the init state has to be 0.

> + case ARCH_SET_XCR0: {

The interface is a mit burky. The SET_XCR0 operation masks out the
"set" value from the current value? That's a bit counterintuitive.

> + unsigned long mask = xfeatures_mask & ~arg2;
> +
> + if (!use_xsave())
> + return -ENODEV;
> +
> + if (arg2 & ~xfeatures_mask)
> + return -ENODEV;

This is rather unfortunately comment-free. "Are you trying to clear a
bit that was not set in the first place?"

Also, shouldn't this be dealing with the new task->xcr0, *not* the
global xfeatures_mask? What if someone calls this more than once?

> + if (!xcr0_is_legal(arg2))
> + return -EINVAL;

FWIW, I don't really get the point of disallowing some of the values
made illegal in there. Sure, you shoot yourself in the foot, but the
worst you'll probably see is a general-protection-fault from the XSETBV,
or from the first XRSTOR*. We can cope with those, and I'd rather not
be trying to keep a list of things you're not allowed to do with XSAVE.

I also don't see any sign of checking for supervisor features anywhere.

> + /*
> + * We require that any state components being disabled by
> + * this prctl be currently in their initial state.
> + */
> + if (!xstate_is_initial(mask))
> + return -EPERM;

Aside: I would *not* refer to the "initial state", for fear that we
could confuse it with the hardware-defined "init state". From software,
we really have zero control over when the hardware is in its "init state".

But, in any case, so how is this supposed to work?

// get features we are disabling into values matching the
// hardware "init state".
__asm__("XRSTOR %reg1,%reg2", ...);
prctl(PRCTL_SET_XCR0, something);

?

That would be *really* fragile code from userspace. Adding a printk()
between those two lines would probably break it, for instance.

I'd probably just not have these checks.

2018-06-18 14:44:59

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

Hi Dave,

thank you for your thorough comments, replies inline:

On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen
<[email protected]> wrote:
> On 06/16/2018 05:33 PM, Keno Fischer wrote:
>> For my use case, it would be sufficient to simply disallow
>> any value of XCR0 with "holes" in it,
> But what if the hardware you are migrating to/from *has* holes?

Yes, that would be the problem ;). I'm not aware of that being the
case for user space states on current hardware, but perhaps,
I'm just missing a case.

> There's no way this is even close to viable until it has been made to
> cope with holes.

Ok, it seems to me the first decision is whether it should be allowed
to have the compacted format (with holes) in an in-memory xstate
image. Doing so seems possible, but would require more invasive
changes to the kernel, so I'm gonna ignore it for now.

If we want to keep the in-memory xstate layout constant after boot
time, I see four ways to do that for this feature.

1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors
use an XCR0 that matches the layout the kernel expects.
2) Use xsaveopt when this particular thread flag is set and have the
kernel be able to deal with non-compressed xstate images, even
if xsaves is available.
3) What's in this patch now, but fix up the xstate after saving it.
4) Catch the fault thrown by xsaves/xrestors in this situation, update
XCR0, redo the xsaves/restores, put XCR0 back and continue
execution after the faulting instruction.

Option 1) seems easiest, but it would also require adding code
somewhere on the common path, which I assume is a no-go ;).

Option 3) seems doable entirely in the slow path for this feature.
If we xsaves with the modified XCR0, we can then fix up the memory
image to match the expected layout. Similarly, we could xrestors
in the slow path (from standard layout) and rely on the
`fpregs_state_valid` mechanism to prevent the fault.

Option 4) seems also doable, but of course messing around with
traps makes things quite a bit more complicated.

> FWIW, I just don't think this is going to be viable. I have the feeling
> that there's way too much stuff that hard-codes assumptions about XCR0
> inside the kernel and out. This is just going to make it much more fragile.
>
> Folks that want this level of container migration are probably better
> off running one of the hardware-based containers and migrating _those_.
> Or, just ensuring the places to/from they want to migrate have a
> homogeneous XCR0 mix.

Fair enough :). I figured I'd throw it out there as an RFC and see if I can
get it into an acceptable shape for you to include upstream. For my,
particular use case (replaying old traces on newer hardware),
since I control the replay hardware, I'm happy if there's a patch
for me to apply locally to solve my problem. Of course there's also
the opposite use case (recording a trace on a new machine, but
wanting to replay on an older one), which would also require the
recording machine to have this patch. The reason I make this
distinction is that we don't usually control the recording machine
(since those are our users' machines), so that tends to have a
much more varied hardware/kernel mix and it's harder to get kernel
patches applied. I'm happy to keep chipping away at this for as
long as there is feedback, but I understand if it won't make it
in at the end of the process.

>> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
>> /* If cpuid was previously disabled for this task, re-enable it. */
>> if (test_thread_flag(TIF_NOCPUID))
>> enable_cpuid();
>> + if (test_thread_flag(TIF_MASKXCR0))
>> + reset_xcr0_mask();
>> }
>
> So the mask is cleared on exec(). Does that mean that *every*
> individual process using this interface has to set up its own mask
> before anything in the C library establishes its cached value of XCR0.
> I'd want to see how that's being accomplished.

Yes, that is correct. In my rr patch this is done by injecting an
arch_prctl syscall using ptrace, at the same spot that we inject
the arch_prctl syscall for enabling cpuid faulting (i.e. the interface
that sets the `TIF_NOCPUID` flag in the check prior to this one).
Since these two features would seem to make the most sense when
used together, I figured it would be best to align semantics.
Happy to entertain alternative suggestions though.

>> +static int xstate_is_initial(unsigned long mask)
>> +{
>> + int i, j;
>> + unsigned long max_bit = __ffs(mask);
>> +
>> + for (i = 0; i < max_bit; ++i) {
>> + if (mask & (1 << i)) {
>> + char *xfeature_addr = (char *)get_xsave_addr(
>> + &current->thread.fpu.state.xsave,
>> + 1 << i);
>> + unsigned long feature_size = xfeature_size(i);
>> +
>> + for (j = 0; j < feature_size; ++j) {
>> + if (xfeature_addr[j] != 0)
>> + return 0;
>> + }
>> + }
>> + }
>> + return 1;
>> +}
>
> There is nothing architectural saying that the init state has to be 0.

True, but it was my understanding that this was true for all currently
defined state components (unless I misread 13.6 in the SDM, which
is totally possible). Since I don't currently allow setting any bits
other than the ones currently defined, I think this is ok at the moment,
but this function is certainly one of those that make this patch "RFC" ;)

>> + case ARCH_SET_XCR0: {
>
> The interface is a mit burky. The SET_XCR0 operation masks out the
> "set" value from the current value? That's a bit counterintuitive.

I apologize for the confusion. The SET_XCR0, userspace ABI takes
the value of XCR0 to set, which then gets translated into a mask
for the kernel to save (I discussed this a bit in the original PATCH -
I guess this confusion further proves that I should just stick
with one or the other).

>> + unsigned long mask = xfeatures_mask & ~arg2;
>> +
>> + if (!use_xsave())
>> + return -ENODEV;
>> +
>> + if (arg2 & ~xfeatures_mask)
>> + return -ENODEV;
>
> This is rather unfortunately comment-free. "Are you trying to clear a
> bit that was not set in the first place?"
>
> Also, shouldn't this be dealing with the new task->xcr0, *not* the
> global xfeatures_mask? What if someone calls this more than once?

The intention of this was to disallow setting any bits in xcr0, that weren't
enabled in the `xfeatures_mask` (because that means the kernel doesn't
know about them). I was fine with re-enabling xcr0 bits later, which is
why this doesn't currently look at the task's xcr0. I personally don't need
this, but it seemed like an arbitrary choice to disallow it.

>> + if (!xcr0_is_legal(arg2))
>> + return -EINVAL;
>
> FWIW, I don't really get the point of disallowing some of the values
> made illegal in there. Sure, you shoot yourself in the foot, but the
> worst you'll probably see is a general-protection-fault from the XSETBV,
> or from the first XRSTOR*. We can cope with those, and I'd rather not
> be trying to keep a list of things you're not allowed to do with XSAVE.

That's fair. I guess I am scared of creating places in the kernel that trap,
perhaps unreasonably so ;). I should point out however, that KVM already
does keep such a list for precisely the same reason, so if I refactor
this to share such code, at least there'd be only one such list, which
may be slightly better?

> I also don't see any sign of checking for supervisor features anywhere.

At least currently, it is my understanding that `xfeatures_mask` only has
user features, am I mistaken about that?

>> + /*
>> + * We require that any state components being disabled by
>> + * this prctl be currently in their initial state.
>> + */
>> + if (!xstate_is_initial(mask))
>> + return -EPERM;
>
> Aside: I would *not* refer to the "initial state", for fear that we
> could confuse it with the hardware-defined "init state". From software,
> we really have zero control over when the hardware is in its "init state".

I meant to have this refer to the initial configuration as defined in 13.6
of the SDM, i.e. what xrestor restores when the corresponding xstate_bv
field is cleared.

> But, in any case, so how is this supposed to work?
>
> // get features we are disabling into values matching the
> // hardware "init state".
> __asm__("XRSTOR %reg1,%reg2", ...);
> prctl(PRCTL_SET_XCR0, something);
>
> ?

I was primarily thinking of the ptracer use case,

ptrace(PTRACE_SETFPXREGS, <recorded regs>)
<inject arch_prctl using ptrace>

in which case there isn't a problem, because the unrecorded regs
should be in the initial state.

> That would be *really* fragile code from userspace. Adding a printk()
> between those two lines would probably break it, for instance.
>
> I'd probably just not have these checks.

Ok, that is fine with me. It also to some extent informs the possibilities
for the compacted storage question above, since not having these
checks would require us to save the current register state such that
we can re-activate it later and restore it properly.

2018-06-18 15:07:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On 06/18/2018 07:42 AM, Keno Fischer wrote:
>> There's no way this is even close to viable until it has been made to
>> cope with holes.
>
> Ok, it seems to me the first decision is whether it should be allowed
> to have the compacted format (with holes) in an in-memory xstate
> image. Doing so seems possible, but would require more invasive
> changes to the kernel, so I'm gonna ignore it for now.
>
> If we want to keep the in-memory xstate layout constant after boot
> time, I see four ways to do that for this feature.
>
> 1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors
> use an XCR0 that matches the layout the kernel expects.

... and do XCR0 writes before every XSAVES/XRSTORS, including in the
context-switch path?

> 2) Use xsaveopt when this particular thread flag is set and have the
> kernel be able to deal with non-compressed xstate images, even
> if xsaves is available.

What about if there is supervisor state in place?

> 3) What's in this patch now, but fix up the xstate after saving it.

That's _maybe_ viable. But, it's going to slow things down quite a bit
and has to be done with preempt disabled.

> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
> XCR0, redo the xsaves/restores, put XCR0 back and continue
> execution after the faulting instruction.

I'm worried about the kernel pieces that go digging in the XSAVE data
getting confused more than the hardware getting confused.

> Option 1) seems easiest, but it would also require adding code
> somewhere on the common path, which I assume is a no-go ;).

Yeah, that would be horribly slow.

You then also have to be really careful with interrupts and preempt when
you're in a funky XCR0 configuration.

> Option 3) seems doable entirely in the slow path for this feature.
> If we xsaves with the modified XCR0, we can then fix up the memory
> image to match the expected layout. Similarly, we could xrestors
> in the slow path (from standard layout) and rely on the
> `fpregs_state_valid` mechanism to prevent the fault.

... with one more modification. You need two buffers _anyway_ if you do
this. So you would probably just XSAVE to a new buffer and then convert
that back to the "real" buffer in the thread struct.

> At least currently, it is my understanding that `xfeatures_mask` only has
> user features, am I mistaken about that?

We're slowing adding supervisor support. I think accounting for
supervisor features is a requirement for any new XSAVE code.

2018-06-18 15:16:15

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>> execution after the faulting instruction.
>
> I'm worried about the kernel pieces that go digging in the XSAVE data
> getting confused more than the hardware getting confused.

So you prefer this option? If so, I can try to have a go at implementing it
this way and seeing if I run into any trouble.

>> At least currently, it is my understanding that `xfeatures_mask` only has
>> user features, am I mistaken about that?
>
> We're slowing adding supervisor support. I think accounting for
> supervisor features is a requirement for any new XSAVE code.

Sure, I don't think this is in any way incompatible with that (though
probably also informs that we want to keep the memory layout the
same if possible).

2018-06-18 16:17:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On 06/18/2018 08:13 AM, Keno Fischer wrote:
>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>>> execution after the faulting instruction.
>>
>> I'm worried about the kernel pieces that go digging in the XSAVE data
>> getting confused more than the hardware getting confused.
>
> So you prefer this option? If so, I can try to have a go at implementing it
> this way and seeing if I run into any trouble.

No, I'm saying that depending on faults is not a viable solution. We
are not guaranteed to get faults in all the cases you would need to fix up.

XSAVE*/XRSTOR* are not even *called* in some of those cases.

>>> At least currently, it is my understanding that `xfeatures_mask` only has
>>> user features, am I mistaken about that?
>>
>> We're slowing adding supervisor support. I think accounting for
>> supervisor features is a requirement for any new XSAVE code.
>
> Sure, I don't think this is in any way incompatible with that (though
> probably also informs that we want to keep the memory layout the
> same if possible).

I think you've tried to simplify your implementation by ignoring
features, like holes. However, the existing implementation actually
*does* handle those things and we've spent a significant amount of time
ensuring that it works, despite the fact that you can't buy an
off-the-shelf CPU that creates a hole without help from a hypervisor today.


2018-06-18 17:01:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> Unfortunately, that is insufficient. Almost difference in CPU behavior
> between the replayer
> and the replayee. In particular, the problems for rr here are
> 1) xgetbv, which can introduce differing register contents (and if
> code branches on this,
> potentially differences in control flow).

So you're saying that software checks XGETBV first before checking
CPUID?

That seems broken. Without AVX XGETBV may not exist.

You always have to check CPUID first. And then it should just branch
around the XGETBV.

So we're talking about a workaround for broken software. The question
is how wide spread is it?


> 2) xsave writing more memory than the trace expects, causing
> divergence in the memory
> contents of the process.

Do memory contents which are never read by the application matter?


-Andi


2018-06-18 17:25:05

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On Mon, Jun 18, 2018 at 12:16 PM, Dave Hansen
<[email protected]> wrote:
> On 06/18/2018 08:13 AM, Keno Fischer wrote:
>>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>>>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>>>> execution after the faulting instruction.
>>>
>>> I'm worried about the kernel pieces that go digging in the XSAVE data
>>> getting confused more than the hardware getting confused.
>>
>> So you prefer this option? If so, I can try to have a go at implementing it
>> this way and seeing if I run into any trouble.
>
> No, I'm saying that depending on faults is not a viable solution. We
> are not guaranteed to get faults in all the cases you would need to fix up.
>
> XSAVE*/XRSTOR* are not even *called* in some of those cases.

Ah, my apologies, I was under the mistaken impression that xsaves also
read xcomp_bv to inform the layout, rather than using the RFBM and then
updating the xcomp_bv field. Let me think about this some more and see
what I can come up with.

2018-06-18 17:30:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On 06/18/2018 10:22 AM, Keno Fischer wrote:
>> No, I'm saying that depending on faults is not a viable solution. We
>> are not guaranteed to get faults in all the cases you would need to fix up.
>>
>> XSAVE*/XRSTOR* are not even *called* in some of those cases.
> Ah, my apologies, I was under the mistaken impression that xsaves also
> read xcomp_bv to inform the layout, rather than using the RFBM and then
> updating the xcomp_bv field. Let me think about this some more and see
> what I can come up with.

It's not even that. If you do this, xstate_comp_offsets[] will
potentially need to be recalculated for each XCR0 value. Granted, we've
pretty nicely consolidated these references in the kernel into very few
spots, but you still need to fix that up.

2018-06-18 17:44:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

On 06/18/2018 07:42 AM, Keno Fischer wrote:
>> But, in any case, so how is this supposed to work?
>>
>> // get features we are disabling into values matching the
>> // hardware "init state".
>> __asm__("XRSTOR %reg1,%reg2", ...);
>> prctl(PRCTL_SET_XCR0, something);
>>
>> ?
> I was primarily thinking of the ptracer use case,
>
> ptrace(PTRACE_SETFPXREGS, <recorded regs>)
> <inject arch_prctl using ptrace>
>
> in which case there isn't a problem, because the unrecorded regs
> should be in the initial state.

So, to be useful, this interface needs to be called before an
application can run XGETBV or XSAVE for the first time and caches a
"bad" value. I think that means that it might not be feasible to use
outside of cases where you ptrace() something and inject things before
it has a chance to run any real instructions.

Fundamentally, I think that makes _this_ interface pretty useless in
practice. The only practical option is to have a _future_ XCR0 value
set by the prctl() and then have it get made active by the kernel at
execve().

IMNHO, if you haven't guessed yet, I think this whole exercise is a dead
end. Just boot an identical XCR0 VM on your new hardware and do replay
there. Done.

2018-06-18 17:52:16

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> So we're talking about a workaround for broken software. The question
> is how wide spread is it?

For rr to work, it tries to replicate the process state *exactly*. That means:

1. The same instructions executed in the same order
2. The exact same register state at those instructions
3. The same memory state, bit-by-bit

In particular 1) means that any extra instructions executed/not executed
will cause a replay divergence (in practice rr uses retired conditional
branches rather than instructions, because the instruction counter is
not accurate, while the branch one is). This alone causes a problem
for the present case, because glibc branches on the xcr0 value before
it branches on the cpuid value for AVX512. Glibc does check for the
correct cpuid before calling xgetbv, so one possible thing to do is to
completely disable xsave during recording by disabling it in CPUID, but
that would make rr quite a bit less useful, since it wouldn't be able to
record any bugs that require AVX to be used. However, the xsave
problem is worse, because xcr0 determines how much memory
`xsave` writes, so if we emulate cpuid, to pretend that AVX512
does not exist, and the user space application uses that to
determine the size of the required buffer, we now suddenly
overflow that buffer (unless the user space application uses
cpuid to compute a minimal RFBM for xsave, which no application
seems to do).

> Do memory contents which are never read by the application matter?

In theory, no. However, in practice, I've seen most memory
divergences (esp if on the stack), end up causing control flow divergences
down the line, because some code somewhere picks up the uninitialized
memory and branches on it.

Hope that helps,
Keno

2018-06-18 18:17:36

by Keno Fischer

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> So, to be useful, this interface needs to be called before an
> application can run XGETBV or XSAVE for the first time and caches a
> "bad" value. I think that means that it might not be feasible to use
> outside of cases where you ptrace() something and inject things before
> it has a chance to run any real instructions.
>
> Fundamentally, I think that makes _this_ interface pretty useless in
> practice. The only practical option is to have a _future_ XCR0 value
> set by the prctl() and then have it get made active by the kernel at
> execve().

Fair enough, but it don't see this as really fundamentally different
from the cpuid masking use case, which has the same problem and
the same interface. I'm also not convinced that there is *no* use case
where one may want to turn on certain XCR0 features while the process
is running and then turn them off again. To give a concrete example in
this context, it can useful to write a small program into the memory space
of the replayed program and use it to analyze the memory state of the
program (e.g. to checksum the memory - or in our case to perform a
GC state validation). Such implants may want to use the AVX512
registers for performance, so it would be nice if that was possible.

> IMNHO, if you haven't guessed yet, I think this whole exercise is a dead
> end. Just boot an identical XCR0 VM on your new hardware and do replay
> there. Done.

I had a hunch ;). However, there are a couple considerations that
make me still want this in the kernel proper:
1. The recording side application of this feature - getting our users
to do everything in a VM to send us a recording is not easy. Part
of the appeal of rr over VM-based record/replay techniques
is that it "just works" on basically any linux hosts.
2. Starting a VM generally requires root permissions, which may
not be available.
3. And probably the biggest from my perspective is performance. rr
needs to do a lot twiddling with the performance counters, which
I've seen have significant performance overhead in a virtualized
environment. There's of course also a per-VM resource consumption,
but presumably we could keep one VM per-XCR0 value and replay
multiple traces per VM.

2018-06-19 13:44:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

> In particular 1) means that any extra instructions executed/not executed
> will cause a replay divergence (in practice rr uses retired conditional
> branches rather than instructions, because the instruction counter is
> not accurate, while the branch one is). This alone causes a problem
> for the present case, because glibc branches on the xcr0 value before
> it branches on the cpuid value for AVX512. Glibc does check for the
> correct cpuid before calling xgetbv, so one possible thing to do is to
> completely disable xsave during recording by disabling it in CPUID, but
> that would make rr quite a bit less useful, since it wouldn't be able to

Ah I see it now. This problem was introduced with the changes
for glibc to save AVX registers using XSAVE instead of manually.

It still seems this has a straight forward fix in glibc though.

It could always allocate the worst case buffer, and also
verify XGETBV against CPUID first. I'm sure this can be
done in a way that executed branches don't differ.

AFAIK manual use of XSAVE is not that common, so hopefully
these problems are not wide spread in other programs.

Of course longer term you'll just need to have matching
ISAs in record and replay. Trying to patch around this
is likely always difficult.

-Andi