A #GP fault is generated when ENQCMD instruction is executed without
a valid PASID value programmed in the current thread's PASID MSR. The
#GP fault handler will initialize the MSR if a PASID has been allocated
for this process.
Decoding the user instruction is ugly and sets a bad architecture
precedent. It may not function if the faulting instruction is modified
after #GP.
Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
without a valid PASID value programmed. #GP error codes are 16 bits and all
16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
when loading from the source operand, so its not fully comprehending all
the reasons. Rather than special case the ENQCMD, in future Intel may
choose a different fault mechanism for such cases if recovery is needed on
#GP.
The following heuristic is used to avoid decoding the user instructions
to determine the precise reason for the #GP fault:
1) If the mm for the process has not been allocated a PASID, this #GP
cannot be fixed.
2) If the PASID MSR is already initialized, then the #GP was for some
other reason
3) Try initializing the PASID MSR and returning. If the #GP was from
an ENQCMD this will fix it. If not, the #GP fault will be repeated
and will hit case "2".
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Lu Baolu <[email protected]>
---
v5:
- Use cpu_feature_enabled() and remove redundant CONFIG_INTEL_IOMMU_SVM
check which is included in cpu_feature_enabled() in
fixup_pasid_exception() (PeterZ and Dave Hansen)
- Reviewed by Lu Baolu
v4:
- Change PASID type to u32 (Christoph)
v3:
- Check and set current->has_valid_pasid in fixup() (PeterZ)
- Add fpu__pasid_write() to update the MSR (PeterZ)
- Add ioasid_find() sanity check in fixup()
v2:
- Update the first paragraph of the commit message (Thomas)
- Add reasons why don't decode the user instruction and don't use
#GP error code (Thomas)
- Change get_task_mm() to current->mm (Thomas)
- Add comments on why IRQ is disabled during PASID fixup (Thomas)
- Add comment in fixup() that the function is called when #GP is from
user (so mm is not NULL) (Dave Hansen)
arch/x86/include/asm/iommu.h | 1 +
arch/x86/kernel/traps.c | 12 ++++++
drivers/iommu/intel/svm.c | 78 ++++++++++++++++++++++++++++++++++++
3 files changed, 91 insertions(+)
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index ed41259fe7ac..e9365a5d6f7d 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -27,5 +27,6 @@ arch_rmrr_sanity_check(struct acpi_dmar_reserved_memory *rmrr)
}
void __free_pasid(struct mm_struct *mm);
+bool __fixup_pasid_exception(void);
#endif /* _ASM_X86_IOMMU_H */
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index f58679e487f6..fe0f7d00523b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -59,6 +59,7 @@
#include <asm/umip.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
+#include <asm/iommu.h>
#ifdef CONFIG_X86_64
#include <asm/x86_init.h>
@@ -518,6 +519,14 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
return GP_CANONICAL;
}
+static bool fixup_pasid_exception(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+ return false;
+
+ return __fixup_pasid_exception();
+}
+
#define GPFSTR "general protection fault"
DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
@@ -530,6 +539,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
cond_local_irq_enable(regs);
+ if (user_mode(regs) && fixup_pasid_exception())
+ goto exit;
+
if (static_cpu_has(X86_FEATURE_UMIP)) {
if (user_mode(regs) && fixup_umip_exception(regs))
goto exit;
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 4c788880b037..4a84c82a4f8c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1105,3 +1105,81 @@ void __free_pasid(struct mm_struct *mm)
*/
ioasid_free(pasid);
}
+
+/*
+ * Write the current task's PASID MSR/state. This is called only when PASID
+ * is enabled.
+ */
+static void fpu__pasid_write(u32 pasid)
+{
+ u64 msr_val = pasid | MSR_IA32_PASID_VALID;
+
+ fpregs_lock();
+
+ /*
+ * If the MSR is active and owned by the current task's FPU, it can
+ * be directly written.
+ *
+ * Otherwise, write the fpstate.
+ */
+ if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
+ wrmsrl(MSR_IA32_PASID, msr_val);
+ } else {
+ struct ia32_pasid_state *ppasid_state;
+
+ ppasid_state = get_xsave_addr(¤t->thread.fpu.state.xsave,
+ XFEATURE_PASID);
+ /*
+ * ppasid_state shouldn't be NULL because XFEATURE_PASID
+ * is enabled.
+ */
+ WARN_ON_ONCE(!ppasid_state);
+ ppasid_state->pasid = msr_val;
+ }
+ fpregs_unlock();
+}
+
+/*
+ * Apply some heuristics to see if the #GP fault was caused by a thread
+ * that hasn't had the IA32_PASID MSR initialized. If it looks like that
+ * is the problem, try initializing the IA32_PASID MSR. If the heuristic
+ * guesses incorrectly, take one more #GP fault.
+ */
+bool __fixup_pasid_exception(void)
+{
+ struct intel_svm *svm;
+ struct mm_struct *mm;
+ u32 pasid;
+
+ /*
+ * If the current task already has a valid PASID in the MSR,
+ * the #GP must be for some other reason.
+ */
+ if (current->has_valid_pasid)
+ return false;
+
+ /*
+ * This function is called only when this #GP was triggered from user
+ * space. So the mm cannot be NULL.
+ */
+ mm = current->mm;
+ pasid = mm->pasid;
+
+ /*
+ * If the PASID isn't found, cannot help.
+ *
+ * Don't care if the PASID is bound to the mm here. #PF will handle the
+ * case that the MSR is fixed up by an unbound PASID.
+ */
+ svm = ioasid_find(NULL, pasid, NULL);
+ if (IS_ERR_OR_NULL(svm))
+ return false;
+
+ /* Fix up the MSR by the PASID in the mm. */
+ fpu__pasid_write(pasid);
+
+ /* Now the current task has a valid PASID in the MSR. */
+ current->has_valid_pasid = 1;
+
+ return true;
+}
--
2.19.1
On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu <[email protected]> wrote:
>
> A #GP fault is generated when ENQCMD instruction is executed without
> a valid PASID value programmed in the current thread's PASID MSR. The
> #GP fault handler will initialize the MSR if a PASID has been allocated
> for this process.
>
> Decoding the user instruction is ugly and sets a bad architecture
> precedent. It may not function if the faulting instruction is modified
> after #GP.
>
> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> without a valid PASID value programmed. #GP error codes are 16 bits and all
> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> when loading from the source operand, so its not fully comprehending all
> the reasons. Rather than special case the ENQCMD, in future Intel may
> choose a different fault mechanism for such cases if recovery is needed on
> #GP.
Decoding the user instruction is ugly and sets a bad architecture
precedent, but we already do it in #GP for UMIP. So I'm unconvinced.
Memo to Intel, though: you REALLY need to start thinking about what
the heck an OS is supposed to do with all these new faults you're
coming up with. The new #NM for TILE is utterly nonsensical. Sure,
it works for an OS that does not use CR0.TS and as long as no one
tries to extend the same mechanism for some new optional piece of
state, but as soon as Intel tries to use the same mechanism for
anything else, it falls apart.
Please do better.
> +
> +/*
> + * Write the current task's PASID MSR/state. This is called only when PASID
> + * is enabled.
> + */
> +static void fpu__pasid_write(u32 pasid)
> +{
> + u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> +
> + fpregs_lock();
> +
> + /*
> + * If the MSR is active and owned by the current task's FPU, it can
> + * be directly written.
> + *
> + * Otherwise, write the fpstate.
> + */
> + if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
> + wrmsrl(MSR_IA32_PASID, msr_val);
> + } else {
> + struct ia32_pasid_state *ppasid_state;
> +
> + ppasid_state = get_xsave_addr(¤t->thread.fpu.state.xsave,
> + XFEATURE_PASID);
> + /*
> + * ppasid_state shouldn't be NULL because XFEATURE_PASID
> + * is enabled.
> + */
> + WARN_ON_ONCE(!ppasid_state);
> + ppasid_state->pasid = msr_val;
WARN instead of BUG is nice, but you'll immediate oops if this fails.
How about:
if (!WARN_ON_ONCE(!ppasid_state))
ppasid_state->pasid = msr_val;
Hi, Andy,
On Fri, Jul 31, 2020 at 04:34:11PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu <[email protected]> wrote:
> >
> > A #GP fault is generated when ENQCMD instruction is executed without
> > a valid PASID value programmed in the current thread's PASID MSR. The
> > #GP fault handler will initialize the MSR if a PASID has been allocated
> > for this process.
> >
> > Decoding the user instruction is ugly and sets a bad architecture
> > precedent. It may not function if the faulting instruction is modified
> > after #GP.
> >
> > Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> > without a valid PASID value programmed. #GP error codes are 16 bits and all
> > 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> > choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > when loading from the source operand, so its not fully comprehending all
> > the reasons. Rather than special case the ENQCMD, in future Intel may
> > choose a different fault mechanism for such cases if recovery is needed on
> > #GP.
>
> Decoding the user instruction is ugly and sets a bad architecture
> precedent, but we already do it in #GP for UMIP. So I'm unconvinced.
Maybe just remove the "Decoding the user instruction ... bad architecture
precedent" sentence? The sentence is vague.
As described in the following "It may not function ..." sentence, the real
issue of parsing the instruction is the instruction may be modified by
another processor before it's parsed in the #GP handler.
If just keep the "It may not function ..." sentence, is that good enough to
explain why we don't parse the faulting instruction?
>
> Memo to Intel, though: you REALLY need to start thinking about what
> the heck an OS is supposed to do with all these new faults you're
> coming up with. The new #NM for TILE is utterly nonsensical. Sure,
> it works for an OS that does not use CR0.TS and as long as no one
> tries to extend the same mechanism for some new optional piece of
> state, but as soon as Intel tries to use the same mechanism for
> anything else, it falls apart.
>
> Please do better.
Internally we did discuss the error code in #GP for PASID with HW architects.
But due to some uarch reason, it's not simple to report the error code for
PASID:( Please see previous discussion on the error code for PASID:
https://lore.kernel.org/lkml/20200427224646.GA103955@otc-nc-03/
It's painful for our SW guys to check exception reasons if hardware
doesn't explicitly tell us.
Hopefully the heuristics (fixup the PASID MSR if the process already has
a valid PASID but the MSR doesn't have one yet) in this patch is acceptable.
>
> > +
> > +/*
> > + * Write the current task's PASID MSR/state. This is called only when PASID
> > + * is enabled.
> > + */
> > +static void fpu__pasid_write(u32 pasid)
> > +{
> > + u64 msr_val = pasid | MSR_IA32_PASID_VALID;
> > +
> > + fpregs_lock();
> > +
> > + /*
> > + * If the MSR is active and owned by the current task's FPU, it can
> > + * be directly written.
> > + *
> > + * Otherwise, write the fpstate.
> > + */
> > + if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
> > + wrmsrl(MSR_IA32_PASID, msr_val);
> > + } else {
> > + struct ia32_pasid_state *ppasid_state;
> > +
> > + ppasid_state = get_xsave_addr(¤t->thread.fpu.state.xsave,
> > + XFEATURE_PASID);
> > + /*
> > + * ppasid_state shouldn't be NULL because XFEATURE_PASID
> > + * is enabled.
> > + */
> > + WARN_ON_ONCE(!ppasid_state);
> > + ppasid_state->pasid = msr_val;
>
> WARN instead of BUG is nice, but you'll immediate oops if this fails.
> How about:
>
> if (!WARN_ON_ONCE(!ppasid_state))
> ppasid_state->pasid = msr_val;
OK. I will fix this issue.
Thank you very much for your review!
-Fenghua
On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu <[email protected]> wrote:
>
> A #GP fault is generated when ENQCMD instruction is executed without
> a valid PASID value programmed in the current thread's PASID MSR. The
> #GP fault handler will initialize the MSR if a PASID has been allocated
> for this process.
Let's take a step back here. Why are we trying to avoid IPIs? If you
call munmap(), you IPI other CPUs running tasks in the current mm. If
you do perf_event_open() and thus acquire RDPMC permission, you IPI
other CPUs running tasks in the current mm. If you call modify_ldt(),
you IPI other CPUs running tasks in the current mm. These events can
all happen more than once per process.
Now we have ENQCMD. An mm can be assigned a PASID *once* in the model
that these patches support. Why not just send an IPI using
essentially identical code to the LDT sync or the CR4.PCE sync?
--Andy
On 7/31/20 4:34 PM, Andy Lutomirski wrote:
>> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
>> without a valid PASID value programmed. #GP error codes are 16 bits and all
>> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
>> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
>> when loading from the source operand, so its not fully comprehending all
>> the reasons. Rather than special case the ENQCMD, in future Intel may
>> choose a different fault mechanism for such cases if recovery is needed on
>> #GP.
> Decoding the user instruction is ugly and sets a bad architecture
> precedent, but we already do it in #GP for UMIP. So I'm unconvinced.
I'll try to do one more bit of convincing. :)
In the end, we need a way to figure out if the #GP was from a known "OK"
source that we can fix up. You're right that we could fire up the
instruction decoder to help answer that question. But, it (also)
doesn't easily yield a perfect answer as to the source of the #GP, it
always involves a user copy, and it's a larger code impact than what
we've got.
I think I went and looked at fixup_umip_exception(), and compared it to
the alternative which is essentially just these three lines of code:
> + /*
> + * If the current task already has a valid PASID in the MSR,
> + * the #GP must be for some other reason.
> + */
> + if (current->has_valid_pasid)
> + return false;
...> + /* Now the current task has a valid PASID in the MSR. */
> + current->has_valid_pasid = 1;
and *I* was convinced that instruction decoding wasn't worth it.
There's a lot of stuff that fixup_umip_exception() does which we don't
have to duplicate, but it's going to be really hard to get it anywhere
near as compact as what we've got.
I guess Fenghua could go give instruction decoding a shot and see how it
looks, though.
On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen <[email protected]> wrote:
>
> On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> >> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> >> without a valid PASID value programmed. #GP error codes are 16 bits and all
> >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> >> when loading from the source operand, so its not fully comprehending all
> >> the reasons. Rather than special case the ENQCMD, in future Intel may
> >> choose a different fault mechanism for such cases if recovery is needed on
> >> #GP.
> > Decoding the user instruction is ugly and sets a bad architecture
> > precedent, but we already do it in #GP for UMIP. So I'm unconvinced.
>
> I'll try to do one more bit of convincing. :)
>
> In the end, we need a way to figure out if the #GP was from a known "OK"
> source that we can fix up. You're right that we could fire up the
> instruction decoder to help answer that question. But, it (also)
> doesn't easily yield a perfect answer as to the source of the #GP, it
> always involves a user copy, and it's a larger code impact than what
> we've got.
>
> I think I went and looked at fixup_umip_exception(), and compared it to
> the alternative which is essentially just these three lines of code:
>
> > + /*
> > + * If the current task already has a valid PASID in the MSR,
> > + * the #GP must be for some other reason.
> > + */
> > + if (current->has_valid_pasid)
> > + return false;
> ...> + /* Now the current task has a valid PASID in the MSR. */
> > + current->has_valid_pasid = 1;
>
> and *I* was convinced that instruction decoding wasn't worth it.
>
> There's a lot of stuff that fixup_umip_exception() does which we don't
> have to duplicate, but it's going to be really hard to get it anywhere
> near as compact as what we've got.
>
I could easily be convinced that the PASID fixup is so trivial and so
obviously free of misfiring in a way that causes an infinite loop that
this code is fine. But I think we first need to answer the bigger
question of why we're doing a lazy fixup in the first place.
--Andy
On Mon, Aug 03, 2020 at 08:12:18AM -0700, Andy Lutomirski wrote:
> On Mon, Aug 3, 2020 at 8:03 AM Dave Hansen <[email protected]> wrote:
> >
> > On 7/31/20 4:34 PM, Andy Lutomirski wrote:
> > >> Thomas suggested to provide a reason for the #GP caused by executing ENQCMD
> > >> without a valid PASID value programmed. #GP error codes are 16 bits and all
> > >> 16 bits are taken. Refer to SDM Vol 3, Chapter 16.13 for details. The other
> > >> choice was to reflect the error code in an MSR. ENQCMD can also cause #GP
> > >> when loading from the source operand, so its not fully comprehending all
> > >> the reasons. Rather than special case the ENQCMD, in future Intel may
> > >> choose a different fault mechanism for such cases if recovery is needed on
> > >> #GP.
> > > Decoding the user instruction is ugly and sets a bad architecture
> > > precedent, but we already do it in #GP for UMIP. So I'm unconvinced.
> >
> > I'll try to do one more bit of convincing. :)
> >
> > In the end, we need a way to figure out if the #GP was from a known "OK"
> > source that we can fix up. You're right that we could fire up the
> > instruction decoder to help answer that question. But, it (also)
> > doesn't easily yield a perfect answer as to the source of the #GP, it
> > always involves a user copy, and it's a larger code impact than what
> > we've got.
> >
> > I think I went and looked at fixup_umip_exception(), and compared it to
> > the alternative which is essentially just these three lines of code:
> >
> > > + /*
> > > + * If the current task already has a valid PASID in the MSR,
> > > + * the #GP must be for some other reason.
> > > + */
> > > + if (current->has_valid_pasid)
> > > + return false;
> > ...> + /* Now the current task has a valid PASID in the MSR. */
> > > + current->has_valid_pasid = 1;
> >
> > and *I* was convinced that instruction decoding wasn't worth it.
> >
> > There's a lot of stuff that fixup_umip_exception() does which we don't
> > have to duplicate, but it's going to be really hard to get it anywhere
> > near as compact as what we've got.
> >
>
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine. But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.
We choose lazy fixup for 2 reasons.
- If some threads were already created before the MSR is programmed, then
we need to fixup those in a race free way. scheduling some task-work etc.
We did do that early on, but decided it was ugly.
- Not all threads need to submit ENQCMD, force feeding the MSR probably
isn't even required for all. Yes the overhead isn't probably big, but
might not even be required for all threads.
We needed to fixup MSR in two different way. To keep the code simple, the
choice was to only fixup on #GP, that eliminated the extra code we need to
support case1.
Cheers,
Ashok
On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> I could easily be convinced that the PASID fixup is so trivial and so
> obviously free of misfiring in a way that causes an infinite loop that
> this code is fine. But I think we first need to answer the bigger
> question of why we're doing a lazy fixup in the first place.
There was an (internal to Intel) implementation of this about a year ago
that used smp_call_function_many() to force the MSR state into all
threads of a process. I took one look at it, decided there was a 0%
chance of it actually functioning and recommended we find another way.
While I'm sure it could be done more efficiently, the implementation I
looked at took ~200 lines of code and comments. It started to look too
much like another instance of mm->cpumask for my comfort.
The only other option I could think of would be an ABI where threads
were required to call into the kernel at least once after creation
before calling ENQCMD. All ENQCMDs would be required to be "wrapped" by
code doing this syscall. Something like this:
if (!thread_local(did_pasid_init))
sys_pasid_init(); // new syscall or prctl
thread_local(did_pasid_init) = 1;
// ENQCMD here
On Mon, Aug 3, 2020 at 9:37 AM Dave Hansen <[email protected]> wrote:
>
> On 8/3/20 8:12 AM, Andy Lutomirski wrote:
> > I could easily be convinced that the PASID fixup is so trivial and so
> > obviously free of misfiring in a way that causes an infinite loop that
> > this code is fine. But I think we first need to answer the bigger
> > question of why we're doing a lazy fixup in the first place.
>
> There was an (internal to Intel) implementation of this about a year ago
> that used smp_call_function_many() to force the MSR state into all
> threads of a process. I took one look at it, decided there was a 0%
> chance of it actually functioning and recommended we find another way.
> While I'm sure it could be done more efficiently, the implementation I
> looked at took ~200 lines of code and comments. It started to look too
> much like another instance of mm->cpumask for my comfort.
If I were implementing this, I would try making switch_mm_irqs_off()
do, roughly:
void load_mm_pasid(...) {
if (cpu_feature_enabled(X86_FEATURE_ENQCMD))
tsk->xstate[offset] = READ_ONCE(next->context.pasid);
}
This costs one cache miss, although the cache line in question is
about to be read anyway. It might be faster to, instead, do:
void load_mm_pasid(...) {
u32 pasid = READ_ONCE(next->context.pasid);
if (tsk->xstate[offset] != pasid)
tsk->state[offset] = pasid;
}
so we don't dirty the cache line in the common case. The actual
generated code ought to be pretty good -- surely the offset of PASID
in XSTATE is an entry in an array somewhere that can be found with a
single read, right?
The READ_ONCE is because this could race against a write to
context.pasid, so this code needs to be at the end of the function
where it's protected by mm_cpumask. With all this done, the pasid
update is just on_each_cpu_mask(mm_cpumask(mm), load_mm_pasid, mm,
true).
This looks like maybe 20 lines of code. As an added bonus, it lets us
free PASIDs early if we ever decide we want to.
May I take this opportunity to ask Intel to please put some real
thought into future pieces of CPU state? Here's a summary of some
things we have:
- Normal extended state (FPU, XMM, etc): genuinely per thread and only
ever used explicitly. Actually makes sense with XSAVE(C/S).
- PKRU: affects CPL0-originated memory accesses, so it needs to be
eagerly loaded in the kernel. Does not make sense with XRSTOR(C/S),
but it's in there anyway.
- CR3: per-mm state. Makes some sense in CR3, but it's tangled up
with CR4 in nasty ways.
- LDTR: per-mm on Linux and mostly obsolete everyone. In it's own
register, so it's not a big deal.
- PASID: per-mm state (surely Intel always intended it to be per-mm,
since it's for shared _virtual memory_!). But for some reason it's in
an MSR (which is slow), and it's cleverly, but not that cleverly,
accessible with XSAVES/XRSTORS. Doesn't actually make sense. Also,
PASID is lazy-loadable, but the mechanism for telling the kernel that
a lazy load is needed got flubbed.
- TILE: genuinely per-thread, but it's expensive so it's
lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not
fully disambiguated from the other use of #NM. So it sort of works,
but it's gross.
- "KERNEL_GS_BASE", i.e. the shadow GS base. This is logically
per-user-thread state, but it's only accessible in MSRs. For some
reason this is *not* in XSAVES/XRSTORS state, nor is there any
efficient way to access it at all.
- Segment registers: can't be properly saved except by hypervisors,
and can almost, but not quite, be properly loaded (assuming the state
was sane to begin with) by normal kernels. Just don't try to load 1,
2, or 3 into any of them.
Sometimes I think that this is all intended to be as confusing as
possible and that it's all a ploy to keep context switches slow and
complicated. Maybe Intel doesn't actually want to compete with other
architectures that can context switch quickly?
It would be really nice if we had a clean way to load per-mm state
(see my private emails about this), a clean way to load CPL3 register
state, and a clean way to load per-user-thread *kernel* register state
(e.g. PKRU and probably PKRS). And there should be an exception that
says "user code accessed a lazy-loaded resource that isn't loaded, and
this is the resource it tried to access".
Hi, Andy,
On Fri, Jul 31, 2020 at 06:28:37PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 13, 2020 at 4:48 PM Fenghua Yu <[email protected]> wrote:
> >
> > A #GP fault is generated when ENQCMD instruction is executed without
> > a valid PASID value programmed in the current thread's PASID MSR. The
> > #GP fault handler will initialize the MSR if a PASID has been allocated
> > for this process.
>
> Let's take a step back here. Why are we trying to avoid IPIs? If you
> call munmap(), you IPI other CPUs running tasks in the current mm. If
> you do perf_event_open() and thus acquire RDPMC permission, you IPI
> other CPUs running tasks in the current mm. If you call modify_ldt(),
> you IPI other CPUs running tasks in the current mm. These events can
> all happen more than once per process.
>
> Now we have ENQCMD. An mm can be assigned a PASID *once* in the model
> that these patches support. Why not just send an IPI using
> essentially identical code to the LDT sync or the CR4.PCE sync?
ldt (or the other two cases) is different from ENQCMD: the PASID MSR
is per-task and is supported by xsaves.
The per-task PASID MSR needs to updated to ALL tasks. That means IPI,
which only updates running tasks' MSRs, is not enough. All tasks' MSRs
need to be updated when a PASID is allocated.
This difference increases the complexity of sending IPI to running tasks
and updating sleeping tasks's MSRs with locking etc.
Of course, it's doable not to update the MSRs in all task when a new PASID
is allocated to the mm. But that means we need to discard xsaves support
for the MSR and create our own switch function to load the MSR. That
increases complexity.
We tried similar IPI way to update the PASID in about 200 lines of code.
As Dave Hansen pointed, it's too complex. The current lazy updating the MSR
only takes essential 3 lines of code in #GP.
Does it make sense to still use the current fix up method to update the MSR?
Thanks.
-Fenghua
On 8/3/20 10:16 AM, Andy Lutomirski wrote:
> - TILE: genuinely per-thread, but it's expensive so it's
> lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not
> fully disambiguated from the other use of #NM. So it sort of works,
> but it's gross.
For those playing along at home, there's a new whitepaper out from Intel
about some new CPU features which are going to be fun:
> https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
Which part were you worried about? I thought it was fully disambuguated
from this:
> When XFD causes an instruction to generate #NM, the processor loads
> the IA32_XFD_ERR MSR to identify the disabled state component(s).
> Specifically, the MSR is loaded with the logical AND of the IA32_XFD
> MSR and the bitmap corresponding to the state components required by
> the faulting instruction.
>
> Device-not-available exceptions that are not due to XFD — those
> resulting from setting CR0.TS to 1 — do not modify the IA32_XFD_ERR
> MSR.
So if you always make sure to *clear* IA32_XFD_ERR after handing and XFD
exception, any #NM's with a clear IA32_XFD_ERR are from "legacy"
CR0.TS=1. Any bits set in IA32_XFD_ERR mean a new-style XFD exception.
Am I missing something?
> On Aug 3, 2020, at 10:34 AM, Dave Hansen <[email protected]> wrote:
>
> On 8/3/20 10:16 AM, Andy Lutomirski wrote:
>> - TILE: genuinely per-thread, but it's expensive so it's
>> lazy-loadable. But the lazy-load mechanism reuses #NM, and it's not
>> fully disambiguated from the other use of #NM. So it sort of works,
>> but it's gross.
>
> For those playing along at home, there's a new whitepaper out from Intel
> about some new CPU features which are going to be fun:
>
>> https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
>
> Which part were you worried about? I thought it was fully disambuguated
> from this:
>
>> When XFD causes an instruction to generate #NM, the processor loads
>> the IA32_XFD_ERR MSR to identify the disabled state component(s).
>> Specifically, the MSR is loaded with the logical AND of the IA32_XFD
>> MSR and the bitmap corresponding to the state components required by
>> the faulting instruction.
>>
>> Device-not-available exceptions that are not due to XFD — those
>> resulting from setting CR0.TS to 1 — do not modify the IA32_XFD_ERR
>> MSR.
>
> So if you always make sure to *clear* IA32_XFD_ERR after handing and XFD
> exception, any #NM's with a clear IA32_XFD_ERR are from "legacy"
> CR0.TS=1. Any bits set in IA32_XFD_ERR mean a new-style XFD exception.
>
> Am I missing something?
I don’t think you’re missing anything, but this mechanism seems to be solidly in the category of “just barely works”, kind of like #DB and DR6, which also just barely works.
And this PASID vs #GP mess is just sad. We’re trying to engineer around an issue that has no need to exist in the first place. For some reason we have two lazy-loading-fault mechanisms showing up in x86 in rapid succession, one of them is maybe 3/4-baked, and the other is totally separate and maybe 1/4 baked.
If ENQCMD instead raise #NM, this would be trivial. (And it even makes sense — the error is, literally, “an instruction tried to use something in XSTATE that isn’t initialized”.). Or if someone had noticed that, kind of like PKRU, PASID didn’t really belong in XSTATE, we wouldn’t have this mess.
Yes, obviously Linux can get all this stuff to work, but maybe Intel could aspire to design features that are straightforward to use well instead of merely possible to use well?