2020-11-02 20:55:57

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

From: Ira Weiny <[email protected]>

Changes from V1
Rebase to TIP master; resolve conflicts and test
Clean up some kernel docs updates missed in V1
Add irqentry_state_t kernel doc for PKRS field
Removed redundant irq_state->pkrs
This is only needed when we add the global state and somehow
ended up in this patch series. That will come back when we add
the global functionality in.
From Thomas Gleixner
Update commit messages
Add kernel doc for struct irqentry_state_t
From Dave Hansen add flags to pks_key_alloc()

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys
is a newer use case which is still being explored. PMEM was submitted as part
of the RFC (v2) series[1]. However, since then it was found that some callers
of kmap() require a global implementation of PKS. Specifically some users of
kmap() expect mappings to be available to all kernel threads. While global use
of PKS is rare it needs to be included for correctness. Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out. Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2] This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections. PKS works in
a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change. Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR. Therefore the implementation
saves/restores the MSR across context switches and during exceptions. Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one. Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis

PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/


Fenghua Yu (2):
x86/pks: Enable Protection Keys Supervisor (PKS)
x86/pks: Add PKS kernel API

Ira Weiny (7):
x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Preserve the PKRS MSR on context switch
x86/entry: Pass irqentry_state_t by reference
x86/entry: Preserve PKRS MSR across exceptions
x86/fault: Report the PKRS state on fault
x86/pks: Add PKS test code

Thomas Gleixner (1):
x86/entry: Move nmi entry/exit into common code

Documentation/core-api/protection-keys.rst | 103 ++-
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 64 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 28 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable.h | 13 +-
arch/x86/include/asm/pgtable_types.h | 12 +
arch/x86/include/asm/pkeys.h | 15 +
arch/x86/include/asm/pkeys_common.h | 40 ++
arch/x86/include/asm/processor.h | 18 +-
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/cpu/common.c | 15 +
arch/x86/kernel/cpu/mce/core.c | 6 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/kvm.c | 6 +-
arch/x86/kernel/nmi.c | 6 +-
arch/x86/kernel/process.c | 26 +
arch/x86/kernel/traps.c | 24 +-
arch/x86/mm/fault.c | 87 ++-
arch/x86/mm/pkeys.c | 194 +++++-
include/linux/entry-common.h | 64 +-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 24 +
kernel/entry/common.c | 62 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 691 ++++++++++++++++++++
mm/Kconfig | 2 +
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 66 ++
33 files changed, 1465 insertions(+), 161 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

--
2.28.0.rc0.12.gb6a658bd00c9


2020-11-02 20:56:00

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 01/10] x86/pkeys: Create pkeys_common.h

From: Ira Weiny <[email protected]>

Protection Keys User (PKU) and Protection Keys Supervisor (PKS) work in
similar fashions and can share common defines. Specifically PKS and PKU
each have:

1. A single control register
2. The same number of keys
3. The same number of bits in the register per key
4. Access and Write disable in the same bit locations

That means that we can share all the macros that synthesize and
manipulate register values between the two features. Normally, these
macros would be put in asm/pkeys.h to be used internally and externally
to the arch code. However, the defines are required in pgtable.h and
inclusion of pkeys.h in that header creates complex dependencies which
are best resolved in a separate header.

Share these defines by moving them into a new header, change their names
to reflect the common use, and include the header where needed.

Reviewed-by: Dave Hansen <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
NOTE: The initialization of init_pkru_value cause checkpatch errors
because of the space after the '(' in the macros. We leave this as is
because it is more readable in this format. And it was existing code.

---
Changes from RFC V3
Per Dave Hansen
Update commit message
Add comment to PKR_AD_KEY macro
---
arch/x86/include/asm/pgtable.h | 13 ++++++-------
arch/x86/include/asm/pkeys.h | 2 ++
arch/x86/include/asm/pkeys_common.h | 15 +++++++++++++++
arch/x86/kernel/fpu/xstate.c | 8 ++++----
arch/x86/mm/pkeys.c | 14 ++++++--------
5 files changed, 33 insertions(+), 19 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..bfbfb951fe65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1360,9 +1360,7 @@ static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd)
}
#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */

-#define PKRU_AD_BIT 0x1
-#define PKRU_WD_BIT 0x2
-#define PKRU_BITS_PER_PKEY 2
+#include <asm/pkeys_common.h>

#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
extern u32 init_pkru_value;
@@ -1372,18 +1370,19 @@ extern u32 init_pkru_value;

static inline bool __pkru_allows_read(u32 pkru, u16 pkey)
{
- int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
- return !(pkru & (PKRU_AD_BIT << pkru_pkey_bits));
+ int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
+
+ return !(pkru & (PKR_AD_BIT << pkru_pkey_bits));
}

static inline bool __pkru_allows_write(u32 pkru, u16 pkey)
{
- int pkru_pkey_bits = pkey * PKRU_BITS_PER_PKEY;
+ int pkru_pkey_bits = pkey * PKR_BITS_PER_PKEY;
/*
* Access-disable disables writes too so we need to check
* both bits here.
*/
- return !(pkru & ((PKRU_AD_BIT|PKRU_WD_BIT) << pkru_pkey_bits));
+ return !(pkru & ((PKR_AD_BIT|PKR_WD_BIT) << pkru_pkey_bits));
}

static inline u16 pte_flags_pkey(unsigned long pte_flags)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 2ff9b98812b7..f9feba80894b 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_PKEYS_H
#define _ASM_X86_PKEYS_H

+#include <asm/pkeys_common.h>
+
#define ARCH_DEFAULT_PKEY 0

/*
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
new file mode 100644
index 000000000000..737d916f476c
--- /dev/null
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKEYS_INTERNAL_H
+#define _ASM_X86_PKEYS_INTERNAL_H
+
+#define PKR_AD_BIT 0x1
+#define PKR_WD_BIT 0x2
+#define PKR_BITS_PER_PKEY 2
+
+/*
+ * Generate an Access-Disable mask for the given pkey. Several of these can be
+ * OR'd together to generate pkey register values.
+ */
+#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))
+
+#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 5d8047441a0a..a99afc70cc0a 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -995,7 +995,7 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
{
u32 old_pkru;
- int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
+ int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
u32 new_pkru_bits = 0;

/*
@@ -1014,16 +1014,16 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,

/* Set the bits we need in PKRU: */
if (init_val & PKEY_DISABLE_ACCESS)
- new_pkru_bits |= PKRU_AD_BIT;
+ new_pkru_bits |= PKR_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
- new_pkru_bits |= PKRU_WD_BIT;
+ new_pkru_bits |= PKR_WD_BIT;

/* Shift the bits in to the correct place in PKRU for pkey: */
new_pkru_bits <<= pkey_shift;

/* Get old PKRU and mask off any old bits in place: */
old_pkru = read_pkru();
- old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
+ old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);

/* Write old part along with new part: */
write_pkru(old_pkru | new_pkru_bits);
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 8873ed1438a9..f5efb4007e74 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -111,19 +111,17 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot, int pkey
return vma_pkey(vma);
}

-#define PKRU_AD_KEY(pkey) (PKRU_AD_BIT << ((pkey) * PKRU_BITS_PER_PKEY))
-
/*
* Make the default PKRU value (at execve() time) as restrictive
* as possible. This ensures that any threads clone()'d early
* in the process's lifetime will not accidentally get access
* to data which is pkey-protected later on.
*/
-u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
- PKRU_AD_KEY( 4) | PKRU_AD_KEY( 5) | PKRU_AD_KEY( 6) |
- PKRU_AD_KEY( 7) | PKRU_AD_KEY( 8) | PKRU_AD_KEY( 9) |
- PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
- PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);
+u32 init_pkru_value = PKR_AD_KEY( 1) | PKR_AD_KEY( 2) | PKR_AD_KEY( 3) |
+ PKR_AD_KEY( 4) | PKR_AD_KEY( 5) | PKR_AD_KEY( 6) |
+ PKR_AD_KEY( 7) | PKR_AD_KEY( 8) | PKR_AD_KEY( 9) |
+ PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) |
+ PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15);

/*
* Called from the FPU code when creating a fresh set of FPU
@@ -173,7 +171,7 @@ static ssize_t init_pkru_write_file(struct file *file,
* up immediately if someone attempts to disable access
* or writes to pkey 0.
*/
- if (new_init_pkru & (PKRU_AD_BIT|PKRU_WD_BIT))
+ if (new_init_pkru & (PKR_AD_BIT|PKR_WD_BIT))
return -EINVAL;

WRITE_ONCE(init_pkru_value, new_init_pkru);
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:01

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 04/10] x86/pks: Preserve the PKRS MSR on context switch

From: Ira Weiny <[email protected]>

The PKRS MSR is defined as a per-logical-processor register. This
isolates memory access by logical CPU. Unfortunately, the MSR is not
managed by XSAVE. Therefore, tasks must save/restore the MSR value on
context switch.

Define a saved PKRS value in the task struct, as well as a cached
per-logical-processor MSR value which mirrors the MSR value of the
current CPU. Initialize all tasks with the default MSR value. Then, on
schedule in, check the saved task MSR vs the per-cpu value. If
different proceed to write the MSR. If not avoid the overhead of the
MSR write and continue.

Follow on patches will update the saved PKRS as well as the MSR if
needed.

Finally it should be noted that the underlying WRMSR(MSR_IA32_PKRS) is
not serializing but still maintains ordering properties similar to
WRPKRU. The current SDM section on PKRS needs updating but should be
the same as that of WRPKRU. So to quote from the WRPKRU text:

WRPKRU will never execute transiently. Memory accesses affected
by PKRU register will not execute (even transiently) until all
prior executions of WRPKRU have completed execution and updated
the PKRU register.

Co-developed-by: Fenghua Yu <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Co-developed-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V1
Rebase to latest tip/master
Resolve conflicts with INIT_THREAD changes

Changes since RFC V3
Per Dave Hansen
Update commit message
move saved_pkrs to be in a nicer place
Per Peter Zijlstra
Add Comment from Peter
Clean up white space
Update authorship
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pkeys_common.h | 20 +++++++++++++++++++
arch/x86/include/asm/processor.h | 18 ++++++++++++++++-
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/process.c | 26 ++++++++++++++++++++++++
arch/x86/mm/pkeys.c | 31 +++++++++++++++++++++++++++++
6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 972a34d93505..ddb125e44408 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -754,6 +754,7 @@

#define MSR_IA32_TSC_DEADLINE 0x000006E0

+#define MSR_IA32_PKRS 0x000006E1

#define MSR_TSX_FORCE_ABORT 0x0000010F

diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index 737d916f476c..801a75615209 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -12,4 +12,24 @@
*/
#define PKR_AD_KEY(pkey) (PKR_AD_BIT << ((pkey) * PKR_BITS_PER_PKEY))

+/*
+ * Define a default PKRS value for each task.
+ *
+ * Key 0 has no restriction. All other keys are set to the most restrictive
+ * value which is access disabled (AD=1).
+ *
+ * NOTE: This needs to be a macro to be used as part of the INIT_THREAD macro.
+ */
+#define INIT_PKRS_VALUE (PKR_AD_KEY(1) | PKR_AD_KEY(2) | PKR_AD_KEY(3) | \
+ PKR_AD_KEY(4) | PKR_AD_KEY(5) | PKR_AD_KEY(6) | \
+ PKR_AD_KEY(7) | PKR_AD_KEY(8) | PKR_AD_KEY(9) | \
+ PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \
+ PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15))
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+void write_pkrs(u32 new_pkrs);
+#else
+static inline void write_pkrs(u32 new_pkrs) { }
+#endif
+
#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 60dbcdcb833f..78eb5f483410 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -18,6 +18,7 @@ struct vm86;
#include <asm/cpufeatures.h>
#include <asm/page.h>
#include <asm/pgtable_types.h>
+#include <asm/pkeys_common.h>
#include <asm/percpu.h>
#include <asm/msr.h>
#include <asm/desc_defs.h>
@@ -522,6 +523,12 @@ struct thread_struct {
unsigned long cr2;
unsigned long trap_nr;
unsigned long error_code;
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+ /* Saved Protection key register for supervisor mappings */
+ u32 saved_pkrs;
+#endif
+
#ifdef CONFIG_VM86
/* Virtual 86 mode info */
struct vm86 *vm86;
@@ -787,7 +794,16 @@ static inline void spin_lock_prefetch(const void *x)
#define KSTK_ESP(task) (task_pt_regs(task)->sp)

#else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+#define INIT_THREAD_PKRS .saved_pkrs = INIT_PKRS_VALUE
+#else
+#define INIT_THREAD_PKRS 0
+#endif
+
+#define INIT_THREAD { \
+ INIT_THREAD_PKRS, \
+}

extern unsigned long KSTK_ESP(struct task_struct *task);

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6a9ca938d9a9..f8929a557d72 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -58,6 +58,7 @@
#include <asm/intel-family.h>
#include <asm/cpu_device_id.h>
#include <asm/uv/uv.h>
+#include <linux/pkeys.h>

#include "cpu.h"

@@ -1503,6 +1504,7 @@ static void setup_pks(void)
if (!cpu_feature_enabled(X86_FEATURE_PKS))
return;

+ write_pkrs(INIT_PKRS_VALUE);
cr4_set_bits(X86_CR4_PKS);
}

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ba4593a913fa..aa2ae5292ff1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,6 +43,7 @@
#include <asm/io_bitmap.h>
#include <asm/proto.h>
#include <asm/frame.h>
+#include <asm/pkeys_common.h>

#include "process.h"

@@ -187,6 +188,27 @@ int copy_thread(unsigned long clone_flags, unsigned long sp, unsigned long arg,
return ret;
}

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+DECLARE_PER_CPU(u32, pkrs_cache);
+static inline void pks_init_task(struct task_struct *tsk)
+{
+ /* New tasks get the most restrictive PKRS value */
+ tsk->thread.saved_pkrs = INIT_PKRS_VALUE;
+}
+static inline void pks_sched_in(void)
+{
+ /*
+ * PKRS is only temporarily changed during specific code paths. Only a
+ * preemption during these windows away from the default value would
+ * require updating the MSR. write_pkrs() handles this optimization.
+ */
+ write_pkrs(current->thread.saved_pkrs);
+}
+#else
+static inline void pks_init_task(struct task_struct *tsk) { }
+static inline void pks_sched_in(void) { }
+#endif
+
void flush_thread(void)
{
struct task_struct *tsk = current;
@@ -195,6 +217,8 @@ void flush_thread(void)
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));

fpu__clear_all(&tsk->thread.fpu);
+
+ pks_init_task(tsk);
}

void disable_TSC(void)
@@ -644,6 +668,8 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p)

if ((tifp ^ tifn) & _TIF_SLD)
switch_to_sld(tifn);
+
+ pks_sched_in();
}

/*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index d1dfe743e79f..76a62419c446 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -231,3 +231,34 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)

return pk_reg;
}
+
+DEFINE_PER_CPU(u32, pkrs_cache);
+
+/**
+ * write_pkrs() optimizes MSR writes by maintaining a per cpu cache which can
+ * be checked quickly.
+ *
+ * It should also be noted that the underlying WRMSR(MSR_IA32_PKRS) is not
+ * serializing but still maintains ordering properties similar to WRPKRU.
+ * The current SDM section on PKRS needs updating but should be the same as
+ * that of WRPKRU. So to quote from the WRPKRU text:
+ *
+ * WRPKRU will never execute transiently. Memory accesses
+ * affected by PKRU register will not execute (even transiently)
+ * until all prior executions of WRPKRU have completed execution
+ * and updated the PKRU register.
+ */
+void write_pkrs(u32 new_pkrs)
+{
+ u32 *pkrs;
+
+ if (!static_cpu_has(X86_FEATURE_PKS))
+ return;
+
+ pkrs = get_cpu_ptr(&pkrs_cache);
+ if (*pkrs != new_pkrs) {
+ *pkrs = new_pkrs;
+ wrmsrl(MSR_IA32_PKRS, new_pkrs);
+ }
+ put_cpu_ptr(pkrs);
+}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:00

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 10/10] x86/pks: Add PKS test code

From: Ira Weiny <[email protected]>

The core PKS functionality provides an interface for kernel users to
reserve keys to their domains set up the page tables with those keys and
control access to those domains when needed.

Define test code which exercises the core functionality of PKS via a
debugfs entry. Basic checks can be triggered on boot with a kernel
command line option while both basic and preemption checks can be
triggered with separate debugfs values.

debugfs controls are:

'0' -- Run access tests with a single pkey
'1' -- Set up the pkey register with no access for the pkey allocated to
this fd
'2' -- Check that the pkey register updated in '1' is still the same.
(To be used after a forced context switch.)
'3' -- Allocate all pkeys possible and run tests on each pkey allocated.
DEFAULT when run at boot.

Closing the fd will cleanup and release the pkey, therefore to exercise
context switch testing a user space program is provided in:

.../tools/testing/selftests/x86/test_pks.c

Reviewed-by: Dave Hansen <[email protected]>
Co-developed-by: Fenghua Yu <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V1
Update for new pks_key_alloc()

Changes from RFC V3
Comments from Dave Hansen
clean up whitespace dmanage
Clean up Kconfig help
Clean up user test error output
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Comments from Jing Han
Remove duplicate stdio.h
---
Documentation/core-api/protection-keys.rst | 1 +
arch/x86/mm/fault.c | 23 +
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 691 +++++++++++++++++++++
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 66 ++
8 files changed, 801 insertions(+), 1 deletion(-)
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index c4e6c480562f..8ffdfbff013c 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -164,3 +164,4 @@ of WRPKRU. So to quote from the WRPKRU text:
until all prior executions of WRPKRU have completed execution
and updated the PKRU register.

+Example code can be found in lib/pks/pks_test.c
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 931603102010..0a51b168e8ee 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_recover_from_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/pkeys.h>

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1149,6 +1150,25 @@ bool fault_in_kernel_space(unsigned long address)
return address >= TASK_SIZE_MAX;
}

+#ifdef CONFIG_PKS_TESTING
+bool pks_test_callback(irqentry_state_t *irq_state);
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state)
+{
+ /*
+ * If we get a protection key exception it could be because we
+ * are running the PKS test. If so, pks_test_callback() will
+ * clear the protection mechanism and return true to indicate
+ * the fault was handled.
+ */
+ return (hw_error_code & X86_PF_PK) && pks_test_callback(irq_state);
+}
+#else
+static bool handle_pks_testing(unsigned long hw_error_code, irqentry_state_t *irq_state)
+{
+ return false;
+}
+#endif
+
/*
* Called for all faults where 'address' is part of the kernel address
* space. Might get called for faults that originate from *code* that
@@ -1165,6 +1185,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
if (!cpu_feature_enabled(X86_FEATURE_PKS))
WARN_ON_ONCE(hw_error_code & X86_PF_PK);

+ if (handle_pks_testing(hw_error_code, irq_state))
+ return;
+
#ifdef CONFIG_X86_32
/*
* We can fault-in kernel-space virtual memory on-demand. The
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d7a7bc3b6098..028beedd86f5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2444,6 +2444,18 @@ config HYPERV_TESTING
help
Select this option to enable Hyper-V vmbus testing.

+config PKS_TESTING
+ bool "PKey (S)upervisor testing"
+ default n
+ depends on ARCH_HAS_SUPERVISOR_PKEYS
+ help
+ Select this option to enable testing of PKS core software and
+ hardware. The PKS core provides a mechanism to allocate keys as well
+ as maintain the protection settings across context switches.
+ Answer N if you don't know what supervisor keys are.
+
+ If unsure, say N.
+
endmenu # "Kernel Testing and Coverage"

endmenu # Kernel hacking
diff --git a/lib/Makefile b/lib/Makefile
index ce45af50983a..6a402bc1b9a0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -352,3 +352,6 @@ obj-$(CONFIG_BITFIELD_KUNIT) += bitfield_kunit.o
obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o
obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o
obj-$(CONFIG_BITS_TEST) += test_bits.o
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..7d1df7563db9
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TESTING) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..fab4a855edc3
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,691 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2020 Intel Corporation. All rights reserved.
+ *
+ * Implement PKS testing
+ * Access to run this test can be with a command line parameter
+ * ("pks-test-on-boot") or more detailed tests can be triggered through:
+ *
+ * /sys/kernel/debug/x86/run_pks
+ *
+ * debugfs controls are:
+ *
+ * '0' -- Run access tests with a single pkey
+ *
+ * '1' -- Set up the pkey register with no access for the pkey allocated to
+ * this fd
+ * '2' -- Check that the pkey register updated in '1' is still the same. (To
+ * be used after a forced context switch.)
+ *
+ * '3' -- Allocate all pkeys possible and run tests on each pkey allocated.
+ * DEFAULT when run at boot.
+ *
+ * Closing the fd will cleanup and release the pkey.
+ *
+ * A companion user space program is provided in:
+ *
+ * .../tools/testing/selftests/x86/test_pks.c
+ *
+ * which will better test the context switching.
+ *
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/debugfs.h>
+#include <linux/delay.h>
+#include <linux/entry-common.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/percpu-defs.h>
+#include <linux/pgtable.h>
+#include <linux/pkeys.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+/*
+ * run_on_boot default '= false' which checkpatch complains about initializing;
+ * so we don't
+ */
+static bool run_on_boot;
+static struct dentry *pks_test_dentry;
+static bool run_9;
+
+/*
+ * We must lock the following globals for brief periods while the fault handler
+ * checks/updates them.
+ */
+static DEFINE_SPINLOCK(test_lock);
+static int test_armed_key;
+static unsigned long prev_cnt;
+static unsigned long fault_cnt;
+
+struct pks_test_ctx {
+ bool pass;
+ bool pks_cpu_enabled;
+ int pkey;
+ char data[64];
+};
+static struct pks_test_ctx *test_exception_ctx;
+
+static pte_t *walk_table(void *ptr)
+{
+ struct page *page = NULL;
+ pgd_t *pgdp;
+ p4d_t *p4dp;
+ pud_t *pudp;
+ pmd_t *pmdp;
+ pte_t *ret = NULL;
+
+ pgdp = pgd_offset_k((unsigned long)ptr);
+ if (pgd_none(*pgdp) || pgd_bad(*pgdp))
+ goto error;
+
+ p4dp = p4d_offset(pgdp, (unsigned long)ptr);
+ if (p4d_none(*p4dp) || p4d_bad(*p4dp))
+ goto error;
+
+ pudp = pud_offset(p4dp, (unsigned long)ptr);
+ if (pud_none(*pudp) || pud_bad(*pudp))
+ goto error;
+
+ pmdp = pmd_offset(pudp, (unsigned long)ptr);
+ if (pmd_none(*pmdp) || pmd_bad(*pmdp))
+ goto error;
+
+ ret = pte_offset_map(pmdp, (unsigned long)ptr);
+ if (pte_present(*ret)) {
+ page = pte_page(*ret);
+ if (!page) {
+ pte_unmap(ret);
+ goto error;
+ }
+ pr_info("page 0x%lx; flags 0x%lx\n",
+ (unsigned long)page, page->flags);
+ }
+
+error:
+ return ret;
+}
+
+static bool check_pkey_val(u32 pk_reg, int pkey, u32 expected)
+{
+ u32 pkey_shift = pkey * PKR_BITS_PER_PKEY;
+ u32 pkey_mask = ((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift;
+
+ pk_reg = (pk_reg & pkey_mask) >> pkey_shift;
+ return (pk_reg == expected);
+}
+
+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(int pkey, u32 expected)
+{
+ bool ret = true;
+ u64 pkrs;
+ u32 *tmp_cache;
+
+ tmp_cache = get_cpu_ptr(&pkrs_cache);
+ if (!check_pkey_val(*tmp_cache, pkey, expected))
+ ret = false;
+ put_cpu_ptr(tmp_cache);
+
+ rdmsrl(MSR_IA32_PKRS, pkrs);
+ if (!check_pkey_val(pkrs, pkey, expected))
+ ret = false;
+
+ return ret;
+}
+
+static void check_exception(irqentry_state_t *irq_state)
+{
+ /* Check the thread saved state */
+ if (!check_pkey_val(irq_state->pkrs, test_armed_key, PKEY_DISABLE_WRITE)) {
+ pr_err(" FAIL: checking irq_state->pkrs\n");
+ test_exception_ctx->pass = false;
+ }
+
+ /* Check the exception state */
+ if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+ pr_err(" FAIL: PKRS cache and MSR\n");
+ test_exception_ctx->pass = false;
+ }
+
+ /*
+ * Check we can update the value during exception without affecting the
+ * calling thread. The calling thread is checked after exception...
+ */
+ pks_mk_readwrite(test_armed_key);
+ if (!check_pkrs(test_armed_key, 0)) {
+ pr_err(" FAIL: exception did not change register to 0\n");
+ test_exception_ctx->pass = false;
+ }
+ pks_mk_noaccess(test_armed_key);
+ if (!check_pkrs(test_armed_key, PKEY_DISABLE_ACCESS)) {
+ pr_err(" FAIL: exception did not change register to 0x%x\n",
+ PKEY_DISABLE_ACCESS);
+ test_exception_ctx->pass = false;
+ }
+}
+
+/* Silence prototype warning */
+bool pks_test_callback(irqentry_state_t *irq_state);
+
+/**
+ * pks_test_callback() is exported so that the fault handler can detect
+ * and report back status of intentional faults.
+ *
+ * NOTE: It clears the protection key from the page such that the fault handler
+ * will not re-trigger.
+ */
+bool pks_test_callback(irqentry_state_t *irq_state)
+{
+ bool armed = (test_armed_key != 0);
+
+ if (test_exception_ctx) {
+ check_exception(irq_state);
+ /*
+ * We stop this check within the exception because the
+ * fault handler clean up code will call us 2x while checking
+ * the PMD entry and we don't need to check this again
+ */
+ test_exception_ctx = NULL;
+ }
+
+ if (armed) {
+ /* Enable read and write to stop faults */
+ irq_state->pkrs = update_pkey_val(irq_state->pkrs, test_armed_key, 0);
+ fault_cnt++;
+ }
+
+ return armed;
+}
+EXPORT_SYMBOL(pks_test_callback);
+
+static bool exception_caught(void)
+{
+ bool ret = (fault_cnt != prev_cnt);
+
+ prev_cnt = fault_cnt;
+ return ret;
+}
+
+static void report_pkey_settings(void *unused)
+{
+ u8 pkey;
+ unsigned long long msr = 0;
+ unsigned int cpu = smp_processor_id();
+
+ rdmsrl(MSR_IA32_PKRS, msr);
+
+ pr_info("for CPU %d : 0x%llx\n", cpu, msr);
+ for (pkey = 0; pkey < PKS_NUM_KEYS; pkey++) {
+ int ad, wd;
+
+ ad = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_ACCESS;
+ wd = (msr >> (pkey * PKR_BITS_PER_PKEY)) & PKEY_DISABLE_WRITE;
+ pr_info(" %u: A:%d W:%d\n", pkey, ad, wd);
+ }
+}
+
+enum pks_access_mode {
+ PKS_TEST_NO_ACCESS,
+ PKS_TEST_RDWR,
+ PKS_TEST_RDONLY
+};
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+ switch (mode) {
+ case PKS_TEST_NO_ACCESS:
+ return "No Access";
+ case PKS_TEST_RDWR:
+ return "Read Write";
+ case PKS_TEST_RDONLY:
+ return "Read Only";
+ default:
+ pr_err("BUG in test invalid mode\n");
+ break;
+ }
+
+ return "";
+}
+
+struct pks_access_test {
+ enum pks_access_mode mode;
+ bool write;
+ bool exception;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+ /* disable both */
+ { PKS_TEST_NO_ACCESS, true, true },
+ { PKS_TEST_NO_ACCESS, false, true },
+
+ /* enable both */
+ { PKS_TEST_RDWR, true, false },
+ { PKS_TEST_RDWR, false, false },
+
+ /* enable read only */
+ { PKS_TEST_RDONLY, true, true },
+ { PKS_TEST_RDONLY, false, false },
+};
+
+static int test_it(struct pks_test_ctx *ctx, struct pks_access_test *test, void *ptr)
+{
+ bool exception;
+ int ret = 0;
+
+ spin_lock(&test_lock);
+ WRITE_ONCE(test_armed_key, ctx->pkey);
+
+ if (test->write)
+ memcpy(ptr, ctx->data, 8);
+ else
+ memcpy(ctx->data, ptr, 8);
+
+ exception = exception_caught();
+
+ WRITE_ONCE(test_armed_key, 0);
+ spin_unlock(&test_lock);
+
+ if (test->exception != exception) {
+ pr_err("pkey test FAILED: mode %s; write %s; exception %s != %s\n",
+ get_mode_str(test->mode),
+ test->write ? "TRUE" : "FALSE",
+ test->exception ? "TRUE" : "FALSE",
+ exception ? "TRUE" : "FALSE");
+ ret = -EFAULT;
+ }
+
+ return ret;
+}
+
+static int run_access_test(struct pks_test_ctx *ctx,
+ struct pks_access_test *test,
+ void *ptr)
+{
+ switch (test->mode) {
+ case PKS_TEST_NO_ACCESS:
+ pks_mk_noaccess(ctx->pkey);
+ break;
+ case PKS_TEST_RDWR:
+ pks_mk_readwrite(ctx->pkey);
+ break;
+ case PKS_TEST_RDONLY:
+ pks_mk_readonly(ctx->pkey);
+ break;
+ default:
+ pr_err("BUG in test invalid mode\n");
+ break;
+ }
+
+ return test_it(ctx, test, ptr);
+}
+
+static void *alloc_test_page(int pkey)
+{
+ return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START, VMALLOC_END,
+ GFP_KERNEL, PAGE_KERNEL_PKEY(pkey), 0,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static void test_mem_access(struct pks_test_ctx *ctx)
+{
+ int i, rc;
+ u8 pkey;
+ void *ptr = NULL;
+ pte_t *ptep;
+
+ ptr = alloc_test_page(ctx->pkey);
+ if (!ptr) {
+ pr_err("Failed to vmalloc page???\n");
+ ctx->pass = false;
+ return;
+ }
+
+ ptep = walk_table(ptr);
+ if (!ptep) {
+ pr_err("Failed to walk table???\n");
+ ctx->pass = false;
+ goto done;
+ }
+
+ pkey = pte_flags_pkey(ptep->pte);
+ pr_info("ptep flags 0x%lx pkey %u\n",
+ (unsigned long)ptep->pte, pkey);
+
+ if (pkey != ctx->pkey) {
+ pr_err("invalid pkey found: %u, test_pkey: %u\n",
+ pkey, ctx->pkey);
+ ctx->pass = false;
+ goto unmap;
+ }
+
+ if (!ctx->pks_cpu_enabled) {
+ pr_err("not CPU enabled; skipping access tests...\n");
+ ctx->pass = true;
+ goto unmap;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+ rc = run_access_test(ctx, &pkey_test_ary[i], ptr);
+
+ /* only save last error is fine */
+ if (rc)
+ ctx->pass = false;
+ }
+
+unmap:
+ pte_unmap(ptep);
+done:
+ vfree(ptr);
+}
+
+static void pks_run_test(struct pks_test_ctx *ctx)
+{
+ ctx->pass = true;
+
+ pr_info("\n");
+ pr_info("\n");
+ pr_info(" ***** BEGIN: Testing (CPU enabled : %s) *****\n",
+ ctx->pks_cpu_enabled ? "TRUE" : "FALSE");
+
+ if (ctx->pks_cpu_enabled)
+ on_each_cpu(report_pkey_settings, NULL, 1);
+
+ pr_info(" BEGIN: pkey %d Testing\n", ctx->pkey);
+ test_mem_access(ctx);
+ pr_info(" END: PAGE_KERNEL_PKEY Testing : %s\n",
+ ctx->pass ? "PASS" : "FAIL");
+
+ pr_info(" ***** END: Testing *****\n");
+ pr_info("\n");
+ pr_info("\n");
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ struct pks_test_ctx *ctx = file->private_data;
+ char buf[32];
+ unsigned int len;
+
+ if (!ctx)
+ len = sprintf(buf, "not run\n");
+ else
+ len = sprintf(buf, "%s\n", ctx->pass ? "PASS" : "FAIL");
+
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static struct pks_test_ctx *alloc_ctx(const char *name)
+{
+ struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+ if (!ctx) {
+ pr_err("Failed to allocate memory for test context\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ ctx->pkey = pks_key_alloc(name, PKS_FLAG_EXCLUSIVE);
+ if (ctx->pkey <= 0) {
+ pr_err("Failed to allocate memory for test context\n");
+ kfree(ctx);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ ctx->pks_cpu_enabled = cpu_feature_enabled(X86_FEATURE_PKS);
+ sprintf(ctx->data, "%s", "DEADBEEF");
+ return ctx;
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+ pks_key_free(ctx->pkey);
+ kfree(ctx);
+}
+
+static void run_exception_test(void)
+{
+ void *ptr = NULL;
+ bool pass = true;
+ struct pks_test_ctx *ctx;
+
+ pr_info(" ***** BEGIN: exception checking\n");
+
+ ctx = alloc_ctx("Exception test");
+ if (IS_ERR(ctx)) {
+ pr_err(" FAIL: no context\n");
+ pass = false;
+ goto result;
+ }
+ ctx->pass = true;
+
+ ptr = alloc_test_page(ctx->pkey);
+ if (!ptr) {
+ pr_err(" FAIL: no vmalloc page\n");
+ pass = false;
+ goto free_context;
+ }
+
+ pks_mk_readonly(ctx->pkey);
+
+ spin_lock(&test_lock);
+ WRITE_ONCE(test_exception_ctx, ctx);
+ WRITE_ONCE(test_armed_key, ctx->pkey);
+
+ memcpy(ptr, ctx->data, 8);
+
+ if (!exception_caught()) {
+ pr_err(" FAIL: did not get an exception\n");
+ pass = false;
+ }
+
+ /*
+ * NOTE The exception code has to enable access (b00) to keep the
+ * fault from looping forever. So we don't see the write disabled
+ * restored but rather full access restored. Also note that as part
+ * of this test the exception callback attempted to disable access
+ * completely (b11) and so we ensure that we are seeing the proper
+ * thread value restored here.
+ */
+ if (!check_pkrs(test_armed_key, 0)) {
+ pr_err(" FAIL: PKRS not restored\n");
+ pass = false;
+ }
+
+ if (!ctx->pass)
+ pass = false;
+
+ WRITE_ONCE(test_armed_key, 0);
+ spin_unlock(&test_lock);
+
+ vfree(ptr);
+free_context:
+ free_ctx(ctx);
+result:
+ pr_info(" ***** END: exception checking : %s\n",
+ pass ? "PASS" : "FAIL");
+}
+
+static void run_all(void)
+{
+ struct pks_test_ctx *ctx[PKS_NUM_KEYS];
+ static char name[PKS_NUM_KEYS][64];
+ int i;
+
+ for (i = 1; i < PKS_NUM_KEYS; i++) {
+ sprintf(name[i], "pks ctx %d", i);
+ ctx[i] = alloc_ctx((const char *)name[i]);
+ }
+
+ for (i = 1; i < PKS_NUM_KEYS; i++) {
+ if (!IS_ERR(ctx[i]))
+ pks_run_test(ctx[i]);
+ }
+
+ for (i = 1; i < PKS_NUM_KEYS; i++) {
+ if (!IS_ERR(ctx[i]))
+ free_ctx(ctx[i]);
+ }
+
+ run_exception_test();
+}
+
+static void crash_it(void)
+{
+ struct pks_test_ctx *ctx;
+ void *ptr;
+
+ pr_warn(" ***** BEGIN: Unhandled fault test *****\n");
+
+ ctx = alloc_ctx("crashing kernel\n");
+
+ ptr = alloc_test_page(ctx->pkey);
+ if (!ptr) {
+ pr_err("Failed to vmalloc page???\n");
+ ctx->pass = false;
+ return;
+ }
+
+ pks_mk_noaccess(ctx->pkey);
+
+ spin_lock(&test_lock);
+ WRITE_ONCE(test_armed_key, 0);
+ /* This purposely faults */
+ memcpy(ptr, ctx->data, 8);
+ spin_unlock(&test_lock);
+
+ vfree(ptr);
+ free_ctx(ctx);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ char buf[2];
+ struct pks_test_ctx *ctx = file->private_data;
+
+ if (copy_from_user(buf, user_buf, 1))
+ return -EFAULT;
+ buf[1] = '\0';
+
+ /*
+ * WARNING: Test "9" will crash the kernel.
+ *
+ * So we arm the test and print a warning. A second "9" will run the
+ * test.
+ */
+ if (!strcmp(buf, "9")) {
+ if (run_9) {
+ crash_it();
+ run_9 = false;
+ } else {
+ pr_warn("CAUTION: Test 9 will crash in the kernel.\n");
+ pr_warn(" Specify 9 a second time to run\n");
+ pr_warn(" run any other test to clear\n");
+ run_9 = true;
+ }
+ } else {
+ run_9 = false;
+ }
+
+ /*
+ * Test "3" will test allocating all keys. Do it first without
+ * using "ctx".
+ */
+ if (!strcmp(buf, "3"))
+ run_all();
+
+ if (!ctx) {
+ ctx = alloc_ctx("pks test");
+ if (IS_ERR(ctx))
+ return -ENOMEM;
+ file->private_data = ctx;
+ }
+
+ if (!strcmp(buf, "0"))
+ pks_run_test(ctx);
+
+ /* start of context switch test */
+ if (!strcmp(buf, "1")) {
+ /* Ensure a known state to test context switch */
+ pks_mk_noaccess(ctx->pkey);
+ }
+
+ /* After context switch msr should be restored */
+ if (!strcmp(buf, "2") && ctx->pks_cpu_enabled) {
+ unsigned long reg_pkrs;
+ int access;
+
+ rdmsrl(MSR_IA32_PKRS, reg_pkrs);
+
+ access = (reg_pkrs >> (ctx->pkey * PKR_BITS_PER_PKEY)) &
+ PKEY_ACCESS_MASK;
+ if (access != (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)) {
+ ctx->pass = false;
+ pr_err("Context switch check failed\n");
+ }
+ }
+
+ return count;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+ struct pks_test_ctx *ctx = file->private_data;
+
+ if (!ctx)
+ return 0;
+
+ free_ctx(ctx);
+ return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+ .read = pks_read_file,
+ .write = pks_write_file,
+ .llseek = default_llseek,
+ .release = pks_release_file,
+};
+
+static int __init parse_pks_test_options(char *str)
+{
+ run_on_boot = true;
+
+ return 0;
+}
+early_param("pks-test-on-boot", parse_pks_test_options);
+
+static int __init pks_test_init(void)
+{
+ if (cpu_feature_enabled(X86_FEATURE_PKS)) {
+ if (run_on_boot)
+ run_all();
+
+ pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+ NULL, &fops_init_pks);
+ }
+
+ return 0;
+}
+late_initcall(pks_test_init);
+
+static void __exit pks_test_exit(void)
+{
+ debugfs_remove(pks_test_dentry);
+ pr_info("test exit\n");
+}
+module_exit(pks_test_exit);
+
+MODULE_AUTHOR("Intel Corporation");
+MODULE_LICENSE("GPL v2");
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 6703c7906b71..f5c80f952eab 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -13,7 +13,8 @@ CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
check_initial_reg_state sigreturn iopl ioperm \
test_vdso test_vsyscall mov_ss_trap \
- syscall_arg_fault fsgsbase_restore
+ syscall_arg_fault fsgsbase_restore test_pks
+
TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
new file mode 100644
index 000000000000..cd40f930b172
--- /dev/null
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <assert.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+
+#define PKS_TEST_FILE "/sys/kernel/debug/x86/run_pks"
+
+int main(void)
+{
+ cpu_set_t cpuset;
+ char result[32];
+ pid_t pid;
+ int fd;
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(0, &cpuset);
+ /* Two processes run on CPU 0 so that they go through context switch. */
+ sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset);
+
+ pid = fork();
+ if (pid == 0) {
+ fd = open(PKS_TEST_FILE, O_RDWR);
+ if (fd < 0) {
+ printf("cannot open %s\n", PKS_TEST_FILE);
+ return -1;
+ }
+
+ /* Allocate test_pkey1 and run test. */
+ write(fd, "0", 1);
+
+ /* Arm for context switch test */
+ write(fd, "1", 1);
+
+ /* Context switch out... */
+ sleep(4);
+
+ /* Check msr restored */
+ write(fd, "2", 1);
+ } else {
+ sleep(2);
+
+ fd = open(PKS_TEST_FILE, O_RDWR);
+ if (fd < 0) {
+ printf("cannot open %s\n", PKS_TEST_FILE);
+ return -1;
+ }
+
+ /* run test with alternate pkey */
+ write(fd, "0", 1);
+ }
+
+ read(fd, result, 10);
+ printf("#PF, context switch, pkey allocation and free tests: %s\n",
+ result);
+
+ close(fd);
+
+ return 0;
+}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:05

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 03/10] x86/pks: Enable Protection Keys Supervisor (PKS)

From: Fenghua Yu <[email protected]>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings. It uses the same mechanism of Protection Keys as those on
User mappings but applies that mechanism to supervisor mappings using a
supervisor specific MSR.

Kernel users can thus defines 'domains' of page mappings which have an
extra level of protection beyond those specified in the supervisor page
table entries.

Define ARCH_HAS_SUPERVISOR_PKEYS to distinguish this functionality from
the existing ARCH_HAS_PKEYS and then enable PKS when configured and
indicated by the CPU instance. While not strictly necessary in this
patch, ARCH_HAS_SUPERVISOR_PKEYS separates this functionality through
the patch series so it is introduced here.

Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>

---
Changes since RFC V3
Per Dave Hansen
Update comment
Add X86_FEATURE_PKS to disabled-features.h
Rebase based on latest TIP tree
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +++++++-
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/cpu/common.c | 13 +++++++++++++
mm/Kconfig | 2 ++
6 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b81f74a..78c4c749c6a9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1876,6 +1876,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+ select ARCH_HAS_SUPERVISOR_PKEYS
help
Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index dad350d42ecf..4deb580324e8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -356,6 +356,7 @@
#define X86_FEATURE_MOVDIRI (16*32+27) /* MOVDIRI instruction */
#define X86_FEATURE_MOVDIR64B (16*32+28) /* MOVDIR64B instruction */
#define X86_FEATURE_ENQCMD (16*32+29) /* ENQCMD and ENQCMDS instructions */
+#define X86_FEATURE_PKS (16*32+31) /* Protection Keys for Supervisor pages */

/* AMD-defined CPU features, CPUID level 0x80000007 (EBX), word 17 */
#define X86_FEATURE_OVERFLOW_RECOV (17*32+ 0) /* MCA overflow recovery support */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 5861d34f9771..82540f0c5b6c 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
# define DISABLE_OSPKE (1<<(X86_FEATURE_OSPKE & 31))
#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+# define DISABLE_PKS 0
+#else
+# define DISABLE_PKS (1<<(X86_FEATURE_PKS & 31))
+#endif
+
#ifdef CONFIG_X86_5LEVEL
# define DISABLE_LA57 0
#else
@@ -82,7 +88,7 @@
#define DISABLED_MASK14 0
#define DISABLED_MASK15 0
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
- DISABLE_ENQCMD)
+ DISABLE_ENQCMD|DISABLE_PKS)
#define DISABLED_MASK17 0
#define DISABLED_MASK18 0
#define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 19)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT 24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS _BITUL(X86_CR4_PKS_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad8480c464..6a9ca938d9a9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1494,6 +1494,18 @@ static void validate_apic_and_package_id(struct cpuinfo_x86 *c)
#endif
}

+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ * Configure PKS if the CPU supports the feature.
+ */
+static void setup_pks(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ cr4_set_bits(X86_CR4_PKS);
+}
+
/*
* This does the hard work of actually picking apart the CPU stuff...
*/
@@ -1591,6 +1603,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)

x86_init_rdrand(c);
setup_pku(c);
+ setup_pks();

/*
* Clear/Set all flags overridden by options, need do it
diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f884a7..fc9ce7f65683 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,8 @@ config ARCH_USES_HIGH_VMA_FLAGS
bool
config ARCH_HAS_PKEYS
bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+ bool

config PERCPU_STATS
bool "Collect percpu memory statistics"
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:10

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 06/10] x86/entry: Move nmi entry/exit into common code

From: Thomas Gleixner <[email protected]>

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V1
Update commit message
Add kernel doc for struct irqentry_state_t
With Thomas' help
---
arch/x86/entry/common.c | 34 ----------------------------
arch/x86/include/asm/idtentry.h | 3 ---
arch/x86/kernel/cpu/mce/core.c | 6 ++---
arch/x86/kernel/nmi.c | 6 ++---
arch/x86/kernel/traps.c | 13 ++++++-----
include/linux/entry-common.h | 39 ++++++++++++++++++++++++++++++++-
kernel/entry/common.c | 36 ++++++++++++++++++++++++++++++
7 files changed, 87 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..18d8f17f755c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
}

-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
- bool irq_state = lockdep_hardirqs_enabled();
-
- __nmi_enter();
- lockdep_hardirqs_off(CALLER_ADDR0);
- lockdep_hardirq_enter();
- rcu_nmi_enter();
-
- instrumentation_begin();
- trace_hardirqs_off_finish();
- ftrace_nmi_enter();
- instrumentation_end();
-
- return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
- instrumentation_begin();
- ftrace_nmi_exit();
- if (restore) {
- trace_hardirqs_on_prepare();
- lockdep_hardirqs_on_prepare(CALLER_ADDR0);
- }
- instrumentation_end();
-
- rcu_nmi_exit();
- lockdep_hardirq_exit();
- if (restore)
- lockdep_hardirqs_on(CALLER_ADDR0);
- __nmi_exit();
-}
-
#ifdef CONFIG_XEN_PV
#ifndef CONFIG_PREEMPTION
/*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb0ac2f..247a60a47331 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@

#include <asm/irq_stack.h>

-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 51bf910b1e9d..403561a89c28 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1981,7 +1981,7 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;

static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
{
- bool irq_state;
+ irqentry_state_t irq_state;

WARN_ON_ONCE(user_mode(regs));

@@ -1993,7 +1993,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = idtentry_enter_nmi(regs);
+ irq_state = irqentry_nmi_enter(regs);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2004,7 +2004,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aaf1303..bf250a339655 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);

DEFINE_IDTENTRY_RAW(exc_nmi)
{
- bool irq_state;
+ irqentry_state_t irq_state;

/*
* Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = idtentry_enter_nmi(regs);
+ irq_state = irqentry_nmi_enter(regs);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e19df6cde35d..e1b78829d909 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -405,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- idtentry_enter_nmi(regs);
+ irqentry_nmi_enter(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,12 +651,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- bool irq_state = idtentry_enter_nmi(regs);
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);
}
}

@@ -851,7 +852,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- bool irq_state = idtentry_enter_nmi(regs);
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
instrumentation_begin();

/*
@@ -908,7 +909,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);

local_db_restore(dr7);
}
@@ -926,7 +927,7 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,

/*
* NB: We can't easily clear DR7 here because
- * idtentry_exit_to_usermode() can invoke ptrace, schedule, access
+ * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
* user memory, etc. This means that a recursive #DB is possible. If
* this happens, that #DB will hit exc_debug_kernel() and clear DR7.
* Since we're not on the IST stack right now, everything will be
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..7dff07713a07 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -341,8 +341,26 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
void irqentry_exit_to_user_mode(struct pt_regs *regs);

#ifndef irqentry_state
+/**
+ * struct irqentry_state - Opaque object for exception state storage
+ * @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the
+ * exit path has to invoke rcu_irq_exit().
+ * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that
+ * lockdep state is restored correctly on exit from nmi.
+ *
+ * This opaque object is filled in by the irqentry_*_enter() functions and
+ * must be passed back into the corresponding irqentry_*_exit() functions
+ * when the exception is complete.
+ *
+ * Callers of irqentry_*_[enter|exit]() must consider this structure opaque
+ * and all members private. Descriptions of the members are provided to aid in
+ * the maintenance of the irqentry_*() functions.
+ */
typedef struct irqentry_state {
- bool exit_rcu;
+ union {
+ bool exit_rcu;
+ bool lockdep;
+ };
} irqentry_state_t;
#endif

@@ -402,4 +420,23 @@ void irqentry_exit_cond_resched(void);
*/
void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);

+/**
+ * irqentry_nmi_enter - Handle NMI entry
+ * @regs: Pointer to currents pt_regs
+ *
+ * Similar to irqentry_enter() but taking care of the NMI constraints.
+ */
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_nmi_exit - Handle return from NMI handling
+ * @regs: Pointer to pt_regs (NMI entry regs)
+ * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ *
+ * Last action before returning to the low level assmenbly code.
+ *
+ * Counterpart to irqentry_nmi_enter().
+ */
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index acb6bec04dc7..372f14cc347f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -397,3 +397,39 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
rcu_irq_exit();
}
}
+
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+{
+ irqentry_state_t irq_state;
+
+ irq_state.lockdep = lockdep_hardirqs_enabled();
+
+ __nmi_enter();
+ lockdep_hardirqs_off(CALLER_ADDR0);
+ lockdep_hardirq_enter();
+ rcu_nmi_enter();
+
+ instrumentation_begin();
+ trace_hardirqs_off_finish();
+ ftrace_nmi_enter();
+ instrumentation_end();
+
+ return irq_state;
+}
+
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+{
+ instrumentation_begin();
+ ftrace_nmi_exit();
+ if (irq_state.lockdep) {
+ trace_hardirqs_on_prepare();
+ lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+ }
+ instrumentation_end();
+
+ rcu_nmi_exit();
+ lockdep_hardirq_exit();
+ if (irq_state.lockdep)
+ lockdep_hardirqs_on(CALLER_ADDR0);
+ __nmi_exit();
+}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:31

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 07/10] x86/entry: Pass irqentry_state_t by reference

From: Ira Weiny <[email protected]>

Currently struct irqentry_state_t only contains a single bool value
which makes passing it by value is reasonable. However, future patches
propose to add information to this struct, for example the PKRS
register/thread state.

Adding information to irqentry_state_t makes passing by value less
efficient. Therefore, change the entry/exit calls to pass irq_state by
reference.

While at it, make the code easier to follow by changing all the usage
sites to consistently use the variable name 'irq_state'.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V1
From Thomas: Update commit message
Further clean up Kernel doc and comments
Missed some 'return' comments which are no longer valid

Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
arch/x86/entry/common.c | 8 ++++----
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++----------
arch/x86/kernel/cpu/mce/core.c | 4 ++--
arch/x86/kernel/kvm.c | 6 +++---
arch/x86/kernel/nmi.c | 4 ++--
arch/x86/kernel/traps.c | 21 ++++++++++++--------
arch/x86/mm/fault.c | 6 +++---
include/linux/entry-common.h | 18 +++++++++--------
kernel/entry/common.c | 34 +++++++++++++--------------------
9 files changed, 65 insertions(+), 61 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
{
struct pt_regs *old_regs;
bool inhcall;
- irqentry_state_t state;
+ irqentry_state_t irq_state;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
old_regs = set_irq_regs(regs);

instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
set_irq_regs(old_regs);

inhcall = get_and_clear_inhcall();
- if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+ if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
}
#endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs, error_code); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
run_sysvec_on_irqstack_cond(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs); \
__irq_exit_raw(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 403561a89c28..169168e535f2 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1993,7 +1993,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2004,7 +2004,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7f57ede3cb8e..ed7427c6e74d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
{
u32 flags = kvm_read_and_reset_apf_flags();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

if (!flags)
return false;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
}

instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
return true;
}

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..1fd7780e99dd 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e1b78829d909..8481cc373794 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)

DEFINE_IDTENTRY_RAW(exc_invalid_op)
{
- irqentry_state_t state;
+ irqentry_state_t irq_state;

/*
* We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();
handle_invalid_op(regs);
instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message,
*/
DEFINE_IDTENTRY_DF(exc_double_fault)
{
+ irqentry_state_t irq_state;
static const char str[] = "double fault";
struct task_struct *tsk = current;

@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);

instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}
}

@@ -852,7 +855,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -909,7 +914,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(dr7);
}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 82bf37a5c9ec..8d20c4c13abf 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

prefetchw(&current->mm->mmap_lock);

@@ -1479,11 +1479,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* code reenabled RCU to avoid subsequent wreckage which helps
* debugability.
*/
- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);

instrumentation_begin();
handle_page_fault(regs, error_code, address);
instrumentation_end();

- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 7dff07713a07..0e10d0b85817 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -367,6 +367,8 @@ typedef struct irqentry_state {
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_exit()
*
* Invokes:
* - lockdep irqflag state tracking as low level ASM entry disabled
@@ -392,10 +394,8 @@ typedef struct irqentry_state {
* For user mode entries irqentry_enter_from_user_mode() is invoked to
* establish the proper context for NOHZ_FULL. Otherwise scheduling on exit
* would not be possible.
- *
- * Returns: An opaque object that must be passed to idtentry_exit()
*/
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
@@ -407,7 +407,7 @@ void irqentry_exit_cond_resched(void);
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)
- * @state: Return value from matching call to irqentry_enter()
+ * @irq_state: Pointer to state information passed to irqentry_enter()
*
* Depending on the return target (kernel/user) this runs the necessary
* preemption and work checks if possible and reguired and returns to
@@ -418,25 +418,27 @@ void irqentry_exit_cond_resched(void);
*
* Counterpart to irqentry_enter().
*/
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_enter - Handle NMI entry
* @regs: Pointer to currents pt_regs
+ * @irq_state: Pointer to an opaque object to store state information; to be
+ * passed back to irqentry_nmi_exit()
*
* Similar to irqentry_enter() but taking care of the NMI constraints.
*/
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_exit - Handle return from NMI handling
* @regs: Pointer to pt_regs (NMI entry regs)
- * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ * @irq_state: Pointer to state information passed to irqentry_nmi_enter()
*
* Last action before returning to the low level assmenbly code.
*
* Counterpart to irqentry_nmi_enter().
*/
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 372f14cc347f..359688400c6b 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -281,15 +281,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
exit_to_user_mode();
}

-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t ret = {
- .exit_rcu = false,
- };
+ irq_state->exit_rcu = false;

if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ return;
}

/*
@@ -327,8 +325,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();

- ret.exit_rcu = true;
- return ret;
+ irq_state->exit_rcu = true;
+ return;
}

/*
@@ -342,8 +340,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
/* Use the combo lockdep/tracing function */
trace_hardirqs_off();
instrumentation_end();
-
- return ret;
}

void irqentry_exit_cond_resched(void)
@@ -358,7 +354,7 @@ void irqentry_exit_cond_resched(void)
}
}

-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
lockdep_assert_irqs_disabled();

@@ -371,7 +367,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
- if (state.exit_rcu) {
+ if (irq_state->exit_rcu) {
instrumentation_begin();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
@@ -393,16 +389,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
- if (state.exit_rcu)
+ if (irq_state->exit_rcu)
rcu_irq_exit();
}
}

-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t irq_state;
-
- irq_state.lockdep = lockdep_hardirqs_enabled();
+ irq_state->lockdep = lockdep_hardirqs_enabled();

__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -413,15 +407,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
-
- return irq_state;
}

-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
instrumentation_begin();
ftrace_nmi_exit();
- if (irq_state.lockdep) {
+ if (irq_state->lockdep) {
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
}
@@ -429,7 +421,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)

rcu_nmi_exit();
lockdep_hardirq_exit();
- if (irq_state.lockdep)
+ if (irq_state->lockdep)
lockdep_hardirqs_on(CALLER_ADDR0);
__nmi_exit();
}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:56:56

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 08/10] x86/entry: Preserve PKRS MSR across exceptions

From: Ira Weiny <[email protected]>

The PKRS MSR is not managed by XSAVE. It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1] pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward. It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/#t

Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V1
remove redundant irq_state->pkrs
This value is only needed for the global tracking. So
it should be included in that patch and not in this one.

Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
arch/x86/entry/common.c | 38 +++++++++++++++++++++++++++++
arch/x86/include/asm/pkeys_common.h | 5 ++--
arch/x86/mm/pkeys.c | 2 +-
include/linux/entry-common.h | 13 ++++++++++
kernel/entry/common.c | 14 +++++++++--
5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..1b6a419a6fac 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
#include <linux/nospec.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
+#include <linux/pkeys.h>

#ifdef CONFIG_XEN_PV
#include <xen/xen-ops.h>
@@ -209,6 +210,41 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
}

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception. Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ irq_state->thread_pkrs = current->thread.saved_pkrs;
+ write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ write_pkrs(irq_state->thread_pkrs);
+ current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
#ifdef CONFIG_XEN_PV
#ifndef CONFIG_PREEMPTION
/*
@@ -272,6 +308,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)

inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+ /* Normally called by irqentry_exit, we must restore pkrs here */
+ irq_restore_pkrs(&irq_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index cd492c23b28c..f921c58793f9 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -31,9 +31,10 @@
#define PKS_NUM_KEYS 16

#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
#else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
#endif

#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 0dc77409957a..39ec034cf379 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -252,7 +252,7 @@ DEFINE_PER_CPU(u32, pkrs_cache);
* until all prior executions of WRPKRU have completed execution
* and updated the PKRU register.
*/
-void write_pkrs(u32 new_pkrs)
+noinstr void write_pkrs(u32 new_pkrs)
{
u32 *pkrs;

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 0e10d0b85817..235110f34a06 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -343,6 +343,8 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs);
#ifndef irqentry_state
/**
* struct irqentry_state - Opaque object for exception state storage
+ * @thread_pkrs: Thread Supervisor Pkey value to be restored when exception is
+ * complete.
* @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the
* exit path has to invoke rcu_irq_exit().
* @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that
@@ -357,6 +359,9 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs);
* the maintenance of the irqentry_*() functions.
*/
typedef struct irqentry_state {
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+ u32 thread_pkrs;
+#endif
union {
bool exit_rcu;
bool lockdep;
@@ -364,6 +369,14 @@ typedef struct irqentry_state {
} irqentry_state_t;
#endif

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val);
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state);
+#else
+static __always_inline void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val) { }
+static __always_inline void irq_restore_pkrs(irqentry_state_t *irq_state) { }
+#endif
+
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 359688400c6b..2340011a5a06 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -326,7 +326,7 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
instrumentation_end();

irq_state->exit_rcu = true;
- return;
+ goto done;
}

/*
@@ -340,6 +340,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
/* Use the combo lockdep/tracing function */
trace_hardirqs_off();
instrumentation_end();
+
+done:
+ irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE);
}

void irqentry_exit_cond_resched(void)
@@ -361,7 +364,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
/* Check whether this returns to user mode */
if (user_mode(regs)) {
irqentry_exit_to_user_mode(regs);
- } else if (!regs_irqs_disabled(regs)) {
+ return;
+ }
+
+ irq_restore_pkrs(irq_state);
+
+ if (!regs_irqs_disabled(regs)) {
/*
* If RCU was not watching on entry this needs to be done
* carefully and needs the same ordering of lockdep/tracing
@@ -407,10 +415,12 @@ void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_stat
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
+ irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE);
}

void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
+ irq_restore_pkrs(irq_state);
instrumentation_begin();
ftrace_nmi_exit();
if (irq_state->lockdep) {
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 20:57:00

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V2 09/10] x86/fault: Report the PKRS state on fault

From: Ira Weiny <[email protected]>

When only user space pkeys are enabled faulting within the kernel was an
unexpected condition which should never happen. Therefore a WARN_ON in
the kernel fault handler would detect if it ever did. Now this is no
longer the case if PKS is enabled and supported.

Report a Pkey fault with a normal splat and add the PKRS state to the
fault splat text. Note the PKS register is reset during an exception
therefore the saved PKRS value from before the beginning of the
exception is passed down.

If PKS is not enabled, or not active, maintain the WARN_ON_ONCE() from
before.

Because each fault has its own state the pkrs information will be
correctly reported even if a fault 'faults'.

Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from RFC V3
Update commit message
Per Dave Hansen
Don't print PKRS if !cpu_feature_enabled(X86_FEATURE_PKS)
Fix comment
Remove check on CONFIG_ARCH_HAS_SUPERVISOR_PKEYS in favor of
disabled-features.h
---
arch/x86/mm/fault.c | 58 ++++++++++++++++++++++++++-------------------
1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 8d20c4c13abf..931603102010 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -504,7 +504,8 @@ static void show_ldttss(const struct desc_ptr *gdt, const char *name, u16 index)
}

static void
-show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+ irqentry_state_t *irq_state)
{
if (!oops_may_print())
return;
@@ -548,6 +549,11 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
(error_code & X86_PF_PK) ? "protection keys violation" :
"permissions violation");

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+ if (cpu_feature_enabled(X86_FEATURE_PKS) && irq_state && (error_code & X86_PF_PK))
+ pr_alert("PKRS: 0x%x\n", irq_state->pkrs);
+#endif
+
if (!(error_code & X86_PF_USER) && user_mode(regs)) {
struct desc_ptr idt, gdt;
u16 ldtr, tr;
@@ -626,7 +632,8 @@ static void set_signal_archinfo(unsigned long address,

static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, int signal, int si_code)
+ unsigned long address, int signal, int si_code,
+ irqentry_state_t *irq_state)
{
struct task_struct *tsk = current;
unsigned long flags;
@@ -732,7 +739,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
*/
flags = oops_begin();

- show_fault_oops(regs, error_code, address);
+ show_fault_oops(regs, error_code, address, irq_state);

if (task_stack_end_corrupted(tsk))
printk(KERN_EMERG "Thread overran stack, or stack corrupted\n");
@@ -785,7 +792,8 @@ static bool is_vsyscall_vaddr(unsigned long vaddr)

static void
__bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, u32 pkey, int si_code)
+ unsigned long address, u32 pkey, int si_code,
+ irqentry_state_t *irq_state)
{
struct task_struct *tsk = current;

@@ -832,14 +840,14 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
if (is_f00f_bug(regs, address))
return;

- no_context(regs, error_code, address, SIGSEGV, si_code);
+ no_context(regs, error_code, address, SIGSEGV, si_code, irq_state);
}

static noinline void
bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+ unsigned long address, irqentry_state_t *irq_state)
{
- __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR);
+ __bad_area_nosemaphore(regs, error_code, address, 0, SEGV_MAPERR, irq_state);
}

static void
@@ -853,7 +861,7 @@ __bad_area(struct pt_regs *regs, unsigned long error_code,
*/
mmap_read_unlock(mm);

- __bad_area_nosemaphore(regs, error_code, address, pkey, si_code);
+ __bad_area_nosemaphore(regs, error_code, address, pkey, si_code, NULL);
}

static noinline void
@@ -923,7 +931,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
{
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
- no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+ no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, NULL);
return;
}

@@ -957,7 +965,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, vm_fault_t fault)
{
if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
- no_context(regs, error_code, address, 0, 0);
+ no_context(regs, error_code, address, 0, 0, NULL);
return;
}

@@ -965,7 +973,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
no_context(regs, error_code, address,
- SIGSEGV, SEGV_MAPERR);
+ SIGSEGV, SEGV_MAPERR, NULL);
return;
}

@@ -980,7 +988,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
VM_FAULT_HWPOISON_LARGE))
do_sigbus(regs, error_code, address, fault);
else if (fault & VM_FAULT_SIGSEGV)
- bad_area_nosemaphore(regs, error_code, address);
+ bad_area_nosemaphore(regs, error_code, address, NULL);
else
BUG();
}
@@ -1148,14 +1156,14 @@ bool fault_in_kernel_space(unsigned long address)
*/
static void
do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
- unsigned long address)
+ unsigned long address, irqentry_state_t *irq_state)
{
/*
- * Protection keys exceptions only happen on user pages. We
- * have no user pages in the kernel portion of the address
- * space, so do not expect them here.
+ * PF_PK is only expected on kernel addresses when supervisor pkeys are
+ * enabled.
*/
- WARN_ON_ONCE(hw_error_code & X86_PF_PK);
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ WARN_ON_ONCE(hw_error_code & X86_PF_PK);

#ifdef CONFIG_X86_32
/*
@@ -1204,7 +1212,7 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock:
*/
- bad_area_nosemaphore(regs, hw_error_code, address);
+ bad_area_nosemaphore(regs, hw_error_code, address, irq_state);
}
NOKPROBE_SYMBOL(do_kern_addr_fault);

@@ -1245,7 +1253,7 @@ void do_user_addr_fault(struct pt_regs *regs,
!(hw_error_code & X86_PF_USER) &&
!(regs->flags & X86_EFLAGS_AC)))
{
- bad_area_nosemaphore(regs, hw_error_code, address);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}

@@ -1254,7 +1262,7 @@ void do_user_addr_fault(struct pt_regs *regs,
* in a region with pagefaults disabled then we must not take the fault
*/
if (unlikely(faulthandler_disabled() || !mm)) {
- bad_area_nosemaphore(regs, hw_error_code, address);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}

@@ -1316,7 +1324,7 @@ void do_user_addr_fault(struct pt_regs *regs,
* Fault from code in kernel from
* which we do not expect faults.
*/
- bad_area_nosemaphore(regs, hw_error_code, address);
+ bad_area_nosemaphore(regs, hw_error_code, address, NULL);
return;
}
retry:
@@ -1375,7 +1383,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (fault_signal_pending(fault, regs)) {
if (!user_mode(regs))
no_context(regs, hw_error_code, address, SIGBUS,
- BUS_ADRERR);
+ BUS_ADRERR, NULL);
return;
}

@@ -1415,7 +1423,7 @@ trace_page_fault_entries(struct pt_regs *regs, unsigned long error_code,

static __always_inline void
handle_page_fault(struct pt_regs *regs, unsigned long error_code,
- unsigned long address)
+ unsigned long address, irqentry_state_t *irq_state)
{
trace_page_fault_entries(regs, error_code, address);

@@ -1424,7 +1432,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,

/* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
- do_kern_addr_fault(regs, error_code, address);
+ do_kern_addr_fault(regs, error_code, address, irq_state);
} else {
do_user_addr_fault(regs, error_code, address);
/*
@@ -1482,7 +1490,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
irqentry_enter(regs, &irq_state);

instrumentation_begin();
- handle_page_fault(regs, error_code, address);
+ handle_page_fault(regs, error_code, address, &irq_state);
instrumentation_end();

irqentry_exit(regs, &irq_state);
--
2.28.0.rc0.12.gb6a658bd00c9

2020-11-02 23:38:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Mon, Nov 02 2020 at 12:53, ira weiny wrote:
> Fenghua Yu (2):
> x86/pks: Enable Protection Keys Supervisor (PKS)
> x86/pks: Add PKS kernel API
>
> Ira Weiny (7):
> x86/pkeys: Create pkeys_common.h
> x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
> x86/pks: Preserve the PKRS MSR on context switch
> x86/entry: Pass irqentry_state_t by reference
> x86/entry: Preserve PKRS MSR across exceptions
> x86/fault: Report the PKRS state on fault
> x86/pks: Add PKS test code
>
> Thomas Gleixner (1):
> x86/entry: Move nmi entry/exit into common code

So the actual patch ordering is:

x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Enable Protection Keys Supervisor (PKS)
x86/pks: Preserve the PKRS MSR on context switch
x86/pks: Add PKS kernel API

x86/entry: Move nmi entry/exit into common code
x86/entry: Pass irqentry_state_t by reference

x86/entry: Preserve PKRS MSR across exceptions
x86/fault: Report the PKRS state on fault
x86/pks: Add PKS test code

This is the wrong ordering, really.

x86/entry: Move nmi entry/exit into common code

is a general cleanup and has absolutely nothing to do with PKRS.So this
wants to go first.

Also:

x86/entry: Move nmi entry/exit into common code

is a prerequisite for the rest. So why is it in the middle of the
series?

And then you enable all that muck _before_ it is usable:

Patch 3/N: x86/pks: Enable Protection Keys Supervisor (PKS)

Bisectability is overrrated, right?

Once again: Read an understand Documentation/process/*

Aside of that using a spell checker is not optional.

Thanks,

tglx

2020-11-02 23:45:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Tue, Nov 03 2020 at 00:36, Thomas Gleixner wrote:
> On Mon, Nov 02 2020 at 12:53, ira weiny wrote:
>
> This is the wrong ordering, really.
>
> x86/entry: Move nmi entry/exit into common code
>
> is a general cleanup and has absolutely nothing to do with PKRS.So this
> wants to go first.
>
> Also:
>
> x86/entry: Move nmi entry/exit into common code

this should be

x86/entry: Pass irqentry_state_t by reference

of course. Copy&pasta fail...

2020-11-04 20:32:52

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> On Mon, Nov 02 2020 at 12:53, ira weiny wrote:
> > Fenghua Yu (2):
> > x86/pks: Enable Protection Keys Supervisor (PKS)
> > x86/pks: Add PKS kernel API
> >
> > Ira Weiny (7):
> > x86/pkeys: Create pkeys_common.h
> > x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
> > x86/pks: Preserve the PKRS MSR on context switch
> > x86/entry: Pass irqentry_state_t by reference
> > x86/entry: Preserve PKRS MSR across exceptions
> > x86/fault: Report the PKRS state on fault
> > x86/pks: Add PKS test code
> >
> > Thomas Gleixner (1):
> > x86/entry: Move nmi entry/exit into common code
>
> So the actual patch ordering is:
>
> x86/pkeys: Create pkeys_common.h
> x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
> x86/pks: Enable Protection Keys Supervisor (PKS)
> x86/pks: Preserve the PKRS MSR on context switch
> x86/pks: Add PKS kernel API
>
> x86/entry: Move nmi entry/exit into common code
> x86/entry: Pass irqentry_state_t by reference
>
> x86/entry: Preserve PKRS MSR across exceptions
> x86/fault: Report the PKRS state on fault
> x86/pks: Add PKS test code
>
> This is the wrong ordering, really.
>
> x86/entry: Move nmi entry/exit into common code
>
> is a general cleanup and has absolutely nothing to do with PKRS.So this
> wants to go first.
>

Sorry, yes this should be a pre-patch.

> Also:
>
> x86/entry: Move nmi entry/exit into common code
> [from other email]
> > x86/entry: Pass irqentry_state_t by reference
> > >
> > >
>
> is a prerequisite for the rest. So why is it in the middle of the
> series?

It is in the middle because passing by reference is not needed until additional
information is added to irqentry_state_t which is done immediately after this
patch by:

x86/entry: Preserve PKRS MSR across exceptions

I debated squashing the 2 but it made review harder IMO. But I thought keeping
them in order together made a lot of sense.

>
> And then you enable all that muck _before_ it is usable:
>

Strictly speaking you are correct, sorry. I will reorder the series.

>
> Bisectability is overrrated, right?

Agreed, bisectability is important. I thought I had it covered but I was
wrong.

>
> Once again: Read an understand Documentation/process/*
>
> Aside of that using a spell checker is not optional.

Agreed.

In looking closer at the entry code I've found a couple of other instances I'll
add another precursor patch.

I've also found other errors with the series which I should have caught. My
apologies I made some last minute changes which I should have checked more
thoroughly.

Thanks,
Ira

2020-11-04 22:04:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
>> This is the wrong ordering, really.
>>
>> x86/entry: Move nmi entry/exit into common code
>>
>> is a general cleanup and has absolutely nothing to do with PKRS.So this
>> wants to go first.
>
> Sorry, yes this should be a pre-patch.

I picked it out of the series and applied it to tip core/entry as I have
other stuff coming up in that area.

Thanks,

tglx

Subject: [tip: core/entry] x86/entry: Move nmi entry/exit into common code

The following commit has been merged into the core/entry branch of tip:

Commit-ID: b6be002bcd1dd1dedb926abf3c90c794eacb77dc
Gitweb: https://git.kernel.org/tip/b6be002bcd1dd1dedb926abf3c90c794eacb77dc
Author: Thomas Gleixner <[email protected]>
AuthorDate: Mon, 02 Nov 2020 12:53:16 -08:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 04 Nov 2020 22:55:36 +01:00

x86/entry: Move nmi entry/exit into common code

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
between the IRQ and NMI exceptions, and add kernel documentation for
struct irqentry_state_t ]

Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/entry/common.c | 34 +----------------------------
arch/x86/include/asm/idtentry.h | 3 +--
arch/x86/kernel/cpu/mce/core.c | 6 ++---
arch/x86/kernel/nmi.c | 6 ++---
arch/x86/kernel/traps.c | 13 +++++------
include/linux/entry-common.h | 39 +++++++++++++++++++++++++++++++-
kernel/entry/common.c | 36 ++++++++++++++++++++++++++++++-
7 files changed, 87 insertions(+), 50 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efee..18d8f17 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -209,40 +209,6 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
}

-noinstr bool idtentry_enter_nmi(struct pt_regs *regs)
-{
- bool irq_state = lockdep_hardirqs_enabled();
-
- __nmi_enter();
- lockdep_hardirqs_off(CALLER_ADDR0);
- lockdep_hardirq_enter();
- rcu_nmi_enter();
-
- instrumentation_begin();
- trace_hardirqs_off_finish();
- ftrace_nmi_enter();
- instrumentation_end();
-
- return irq_state;
-}
-
-noinstr void idtentry_exit_nmi(struct pt_regs *regs, bool restore)
-{
- instrumentation_begin();
- ftrace_nmi_exit();
- if (restore) {
- trace_hardirqs_on_prepare();
- lockdep_hardirqs_on_prepare(CALLER_ADDR0);
- }
- instrumentation_end();
-
- rcu_nmi_exit();
- lockdep_hardirq_exit();
- if (restore)
- lockdep_hardirqs_on(CALLER_ADDR0);
- __nmi_exit();
-}
-
#ifdef CONFIG_XEN_PV
#ifndef CONFIG_PREEMPTION
/*
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b2442eb..247a60a 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -11,9 +11,6 @@

#include <asm/irq_stack.h>

-bool idtentry_enter_nmi(struct pt_regs *regs);
-void idtentry_exit_nmi(struct pt_regs *regs, bool irq_state);
-
/**
* DECLARE_IDTENTRY - Declare functions for simple IDT entry points
* No error code pushed by hardware
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4102b86..f5c860b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1983,7 +1983,7 @@ void (*machine_check_vector)(struct pt_regs *) = unexpected_machine_check;

static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
{
- bool irq_state;
+ irqentry_state_t irq_state;

WARN_ON_ONCE(user_mode(regs));

@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = idtentry_enter_nmi(regs);
+ irq_state = irqentry_nmi_enter(regs);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4bc77aa..bf250a3 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -475,7 +475,7 @@ static DEFINE_PER_CPU(unsigned long, nmi_dr7);

DEFINE_IDTENTRY_RAW(exc_nmi)
{
- bool irq_state;
+ irqentry_state_t irq_state;

/*
* Re-enable NMIs right here when running as an SEV-ES guest. This might
@@ -502,14 +502,14 @@ nmi_restart:

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = idtentry_enter_nmi(regs);
+ irq_state = irqentry_nmi_enter(regs);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index e19df6c..e1b7882 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -405,7 +405,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- idtentry_enter_nmi(regs);
+ irqentry_nmi_enter(regs);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,12 +651,13 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- bool irq_state = idtentry_enter_nmi(regs);
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+
instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);
}
}

@@ -851,7 +852,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- bool irq_state = idtentry_enter_nmi(regs);
+ irqentry_state_t irq_state = irqentry_nmi_enter(regs);
instrumentation_begin();

/*
@@ -908,7 +909,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- idtentry_exit_nmi(regs, irq_state);
+ irqentry_nmi_exit(regs, irq_state);

local_db_restore(dr7);
}
@@ -926,7 +927,7 @@ static __always_inline void exc_debug_user(struct pt_regs *regs,

/*
* NB: We can't easily clear DR7 here because
- * idtentry_exit_to_usermode() can invoke ptrace, schedule, access
+ * irqentry_exit_to_usermode() can invoke ptrace, schedule, access
* user memory, etc. This means that a recursive #DB is possible. If
* this happens, that #DB will hit exc_debug_kernel() and clear DR7.
* Since we're not on the IST stack right now, everything will be
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b9711e8..1a128ba 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -346,8 +346,26 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
void irqentry_exit_to_user_mode(struct pt_regs *regs);

#ifndef irqentry_state
+/**
+ * struct irqentry_state - Opaque object for exception state storage
+ * @exit_rcu: Used exclusively in the irqentry_*() calls; signals whether the
+ * exit path has to invoke rcu_irq_exit().
+ * @lockdep: Used exclusively in the irqentry_nmi_*() calls; ensures that
+ * lockdep state is restored correctly on exit from nmi.
+ *
+ * This opaque object is filled in by the irqentry_*_enter() functions and
+ * must be passed back into the corresponding irqentry_*_exit() functions
+ * when the exception is complete.
+ *
+ * Callers of irqentry_*_[enter|exit]() must consider this structure opaque
+ * and all members private. Descriptions of the members are provided to aid in
+ * the maintenance of the irqentry_*() functions.
+ */
typedef struct irqentry_state {
- bool exit_rcu;
+ union {
+ bool exit_rcu;
+ bool lockdep;
+ };
} irqentry_state_t;
#endif

@@ -407,4 +425,23 @@ void irqentry_exit_cond_resched(void);
*/
void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);

+/**
+ * irqentry_nmi_enter - Handle NMI entry
+ * @regs: Pointer to currents pt_regs
+ *
+ * Similar to irqentry_enter() but taking care of the NMI constraints.
+ */
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+
+/**
+ * irqentry_nmi_exit - Handle return from NMI handling
+ * @regs: Pointer to pt_regs (NMI entry regs)
+ * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ *
+ * Last action before returning to the low level assmenbly code.
+ *
+ * Counterpart to irqentry_nmi_enter().
+ */
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+
#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 3a1dfec..bc75c11 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -405,3 +405,39 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
rcu_irq_exit();
}
}
+
+irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+{
+ irqentry_state_t irq_state;
+
+ irq_state.lockdep = lockdep_hardirqs_enabled();
+
+ __nmi_enter();
+ lockdep_hardirqs_off(CALLER_ADDR0);
+ lockdep_hardirq_enter();
+ rcu_nmi_enter();
+
+ instrumentation_begin();
+ trace_hardirqs_off_finish();
+ ftrace_nmi_enter();
+ instrumentation_end();
+
+ return irq_state;
+}
+
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+{
+ instrumentation_begin();
+ ftrace_nmi_exit();
+ if (irq_state.lockdep) {
+ trace_hardirqs_on_prepare();
+ lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+ }
+ instrumentation_end();
+
+ rcu_nmi_exit();
+ lockdep_hardirq_exit();
+ if (irq_state.lockdep)
+ lockdep_hardirqs_on(CALLER_ADDR0);
+ __nmi_exit();
+}

2020-11-04 22:57:24

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Wed, Nov 04, 2020 at 02:45:54PM -0800, 'Ira Weiny' wrote:
> On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> > On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> > >> This is the wrong ordering, really.
> > >>
> > >> x86/entry: Move nmi entry/exit into common code
> > >>
> > >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> > >> wants to go first.
> > >
> > > Sorry, yes this should be a pre-patch.
> >
> > I picked it out of the series and applied it to tip core/entry as I have
> > other stuff coming up in that area.
>
> Thanks! I'll rebase to that tree.
>
> I assume you fixed the spelling error? Sorry about that.

I'll fix it and send with the other spelling errors I found.

Ira

2020-11-05 02:31:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V2 00/10] PKS: Add Protection Keys Supervisor (PKS) support

On Wed, Nov 04, 2020 at 11:00:04PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 04 2020 at 09:46, Ira Weiny wrote:
> > On Tue, Nov 03, 2020 at 12:36:16AM +0100, Thomas Gleixner wrote:
> >> This is the wrong ordering, really.
> >>
> >> x86/entry: Move nmi entry/exit into common code
> >>
> >> is a general cleanup and has absolutely nothing to do with PKRS.So this
> >> wants to go first.
> >
> > Sorry, yes this should be a pre-patch.
>
> I picked it out of the series and applied it to tip core/entry as I have
> other stuff coming up in that area.

Thanks! I'll rebase to that tree.

I assume you fixed the spelling error? Sorry about that.

Ira