2020-10-23 13:03:12

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 00/10] PKS: Add Protection Keys Supervisor (PKS) support

From: Ira Weiny <[email protected]>

Changes from RFC V3[3]
Rebase to TIP master
Update test error output
Standardize on 'irq_state' for state variables
From Dave Hansen
Update commit messages
Add/clean up comments
Add X86_FEATURE_PKS to disabled-features.h and remove some
explicit CONFIG checks
Move saved_pkrs member of thread_struct
Remove superfluous preempt_disable()
s/irq_save_pks/irq_save_set_pks/
Ensure PKRS is not seen in faults if not configured or not
supported
s/pks_mknoaccess/pks_mk_noaccess/
s/pks_mkread/pks_mk_readonly/
s/pks_mkrdwr/pks_mk_readwrite/
Change pks_key_alloc return to -EOPNOTSUPP when not supported
From Peter Zijlstra
Clean up Attribution
Remove superfluous preempt_disable()
Add union to differentiate exit_rcu/lockdep use in
irqentry_state_t
From Thomas Gleixner
Add preliminary clean up patch and adjust series as needed


Introduce a new page protection mechanism for supervisor pages, Protection Key
Supervisor (PKS).

2 use cases for PKS are being developed, trusted keys and PMEM. Trusted keys
is a newer use case which is still being explored. PMEM was submitted as part
of the RFC (v2) series[1]. However, since then it was found that some callers
of kmap() require a global implementation of PKS. Specifically some users of
kmap() expect mappings to be available to all kernel threads. While global use
of PKS is rare it needs to be included for correctness. Unfortunately the
kmap() updates required a large patch series to make the needed changes at the
various kmap() call sites so that patch set has been split out. Because the
global PKS feature is only required for that use case it will be deferred to
that set as well.[2] This patch set is being submitted as a precursor to both
of the use cases.

For an overview of the entire PKS ecosystem, a git tree including this series
and 2 proposed use cases can be found here:

https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/


PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to those pages beyond the normal paging protections. PKS works in
a similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
checked in addition to normal paging protections and Access or Writes can be
disabled via a MSR update without TLB flushes when permissions change. Also
like PKU, a page mapping is assigned to a domain by setting pkey bits in the
page table entry for that mapping.

Access is controlled through a PKRS register which is updated via WRMSR/RDMSR.

XSAVE is not supported for the PKRS MSR. Therefore the implementation
saves/restores the MSR across context switches and during exceptions. Nested
exceptions are supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections on mappings with the default pkey value of 0.

Other keys, (1-15) are allocated by an allocator which prepares us for key
contention from day one. Kernel users should be prepared for the allocator to
fail either because of key exhaustion or due to PKS not being supported on the
arch and/or CPU instance.

The following are key attributes of PKS.

1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis

PKS is available with 4 and 5 level paging. Like PKRU it consumes 4 bits from
the PTE to store the pkey within the entry.


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/


Fenghua Yu (2):
x86/pks: Enable Protection Keys Supervisor (PKS)
x86/pks: Add PKS kernel API

Ira Weiny (7):
x86/pkeys: Create pkeys_common.h
x86/fpu: Refactor arch_set_user_pkey_access() for PKS support
x86/pks: Preserve the PKRS MSR on context switch
x86/entry: Pass irqentry_state_t by reference
x86/entry: Preserve PKRS MSR across exceptions
x86/fault: Report the PKRS state on fault
x86/pks: Add PKS test code

Thomas Gleixner (1):
x86/entry: Move nmi entry/exit into common code

Documentation/core-api/protection-keys.rst | 102 ++-
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 65 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 28 +-
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable.h | 13 +-
arch/x86/include/asm/pgtable_types.h | 12 +
arch/x86/include/asm/pkeys.h | 15 +
arch/x86/include/asm/pkeys_common.h | 40 ++
arch/x86/include/asm/processor.h | 14 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/cpu/common.c | 15 +
arch/x86/kernel/cpu/mce/core.c | 6 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/kvm.c | 6 +-
arch/x86/kernel/nmi.c | 6 +-
arch/x86/kernel/process.c | 26 +
arch/x86/kernel/traps.c | 24 +-
arch/x86/mm/fault.c | 87 ++-
arch/x86/mm/pkeys.c | 191 +++++-
include/linux/entry-common.h | 46 +-
include/linux/pgtable.h | 4 +
include/linux/pkeys.h | 22 +
kernel/entry/common.c | 62 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 691 ++++++++++++++++++++
mm/Kconfig | 2 +
tools/testing/selftests/x86/Makefile | 3 +-
tools/testing/selftests/x86/test_pks.c | 66 ++
33 files changed, 1441 insertions(+), 158 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

--
2.28.0.rc0.12.gb6a658bd00c9


2020-10-23 13:10:38

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 02/10] x86/fpu: Refactor arch_set_user_pkey_access() for PKS support

From: Ira Weiny <[email protected]>

Define a helper, update_pkey_val(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

Co-developed-by: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from RFC V3:
Per Dave Hansen
Update and add comments per Dave's review
Per Peter
Correct attribution
---
arch/x86/include/asm/pkeys.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
arch/x86/mm/pkeys.c | 23 +++++++++++++++++++++++
3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index f9feba80894b..4526245b03e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -136,4 +136,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
}

+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags);
+
#endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index a99afc70cc0a..a3bca3211eba 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -994,9 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
{
- u32 old_pkru;
- int pkey_shift = (pkey * PKR_BITS_PER_PKEY);
- u32 new_pkru_bits = 0;
+ u32 pkru;

/*
* This check implies XSAVE support. OSPKE only gets
@@ -1012,21 +1010,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
*/
WARN_ON_ONCE(pkey >= arch_max_pkey());

- /* Set the bits we need in PKRU: */
- if (init_val & PKEY_DISABLE_ACCESS)
- new_pkru_bits |= PKR_AD_BIT;
- if (init_val & PKEY_DISABLE_WRITE)
- new_pkru_bits |= PKR_WD_BIT;
-
- /* Shift the bits in to the correct place in PKRU for pkey: */
- new_pkru_bits <<= pkey_shift;
-
- /* Get old PKRU and mask off any old bits in place: */
- old_pkru = read_pkru();
- old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
- /* Write old part along with new part: */
- write_pkru(old_pkru | new_pkru_bits);
+ pkru = read_pkru();
+ pkru = update_pkey_val(pkru, pkey, init_val);
+ write_pkru(pkru);

return 0;
}
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index f5efb4007e74..d1dfe743e79f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -208,3 +208,26 @@ static __init int setup_init_pkru(char *opt)
return 1;
}
__setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Replace disable bits for @pkey with values from @flags
+ *
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags)
+{
+ int pkey_shift = pkey * PKR_BITS_PER_PKEY;
+
+ /* Mask out old bit values */
+ pk_reg &= ~(((1 << PKR_BITS_PER_PKEY) - 1) << pkey_shift);
+
+ /* Or in new values */
+ if (flags & PKEY_DISABLE_ACCESS)
+ pk_reg |= PKR_AD_BIT << pkey_shift;
+ if (flags & PKEY_DISABLE_WRITE)
+ pk_reg |= PKR_WD_BIT << pkey_shift;
+
+ return pk_reg;
+}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-10-23 13:10:38

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 08/10] x86/entry: Preserve PKRS MSR across exceptions

From: Ira Weiny <[email protected]>

The PKRS MSR is not managed by XSAVE. It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1] pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]
irqentry_state_t was already an object being used in the exception
handling and is straightforward. It is also easy for any number of
nested states to be tracked and eventually can be enhanced to store the
reference counting required to support PKS through kmap reentry

Preserve the current task's PKRS values in irqentry_state_t on exception
entry and restoring them on exception exit.

Each nested exception is further saved allowing for any number of levels
of exception handling.

Peter and Thomas both suggested parts of the patch, IDT and NMI respectively.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/#t

Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from RFC V3
Standardize on 'irq_state' variable name
Per Dave Hansen
irq_save_pkrs() -> irq_save_set_pkrs()
Rebased based on clean up patch by Thomas Gleixner
This includes moving irq_[save_set|restore]_pkrs() to
the core as well.
---
arch/x86/entry/common.c | 39 +++++++++++++++++++++++++++++
arch/x86/include/asm/pkeys_common.h | 5 ++--
arch/x86/mm/pkeys.c | 2 +-
include/linux/entry-common.h | 12 +++++++++
kernel/entry/common.c | 14 +++++++++--
5 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 87dea56a15d2..c047ea57bb8c 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
#include <linux/nospec.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
+#include <linux/pkeys.h>

#ifdef CONFIG_XEN_PV
#include <xen/xen-ops.h>
@@ -209,6 +210,42 @@ SYSCALL_DEFINE0(ni_syscall)
return -ENOSYS;
}

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * The register is not maintained with XSAVE so we have to maintain the MSR
+ * value in software during context switch and exception handling.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory we save the
+ * current running value and set the PKRS value for the duration of the
+ * exception. Thus preventing exception handlers from having the elevated
+ * access of the interrupted task.
+ */
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ irq_state->thread_pkrs = current->thread.saved_pkrs;
+ irq_state->pkrs = this_cpu_read(pkrs_cache);
+ write_pkrs(INIT_PKRS_VALUE);
+}
+
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ write_pkrs(irq_state->pkrs);
+ current->thread.saved_pkrs = irq_state->thread_pkrs;
+}
+#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */
+
#ifdef CONFIG_XEN_PV
#ifndef CONFIG_PREEMPTION
/*
@@ -272,6 +309,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)

inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
+ /* Normally called by irqentry_exit, we must restore pkrs here */
+ irq_restore_pkrs(&irq_state);
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h
index cd492c23b28c..f921c58793f9 100644
--- a/arch/x86/include/asm/pkeys_common.h
+++ b/arch/x86/include/asm/pkeys_common.h
@@ -31,9 +31,10 @@
#define PKS_NUM_KEYS 16

#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
-void write_pkrs(u32 new_pkrs);
+DECLARE_PER_CPU(u32, pkrs_cache);
+noinstr void write_pkrs(u32 new_pkrs);
#else
-static inline void write_pkrs(u32 new_pkrs) { }
+static __always_inline void write_pkrs(u32 new_pkrs) { }
#endif

#endif /*_ASM_X86_PKEYS_INTERNAL_H */
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index fd5c4d34c3a5..a05f4edf2f80 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -252,7 +252,7 @@ DEFINE_PER_CPU(u32, pkrs_cache);
* until all prior executions of WRPKRU have completed execution
* and updated the PKRU register.
*/
-void write_pkrs(u32 new_pkrs)
+noinstr void write_pkrs(u32 new_pkrs)
{
u32 *pkrs;

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index b79704af744f..3395931d55d0 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -342,6 +342,10 @@ void irqentry_exit_to_user_mode(struct pt_regs *regs);

#ifndef irqentry_state
typedef struct irqentry_state {
+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+ u32 pkrs;
+ u32 thread_pkrs;
+#endif
union {
bool exit_rcu;
bool lockdep;
@@ -349,6 +353,14 @@ typedef struct irqentry_state {
} irqentry_state_t;
#endif

+#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS
+noinstr void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val);
+noinstr void irq_restore_pkrs(irqentry_state_t *irq_state);
+#else
+static __always_inline void irq_save_set_pkrs(irqentry_state_t *irq_state, u32 val) { }
+static __always_inline void irq_restore_pkrs(irqentry_state_t *irq_state) { }
+#endif
+
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e4f745b3a229..1d3df88e1657 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -327,7 +327,7 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
instrumentation_end();

irq_state->exit_rcu = true;
- return;
+ goto done;
}

/*
@@ -341,6 +341,9 @@ noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
/* Use the combo lockdep/tracing function */
trace_hardirqs_off();
instrumentation_end();
+
+done:
+ irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE);
}

void irqentry_exit_cond_resched(void)
@@ -362,7 +365,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
/* Check whether this returns to user mode */
if (user_mode(regs)) {
irqentry_exit_to_user_mode(regs);
- } else if (!regs_irqs_disabled(regs)) {
+ return;
+ }
+
+ irq_restore_pkrs(irq_state);
+
+ if (!regs_irqs_disabled(regs)) {
/*
* If RCU was not watching on entry this needs to be done
* carefully and needs the same ordering of lockdep/tracing
@@ -408,10 +416,12 @@ void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_stat
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
+ irq_save_set_pkrs(irq_state, INIT_PKRS_VALUE);
}

void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
+ irq_restore_pkrs(irq_state);
instrumentation_begin();
ftrace_nmi_exit();
if (irq_state->lockdep) {
--
2.28.0.rc0.12.gb6a658bd00c9

2020-10-23 13:11:50

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 07/10] x86/entry: Pass irqentry_state_t by reference

From: Ira Weiny <[email protected]>

In preparation for adding PKS information to struct irqentry_state_t
change all call sites and usages to pass the struct by reference
instead of by value.

While we are editing the call sites it is a good time to standardize on
the name 'irq_state'.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from RFC V3
Clean up @irq_state comments
Standardize on 'irq_state' for the state variable name
Refactor based on new patch from Thomas Gleixner
Also addresses Peter Zijlstra's comment
---
arch/x86/entry/common.c | 8 ++++----
arch/x86/include/asm/idtentry.h | 25 ++++++++++++++----------
arch/x86/kernel/cpu/mce/core.c | 4 ++--
arch/x86/kernel/kvm.c | 6 +++---
arch/x86/kernel/nmi.c | 4 ++--
arch/x86/kernel/traps.c | 21 ++++++++++++--------
arch/x86/mm/fault.c | 6 +++---
include/linux/entry-common.h | 16 ++++++++++------
kernel/entry/common.c | 34 +++++++++++++--------------------
9 files changed, 65 insertions(+), 59 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 18d8f17f755c..87dea56a15d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -259,9 +259,9 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
{
struct pt_regs *old_regs;
bool inhcall;
- irqentry_state_t state;
+ irqentry_state_t irq_state;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
old_regs = set_irq_regs(regs);

instrumentation_begin();
@@ -271,13 +271,13 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
set_irq_regs(old_regs);

inhcall = get_and_clear_inhcall();
- if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+ if (inhcall && !WARN_ON_ONCE(irq_state.exit_rcu)) {
instrumentation_begin();
irqentry_exit_cond_resched();
instrumentation_end();
restore_inhcall(inhcall);
} else {
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
}
#endif /* CONFIG_XEN_PV */
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 247a60a47331..282d2413b6a1 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -49,12 +49,13 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
@@ -96,12 +97,13 @@ static __always_inline void __##func(struct pt_regs *regs, \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__##func (regs, error_code); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, \
@@ -192,15 +194,16 @@ static __always_inline void __##func(struct pt_regs *regs, u8 vector); \
__visible noinstr void func(struct pt_regs *regs, \
unsigned long error_code) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs, (u8)error_code); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs, u8 vector)
@@ -234,15 +237,16 @@ static void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
irq_enter_rcu(); \
kvm_set_cpu_l1tf_flush_l1d(); \
run_sysvec_on_irqstack_cond(__##func, regs); \
irq_exit_rcu(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static noinline void __##func(struct pt_regs *regs)
@@ -263,15 +267,16 @@ static __always_inline void __##func(struct pt_regs *regs); \
\
__visible noinstr void func(struct pt_regs *regs) \
{ \
- irqentry_state_t state = irqentry_enter(regs); \
+ irqentry_state_t irq_state; \
\
+ irqentry_enter(regs, &irq_state); \
instrumentation_begin(); \
__irq_enter_raw(); \
kvm_set_cpu_l1tf_flush_l1d(); \
__##func (regs); \
__irq_exit_raw(); \
instrumentation_end(); \
- irqentry_exit(regs, state); \
+ irqentry_exit(regs, &irq_state); \
} \
\
static __always_inline void __##func(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index eb3338c0bbc1..f90f2163e366 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1995,7 +1995,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
mce_check_crashing_cpu())
return;

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
/*
* The call targets are marked noinstr, but objtool can't figure
* that out because it's an indirect call. Annotate it.
@@ -2006,7 +2006,7 @@ static __always_inline void exc_machine_check_kernel(struct pt_regs *regs)
if (regs->flags & X86_EFLAGS_IF)
trace_hardirqs_on_prepare();
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}

static __always_inline void exc_machine_check_user(struct pt_regs *regs)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1c0f2560a41c..78416af3d3a0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -238,12 +238,12 @@ EXPORT_SYMBOL_GPL(kvm_read_and_reset_apf_flags);
noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
{
u32 flags = kvm_read_and_reset_apf_flags();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

if (!flags)
return false;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -264,7 +264,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
}

instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
return true;
}

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..1fd7780e99dd 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -502,14 +502,14 @@ DEFINE_IDTENTRY_RAW(exc_nmi)

this_cpu_write(nmi_dr7, local_db_save());

- irq_state = irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);

inc_irq_stat(__nmi_count);

if (!ignore_nmis)
default_do_nmi(regs);

- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(this_cpu_read(nmi_dr7));

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index bffbbe29fc8c..5566a0a38f63 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -245,7 +245,7 @@ static noinstr bool handle_bug(struct pt_regs *regs)

DEFINE_IDTENTRY_RAW(exc_invalid_op)
{
- irqentry_state_t state;
+ irqentry_state_t irq_state;

/*
* We use UD2 as a short encoding for 'CALL __WARN', as such
@@ -255,11 +255,11 @@ DEFINE_IDTENTRY_RAW(exc_invalid_op)
if (!user_mode(regs) && handle_bug(regs))
return;

- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);
instrumentation_begin();
handle_invalid_op(regs);
instrumentation_end();
- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}

DEFINE_IDTENTRY(exc_coproc_segment_overrun)
@@ -343,6 +343,7 @@ __visible void __noreturn handle_stack_overflow(const char *message,
*/
DEFINE_IDTENTRY_DF(exc_double_fault)
{
+ irqentry_state_t irq_state;
static const char str[] = "double fault";
struct task_struct *tsk = current;

@@ -405,7 +406,7 @@ DEFINE_IDTENTRY_DF(exc_double_fault)
}
#endif

- irqentry_nmi_enter(regs);
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();
notify_die(DIE_TRAP, str, regs, error_code, X86_TRAP_DF, SIGSEGV);

@@ -651,13 +652,15 @@ DEFINE_IDTENTRY_RAW(exc_int3)
instrumentation_end();
irqentry_exit_to_user_mode(regs);
} else {
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);

instrumentation_begin();
if (!do_int3(regs))
die("int3", regs, 0);
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);
}
}

@@ -865,7 +868,9 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
* includes the entry stack is excluded for everything.
*/
unsigned long dr7 = local_db_save();
- irqentry_state_t irq_state = irqentry_nmi_enter(regs);
+ irqentry_state_t irq_state;
+
+ irqentry_nmi_enter(regs, &irq_state);
instrumentation_begin();

/*
@@ -908,7 +913,7 @@ static __always_inline void exc_debug_kernel(struct pt_regs *regs,
regs->flags &= ~X86_EFLAGS_TF;
out:
instrumentation_end();
- irqentry_nmi_exit(regs, irq_state);
+ irqentry_nmi_exit(regs, &irq_state);

local_db_restore(dr7);
}
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 42606a04ae85..5e3fd7763315 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1441,7 +1441,7 @@ handle_page_fault(struct pt_regs *regs, unsigned long error_code,
DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
{
unsigned long address = read_cr2();
- irqentry_state_t state;
+ irqentry_state_t irq_state;

prefetchw(&current->mm->mmap_lock);

@@ -1476,11 +1476,11 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
* code reenabled RCU to avoid subsequent wreckage which helps
* debugability.
*/
- state = irqentry_enter(regs);
+ irqentry_enter(regs, &irq_state);

instrumentation_begin();
handle_page_fault(regs, error_code, address);
instrumentation_end();

- irqentry_exit(regs, state);
+ irqentry_exit(regs, &irq_state);
}
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 47f9a0658acf..b79704af744f 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -352,6 +352,8 @@ typedef struct irqentry_state {
/**
* irqentry_enter - Handle state tracking on ordinary interrupt entries
* @regs: Pointer to pt_regs of interrupted context
+ * @irq_state: Pointer to state information; to be passed back to
+ * irqentry_exit()
*
* Invokes:
* - lockdep irqflag state tracking as low level ASM entry disabled
@@ -380,7 +382,7 @@ typedef struct irqentry_state {
*
* Returns: An opaque object that must be passed to idtentry_exit()
*/
-irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
+void noinstr irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
@@ -392,7 +394,7 @@ void irqentry_exit_cond_resched(void);
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)
- * @state: Return value from matching call to irqentry_enter()
+ * @irq_state: Pointer to state information passed to irqentry_enter()
*
* Depending on the return target (kernel/user) this runs the necessary
* preemption and work checks if possible and reguired and returns to
@@ -403,25 +405,27 @@ void irqentry_exit_cond_resched(void);
*
* Counterpart to irqentry_enter().
*/
-void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
+void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_enter - Handle NMI entry
* @regs: Pointer to currents pt_regs
+ * @irq_state: Pointer to state information; to be passed back to
+ * irqentry_nmi_exit()
*
* Similar to irqentry_enter() but taking care of the NMI constraints.
*/
-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs);
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state);

/**
* irqentry_nmi_exit - Handle return from NMI handling
* @regs: Pointer to pt_regs (NMI entry regs)
- * @irq_state: Return value from matching call to irqentry_nmi_enter()
+ * @irq_state: Pointer to state information passed to irqentry_nmi_enter()
*
* Last action before returning to the low level assmenbly code.
*
* Counterpart to irqentry_nmi_enter().
*/
-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state);
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state);

#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 5cc2e4174d7c..e4f745b3a229 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -282,15 +282,13 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
exit_to_user_mode();
}

-noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
+noinstr void irqentry_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t ret = {
- .exit_rcu = false,
- };
+ irq_state->exit_rcu = false;

if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ return;
}

/*
@@ -328,8 +326,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();

- ret.exit_rcu = true;
- return ret;
+ irq_state->exit_rcu = true;
+ return;
}

/*
@@ -343,8 +341,6 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
/* Use the combo lockdep/tracing function */
trace_hardirqs_off();
instrumentation_end();
-
- return ret;
}

void irqentry_exit_cond_resched(void)
@@ -359,7 +355,7 @@ void irqentry_exit_cond_resched(void)
}
}

-noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
+noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
lockdep_assert_irqs_disabled();

@@ -372,7 +368,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* carefully and needs the same ordering of lockdep/tracing
* and RCU as the return to user mode path.
*/
- if (state.exit_rcu) {
+ if (irq_state->exit_rcu) {
instrumentation_begin();
/* Tell the tracer that IRET will enable interrupts */
trace_hardirqs_on_prepare();
@@ -394,16 +390,14 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
* IRQ flags state is correct already. Just tell RCU if it
* was not watching on entry.
*/
- if (state.exit_rcu)
+ if (irq_state->exit_rcu)
rcu_irq_exit();
}
}

-irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
+void noinstr irqentry_nmi_enter(struct pt_regs *regs, irqentry_state_t *irq_state)
{
- irqentry_state_t irq_state;
-
- irq_state.lockdep = lockdep_hardirqs_enabled();
+ irq_state->lockdep = lockdep_hardirqs_enabled();

__nmi_enter();
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -414,15 +408,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
ftrace_nmi_enter();
instrumentation_end();
-
- return irq_state;
}

-void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
+void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t *irq_state)
{
instrumentation_begin();
ftrace_nmi_exit();
- if (irq_state.lockdep) {
+ if (irq_state->lockdep) {
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
}
@@ -430,7 +422,7 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)

rcu_nmi_exit();
lockdep_hardirq_exit();
- if (irq_state.lockdep)
+ if (irq_state->lockdep)
lockdep_hardirqs_on(CALLER_ADDR0);
__nmi_exit();
}
--
2.28.0.rc0.12.gb6a658bd00c9

2020-10-24 00:41:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 07/10] x86/entry: Pass irqentry_state_t by reference

On Thu, Oct 22 2020 at 15:26, ira weiny wrote:
> From: Ira Weiny <[email protected]>
>
> In preparation for adding PKS information to struct irqentry_state_t
> change all call sites and usages to pass the struct by reference
> instead of by value.

This still does not explain WHY you need to do that. 'Preparation' is a
pretty useless information.

What is the actual reason for this? Just because PKS information feels
better that way?

Also what is PKS information? Changelogs have to make sense on their own
and not only in the context of a larger series of changes.

> While we are editing the call sites it is a good time to standardize on
> the name 'irq_state'.

While at it change all usage sites to consistently use the variable
name 'irq_state'.

Or something like that. See Documentation/process/...

Thanks,

tglx

2020-10-27 14:12:24

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 07/10] x86/entry: Pass irqentry_state_t by reference

On Fri, Oct 23, 2020 at 11:56:33PM +0200, Thomas Gleixner wrote:
> On Thu, Oct 22 2020 at 15:26, ira weiny wrote:
> > From: Ira Weiny <[email protected]>
> >
> > In preparation for adding PKS information to struct irqentry_state_t
> > change all call sites and usages to pass the struct by reference
> > instead of by value.
>
> This still does not explain WHY you need to do that. 'Preparation' is a
> pretty useless information.
>
> What is the actual reason for this? Just because PKS information feels
> better that way?
>
> Also what is PKS information? Changelogs have to make sense on their own
> and not only in the context of a larger series of changes.

I've reworded this to explain the addition of new members which would make
passing by value less efficient with additional structure changes being added
later in the series.

>
> > While we are editing the call sites it is a good time to standardize on
> > the name 'irq_state'.
>
> While at it change all usage sites to consistently use the variable
> name 'irq_state'.
>
> Or something like that. See Documentation/process/...

Sorry, bad habit.

Fixed.

Thanks,
Ira

>
> Thanks,
>
> tglx
>