2022-03-10 23:52:38

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection

From: Ira Weiny <[email protected]>


I'm looking for Intel acks on the series prior to submitting to maintainers.
Most of the changes from V8 to V9 was in getting the tests straightened out.
But there are some improvements in the actual code.


Changes for V9

Review and update all commit messages.
Update cover letter below

PKS Core
Separate user and supervisor pkey code in the headers
create linux/pks.h for supervisor calls
This facilitated making the pmem code more efficient
Completely rearchitect the test code
[After Dave Hansen and Rick Edgecombe found issues in the test
code it was easier to rearchitect the code completely
rather than attempt to fix it.]
Remove pks_test_callback in favor of using fault hooks
Fault hooks also isolate the fault callbacks from being
false positives if non-test consumers are running
Make additional PKS_TEST_RUN_ALL Kconfig option which is
mutually exclusive to any non-test PKS consumer
PKS_TEST_RUN_ALL takes over all pkey callbacks
Ensure that each test runs within it's own context and is
mutually exclusive from running while any other test is
running.
Ensure test session and context memory is cleaned up on file
close
Use pr_debug() and dynamic debug for in kernel debug messages
Enhance test_pks selftest
Add the ability to run all tests not just the context
switch test
Standardize output [PASS][FAIL][SKIP]
Add '-d' option enables dynamic debug to see the kernel
debug messages

Incorporate feedback from Rick Edgecombe
Update all pkey types to u8
Fix up test code barriers
Move patch declaring PKS_INIT_VALUE ahead of the patch which enables
PKS so that PKS_INIT_VALUE can be used when pks_setup() is
first created
From Dan Williams
Use macros instead of an enum for a pkey allocation scheme
which is predicated on the config options of consumers
This almost worked perfectly. It required a bit of
tweeking to be able to allocate all of the keys.

From Dave Hansen
Reposition some code to be near/similar to user pkeys
s/pks_write_current/x86_pkrs_load
s/pks_saved_pkrs/pkrs
Update Documentation
s/PKR_{RW,AD,WD}_KEY/PKR_{RW,AD,WD}_MASK
Consistently use lower case for pkey
Update commit messages
Add Acks

PMEM Stray Write
Building on the change to the pks_mk_*() function rename
s/pgmap_mk_*/pgmap_set_*/
s/dax_mk_*/dax_set_*/
From Dan Williams
Avoid adding new dax operations by teaching dax_device about pgmap
Remove pgmap_protection_flag_invalid() patch (Just let
kmap'ings fail)


PKS/PMEM Stray write protection
===============================

This series is broken into 2 parts.

1) Introduce Protection Key Supervisor (PKS), testing, and
documentation
2) Use PKS to protect PMEM from stray writes

Introduce Protection Key Supervisor (PKS)
-----------------------------------------

PKS enables protections on 'domains' of supervisor pages to limit supervisor
mode access to pages beyond the normal paging protections. PKS works in a
similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
checked in addition to normal paging protections. And page mappings are
assigned to a domain by setting a 4 bit pkey in the PTE of that mapping.

Unlike PKU, permissions are changed via a MSR update. This update avoids TLB
flushes making this an efficient way to alter protections vs PTE updates.

Also, unlike PTE updates PKS permission changes apply only to the current
processor. Therefore changing permissions apply only to that thread and not
any other cpu/process. This allows protections to remain in place on other
cpus for additional protection and isolation.

Even though PKS updates are thread local, XSAVE is not supported for the PKRS
MSR. Therefore this implementation saves and restores the MSR across context
switches and during exceptions within software. Nested exceptions are
supported by each exception getting a new PKS state.

For consistent behavior with current paging protections, pkey 0 is reserved and
configured to allow full access via the pkey mechanism, thus preserving the
default paging protections because PTEs naturally have a pkey value of 0.

Other keys, (1-15) are statically allocated by kernel consumers when
configured. This is done by adding the appropriate PKS_NEW_KEY and
PKS_DECLARE_INIT_VALUE macros to pks-keys.h.

Two PKS consumers, PKS_TEST and PMEM stray write protection, are included in
this series. When the number of users grows larger the sharing of keys will
need to be resolved depending on the needs of the users at that time. Many
methods have been contemplated but the number of kernel users and use cases
envisioned is still quite small, much less than the 15 available keys.

To summarize, the following are key attributes of PKS.

1) Fast switching of permissions
1a) Prevents access without page table manipulations
1b) No TLB flushes required
2) Works on a per thread basis, thus allowing protections to be
preserved on threads which are not actively accessing data through
the mapping.

PKS is available with 4 and 5 level paging. For this and simplicity of
implementation, the feature is restricted to x86_64.


Use PKS to protect PMEM from stray writes
-----------------------------------------

DAX leverages the direct-map to enable 'struct page' services for PMEM. Given
that PMEM capacity may be an order of magnitude higher capacity than System RAM
it presents a large vulnerability surface to stray writes. Such a stray write
becomes a silent data corruption bug.

Stray pointers to System RAM may result in a crash or other undesirable
behavior which, while unfortunate, are usually recoverable with a reboot.
Stray writes to PMEM are permanent in nature and thus are more likely to result
in permanent user data loss. Given that PMEM access from the kernel is limited
to a constrained set of locations (PMEM driver, Filesystem-DAX, direct-I/O, and
any properly kmap'ed page), it is amenable to PKS protection.

Set up an infrastructure for extra device access protection. Then implement the
protection using the new Protection Keys Supervisor (PKS) on architectures
which support it.

Because PMEM pages are all associated with a struct dev_pagemap and flags in
struct page are valuable the flag of protecting memory can be stored in struct
dev_pagemap. All PMEM is protected by the same pkey. So a single flag is all
that is needed in each dev_pagemap to indicate protection.

General access in the kernel is supported by modifying the kmap infrastructure
which can detect if a page is pks protected and enable access until the
corresponding unmap is called.

Because PKS is a thread local mechanism and because kmap was never really
intended to create a long term mapping, this implementation does not support
the kmap()/kunmap() calls. Calling kmap() on a PMEM protected page is allowed
but accessing that mapping will cause a fault.

Originally this series modified many of the kmap call sites to indicate they
were thread local.[1] And an attempt to support kmap()[2] was made. But now
that kmap_local_page() has been developed[3] and in more wide spread use,
kmap() can safely be left unsupported.

How the fault is handled is configurable via a new module parameter
memremap.pks_fault_mode. Two modes are supported.

'relaxed' (default) -- WARN_ONCE, disable the protection and allow
access

'strict' -- prevent any unguarded access to a protected dev_pagemap
range

This 'safety valve' feature has already been useful in the development of this
feature.


[1] https://lore.kernel.org/lkml/[email protected]/

[2] https://lore.kernel.org/lkml/[email protected]/

[3] https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/
https://lore.kernel.org/lkml/[email protected]/


----------------------------------------------------------------------------
Changes for V8

Feedback from Thomas
* clean up noinstr mess
* Fix static PKEY allocation mess
* Ensure all functions are consistently named.
* Split up patches to do 1 thing per patch
* pkey_update_pkval() implementation
* Streamline the use of pks_write_pkrs() by not disabling preemption
- Leave this to the callers who require it.
- Use documentation and lockdep to prevent errors
* Clean up commit messages to explain in detail _why_ each patch is
there.

Feedback from Dave H.
* Leave out pks_mk_readonly() as it is not used by the PMEM use case

Feedback from Peter Anvin
* Replace pks_abandon_pkey() with pks_update_exception()
This is an even greater simplification in that it no longer
attempts to shield users from faults. As the main use case for
abandoning a key was to allow a system to continue running even
with an error. This should be a rare event so the performance
should not be an issue.

* Simplify ARCH_ENABLE_SUPERVISOR_PKEYS

* Update PKS Test code
- Add default value test
- Split up the test code into patches which follow each feature
addition
- simplify test code processing
- ensure consistent reporting of errors.

* Ensure all entry points to the PKS code are protected by
cpu_feature_enabled(X86_FEATURE_PKS)
- At the same time make sure non-entry points or sub-functions to the
PKS code are not _unnecessarily_ protected by the feature check

* Update documentation
- Use kernel docs to place the docs with the code for easier internal
developer use

* Adjust the PMEM use cases for the core changes

* Split the PMEM patches up to be 1 change per patch and help clarify review

* Review all header files and remove those no longer needed

* Review/update/clarify all commit messages

Fenghua Yu (1):
mm/pkeys: Define PKS page table macros

Ira Weiny (43):
entry: Create an internal irqentry_exit_cond_resched() call
Documentation/protection-keys: Clean up documentation for User Space
pkeys
x86/pkeys: Clarify PKRU_AD_KEY macro
x86/pkeys: Make PKRU macros generic
x86/fpu: Refactor arch_set_user_pkey_access()
mm/pkeys: Add Kconfig options for PKS
x86/pkeys: Add PKS CPU feature bit
x86/fault: Adjust WARN_ON for pkey fault
Documentation/pkeys: Add initial PKS documentation
mm/pkeys: Provide for PKS key allocation
x86/pkeys: Enable PKS on cpus which support it
mm/pkeys: PKS testing, add initial test code
x86/selftests: Add test_pks
x86/pkeys: Introduce pks_write_pkrs()
x86/pkeys: Preserve the PKS MSR on context switch
mm/pkeys: Introduce pks_set_readwrite()
mm/pkeys: Introduce pks_set_noaccess()
mm/pkeys: PKS testing, add a fault call back
mm/pkeys: PKS testing, add pks_set_*() tests
mm/pkeys: PKS testing, test context switching
x86/entry: Add auxiliary pt_regs space
entry: Split up irqentry_exit_cond_resched()
entry: Add calls for save/restore auxiliary pt_regs
x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
x86/pkeys: Preserve PKRS MSR across exceptions
x86/fault: Print PKS MSR on fault
mm/pkeys: PKS testing, Add exception test
mm/pkeys: Introduce pks_update_exception()
mm/pkeys: PKS testing, test pks_update_exception()
mm/pkeys: PKS testing, add test for all keys
mm/pkeys: Add pks_available()
memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
memremap_pages: Introduce pgmap_protection_available()
memremap_pages: Introduce a PGMAP_PROTECTION flag
memremap_pages: Introduce devmap_protected()
memremap_pages: Reserve a PKS pkey for eventual use by PMEM
memremap_pages: Set PKS pkey in PTEs if requested
memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls
memremap_pages: Add memremap.pks_fault_mode
kmap: Make kmap work for devmap protected pages
dax: Stray access protection for dax_direct_access()
nvdimm/pmem: Enable stray access protection
devdax: Enable stray access protection

Rick Edgecombe (1):
mm/pkeys: Introduce PKS fault callbacks

.../admin-guide/kernel-parameters.txt | 12 +
Documentation/core-api/protection-keys.rst | 130 ++-
arch/x86/Kconfig | 6 +
arch/x86/entry/calling.h | 20 +
arch/x86/entry/common.c | 2 +-
arch/x86/entry/entry_64.S | 22 +
arch/x86/entry/entry_64_compat.S | 6 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/entry-common.h | 15 +
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pgtable_types.h | 22 +
arch/x86/include/asm/pkeys.h | 2 +
arch/x86/include/asm/pkeys_common.h | 18 +
arch/x86/include/asm/pkru.h | 20 +-
arch/x86/include/asm/pks.h | 46 ++
arch/x86/include/asm/processor.h | 15 +-
arch/x86/include/asm/ptrace.h | 21 +
arch/x86/include/uapi/asm/processor-flags.h | 2 +
arch/x86/kernel/asm-offsets_64.c | 15 +
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/dumpstack.c | 32 +-
arch/x86/kernel/fpu/xstate.c | 22 +-
arch/x86/kernel/head_64.S | 6 +
arch/x86/kernel/process_64.c | 3 +
arch/x86/mm/fault.c | 17 +-
arch/x86/mm/pkeys.c | 320 +++++++-
drivers/dax/device.c | 2 +
drivers/dax/super.c | 59 ++
drivers/md/dm-writecache.c | 8 +-
drivers/nvdimm/pmem.c | 26 +
fs/dax.c | 8 +
fs/fuse/virtio_fs.c | 2 +
include/linux/dax.h | 5 +
include/linux/entry-common.h | 15 +-
include/linux/highmem-internal.h | 4 +
include/linux/memremap.h | 1 +
include/linux/mm.h | 72 ++
include/linux/pgtable.h | 4 +
include/linux/pks-keys.h | 92 +++
include/linux/pks.h | 73 ++
include/linux/sched.h | 7 +
include/uapi/asm-generic/mman-common.h | 1 +
init/init_task.c | 3 +
kernel/entry/common.c | 44 +-
kernel/sched/core.c | 40 +-
lib/Kconfig.debug | 33 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 755 ++++++++++++++++++
mm/Kconfig | 32 +
mm/memremap.c | 132 +++
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/test_pks.c | 514 ++++++++++++
54 files changed, 2617 insertions(+), 109 deletions(-)
create mode 100644 arch/x86/include/asm/pkeys_common.h
create mode 100644 arch/x86/include/asm/pks.h
create mode 100644 include/linux/pks-keys.h
create mode 100644 include/linux/pks.h
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c
create mode 100644 tools/testing/selftests/x86/test_pks.c

--
2.35.1


2022-03-11 00:14:37

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 29/45] mm/pkeys: PKS testing, Add exception test

From: Ira Weiny <[email protected]>

During an exception the interrupted threads PKRS value is preserved
and the exception receives the default value for that pkey. Upon
return from exception the threads PKRS value is restored.

Add a PKS test which forces a fault to check that this works as
intended. Check that both the thread as well as the exception PKRS
state is correct at the beginning, during, and after the exception.

Add the test to the test_pks app.

Signed-off-by: Ira Weiny <[email protected]>

---
Change for V9
Add test to test_pks
Clean up the globals shared with the fault handler
Use the PKS Test specific fault callback
s/pks_mk*/pks_set*/
Change pkey type to u8
From Dave Hansen
use pkey

Change for V8
Split this test off from the testing patch and place it after
the exception saving code.
---
arch/x86/mm/pkeys.c | 2 +-
include/linux/pks.h | 6 ++
lib/pks/pks_test.c | 133 +++++++++++++++++++++++++
tools/testing/selftests/x86/test_pks.c | 5 +-
4 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7c8e4ea9f022..6327e32d7237 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -215,7 +215,7 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)

#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS

-static DEFINE_PER_CPU(u32, pkrs_cache);
+__static_or_pks_test DEFINE_PER_CPU(u32, pkrs_cache);

/**
* DOC: DEFINE_PKS_FAULT_CALLBACK
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 208f88fcb48c..224fc3bbd072 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -46,9 +46,15 @@ static inline void pks_set_readwrite(u8 pkey) {}

#ifdef CONFIG_PKS_TEST

+#define __static_or_pks_test
+
bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
bool write);

+#else /* !CONFIG_PKS_TEST */
+
+#define __static_or_pks_test static
+
#endif /* CONFIG_PKS_TEST */

#endif /* _LINUX_PKS_H */
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 86af2f61393d..762f4a19cb7d 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -48,19 +48,30 @@
#define RUN_SINGLE 1
#define ARM_CTX_SWITCH 2
#define CHECK_CTX_SWITCH 3
+#define RUN_EXCEPTION 4
#define RUN_CRASH_TEST 9

+DECLARE_PER_CPU(u32, pkrs_cache);
+
static struct dentry *pks_test_dentry;

DEFINE_MUTEX(test_run_lock);

struct pks_test_ctx {
u8 pkey;
+ bool pass;
char data[64];
void *test_page;
bool fault_seen;
+ bool validate_exp_handling;
};

+static bool check_pkey_val(u32 pk_reg, u8 pkey, u32 expected)
+{
+ pk_reg = (pk_reg >> PKR_PKEY_SHIFT(pkey)) & PKEY_ACCESS_MASK;
+ return (pk_reg == expected);
+}
+
static void debug_context(const char *label, struct pks_test_ctx *ctx)
{
pr_debug("%s [%d] %s <-> %p\n",
@@ -96,6 +107,63 @@ static void debug_result(const char *label, int test_num,
sd->last_test_pass ? "PASS" : "FAIL");
}

+/*
+ * Check if the register @pkey value matches @expected value
+ *
+ * Both the cached and actual MSR must match.
+ */
+static bool check_pkrs(u8 pkey, u8 expected)
+{
+ bool ret = true;
+ u64 pkrs;
+ u32 *tmp_cache;
+
+ tmp_cache = get_cpu_ptr(&pkrs_cache);
+ if (!check_pkey_val(*tmp_cache, pkey, expected))
+ ret = false;
+ put_cpu_ptr(tmp_cache);
+
+ rdmsrl(MSR_IA32_PKRS, pkrs);
+ if (!check_pkey_val(pkrs, pkey, expected))
+ ret = false;
+
+ return ret;
+}
+
+static void validate_exception(struct pks_test_ctx *ctx, u32 thread_pkrs)
+{
+ u8 pkey = ctx->pkey;
+
+ /* Check that the thread state was saved */
+ if (!check_pkey_val(thread_pkrs, pkey, PKEY_DISABLE_WRITE)) {
+ pr_err(" FAIL: checking aux_pt_regs->thread_pkrs\n");
+ ctx->pass = false;
+ }
+
+ /* Check that the exception received the default of disabled access */
+ if (!check_pkrs(pkey, PKEY_DISABLE_ACCESS)) {
+ pr_err(" FAIL: PKRS cache and MSR\n");
+ ctx->pass = false;
+ }
+
+ /*
+ * Ensure an update can occur during exception without affecting the
+ * interrupted thread. The interrupted thread is verified after the
+ * exception returns.
+ */
+ pks_set_readwrite(pkey);
+ if (!check_pkrs(pkey, 0)) {
+ pr_err(" FAIL: exception did not change register to 0\n");
+ ctx->pass = false;
+ }
+ pks_set_noaccess(pkey);
+ if (!check_pkrs(pkey, PKEY_DISABLE_ACCESS)) {
+ pr_err(" FAIL: exception did not change register to 0x%x\n",
+ PKEY_DISABLE_ACCESS);
+ ctx->pass = false;
+ }
+}
+
/* Global data protected by test_run_lock */
struct pks_test_ctx *g_ctx_under_test;

@@ -122,6 +190,16 @@ bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
if (!g_ctx_under_test)
return false;

+ if (g_ctx_under_test->validate_exp_handling) {
+ validate_exception(g_ctx_under_test, pkrs);
+ /*
+ * Stop this check directly within the exception because the
+ * fault handler clean up code will call again while checking
+ * the PMD entry and there is no need to check this again.
+ */
+ g_ctx_under_test->validate_exp_handling = false;
+ }
+
aux_pt_regs->pkrs = pkey_update_pkval(pkrs, g_ctx_under_test->pkey, 0);
g_ctx_under_test->fault_seen = true;
return true;
@@ -255,6 +333,7 @@ static struct pks_test_ctx *alloc_ctx(u8 pkey)
return ERR_PTR(-ENOMEM);

ctx->pkey = pkey;
+ ctx->pass = true;
sprintf(ctx->data, "%s", "DEADBEEF");

ctx->test_page = alloc_test_page(ctx->pkey);
@@ -295,6 +374,56 @@ static bool run_single(struct pks_session_data *sd)
return rc;
}

+static bool run_exception_test(struct pks_session_data *sd)
+{
+ bool pass = true;
+ struct pks_test_ctx *ctx;
+
+ ctx = alloc_ctx(PKS_KEY_TEST);
+ if (IS_ERR(ctx)) {
+ pr_debug(" FAIL: no context\n");
+ return false;
+ }
+
+ set_ctx_data(sd, ctx);
+
+ /*
+ * Set the thread pkey value to something other than the default of
+ * access disable but something which still causes a fault, disable
+ * writes.
+ */
+ pks_update_protection(ctx->pkey, PKEY_DISABLE_WRITE);
+
+ ctx->validate_exp_handling = true;
+ set_context_for_fault(ctx);
+
+ memcpy(ctx->test_page, ctx->data, 8);
+
+ if (!ctx->fault_seen) {
+ pr_err(" FAIL: did not get an exception\n");
+ pass = false;
+ }
+
+ /*
+ * The exception code has to enable access to keep the fault from
+ * looping forever. Therefore full access is seen here rather than
+ * write disabled.
+ *
+ * However, this does verify that the exception state was independent
+ * of the interrupted threads state because validate_exception()
+ * disabled access during the exception.
+ */
+ if (!check_pkrs(ctx->pkey, 0)) {
+ pr_err(" FAIL: PKRS not restored\n");
+ pass = false;
+ }
+
+ if (!ctx->pass)
+ pass = false;
+
+ return pass;
+}
+
static void crash_it(struct pks_session_data *sd)
{
struct pks_test_ctx *ctx;
@@ -451,6 +580,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
pr_debug("Checking Context switch test\n");
check_ctx_switch(file->private_data);
break;
+ case RUN_EXCEPTION:
+ pr_debug("Exception checking\n");
+ sd->last_test_pass = run_exception_test(file->private_data);
+ break;
default:
pr_debug("Unknown test\n");
sd->last_test_pass = false;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 5a32645a6e6d..817df7a14923 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -35,6 +35,7 @@
#define RUN_SINGLE "1"
#define ARM_CTX_SWITCH "2"
#define CHECK_CTX_SWITCH "3"
+#define RUN_EXCEPTION "4"
#define RUN_CRASH_TEST "9"

time_t g_start_time;
@@ -61,6 +62,7 @@ enum {
TEST_DEFAULTS = 0,
TEST_SINGLE,
TEST_CTX_SWITCH,
+ TEST_EXCEPTION,
MAX_TESTS,
} tests;

@@ -74,7 +76,8 @@ struct test_item {
} test_list[] = {
{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
{ "single", RUN_SINGLE, do_simple_test },
- { "context_switch", ARM_CTX_SWITCH, do_context_switch }
+ { "context_switch", ARM_CTX_SWITCH, do_context_switch },
+ { "exception", RUN_EXCEPTION, do_simple_test }
};

static char *get_test_name(int test_num)
--
2.35.1

2022-03-11 01:33:48

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 21/45] mm/pkeys: PKS testing, add pks_set_*() tests

From: Ira Weiny <[email protected]>

Test that the pks_set_*() functions operate as intended.

First, verify that the pkey was properly set in the PTE.

Second, use the fault callback mechanism to detect if a fault occurred
when expected and if so clear the fault.

The test iterates each of the following test cases.

PKS_TEST_NO_ACCESS, WRITE, FAULT_EXPECTED
PKS_TEST_NO_ACCESS, READ, FAULT_EXPECTED

PKS_TEST_RDWR, WRITE, NO_FAULT_EXPECTED
PKS_TEST_RDWR, READ, NO_FAULT_EXPECTED

Add documentation.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Update commit message
Clarify use of global state for faults to be used by all tests
Add test to test_pks user app
Remove an incorrect comment in the kdoc
Change pkey type to u8
From Dave Hansen
s/pks_mk*/pks_set*/
From Rick Edgecombe
Use standard fault callback instead of the custom PKS
test one

Changes for V8
Remove readonly test, as that patch is not needed for PMEM
Split this off into a patch which follows the pks_mk_*()
patches. Thus allowing for a better view of how the
test works compared to the functionality added with
those patches.
Remove unneeded prints
---
lib/pks/pks_test.c | 161 ++++++++++++++++++++++++-
tools/testing/selftests/x86/test_pks.c | 5 +-
2 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 37f2cd7d0f56..3e14c621bde6 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -33,11 +33,14 @@
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <linux/pgtable.h>
+#include <linux/pks.h>
#include <linux/pks-keys.h>

#define PKS_TEST_MEM_SIZE (PAGE_SIZE)

#define CHECK_DEFAULTS 0
+#define RUN_SINGLE 1
#define RUN_CRASH_TEST 9

static struct dentry *pks_test_dentry;
@@ -48,6 +51,7 @@ struct pks_test_ctx {
u8 pkey;
char data[64];
void *test_page;
+ bool fault_seen;
};

static void debug_context(const char *label, struct pks_test_ctx *ctx)
@@ -85,10 +89,103 @@ static void debug_result(const char *label, int test_num,
sd->last_test_pass ? "PASS" : "FAIL");
}

+/* Global data protected by test_run_lock */
+struct pks_test_ctx *g_ctx_under_test;
+
+/*
+ * Call set_context_for_fault() after the context has been set up and prior to
+ * the expected fault.
+ */
+static void set_context_for_fault(struct pks_test_ctx *ctx)
+{
+ g_ctx_under_test = ctx;
+ /* Ensure the state of the global context is correct prior to a fault */
+ barrier();
+}
+
bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
bool write)
{
- return false;
+ pr_debug("PKS Fault callback: ctx %p\n", g_ctx_under_test);
+
+ if (!g_ctx_under_test)
+ return false;
+
+ pks_set_readwrite(g_ctx_under_test->pkey);
+ g_ctx_under_test->fault_seen = true;
+ return true;
+}
+
+enum pks_access_mode {
+ PKS_TEST_NO_ACCESS,
+ PKS_TEST_RDWR,
+};
+
+#define PKS_WRITE true
+#define PKS_READ false
+#define PKS_FAULT_EXPECTED true
+#define PKS_NO_FAULT_EXPECTED false
+
+static char *get_mode_str(enum pks_access_mode mode)
+{
+ switch (mode) {
+ case PKS_TEST_NO_ACCESS:
+ return "No Access";
+ case PKS_TEST_RDWR:
+ return "Read Write";
+ }
+
+ return "";
+}
+
+struct pks_access_test {
+ enum pks_access_mode mode;
+ bool write;
+ bool fault;
+};
+
+static struct pks_access_test pkey_test_ary[] = {
+ { PKS_TEST_NO_ACCESS, PKS_WRITE, PKS_FAULT_EXPECTED },
+ { PKS_TEST_NO_ACCESS, PKS_READ, PKS_FAULT_EXPECTED },
+
+ { PKS_TEST_RDWR, PKS_WRITE, PKS_NO_FAULT_EXPECTED },
+ { PKS_TEST_RDWR, PKS_READ, PKS_NO_FAULT_EXPECTED },
+};
+
+static bool run_access_test(struct pks_test_ctx *ctx,
+ struct pks_access_test *test,
+ void *ptr)
+{
+ switch (test->mode) {
+ case PKS_TEST_NO_ACCESS:
+ pks_set_noaccess(ctx->pkey);
+ break;
+ case PKS_TEST_RDWR:
+ pks_set_readwrite(ctx->pkey);
+ break;
+ default:
+ pr_debug("BUG in test, invalid mode\n");
+ return false;
+ }
+
+ ctx->fault_seen = false;
+ set_context_for_fault(ctx);
+
+ if (test->write)
+ memcpy(ptr, ctx->data, 8);
+ else
+ memcpy(ctx->data, ptr, 8);
+
+ if (test->fault != ctx->fault_seen) {
+ pr_err("pkey test FAILED: mode %s; write %s; fault %s != %s\n",
+ get_mode_str(test->mode),
+ test->write ? "TRUE" : "FALSE",
+ test->fault ? "YES" : "NO",
+ ctx->fault_seen ? "YES" : "NO");
+ return false;
+ }
+
+ return true;
}

static void *alloc_test_page(u8 pkey)
@@ -108,6 +205,37 @@ static void free_ctx(struct pks_test_ctx *ctx)
kfree(ctx);
}

+static bool test_ctx(struct pks_test_ctx *ctx)
+{
+ bool rc = true;
+ int i;
+ u8 pkey;
+ void *ptr = ctx->test_page;
+ pte_t *ptep = NULL;
+ unsigned int level;
+
+ ptep = lookup_address((unsigned long)ptr, &level);
+ if (!ptep) {
+ pr_err("Failed to lookup address???\n");
+ return false;
+ }
+
+ pkey = pte_flags_pkey(ptep->pte);
+ if (pkey != ctx->pkey) {
+ pr_err("invalid pkey found: %u, test_pkey: %u\n",
+ pkey, ctx->pkey);
+ return false;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(pkey_test_ary); i++) {
+ /* sticky fail */
+ if (!run_access_test(ctx, &pkey_test_ary[i], ptr))
+ rc = false;
+ }
+
+ return rc;
+}
+
static struct pks_test_ctx *alloc_ctx(u8 pkey)
{
struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
@@ -139,6 +267,23 @@ static void set_ctx_data(struct pks_session_data *sd, struct pks_test_ctx *ctx)
sd->ctx = ctx;
}

+static bool run_single(struct pks_session_data *sd)
+{
+ struct pks_test_ctx *ctx;
+ bool rc;
+
+ ctx = alloc_ctx(PKS_KEY_TEST);
+ if (IS_ERR(ctx))
+ return false;
+
+ set_ctx_data(sd, ctx);
+
+ rc = test_ctx(ctx);
+ pks_set_noaccess(ctx->pkey);
+
+ return rc;
+}
+
static void crash_it(struct pks_session_data *sd)
{
struct pks_test_ctx *ctx;
@@ -203,6 +348,12 @@ static ssize_t pks_read_file(struct file *file, char __user *user_buf,
return simple_read_from_buffer(user_buf, count, ppos, buf, len);
}

+static void cleanup_test(void)
+{
+ g_ctx_under_test = NULL;
+ mutex_unlock(&test_run_lock);
+}
+
static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
size_t count, loff_t *ppos)
{
@@ -235,6 +386,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
pr_debug("check defaults test: 0x%lx\n", PKS_INIT_VALUE);
on_each_cpu(check_pkey_settings, file->private_data, 1);
break;
+ case RUN_SINGLE:
+ pr_debug("Single key\n");
+ sd->last_test_pass = run_single(file->private_data);
+ break;
default:
pr_debug("Unknown test\n");
sd->last_test_pass = false;
@@ -251,7 +406,7 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
* Normal exit; clear up the locking flag
*/
sd->need_unlock = false;
- mutex_unlock(&test_run_lock);
+ cleanup_test();
debug_result("Test complete", test_num, sd);
return count;
}
@@ -282,7 +437,7 @@ static int pks_release_file(struct inode *inode, struct file *file)
* not exit normally.
*/
if (sd->need_unlock)
- mutex_unlock(&test_run_lock);
+ cleanup_test();
free_ctx(sd->ctx);
kfree(sd);
return 0;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index df5bde9bfdbe..2c10b6c50416 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -31,6 +31,7 @@

/* Values from the kernel */
#define CHECK_DEFAULTS "0"
+#define RUN_SINGLE "1"
#define RUN_CRASH_TEST "9"

time_t g_start_time;
@@ -53,6 +54,7 @@ static int do_simple_test(const char *debugfs_str);
*/
enum {
TEST_DEFAULTS = 0,
+ TEST_SINGLE,
MAX_TESTS,
} tests;

@@ -64,7 +66,8 @@ struct test_item {
const char *debugfs_str;
int (*test_fn)(const char *debugfs_str);
} test_list[] = {
- { "check_defaults", CHECK_DEFAULTS, do_simple_test }
+ { "check_defaults", CHECK_DEFAULTS, do_simple_test },
+ { "single", RUN_SINGLE, do_simple_test }
};

static char *get_test_name(int test_num)
--
2.35.1

2022-03-11 02:20:27

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 23/45] x86/entry: Add auxiliary pt_regs space

From: Ira Weiny <[email protected]>

The PKRS MSR is not managed by XSAVE. In order for the MSR to be saved
during an exception the current CPU MSR value needs to be saved
somewhere during the exception and restored when returning to the
previous context.

Two possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1] pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2] However, Andy
Lutomirski came up with a way to hide additional values on the stack
which could be accessed as "extended_pt_regs".[3] This method allows any
function with current access to pt_regs to obtain access to the extra
information without expanding the use of irqentry_state_t and leaving
pt_regs intact for compatibility with outside tools like BPF.

Prepare the assembly code to add a hidden auxiliary pt_regs space. To
simplify, the assembly code only adds space on the stack as defined by
the C code which needs it. The use of this space is left to the C code
which is required to select ARCH_HAS_PTREGS_AUXILIARY to enable this
support.

Each nested exception gets another copy of this auxiliary space allowing
for any number of levels of exception handling.

Initially the space is left empty and results in no code changes because
ARCH_HAS_PTREGS_AUXILIARY is not set. Subsequent patches adding data to
pt_regs_auxiliary must set ARCH_HAS_PTREGS_AUXILIARY or a build failure
will occur. The use of ARCH_HAS_PTREGS_AUXILIARY also avoids the
introduction of 2 instructions (addq/subq) on every entry call when the
extra space is not needed.

32bit is specifically excluded as the current consumer of this, PKS,
will not support 32bit either.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9:
Update commit message

Changes for V8:
Exclude 32bit
Introduce ARCH_HAS_PTREGS_AUXILIARY to optimize this away when
not needed.
From Thomas
s/EXTENDED_PT_REGS_SIZE/PT_REGS_AUX_SIZE
Fix up PTREGS_AUX_SIZE macro to be based on the
structures and used in assembly code via the
nifty asm-offset macros
Bound calls into c code with [PUSH|POP]_RTREGS_AUXILIARY
instead of using a macro 'call'
Split this patch out and put the PKS specific stuff in a
separate patch

Changes for V7:
Rebased to 5.14 entry code
declare write_pkrs() in pks.h
s/INIT_PKRS_VALUE/pkrs_init_value
Remove unnecessary INIT_PKRS_VALUE def
s/pkrs_save_set_irq/pkrs_save_irq/
The inital value for exceptions is best managed
completely within the pkey code.
---
arch/x86/Kconfig | 4 ++++
arch/x86/entry/calling.h | 20 ++++++++++++++++++++
arch/x86/entry/entry_64.S | 22 ++++++++++++++++++++++
arch/x86/entry/entry_64_compat.S | 6 ++++++
arch/x86/include/asm/ptrace.h | 18 ++++++++++++++++++
arch/x86/kernel/asm-offsets_64.c | 15 +++++++++++++++
arch/x86/kernel/head_64.S | 6 ++++++
7 files changed, 91 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 459948622a73..64348c94477e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1878,6 +1878,10 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS

If unsure, say y.

+config ARCH_HAS_PTREGS_AUXILIARY
+ depends on X86_64
+ bool
+
choice
prompt "TSX enable mode"
depends on CPU_SUP_INTEL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..d0ebf9b069c9 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,6 +63,26 @@ For 32-bit we have the following conventions - kernel is built with
* for assembly code:
*/

+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+.macro PUSH_PTREGS_AUXILIARY
+ /* add space for pt_regs_auxiliary */
+ subq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+.macro POP_PTREGS_AUXILIARY
+ /* remove space for pt_regs_auxiliary */
+ addq $PTREGS_AUX_SIZE, %rsp
+.endm
+
+#else
+
+#define PUSH_PTREGS_AUXILIARY
+#define POP_PTREGS_AUXILIARY
+
+#endif
+
.macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
.if \save_ret
pushq %rsi /* pt_regs->si */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 466df3e50276..0684a8093965 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -332,7 +332,9 @@ SYM_CODE_END(ret_from_fork)
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
.endif

+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY

jmp error_return
.endm
@@ -435,7 +437,9 @@ SYM_CODE_START(\asmsym)

movq %rsp, %rdi /* pt_regs pointer */

+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY

jmp paranoid_exit

@@ -496,7 +500,9 @@ SYM_CODE_START(\asmsym)
* stack.
*/
movq %rsp, %rdi /* pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call vc_switch_off_ist
+ POP_PTREGS_AUXILIARY
movq %rax, %rsp /* Switch to new stack */

UNWIND_HINT_REGS
@@ -507,7 +513,9 @@ SYM_CODE_START(\asmsym)

movq %rsp, %rdi /* pt_regs pointer */

+ PUSH_PTREGS_AUXILIARY
call kernel_\cfunc
+ POP_PTREGS_AUXILIARY

/*
* No need to switch back to the IST stack. The current stack is either
@@ -542,7 +550,9 @@ SYM_CODE_START(\asmsym)
movq %rsp, %rdi /* pt_regs pointer into first argument */
movq ORIG_RAX(%rsp), %rsi /* get error code into 2nd argument*/
movq $-1, ORIG_RAX(%rsp) /* no syscall to restart */
+ PUSH_PTREGS_AUXILIARY
call \cfunc
+ POP_PTREGS_AUXILIARY

jmp paranoid_exit

@@ -784,7 +794,9 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
movq %rdi, %rsp /* we don't return, adjust the stack frame */
UNWIND_HINT_REGS

+ PUSH_PTREGS_AUXILIARY
call xen_pv_evtchn_do_upcall
+ POP_PTREGS_AUXILIARY

jmp error_return
SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -984,7 +996,9 @@ SYM_CODE_START_LOCAL(error_entry)
/* Put us onto the real thread stack. */
popq %r12 /* save return addr in %12 */
movq %rsp, %rdi /* arg0 = pt_regs pointer */
+ PUSH_PTREGS_AUXILIARY
call sync_regs
+ POP_PTREGS_AUXILIARY
movq %rax, %rsp /* switch stack */
ENCODE_FRAME_POINTER
pushq %r12
@@ -1040,7 +1054,9 @@ SYM_CODE_START_LOCAL(error_entry)
* as if we faulted immediately after IRET.
*/
mov %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call fixup_bad_iret
+ POP_PTREGS_AUXILIARY
mov %rax, %rsp
jmp .Lerror_entry_from_usermode_after_swapgs
SYM_CODE_END(error_entry)
@@ -1146,7 +1162,9 @@ SYM_CODE_START(asm_exc_nmi)

movq %rsp, %rdi
movq $-1, %rsi
+ PUSH_PTREGS_AUXILIARY
call exc_nmi
+ POP_PTREGS_AUXILIARY

/*
* Return back to user mode. We must *not* do the normal exit
@@ -1182,6 +1200,8 @@ SYM_CODE_START(asm_exc_nmi)
* +---------------------------------------------------------+
* | pt_regs |
* +---------------------------------------------------------+
+ * | (Optionally) pt_regs_extended |
+ * +---------------------------------------------------------+
*
* The "original" frame is used by hardware. Before re-enabling
* NMIs, we need to be done with it, and we need to leave enough
@@ -1358,7 +1378,9 @@ end_repeat_nmi:

movq %rsp, %rdi
movq $-1, %rsi
+ PUSH_PTREGS_AUXILIARY
call exc_nmi
+ POP_PTREGS_AUXILIARY

/* Always restore stashed CR3 value (see paranoid_entry) */
RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..c6859d8acae4 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -136,7 +136,9 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
.Lsysenter_flags_fixed:

movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_SYSENTER_32
+ POP_PTREGS_AUXILIARY
/* XEN PV guests always use IRET path */
ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -253,7 +255,9 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
UNWIND_HINT_REGS

movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_fast_syscall_32
+ POP_PTREGS_AUXILIARY
/* XEN PV guests always use IRET path */
ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -410,6 +414,8 @@ SYM_CODE_START(entry_INT80_compat)
cld

movq %rsp, %rdi
+ PUSH_PTREGS_AUXILIARY
call do_int80_syscall_32
+ POP_PTREGS_AUXILIARY
jmp swapgs_restore_regs_and_return_to_usermode
SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 703663175a5a..5e7f6e48c0ab 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_PTRACE_H
#define _ASM_X86_PTRACE_H

+#include <linux/container_of.h>
#include <asm/segment.h>
#include <asm/page_types.h>
#include <uapi/asm/ptrace.h>
@@ -91,6 +92,23 @@ struct pt_regs {
/* top of stack page */
};

+/*
+ * NOTE: Features which add data to pt_regs_auxiliary must select
+ * ARCH_HAS_PTREGS_AUXILIARY. Failure to do so will result in a build failure.
+ */
+struct pt_regs_auxiliary {
+};
+
+struct pt_regs_extended {
+ struct pt_regs_auxiliary aux;
+ struct pt_regs pt_regs __aligned(8);
+};
+
+static inline struct pt_regs_extended *to_extended_pt_regs(struct pt_regs *regs)
+{
+ return container_of(regs, struct pt_regs_extended, pt_regs);
+}
+
#endif /* !__i386__ */

#ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index b14533af7676..66f08ac3507a 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -4,6 +4,7 @@
#endif

#include <asm/ia32.h>
+#include <asm/ptrace.h>

#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
#include <asm/kvm_para.h>
@@ -60,5 +61,19 @@ int main(void)
DEFINE(stack_canary_offset, offsetof(struct fixed_percpu_data, stack_canary));
BLANK();
#endif
+
+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+ /* Size of Auxiliary pt_regs data */
+ DEFINE(PTREGS_AUX_SIZE, sizeof(struct pt_regs_extended) -
+ sizeof(struct pt_regs));
+#else
+ /*
+ * Adding data to struct pt_regs_auxiliary requires setting
+ * ARCH_HAS_PTREGS_AUXILIARY
+ */
+ BUILD_BUG_ON((sizeof(struct pt_regs_extended) -
+ sizeof(struct pt_regs)) != 0);
+#endif
+
return 0;
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 9c63fc5988cd..8418d9de8d70 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -336,8 +336,10 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
movq %rsp, %rdi
movq ORIG_RAX(%rsp), %rsi
movq initial_vc_handler(%rip), %rax
+ PUSH_PTREGS_AUXILIARY
ANNOTATE_RETPOLINE_SAFE
call *%rax
+ POP_PTREGS_AUXILIARY

/* Unwind pt_regs */
POP_REGS
@@ -414,7 +416,9 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
UNWIND_HINT_REGS

movq %rsp,%rdi /* RDI = pt_regs; RSI is already trapnr */
+ PUSH_PTREGS_AUXILIARY
call do_early_exception
+ POP_PTREGS_AUXILIARY

decl early_recursion_flag(%rip)
jmp restore_regs_and_return_to_kernel
@@ -438,7 +442,9 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
/* Call C handler */
movq %rsp, %rdi
movq ORIG_RAX(%rsp), %rsi
+ PUSH_PTREGS_AUXILIARY
call do_vc_no_ghcb
+ POP_PTREGS_AUXILIARY

/* Unwind pt_regs */
POP_REGS
--
2.35.1

2022-03-11 03:16:27

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 11/45] x86/pkeys: Enable PKS on cpus which support it

From: Ira Weiny <[email protected]>

Protection Keys for Supervisor pages (PKS) enables fast, hardware thread
specific, manipulation of permission restrictions on supervisor page
mappings. It uses a supervisor specific MSR to assign permissions to
the pkeys.

When PKS is configured and the cpu supports PKS, initialize the MSR, and
enable the hardware.

Add asm/pks.h to store new internal functions and structures such as
pks_setup().

Co-developed-by: Fenghua Yu <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Reword commit message
Move this after the patch defining PKS_INIT_VALUE

Changes for V8
Move setup_pks() into this patch with a default of all access
for all pkeys.
From Thomas
s/setup_pks/pks_setup/
Update Change log to better reflect exactly what this patch does.
---
arch/x86/include/asm/msr-index.h | 1 +
arch/x86/include/asm/pks.h | 15 +++++++++++++++
arch/x86/include/uapi/asm/processor-flags.h | 2 ++
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/mm/pkeys.c | 17 +++++++++++++++++
5 files changed, 37 insertions(+)
create mode 100644 arch/x86/include/asm/pks.h

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a4a39c3e0f19..6b0a6e0300a4 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -787,6 +787,7 @@

#define MSR_IA32_TSC_DEADLINE 0x000006E0

+#define MSR_IA32_PKRS 0x000006E1

#define MSR_TSX_FORCE_ABORT 0x0000010F

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
new file mode 100644
index 000000000000..8180fc59790b
--- /dev/null
+++ b/arch/x86/include/asm/pks.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_PKS_H
+#define _ASM_X86_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void pks_setup(void);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_setup(void) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..191c574b2390 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
#define X86_CR4_SMAP _BITUL(X86_CR4_SMAP_BIT)
#define X86_CR4_PKE_BIT 22 /* enable Protection Keys support */
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_PKS_BIT 24 /* enable Protection Keys for Supervisor */
+#define X86_CR4_PKS _BITUL(X86_CR4_PKS_BIT)

/*
* x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7b8382c11788..83c1abce7d93 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -59,6 +59,7 @@
#include <asm/cpu_device_id.h>
#include <asm/uv/uv.h>
#include <asm/sigframe.h>
+#include <asm/pks.h>

#include "cpu.h"

@@ -1632,6 +1633,7 @@ static void identify_cpu(struct cpuinfo_x86 *c)

x86_init_rdrand(c);
setup_pku(c);
+ pks_setup();

/*
* Clear/Set all flags overridden by options, need do it
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 7c90b2188c5f..f904376570f4 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -6,6 +6,7 @@
#include <linux/debugfs.h> /* debugfs_create_u32() */
#include <linux/mm_types.h> /* mm_struct, vma, etc... */
#include <linux/pkeys.h> /* PKEY_* */
+#include <linux/pks-keys.h>
#include <uapi/asm-generic/mman-common.h>

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
@@ -209,3 +210,19 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
pkval &= ~(PKEY_ACCESS_MASK << shift);
return pkval | accessbits << shift;
}
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+/*
+ * PKS is independent of PKU and either or both may be supported on a CPU.
+ */
+void pks_setup(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ wrmsrl(MSR_IA32_PKRS, PKS_INIT_VALUE);
+ cr4_set_bits(X86_CR4_PKS);
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
--
2.35.1

2022-03-11 03:30:20

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched()

From: Ira Weiny <[email protected]>

Auxiliary pt_regs space needs to be manipulated by the generic
entry/exit code.

Normally irqentry_exit() would take care of handling any auxiliary
pt_regs on exit. Unfortunately, the call to
irqentry_exit_cond_resched() from xen_pv_evtchn_do_upcall() bypasses the
normal irqentry_exit() call. Because of this bypass
irqentry_exit_cond_resched() will be required to handle any auxiliary
pt_regs exit handling. However, this prevents irqentry_exit() from
being able to call irqentry_exit_cond_resched() and while maintaining
control of the auxiliary pt_regs.

Separate out the common functionality of irqentry_exit_cond_resched() so
that functionality can be used by irqentry_exit(). Add a pt_regs
parameter in anticipation of having irqentry_exit_cond_resched() handle
the auxiliary pt_regs separately from irqentry_exit().

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Update commit message

Changes for V8
New Patch
---
arch/x86/entry/common.c | 2 +-
include/linux/entry-common.h | 3 ++-
kernel/entry/common.c | 9 +++++++--
3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..f1ba770d035d 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)

inhcall = get_and_clear_inhcall();
if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
- irqentry_exit_cond_resched();
+ irqentry_exit_cond_resched(regs);
instrumentation_end();
restore_inhcall(inhcall);
} else {
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index ddaffc983e62..14fd329847e7 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -451,10 +451,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);

/**
* irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
+ * @regs: Pointer to pt_regs of interrupted context
*
* Conditional reschedule with additional sanity checks.
*/
-void irqentry_exit_cond_resched(void);
+void irqentry_exit_cond_resched(struct pt_regs *regs);

void __irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 490442a48332..f4210a7fc84d 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -395,7 +395,7 @@ void __irqentry_exit_cond_resched(void)
DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
#endif

-void irqentry_exit_cond_resched(void)
+static void exit_cond_resched(void)
{
if (IS_ENABLED(CONFIG_PREEMPTION)) {
#ifdef CONFIG_PREEMPT_DYNAMIC
@@ -406,6 +406,11 @@ void irqentry_exit_cond_resched(void)
}
}

+void irqentry_exit_cond_resched(struct pt_regs *regs)
+{
+ exit_cond_resched();
+}
+
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
@@ -431,7 +436,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
}

instrumentation_begin();
- irqentry_exit_cond_resched();
+ exit_cond_resched();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
--
2.35.1

2022-03-11 04:35:03

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 17/45] mm/pkeys: Introduce pks_set_readwrite()

From: Ira Weiny <[email protected]>

When kernel code needs access to a PKS protected page they will need to
change the protections for the pkey to Read/Write.

Define pks_set_readwrite() to update the specified pkey. Define
pks_update_protection() as a helper to do the heavy lifting and allow
for subsequent pks_set_*() calls.

Define PKEY_READ_WRITE rather than use a magic value of '0' in
pks_update_protection().

Finally, ensure preemption is disabled for pks_write_pkrs() because the
context of this call can not generally be predicted.

pks.h is created to avoid conflicts and header dependencies with the
user space pkey code.

Add documentation.

Signed-off-by: Ira Weiny <[email protected]>

---
changes for v9
Move MSR documentation note to this patch
move declarations to incline/linux/pks.h
from rick edgecombe
change pkey type to u8
validate pkey range in pks_update_protection
from 0day
fix documentation link
from dave hansen
s/pks_mk_*/pks_set_*/
use pkey
s/pks_saved_pkrs/pkrs/

changes for v8
define pkey_read_write
make the call inline
clean up the names
use pks_write_pkrs() with preemption disabled
split this out from 'add pks kernel api'
include documentation in this patch
---
Documentation/core-api/protection-keys.rst | 15 +++++++++++
arch/x86/mm/pkeys.c | 31 ++++++++++++++++++++++
include/linux/pks.h | 31 ++++++++++++++++++++++
include/uapi/asm-generic/mman-common.h | 1 +
4 files changed, 78 insertions(+)
create mode 100644 include/linux/pks.h

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 23330a7d53eb..e6564f5336b7 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -143,6 +143,21 @@ Adding pages to a pkey protected domain
.. kernel-doc:: arch/x86/include/asm/pgtable_types.h
:doc: PKS_KEY_ASSIGNMENT

+Changing permissions of individual keys
+---------------------------------------
+
+.. kernel-doc:: include/linux/pks.h
+ :identifiers: pks_set_readwrite
+
+MSR details
+~~~~~~~~~~~
+
+WRMSR is typically an architecturally serializing instruction. However,
+WRMSR(MSR_IA32_PKRS) is an exception. It is not a serializing instruction and
+instead maintains ordering properties similar to WRPKRU. Thus it is safe to
+immediately use a mapping when the pks_set*() functions returns. Check the
+latest SDM for details.
+
Testing
-------

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 39e4c2cbc279..e4cbc79686ea 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -6,6 +6,7 @@
#include <linux/debugfs.h> /* debugfs_create_u32() */
#include <linux/mm_types.h> /* mm_struct, vma, etc... */
#include <linux/pkeys.h> /* PKEY_* */
+#include <linux/pks.h>
#include <linux/pks-keys.h>
#include <uapi/asm-generic/mman-common.h>

@@ -275,4 +276,34 @@ void pks_setup(void)
cr4_set_bits(X86_CR4_PKS);
}

+/*
+ * Do not call this directly, see pks_set*().
+ *
+ * @pkey: Key for the domain to change
+ * @protection: protection bits to be used
+ *
+ * Protection utilizes the same protection bits specified for User pkeys
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ *
+ */
+void pks_update_protection(u8 pkey, u8 protection)
+{
+ u32 pkrs;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
+ return;
+
+ pkrs = current->thread.pkrs;
+ current->thread.pkrs = pkey_update_pkval(pkrs, pkey,
+ protection);
+ preempt_disable();
+ pks_write_pkrs(current->thread.pkrs);
+ preempt_enable();
+}
+EXPORT_SYMBOL_GPL(pks_update_protection);
+
#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pks.h b/include/linux/pks.h
new file mode 100644
index 000000000000..8b705a937b19
--- /dev/null
+++ b/include/linux/pks.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PKS_H
+#define _LINUX_PKS_H
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+#include <linux/types.h>
+
+#include <uapi/asm-generic/mman-common.h>
+
+void pks_update_protection(u8 pkey, u8 protection);
+
+/**
+ * pks_set_readwrite() - Make the domain Read/Write
+ * @pkey: the pkey for which the access should change.
+ *
+ * Allow all access, read and write, to the domain specified by pkey. This is
+ * not a global update and only affects the current running thread.
+ */
+static inline void pks_set_readwrite(u8 pkey)
+{
+ pks_update_protection(pkey, PKEY_READ_WRITE);
+}
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pks_set_readwrite(u8 pkey) {}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+#endif /* _LINUX_PKS_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1567a3294c3d..3da6ac9e5ded 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -78,6 +78,7 @@
/* compatibility flags */
#define MAP_FILE 0

+#define PKEY_READ_WRITE 0x0
#define PKEY_DISABLE_ACCESS 0x1
#define PKEY_DISABLE_WRITE 0x2
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
--
2.35.1

2022-03-11 05:06:34

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 05/45] x86/fpu: Refactor arch_set_user_pkey_access()

From: Ira Weiny <[email protected]>

Both PKU and PKS update their register values in the same way. They can
therefore share the update code.

Define a helper, pkey_update_pkval(), which will be used to support both
Protection Key User (PKU) and the new Protection Key for Supervisor
(PKS) in subsequent patches.

pkey_update_pkval() contributed by Thomas

Acked-by: Dave Hansen <[email protected]>
Co-developed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Update for V8:
From Rick Edgecombe
Change pkey type to u8
Replace the code Peter provided in update_pkey_reg() for
Thomas' pkey_update_pkval()
-- https://lore.kernel.org/lkml/[email protected]/
---
arch/x86/include/asm/pkeys.h | 2 ++
arch/x86/kernel/fpu/xstate.c | 22 ++++------------------
arch/x86/mm/pkeys.c | 16 ++++++++++++++++
3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 1d5f14aff5f6..26616cbe19e2 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -131,4 +131,6 @@ static inline int vma_pkey(struct vm_area_struct *vma)
return (vma->vm_flags & vma_pkey_mask) >> VM_PKEY_SHIFT;
}

+u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits);
+
#endif /*_ASM_X86_PKEYS_H */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d090867c9de3..c8a8dadd9f87 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1071,8 +1071,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
{
- u32 old_pkru, new_pkru_bits = 0;
- int pkey_shift;
+ u32 pkru;

/*
* This check implies XSAVE support. OSPKE only gets
@@ -1089,22 +1088,9 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
return -EINVAL;

- /* Set the bits needed in PKRU: */
- if (init_val & PKEY_DISABLE_ACCESS)
- new_pkru_bits |= PKR_AD_BIT;
- if (init_val & PKEY_DISABLE_WRITE)
- new_pkru_bits |= PKR_WD_BIT;
-
- /* Shift the bits in to the correct place in PKRU for pkey: */
- pkey_shift = pkey * PKR_BITS_PER_PKEY;
- new_pkru_bits <<= pkey_shift;
-
- /* Get old PKRU and mask off any old bits in place: */
- old_pkru = read_pkru();
- old_pkru &= ~((PKR_AD_BIT|PKR_WD_BIT) << pkey_shift);
-
- /* Write old part along with new part: */
- write_pkru(old_pkru | new_pkru_bits);
+ pkru = read_pkru();
+ pkru = pkey_update_pkval(pkru, pkey, init_val);
+ write_pkru(pkru);

return 0;
}
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e1527b4619e1..7c90b2188c5f 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -193,3 +193,19 @@ static __init int setup_init_pkru(char *opt)
return 1;
}
__setup("init_pkru=", setup_init_pkru);
+
+/*
+ * Kernel users use the same flags as user space:
+ * PKEY_DISABLE_ACCESS
+ * PKEY_DISABLE_WRITE
+ */
+u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)
+{
+ int shift = pkey * PKR_BITS_PER_PKEY;
+
+ if (WARN_ON_ONCE(accessbits & ~PKEY_ACCESS_MASK))
+ accessbits &= PKEY_ACCESS_MASK;
+
+ pkval &= ~(PKEY_ACCESS_MASK << shift);
+ return pkval | accessbits << shift;
+}
--
2.35.1

2022-03-11 05:18:27

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 27/45] x86/pkeys: Preserve PKRS MSR across exceptions

From: Ira Weiny <[email protected]>

PKRS is a per-logical-processor MSR which overlays additional protection
for pages which have been mapped with a protection key. It is desired
to protect PKS pages while executing exception code while also allowing
exception code to access PKS pages with the proper pks_set_*() calls.

To do this the current thread value must be saved, the CPU MSR value set
to the default value during the exception, and the saved thread value
restored upon completion. This can be done with the new auxiliary
pt_regs space.

When PKS is configured, configure auxiliary pt_regs, add space to
pt_regs_auxiliary, and define save/restore functions.

Update the PKS test code to maintain functionality by clearing the saved
PKRS value before returning.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch.

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9:
Update commit message
s/pks_thread_pkrs/pkrs/
From Dave Hansen
s/pks_saved_pkrs/pkrs/

Changes for V8:
Tie this into the new generic auxiliary pt_regs support.
Build this on the new irqentry_*() refactoring patches
Split this patch off from the PKS portion of the auxiliary
pt_regs functionality.
From Thomas
Fix noinstr mess
s/write_pkrs/pks_write_pkrs
s/pkrs_init_value/PKRS_INIT_VALUE
Simplify the number and location of the save/restore calls.
Cover entry from user space as well.

Changes for V7:
Rebased to 5.14 entry code
declare write_pkrs() in pks.h
s/INIT_PKRS_VALUE/pkrs_init_value
Remove unnecessary INIT_PKRS_VALUE def
s/pkrs_save_set_irq/pkrs_save_irq/
The inital value for exceptions is best managed
completely within the pkey code.
---
arch/x86/Kconfig | 3 ++-
arch/x86/include/asm/entry-common.h | 3 +++
arch/x86/include/asm/pks.h | 4 ++++
arch/x86/include/asm/ptrace.h | 3 +++
arch/x86/mm/pkeys.c | 32 +++++++++++++++++++++++++++++
lib/pks/pks_test.c | 9 +++++++-
6 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 64348c94477e..f13fd7a73535 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1879,8 +1879,9 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
If unsure, say y.

config ARCH_HAS_PTREGS_AUXILIARY
+ def_bool y
depends on X86_64
- bool
+ depends on ARCH_ENABLE_SUPERVISOR_PKEYS

choice
prompt "TSX enable mode"
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 5fa5dd2d539c..803727b95b3a 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -8,6 +8,7 @@
#include <asm/nospec-branch.h>
#include <asm/io_bitmap.h>
#include <asm/fpu/api.h>
+#include <asm/pks.h>

/* Check that the stack and regs on entry from user mode are sane. */
static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -99,10 +100,12 @@ static __always_inline void arch_exit_to_user_mode(void)

static inline void arch_save_aux_pt_regs(struct pt_regs *regs)
{
+ pks_save_pt_regs(regs);
}

static inline void arch_restore_aux_pt_regs(struct pt_regs *regs)
{
+ pks_restore_pt_regs(regs);
}

#endif
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index e9ad3ecd7ed0..b69e03a141fe 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -6,6 +6,8 @@

void pks_setup(void);
void x86_pkrs_load(struct thread_struct *thread);
+void pks_save_pt_regs(struct pt_regs *regs);
+void pks_restore_pt_regs(struct pt_regs *regs);

bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address);
@@ -14,6 +16,8 @@ bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,

static inline void pks_setup(void) { }
static inline void x86_pkrs_load(struct thread_struct *thread) { }
+static inline void pks_save_pt_regs(struct pt_regs *regs) { }
+static inline void pks_restore_pt_regs(struct pt_regs *regs) { }

static inline bool pks_handle_key_fault(struct pt_regs *regs,
unsigned long hw_error_code,
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 5e7f6e48c0ab..a3b00ad0d69b 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -97,6 +97,9 @@ struct pt_regs {
* ARCH_HAS_PTREGS_AUXILIARY. Failure to do so will result in a build failure.
*/
struct pt_regs_auxiliary {
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+ u32 pkrs;
+#endif
};

struct pt_regs_extended {
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 39867d39460b..29885dfb0980 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -346,6 +346,38 @@ void x86_pkrs_load(struct thread_struct *thread)
pks_write_pkrs(thread->pkrs);
}

+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * To protect against exceptions having potentially privileged access to memory
+ * of an interrupted thread, save the current thread value and set the PKRS
+ * value to be used during the exception.
+ */
+void pks_save_pt_regs(struct pt_regs *regs)
+{
+ struct pt_regs_auxiliary *aux_pt_regs;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+ aux_pt_regs->pkrs = current->thread.pkrs;
+ pks_write_pkrs(PKS_INIT_VALUE);
+}
+
+void pks_restore_pt_regs(struct pt_regs *regs)
+{
+ struct pt_regs_auxiliary *aux_pt_regs;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+ current->thread.pkrs = aux_pt_regs->pkrs;
+ pks_write_pkrs(current->thread.pkrs);
+}
+
/*
* PKS is independent of PKU and either or both may be supported on a CPU.
*
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 16aa44cf498a..86af2f61393d 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -34,11 +34,14 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/pgtable.h>
+#include <linux/pkeys.h>
#include <linux/pks.h>
#include <linux/pks-keys.h>

#include <uapi/asm-generic/mman-common.h>

+#include <asm/ptrace.h>
+
#define PKS_TEST_MEM_SIZE (PAGE_SIZE)

#define CHECK_DEFAULTS 0
@@ -110,12 +113,16 @@ static void set_context_for_fault(struct pks_test_ctx *ctx)
bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
bool write)
{
+ struct pt_regs_extended *ept_regs = to_extended_pt_regs(regs);
+ struct pt_regs_auxiliary *aux_pt_regs = &ept_regs->aux;
+ u32 pkrs = aux_pt_regs->pkrs;
+
pr_debug("PKS Fault callback: ctx %p\n", g_ctx_under_test);

if (!g_ctx_under_test)
return false;

- pks_set_readwrite(g_ctx_under_test->pkey);
+ aux_pt_regs->pkrs = pkey_update_pkval(pkrs, g_ctx_under_test->pkey, 0);
g_ctx_under_test->fault_seen = true;
return true;
}
--
2.35.1

2022-03-11 05:54:45

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 30/45] mm/pkeys: Introduce pks_update_exception()

From: Ira Weiny <[email protected]>

Some PKS use cases will want to catch permissions violations with the
fault callback mechanism and optionally allow the access.

The pks_set_*() calls update the protection of the current running
context. They will not work to change the protections of a thread which
has been interrupted. Therefore updating a thread from within an
exception requires a different method.

Introduce pks_update_exception() which updates the faulted threads
protections in addition to the current context.

Add documentation

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Add preemption disable around pkrs per cpu cache
Update commit message
Change pkey type to u8
s/pks_saved_pkrs/pkrs

Changes for V8
Remove the concept of abandoning a pkey in favor of using the
custom fault handler via this new pks_update_exception()
call
Without an abandon call there is no need for an abandon mask on
sched in, new thread creation, or within exceptions...
This now lets all invalid access' fault
Ensure that all entry points into the pks has feature checks...
Place abandon fault check before the test callback to ensure
testing does not detect the double fault of the abandon
code and flag it incorrectly as a fault.
Change return type of pks_handle_abandoned_pkeys() to bool
---
Documentation/core-api/protection-keys.rst | 3 ++
arch/x86/mm/pkeys.c | 58 +++++++++++++++++++---
include/linux/pks.h | 5 ++
3 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 5fdc83a39d4e..22ad58a93423 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -149,6 +149,9 @@ Changing permissions of individual keys
.. kernel-doc:: include/linux/pks.h
:identifiers: pks_set_readwrite pks_set_noaccess

+.. kernel-doc:: arch/x86/mm/pkeys.c
+ :identifiers: pks_update_exception
+
Overriding Default Fault Behavior
---------------------------------

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6327e32d7237..9b2a6a62d433 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -409,6 +409,18 @@ void pks_setup(void)
cr4_set_bits(X86_CR4_PKS);
}

+static void __pks_update_protection(u8 pkey, u8 protection)
+{
+ u32 pkrs;
+
+ pkrs = current->thread.pkrs;
+ current->thread.pkrs = pkey_update_pkval(pkrs, pkey, protection);
+
+ preempt_disable();
+ pks_write_pkrs(current->thread.pkrs);
+ preempt_enable();
+}
+
/*
* Do not call this directly, see pks_set*().
*
@@ -422,21 +434,51 @@ void pks_setup(void)
*/
void pks_update_protection(u8 pkey, u8 protection)
{
- u32 pkrs;
-
if (!cpu_feature_enabled(X86_FEATURE_PKS))
return;

if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
return;

- pkrs = current->thread.pkrs;
- current->thread.pkrs = pkey_update_pkval(pkrs, pkey,
- protection);
- preempt_disable();
- pks_write_pkrs(current->thread.pkrs);
- preempt_enable();
+ __pks_update_protection(pkey, protection);
}
EXPORT_SYMBOL_GPL(pks_update_protection);

+/**
+ * pks_update_exception() - Update the protections of a faulted thread
+ *
+ * @regs: Faulting thread registers
+ * @pkey: pkey to update
+ * @protection: protection bits to use.
+ *
+ * CONTEXT: Exception
+ *
+ * pks_update_exception() updates the faulted threads protections in addition
+ * to the protections within the exception.
+ *
+ * This is useful because the pks_set_*() functions will not work to change the
+ * protections of a thread which has been interrupted. Only the current
+ * context is updated by those functions. Therefore, if a PKS fault callback
+ * wants to update the faulted threads protections it must call
+ * pks_update_exception().
+ */
+void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection)
+{
+ struct pt_regs_extended *ept_regs;
+ u32 old;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ if (WARN_ON_ONCE(pkey >= PKS_KEY_MAX))
+ return;
+
+ __pks_update_protection(pkey, protection);
+
+ ept_regs = to_extended_pt_regs(regs);
+ old = ept_regs->aux.pkrs;
+ ept_regs->aux.pkrs = pkey_update_pkval(old, pkey, protection);
+}
+EXPORT_SYMBOL_GPL(pks_update_exception);
+
#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 224fc3bbd072..45156f358776 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -9,6 +9,7 @@
#include <uapi/asm-generic/mman-common.h>

void pks_update_protection(u8 pkey, u8 protection);
+void pks_update_exception(struct pt_regs *regs, u8 pkey, u8 protection);

/**
* pks_set_noaccess() - Disable all access to the domain
@@ -41,6 +42,10 @@ typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,

static inline void pks_set_noaccess(u8 pkey) {}
static inline void pks_set_readwrite(u8 pkey) {}
+static inline void pks_update_exception(struct pt_regs *regs,
+ u8 pkey,
+ u8 protection)
+{ }

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

--
2.35.1

2022-03-11 05:59:31

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 31/45] mm/pkeys: PKS testing, test pks_update_exception()

From: Ira Weiny <[email protected]>

A common use case for the custom fault callbacks will be for the
callback to warn of the violation and relax the permissions rather than
crash the kernel. pks_update_exception() was added for this purpose.

Add a test which uses pks_update_exception() to clear the pkey
permissions. Verify that the permissions are changed in the interrupted
thread.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Update the commit message
Clean up test name
Add test_pks support
s/pks_mk_*/pks_set_*/
Simplify the use of globals for the faults
From Rick Edgecombe
Use WRITE_ONCE to protect against races with the fault
handler
s/RUN_FAULT_ABANDON/RUN_FAULT_CALLBACK

Changes for V8
New test developed just to double check for regressions while
reworking the code.
---
lib/pks/pks_test.c | 60 ++++++++++++++++++++++++++
tools/testing/selftests/x86/test_pks.c | 5 ++-
2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
index 762f4a19cb7d..a9cd2a49abfa 100644
--- a/lib/pks/pks_test.c
+++ b/lib/pks/pks_test.c
@@ -49,6 +49,7 @@
#define ARM_CTX_SWITCH 2
#define CHECK_CTX_SWITCH 3
#define RUN_EXCEPTION 4
+#define RUN_EXCEPTION_UPDATE 5
#define RUN_CRASH_TEST 9

DECLARE_PER_CPU(u32, pkrs_cache);
@@ -64,6 +65,7 @@ struct pks_test_ctx {
void *test_page;
bool fault_seen;
bool validate_exp_handling;
+ bool validate_update_exp;
};

static bool check_pkey_val(u32 pk_reg, u8 pkey, u32 expected)
@@ -164,6 +166,16 @@ static void validate_exception(struct pks_test_ctx *ctx, u32 thread_pkrs)
}
}

+static bool handle_update_exception(struct pt_regs *regs, struct pks_test_ctx *ctx)
+{
+ pr_debug("Updating pkey %d during exception\n", ctx->pkey);
+
+ ctx->fault_seen = true;
+ pks_update_exception(regs, ctx->pkey, 0);
+
+ return true;
+}
+
/* Global data protected by test_run_lock */
struct pks_test_ctx *g_ctx_under_test;

@@ -190,6 +202,9 @@ bool pks_test_fault_callback(struct pt_regs *regs, unsigned long address,
if (!g_ctx_under_test)
return false;

+ if (g_ctx_under_test->validate_update_exp)
+ return handle_update_exception(regs, g_ctx_under_test);
+
if (g_ctx_under_test->validate_exp_handling) {
validate_exception(g_ctx_under_test, pkrs);
/*
@@ -518,6 +533,47 @@ static void check_ctx_switch(struct pks_session_data *sd)
}
}

+static bool run_exception_update(struct pks_session_data *sd)
+{
+ struct pks_test_ctx *ctx;
+
+ ctx = alloc_ctx(PKS_KEY_TEST);
+ if (IS_ERR(ctx))
+ return false;
+
+ set_ctx_data(sd, ctx);
+
+ ctx->fault_seen = false;
+ ctx->validate_update_exp = true;
+ pks_set_noaccess(ctx->pkey);
+
+ set_context_for_fault(ctx);
+
+ /* fault */
+ memcpy(ctx->test_page, ctx->data, 8);
+
+ if (!ctx->fault_seen) {
+ pr_err("Failed to see the callback\n");
+ return false;
+ }
+
+ ctx->fault_seen = false;
+ ctx->validate_update_exp = false;
+
+ set_context_for_fault(ctx);
+
+ /* no fault */
+ memcpy(ctx->test_page, ctx->data, 8);
+
+ if (ctx->fault_seen) {
+ pr_err("Pkey %d failed to be set RD/WR in the callback\n",
+ ctx->pkey);
+ return false;
+ }
+
+ return true;
+}
+
static ssize_t pks_read_file(struct file *file, char __user *user_buf,
size_t count, loff_t *ppos)
{
@@ -584,6 +640,10 @@ static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
pr_debug("Exception checking\n");
sd->last_test_pass = run_exception_test(file->private_data);
break;
+ case RUN_EXCEPTION_UPDATE:
+ pr_debug("Fault clear test\n");
+ sd->last_test_pass = run_exception_update(file->private_data);
+ break;
default:
pr_debug("Unknown test\n");
sd->last_test_pass = false;
diff --git a/tools/testing/selftests/x86/test_pks.c b/tools/testing/selftests/x86/test_pks.c
index 817df7a14923..243347e48228 100644
--- a/tools/testing/selftests/x86/test_pks.c
+++ b/tools/testing/selftests/x86/test_pks.c
@@ -36,6 +36,7 @@
#define ARM_CTX_SWITCH "2"
#define CHECK_CTX_SWITCH "3"
#define RUN_EXCEPTION "4"
+#define RUN_EXCEPTION_UPDATE "5"
#define RUN_CRASH_TEST "9"

time_t g_start_time;
@@ -63,6 +64,7 @@ enum {
TEST_SINGLE,
TEST_CTX_SWITCH,
TEST_EXCEPTION,
+ TEST_FAULT_CALLBACK,
MAX_TESTS,
} tests;

@@ -77,7 +79,8 @@ struct test_item {
{ "check_defaults", CHECK_DEFAULTS, do_simple_test },
{ "single", RUN_SINGLE, do_simple_test },
{ "context_switch", ARM_CTX_SWITCH, do_context_switch },
- { "exception", RUN_EXCEPTION, do_simple_test }
+ { "exception", RUN_EXCEPTION, do_simple_test },
+ { "exception_update", RUN_EXCEPTION_UPDATE, do_simple_test }
};

static char *get_test_name(int test_num)
--
2.35.1

2022-03-11 10:16:42

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 06/45] mm/pkeys: Add Kconfig options for PKS

From: Ira Weiny <[email protected]>

Consumers wishing to implement additional protections on memory pages
can use PKS. However, PKS is only available on some architectures.

For this reason PKS code, both in the core and in the consumers, is dead
code without PKS being both available and used.

Add Kconfig options to allow for the elimination of unneeded code by
detecting architecture PKS support (ARCH_HAS_SUPERVISOR_PKEYS) and
requiring an indication of consumer need (ARCH_ENABLE_SUPERVISOR_PKEYS).

In this patch ARCH_ENABLE_SUPERVISOR_PKEYS remains off until the first
kernel consumer sets it.

Cc: "Moger, Babu" <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Dave Hansen
Don't exclude AMD, cpu supported bits will properly turn
the feature off.
Clarify commit message
Depend on CPU_SUP_INTEL

Changes for V8
Split this out to a single change patch
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9f5bd41bf660..459948622a73 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1868,6 +1868,7 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
depends on X86_64 && (CPU_SUP_INTEL || CPU_SUP_AMD)
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
+ select ARCH_HAS_SUPERVISOR_PKEYS
help
Memory Protection Keys provides a mechanism for enforcing
page-based protections, but without requiring modification of the
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..46f2bb15aa4e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -804,6 +804,10 @@ config ARCH_USES_HIGH_VMA_FLAGS
bool
config ARCH_HAS_PKEYS
bool
+config ARCH_HAS_SUPERVISOR_PKEYS
+ bool
+config ARCH_ENABLE_SUPERVISOR_PKEYS
+ bool

config PERCPU_STATS
bool "Collect percpu memory statistics"
--
2.35.1

2022-03-11 10:23:35

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 18/45] mm/pkeys: Introduce pks_set_noaccess()

From: Ira Weiny <[email protected]>

After a valid access consumers will want to change PKS protections back
to No Access for their pkey.

Define pks_set_noaccess() to update the specified pkey.

Add documentation.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Move to pks.h
Change pkey type to u8
From 0day
Fix documentation link
From Dave Hansen
use pkey
s/pks_mk*/pks_set*/

Changes for V8
Make the call inline
Split this patch out from 'Add PKS kernel API'
Include documentation in this patch
---
Documentation/core-api/protection-keys.rst | 2 +-
include/linux/pks.h | 13 +++++++++++++
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index e6564f5336b7..2ec35349ecfd 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -147,7 +147,7 @@ Changing permissions of individual keys
---------------------------------------

.. kernel-doc:: include/linux/pks.h
- :identifiers: pks_set_readwrite
+ :identifiers: pks_set_readwrite pks_set_noaccess

MSR details
~~~~~~~~~~~
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 8b705a937b19..9f18f8b4cbb1 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -10,6 +10,18 @@

void pks_update_protection(u8 pkey, u8 protection);

+/**
+ * pks_set_noaccess() - Disable all access to the domain
+ * @pkey: the pkey for which the access should change.
+ *
+ * Disable all access to the domain specified by pkey. This is not a global
+ * update and only affects the current running thread.
+ */
+static inline void pks_set_noaccess(u8 pkey)
+{
+ pks_update_protection(pkey, PKEY_DISABLE_ACCESS);
+}
+
/**
* pks_set_readwrite() - Make the domain Read/Write
* @pkey: the pkey for which the access should change.
@@ -24,6 +36,7 @@ static inline void pks_set_readwrite(u8 pkey)

#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

+static inline void pks_set_noaccess(u8 pkey) {}
static inline void pks_set_readwrite(u8 pkey) {}

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
--
2.35.1

2022-03-11 11:43:43

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 02/45] Documentation/protection-keys: Clean up documentation for User Space pkeys

From: Ira Weiny <[email protected]>

The documentation for user space pkeys was a bit dated including things
such as Amazon and distribution testing information which is irrelevant
now.

Update the documentation. This also streamlines adding the Supervisor
pkey documentation later on.

Cc: "Moger, Babu" <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9:
use pkey
Change information on which CPU's have PKU
---
Documentation/core-api/protection-keys.rst | 44 +++++++++++-----------
1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index ec575e72d0b2..bf28ac0401f3 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -4,31 +4,29 @@
Memory Protection Keys
======================

-Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature
-which is found on Intel's Skylake (and later) "Scalable Processor"
-Server CPUs. It will be available in future non-server Intel parts
-and future AMD processors.
-
-For anyone wishing to test or use this feature, it is available in
-Amazon's EC2 C5 instances and is known to work there using an Ubuntu
-17.04 image.
-
-Memory Protection Keys provides a mechanism for enforcing page-based
-protections, but without requiring modification of the page tables
-when an application changes protection domains. It works by
-dedicating 4 previously ignored bits in each page table entry to a
-"protection key", giving 16 possible keys.
-
-There is also a new user-accessible register (PKRU) with two separate
-bits (Access Disable and Write Disable) for each key. Being a CPU
-register, PKRU is inherently thread-local, potentially giving each
+Memory Protection Keys provide a mechanism for enforcing page-based
+protections, but without requiring modification of the page tables when an
+application changes protection domains.
+
+Pkeys Userspace (PKU) is a feature which can be found on:
+ * Intel server CPUs, Skylake and later
+ * Intel client CPUs, Tiger Lake (11th Gen Core) and later
+ * Future AMD CPUs
+
+Pkeys work by dedicating 4 previously Reserved bits in each page table entry to
+a "protection key", giving 16 possible keys.
+
+Protections for each key are defined with a per-CPU user-accessible register
+(PKRU). Each of these is a 32-bit register storing two bits (Access Disable
+and Write Disable) for each of 16 keys.
+
+Being a CPU register, PKRU is inherently thread-local, potentially giving each
thread a different set of protections from every other thread.

-There are two new instructions (RDPKRU/WRPKRU) for reading and writing
-to the new register. The feature is only available in 64-bit mode,
-even though there is theoretically space in the PAE PTEs. These
-permissions are enforced on data access only and have no effect on
-instruction fetches.
+There are two instructions (RDPKRU/WRPKRU) for reading and writing to the
+register. The feature is only available in 64-bit mode, even though there is
+theoretically space in the PAE PTEs. These permissions are enforced on data
+access only and have no effect on instruction fetches.

Syscalls
========
--
2.35.1

2022-03-11 15:09:40

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 28/45] x86/fault: Print PKS MSR on fault

From: Ira Weiny <[email protected]>

If a PKS fault occurs it will be easier to debug if the PKS MSR value at
the time of the fault is known.

Add pks_show_regs() to __show_regs() to show the PKRS MSR on fault if
enabled.

An 'executive summary' of the pt_regs are saved in __die_header() which
ensures that the first registers are saved in the event of multiple
faults. Teach this code about the extended pt_registers such that the
PKS code can get to the original pkrs value as well.

Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
From Dave Hansen
Move this output to __show_regs() next to the PKRU
register dump

Changes for V8
Split this into it's own patch.
---
arch/x86/include/asm/pks.h | 3 +++
arch/x86/kernel/dumpstack.c | 32 ++++++++++++++++++++++++++++++--
arch/x86/kernel/process_64.c | 1 +
arch/x86/mm/pkeys.c | 11 +++++++++++
4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index b69e03a141fe..de67d5b5a2af 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -8,6 +8,7 @@ void pks_setup(void);
void x86_pkrs_load(struct thread_struct *thread);
void pks_save_pt_regs(struct pt_regs *regs);
void pks_restore_pt_regs(struct pt_regs *regs);
+void pks_show_regs(struct pt_regs *regs, const char *log_lvl);

bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address);
@@ -18,6 +19,8 @@ static inline void pks_setup(void) { }
static inline void x86_pkrs_load(struct thread_struct *thread) { }
static inline void pks_save_pt_regs(struct pt_regs *regs) { }
static inline void pks_restore_pt_regs(struct pt_regs *regs) { }
+static inline void pks_show_regs(struct pt_regs *regs,
+ const char *log_lvl) { }

static inline bool pks_handle_key_fault(struct pt_regs *regs,
unsigned long hw_error_code,
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 53de044e5654..38be69d15431 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -27,8 +27,36 @@ int panic_on_unrecovered_nmi;
int panic_on_io_nmi;
static int die_counter;

+#ifdef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static struct pt_regs_extended exec_summary_regs;
+
+static void save_exec_summary(struct pt_regs *regs)
+{
+ exec_summary_regs = *(to_extended_pt_regs(regs));
+}
+
+static struct pt_regs *retrieve_exec_summary(void)
+{
+ return &exec_summary_regs.pt_regs;
+}
+
+#else /* !CONFIG_ARCH_HAS_PTREGS_AUXILIARY */
+
static struct pt_regs exec_summary_regs;

+static void save_exec_summary(struct pt_regs *regs)
+{
+ exec_summary_regs = *regs;
+}
+
+static struct pt_regs *retrieve_exec_summary(void)
+{
+ return &exec_summary_regs;
+}
+
+#endif /* CONFIG_ARCH_HAS_PTREGS_AUXILIARY */
+
bool noinstr in_task_stack(unsigned long *stack, struct task_struct *task,
struct stack_info *info)
{
@@ -369,7 +397,7 @@ void oops_end(unsigned long flags, struct pt_regs *regs, int signr)
oops_exit();

/* Executive summary in case the oops scrolled away */
- __show_regs(&exec_summary_regs, SHOW_REGS_ALL, KERN_DEFAULT);
+ __show_regs(retrieve_exec_summary(), SHOW_REGS_ALL, KERN_DEFAULT);

if (!signr)
return;
@@ -396,7 +424,7 @@ static void __die_header(const char *str, struct pt_regs *regs, long err)

/* Save the regs of the first oops for the executive summary later. */
if (!die_counter)
- exec_summary_regs = *regs;
+ save_exec_summary(regs);

if (IS_ENABLED(CONFIG_PREEMPTION))
pr = IS_ENABLED(CONFIG_PREEMPT_RT) ? " PREEMPT_RT" : " PREEMPT";
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e703cc451128..68d998ea3571 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -140,6 +140,7 @@ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode,

if (cpu_feature_enabled(X86_FEATURE_OSPKE))
printk("%sPKRU: %08x\n", log_lvl, read_pkru());
+ pks_show_regs(regs, log_lvl);
}

void release_thread(struct task_struct *dead_task)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 29885dfb0980..7c8e4ea9f022 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -378,6 +378,17 @@ void pks_restore_pt_regs(struct pt_regs *regs)
pks_write_pkrs(current->thread.pkrs);
}

+void pks_show_regs(struct pt_regs *regs, const char *log_lvl)
+{
+ struct pt_regs_auxiliary *aux_pt_regs;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ aux_pt_regs = &to_extended_pt_regs(regs)->aux;
+ printk("%sPKRS: 0x%x\n", log_lvl, aux_pt_regs->pkrs);
+}
+
/*
* PKS is independent of PKU and either or both may be supported on a CPU.
*
--
2.35.1

2022-03-11 16:50:05

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call

From: Ira Weiny <[email protected]>

The static call to irqentry_exit_cond_resched() was not properly being
overridden when called from xen_pv_evtchn_do_upcall().

Define __irqentry_exit_cond_resched() as the static call and place the
override logic in irqentry_exit_cond_resched().

Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Update the commit message a bit

Because this was found via code inspection and it does not actually fix
any seen bug I've not added a fixes tag.

But for reference:
Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
---
include/linux/entry-common.h | 5 ++++-
kernel/entry/common.c | 23 +++++++++++++--------
kernel/sched/core.c | 40 ++++++++++++++++++------------------
3 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 2e2b8d6140ed..ddaffc983e62 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -455,10 +455,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void irqentry_exit_cond_resched(void);
+
+void __irqentry_exit_cond_resched(void);
#ifdef CONFIG_PREEMPT_DYNAMIC
-DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DECLARE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
#endif

+
/**
* irqentry_exit - Handle return from exception that used irqentry_enter()
* @regs: Pointer to pt_regs (exception entry regs)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bad713684c2e..490442a48332 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -380,7 +380,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
return ret;
}

-void irqentry_exit_cond_resched(void)
+void __irqentry_exit_cond_resched(void)
{
if (!preempt_count()) {
/* Sanity check RCU and thread stack */
@@ -392,9 +392,20 @@ void irqentry_exit_cond_resched(void)
}
}
#ifdef CONFIG_PREEMPT_DYNAMIC
-DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
#endif

+void irqentry_exit_cond_resched(void)
+{
+ if (IS_ENABLED(CONFIG_PREEMPTION)) {
+#ifdef CONFIG_PREEMPT_DYNAMIC
+ static_call(__irqentry_exit_cond_resched)();
+#else
+ __irqentry_exit_cond_resched();
+#endif
+ }
+}
+
noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();
@@ -420,13 +431,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
}

instrumentation_begin();
- if (IS_ENABLED(CONFIG_PREEMPTION)) {
-#ifdef CONFIG_PREEMPT_DYNAMIC
- static_call(irqentry_exit_cond_resched)();
-#else
- irqentry_exit_cond_resched();
-#endif
- }
+ irqentry_exit_cond_resched();
/* Covers both tracing and lockdep */
trace_hardirqs_on();
instrumentation_end();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9745613d531c..f56db4bd9730 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6571,29 +6571,29 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
* SC:might_resched
* SC:preempt_schedule
* SC:preempt_schedule_notrace
- * SC:irqentry_exit_cond_resched
+ * SC:__irqentry_exit_cond_resched
*
*
* NONE:
- * cond_resched <- __cond_resched
- * might_resched <- RET0
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- RET0
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * __irqentry_exit_cond_resched <- NOP
*
* VOLUNTARY:
- * cond_resched <- __cond_resched
- * might_resched <- __cond_resched
- * preempt_schedule <- NOP
- * preempt_schedule_notrace <- NOP
- * irqentry_exit_cond_resched <- NOP
+ * cond_resched <- __cond_resched
+ * might_resched <- __cond_resched
+ * preempt_schedule <- NOP
+ * preempt_schedule_notrace <- NOP
+ * __irqentry_exit_cond_resched <- NOP
*
* FULL:
- * cond_resched <- RET0
- * might_resched <- RET0
- * preempt_schedule <- preempt_schedule
- * preempt_schedule_notrace <- preempt_schedule_notrace
- * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
+ * cond_resched <- RET0
+ * might_resched <- RET0
+ * preempt_schedule <- preempt_schedule
+ * preempt_schedule_notrace <- preempt_schedule_notrace
+ * __irqentry_exit_cond_resched <- __irqentry_exit_cond_resched
*/

enum {
@@ -6629,7 +6629,7 @@ void sched_dynamic_update(int mode)
static_call_update(might_resched, __cond_resched);
static_call_update(preempt_schedule, __preempt_schedule_func);
static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
- static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+ static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);

switch (mode) {
case preempt_dynamic_none:
@@ -6637,7 +6637,7 @@ void sched_dynamic_update(int mode)
static_call_update(might_resched, (void *)&__static_call_return0);
static_call_update(preempt_schedule, NULL);
static_call_update(preempt_schedule_notrace, NULL);
- static_call_update(irqentry_exit_cond_resched, NULL);
+ static_call_update(__irqentry_exit_cond_resched, NULL);
pr_info("Dynamic Preempt: none\n");
break;

@@ -6646,7 +6646,7 @@ void sched_dynamic_update(int mode)
static_call_update(might_resched, __cond_resched);
static_call_update(preempt_schedule, NULL);
static_call_update(preempt_schedule_notrace, NULL);
- static_call_update(irqentry_exit_cond_resched, NULL);
+ static_call_update(__irqentry_exit_cond_resched, NULL);
pr_info("Dynamic Preempt: voluntary\n");
break;

@@ -6655,7 +6655,7 @@ void sched_dynamic_update(int mode)
static_call_update(might_resched, (void *)&__static_call_return0);
static_call_update(preempt_schedule, __preempt_schedule_func);
static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
- static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
+ static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
pr_info("Dynamic Preempt: full\n");
break;
}
--
2.35.1

2022-03-11 20:23:17

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 13/45] mm/pkeys: PKS testing, add initial test code

From: Ira Weiny <[email protected]>

Define a PKS consumer for testing.

Two initial tests are created. One to check that the default values
have been properly assigned and a second which purposely causes a fault.

Add documentation.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Simplify the commit message
Simplify documentation in favor of using test_pks
Complete re-arch of test code...
Return -ENOENT for unknown tests
Adjust the key allocation
Reduce the globals used during fault detection
Introduce a session structure to track information as long as the
debugfs file remains open.
Use pr_debug() for internal debug output.
Document how to run tests from debugfs with trace_printk()
output.
Feedback from Rick Edgecombe
Change pkey type to u8
remove pks_test_exit
set file data within the crash test to be cleaned up on
file close
Resolve when memory barriers are needed
From Dave Hansen
Place a lock around the execution of tests so that only
a single thread execute at a time.

Changes for V8
Ensure that unknown tests are flagged as failures.
Split out the various tests into their own patches which test
the functionality as the series goes.
Move this basic test forward in the series

Changes for V7
Add testing for pks_abandon_protections()
Adjust pkrs_init_value
Adjust for new defines
Clean up comments
Adjust test for static allocation of pkeys
Use lookup_address() instead of follow_pte()
follow_pte only works on IO and raw PFN mappings, use
lookup_address() instead. lookup_address() is
constrained to architectures which support it.
---
Documentation/core-api/protection-keys.rst | 6 +
include/linux/pks-keys.h | 8 +-
lib/Kconfig.debug | 12 +
lib/Makefile | 3 +
lib/pks/Makefile | 3 +
lib/pks/pks_test.c | 301 +++++++++++++++++++++
6 files changed, 331 insertions(+), 2 deletions(-)
create mode 100644 lib/pks/Makefile
create mode 100644 lib/pks/pks_test.c

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index fe63acf5abbe..4d99ca41c914 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -142,3 +142,9 @@ Adding pages to a pkey protected domain

.. kernel-doc:: arch/x86/include/asm/pgtable_types.h
:doc: PKS_KEY_ASSIGNMENT
+
+Testing
+-------
+
+.. kernel-doc:: lib/pks/pks_test.c
+ :doc: PKS_TEST
diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index c914afecb2d3..43e4ae42db2e 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -60,17 +60,21 @@

/* PKS_KEY_DEFAULT must be 0 */
#define PKS_KEY_DEFAULT 0
-#define PKS_KEY_MAX PKS_NEW_KEY(PKS_KEY_DEFAULT, 1)
+#define PKS_KEY_TEST PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_PKS_TEST)
+#define PKS_KEY_MAX PKS_NEW_KEY(PKS_KEY_TEST, 1)

/* PKS_KEY_DEFAULT_INIT must be RW */
#define PKS_KEY_DEFAULT_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
+#define PKS_KEY_TEST_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_TEST, AD, \
+ CONFIG_PKS_TEST)

#define PKS_ALL_AD_MASK \
GENMASK(PKS_NUM_PKEYS * PKR_BITS_PER_PKEY, \
PKS_KEY_MAX * PKR_BITS_PER_PKEY)

#define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
- PKS_KEY_DEFAULT_INIT \
+ PKS_KEY_DEFAULT_INIT | \
+ PKS_KEY_TEST_INIT \
)

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 14b89aa37c5c..5cab2100c133 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2685,6 +2685,18 @@ config HYPERV_TESTING
help
Select this option to enable Hyper-V vmbus testing.

+config PKS_TEST
+ bool "PKey (S)upervisor testing"
+ depends on ARCH_HAS_SUPERVISOR_PKEYS
+ select ARCH_ENABLE_SUPERVISOR_PKEYS
+ help
+ Select this option to enable testing of PKS core software and
+ hardware.
+
+ Answer N if you don't know what supervisor keys are.
+
+ If unsure, say N.
+
endmenu # "Kernel Testing and Coverage"

source "Documentation/Kconfig"
diff --git a/lib/Makefile b/lib/Makefile
index 300f569c626b..038a93c89714 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -398,3 +398,6 @@ $(obj)/$(TEST_FORTIFY_LOG): $(addprefix $(obj)/, $(TEST_FORTIFY_LOGS)) FORCE
ifeq ($(CONFIG_FORTIFY_SOURCE),y)
$(obj)/string.o: $(obj)/$(TEST_FORTIFY_LOG)
endif
+
+# PKS test
+obj-y += pks/
diff --git a/lib/pks/Makefile b/lib/pks/Makefile
new file mode 100644
index 000000000000..9daccba4f7c4
--- /dev/null
+++ b/lib/pks/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_PKS_TEST) += pks_test.o
diff --git a/lib/pks/pks_test.c b/lib/pks/pks_test.c
new file mode 100644
index 000000000000..2fc92aaa54e8
--- /dev/null
+++ b/lib/pks/pks_test.c
@@ -0,0 +1,301 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright(c) 2022 Intel Corporation. All rights reserved.
+ */
+
+/**
+ * DOC: PKS_TEST
+ *
+ * When CONFIG_PKS_TEST is enabled a debugfs file is created to facilitate in
+ * kernel testing. Tests can be triggered by writing a test number to
+ * /sys/kernel/debug/x86/run_pks
+ *
+ * Results and debug output can be seen through dynamic debug.
+ *
+ * Example:
+ *
+ * .. code-block:: sh
+ *
+ * # Enable kernel debug
+ * echo "file pks_test.c +pflm" > /sys/kernel/debug/dynamic_debug/control
+ *
+ * # Run test
+ * echo 0 > /sys/kernel/debug/x86/run_pks
+ *
+ * # Turn off kernel debug
+ * echo "file pks_test.c -p" > /sys/kernel/debug/dynamic_debug/control
+ *
+ * # view kernel debugging output
+ * dmesg -H | grep pks_test
+ */
+
+#include <linux/debugfs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/pks-keys.h>
+
+#define PKS_TEST_MEM_SIZE (PAGE_SIZE)
+
+#define CHECK_DEFAULTS 0
+#define RUN_CRASH_TEST 9
+
+static struct dentry *pks_test_dentry;
+
+DEFINE_MUTEX(test_run_lock);
+
+struct pks_test_ctx {
+ u8 pkey;
+ char data[64];
+ void *test_page;
+};
+
+static void debug_context(const char *label, struct pks_test_ctx *ctx)
+{
+ pr_debug("%s [%d] %s <-> %p\n",
+ label,
+ ctx->pkey,
+ ctx->data,
+ ctx->test_page);
+}
+
+struct pks_session_data {
+ struct pks_test_ctx *ctx;
+ bool need_unlock;
+ bool crash_armed;
+ bool last_test_pass;
+};
+
+static void debug_session(const char *label, struct pks_session_data *sd)
+{
+ pr_debug("%s ctx %p; unlock %d; crash %d; last test %s\n",
+ label,
+ sd->ctx,
+ sd->need_unlock,
+ sd->crash_armed,
+ sd->last_test_pass ? "PASS" : "FAIL");
+
+}
+
+static void debug_result(const char *label, int test_num,
+ struct pks_session_data *sd)
+{
+ pr_debug("%s [%d]: %s\n",
+ label, test_num,
+ sd->last_test_pass ? "PASS" : "FAIL");
+}
+
+static void *alloc_test_page(u8 pkey)
+{
+ return __vmalloc_node_range(PKS_TEST_MEM_SIZE, 1, VMALLOC_START,
+ VMALLOC_END, GFP_KERNEL,
+ PAGE_KERNEL_PKEY(pkey), 0,
+ NUMA_NO_NODE, __builtin_return_address(0));
+}
+
+static void free_ctx(struct pks_test_ctx *ctx)
+{
+ if (!ctx)
+ return;
+
+ vfree(ctx->test_page);
+ kfree(ctx);
+}
+
+static struct pks_test_ctx *alloc_ctx(u8 pkey)
+{
+ struct pks_test_ctx *ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ ctx->pkey = pkey;
+ sprintf(ctx->data, "%s", "DEADBEEF");
+
+ ctx->test_page = alloc_test_page(ctx->pkey);
+ if (!ctx->test_page) {
+ pr_debug("Test page allocation failed\n");
+ kfree(ctx);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ debug_context("Context allocated", ctx);
+ return ctx;
+}
+
+static void set_ctx_data(struct pks_session_data *sd, struct pks_test_ctx *ctx)
+{
+ if (sd->ctx) {
+ pr_debug("Context data already set\n");
+ free_ctx(sd->ctx);
+ }
+ pr_debug("Setting context data; %p\n", ctx);
+ sd->ctx = ctx;
+}
+
+static void crash_it(struct pks_session_data *sd)
+{
+ struct pks_test_ctx *ctx;
+
+ ctx = alloc_ctx(PKS_KEY_TEST);
+ if (IS_ERR(ctx)) {
+ pr_err("Failed to allocate context???\n");
+ sd->last_test_pass = false;
+ return;
+ }
+ set_ctx_data(sd, ctx);
+
+ pr_debug("Purposely faulting...\n");
+ memcpy(ctx->test_page, ctx->data, 8);
+
+ pr_err("ERROR: Should never get here...\n");
+ sd->last_test_pass = false;
+}
+
+static void check_pkey_settings(void *data)
+{
+ struct pks_session_data *sd = data;
+ unsigned long long msr = 0;
+ unsigned int cpu = smp_processor_id();
+
+ rdmsrl(MSR_IA32_PKRS, msr);
+ pr_debug("cpu %d 0x%llx\n", cpu, msr);
+ if (msr != PKS_INIT_VALUE) {
+ pr_err("cpu %d value incorrect : 0x%llx expected 0x%lx\n",
+ cpu, msr, PKS_INIT_VALUE);
+ sd->last_test_pass = false;
+ }
+}
+
+static void arm_or_run_crash_test(struct pks_session_data *sd)
+{
+
+ /*
+ * WARNING: Test "9" will crash.
+ * Arm the test.
+ * A second "9" will run the test.
+ */
+ if (!sd->crash_armed) {
+ pr_debug("Arming crash test\n");
+ sd->crash_armed = true;
+ return;
+ }
+
+ sd->crash_armed = false;
+ crash_it(sd);
+}
+
+static ssize_t pks_read_file(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ struct pks_session_data *sd = file->private_data;
+ char buf[64];
+ unsigned int len;
+
+ len = sprintf(buf, "%s\n", sd->last_test_pass ? "PASS" : "FAIL");
+
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t pks_write_file(struct file *file, const char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ struct pks_session_data *sd = file->private_data;
+ long test_num;
+ char buf[2];
+
+ pr_debug("Begin...\n");
+ sd->last_test_pass = false;
+
+ if (copy_from_user(buf, user_buf, 1))
+ return -EFAULT;
+ buf[1] = '\0';
+
+ if (kstrtol(buf, 0, &test_num))
+ return -EINVAL;
+
+ if (mutex_lock_interruptible(&test_run_lock))
+ return -EBUSY;
+
+ sd->need_unlock = true;
+ sd->last_test_pass = true;
+
+ switch (test_num) {
+ case RUN_CRASH_TEST:
+ pr_debug("crash test\n");
+ arm_or_run_crash_test(file->private_data);
+ goto unlock_test;
+ case CHECK_DEFAULTS:
+ pr_debug("check defaults test: 0x%lx\n", PKS_INIT_VALUE);
+ on_each_cpu(check_pkey_settings, file->private_data, 1);
+ break;
+ default:
+ pr_debug("Unknown test\n");
+ sd->last_test_pass = false;
+ count = -ENOENT;
+ break;
+ }
+
+ /* Clear arming on any test run */
+ pr_debug("Clearing crash test arm\n");
+ sd->crash_armed = false;
+
+unlock_test:
+ /*
+ * Normal exit; clear up the locking flag
+ */
+ sd->need_unlock = false;
+ mutex_unlock(&test_run_lock);
+ debug_result("Test complete", test_num, sd);
+ return count;
+}
+
+static int pks_open_file(struct inode *inode, struct file *file)
+{
+ struct pks_session_data *sd = kzalloc(sizeof(*sd), GFP_KERNEL);
+
+ if (!sd)
+ return -ENOMEM;
+
+ debug_session("Allocated session", sd);
+ file->private_data = sd;
+
+ return 0;
+}
+
+static int pks_release_file(struct inode *inode, struct file *file)
+{
+ struct pks_session_data *sd = file->private_data;
+
+ debug_session("Freeing session", sd);
+
+ /*
+ * Some tests may fault and not return through the normal write
+ * syscall. The crash test is specifically designed to do this. Clean
+ * up the run lock when the file is closed if the write syscall does
+ * not exit normally.
+ */
+ if (sd->need_unlock)
+ mutex_unlock(&test_run_lock);
+ free_ctx(sd->ctx);
+ kfree(sd);
+ return 0;
+}
+
+static const struct file_operations fops_init_pks = {
+ .read = pks_read_file,
+ .write = pks_write_file,
+ .llseek = default_llseek,
+ .open = pks_open_file,
+ .release = pks_release_file,
+};
+
+static int __init pks_test_init(void)
+{
+ if (cpu_feature_enabled(X86_FEATURE_PKS))
+ pks_test_dentry = debugfs_create_file("run_pks", 0600, arch_debugfs_dir,
+ NULL, &fops_init_pks);
+
+ return 0;
+}
+late_initcall(pks_test_init);
--
2.35.1

2022-03-11 20:42:25

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 44/45] nvdimm/pmem: Enable stray access protection

From: Ira Weiny <[email protected]>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM. Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes. Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Now that all valid kernel access' to PMEM have been annotated with
{__}pgmap_set_{readwrite,noaccess}() PGMAP_PROTECTION is safe to enable
in the pmem layer.

Set PGMAP_PROTECTION if pgmap protections are available and set the
pgmap property of the dax device for it's use.

Internally, the pmem driver uses a cached virtual address,
pmem->virt_addr (pmem_addr). Call __pgmap_set_{readwrite,noaccess}()
directly when PGMAP_PROTECTION is active on those mappings.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Remove the dax operations and pass the pgmap to the dax_device
for its use.
s/pgmap_mk_*/pgmap_set_*/
s/pmem_mk_*/pmem_set_*/

Changes for V8
Rebase to 5.17-rc1
Remove global param
Add internal structure which uses the pmem device and pgmap
device directly in the *_mk_*() calls.
Add pmem dax ops callbacks
Use pgmap_protection_available()
s/PGMAP_PKEY_PROTECT/PGMAP_PROTECTION
---
drivers/nvdimm/pmem.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..2c7b18da7974 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -138,6 +138,18 @@ static blk_status_t read_pmem(struct page *page, unsigned int off,
return BLK_STS_OK;
}

+static void pmem_set_readwrite(struct pmem_device *pmem)
+{
+ if (pmem->pgmap.flags & PGMAP_PROTECTION)
+ __pgmap_set_readwrite(&pmem->pgmap);
+}
+
+static void pmem_set_noaccess(struct pmem_device *pmem)
+{
+ if (pmem->pgmap.flags & PGMAP_PROTECTION)
+ __pgmap_set_noaccess(&pmem->pgmap);
+}
+
static blk_status_t pmem_do_read(struct pmem_device *pmem,
struct page *page, unsigned int page_off,
sector_t sector, unsigned int len)
@@ -149,7 +161,11 @@ static blk_status_t pmem_do_read(struct pmem_device *pmem,
if (unlikely(is_bad_pmem(&pmem->bb, sector, len)))
return BLK_STS_IOERR;

+ /* Enable direct use of pmem->virt_addr */
+ pmem_set_readwrite(pmem);
rc = read_pmem(page, page_off, pmem_addr, len);
+ pmem_set_noaccess(pmem);
+
flush_dcache_page(page);
return rc;
}
@@ -181,11 +197,15 @@ static blk_status_t pmem_do_write(struct pmem_device *pmem,
* after clear poison.
*/
flush_dcache_page(page);
+
+ /* Enable direct use of pmem->virt_addr */
+ pmem_set_readwrite(pmem);
write_pmem(pmem_addr, page, page_off, len);
if (unlikely(bad_pmem)) {
rc = pmem_clear_poison(pmem, pmem_off, len);
write_pmem(pmem_addr, page, page_off, len);
}
+ pmem_set_noaccess(pmem);

return rc;
}
@@ -427,6 +447,8 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+ if (pgmap_protection_available())
+ pmem->pgmap.flags |= PGMAP_PROTECTION;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +462,8 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.range.end = res->end;
pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+ if (pgmap_protection_available())
+ pmem->pgmap.flags |= PGMAP_PROTECTION;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
@@ -481,6 +505,8 @@ static int pmem_attach_disk(struct device *dev,
}
set_dax_nocache(dax_dev);
set_dax_nomc(dax_dev);
+ if (pmem->pgmap.flags & PGMAP_PROTECTION)
+ set_dax_pgmap(dax_dev, &pmem->pgmap);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
rc = dax_add_host(dax_dev, disk);
--
2.35.1

2022-03-11 21:04:24

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 19/45] mm/pkeys: Introduce PKS fault callbacks

From: Rick Edgecombe <[email protected]>

Some PKS consumers will want special handling on violations of pkey
permissions. Such a consumer is PMEM which will want to have a mode
that logs the access violation, disables protection, and continues
rather than oops'ing the machine.

Provide an API to assign callbacks for individual pkeys.

Since PKS faults do not provide the key that faulted, this information
needs to be recovered by walking the page tables and extracting it from
the leaf entry. The key can then be used to call the proper callback.

Add documentation.

Co-developed-by: Ira Weiny <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>
Signed-off-by: Rick Edgecombe <[email protected]>

---
Changes for V9:
Rework commit message
Adjust for the use of linux/pks.h
From the new key allocation: s/PKS_NR_CONSUMERS/PKS_KEY_MAX
From Dave Hansen
use pkey
Fix conflicts with other users in the test code by
moving this forward in the series

Changes for V8:
Add pt_regs to the callback signature so that
pks_update_exception() can be called if needed.
Update commit message
Determine if page is large prior to not present
Update commit message with more clarity as to why this was kept
separate from pks_abandon_protections() and
pks_test_callback()
Embed documentation in c file.
Move handle_pks_key_fault() to pkeys.c
s/handle_pks_key_fault/pks_handle_key_fault/
This consolidates the PKS code nicely
Add feature check to pks_handle_key_fault()
From Rick Edgecombe
Fix key value check
From kernel test robot
Add static to handle_pks_key_fault

Changes for V7:
New patch
---
Documentation/core-api/protection-keys.rst | 6 ++
arch/x86/include/asm/pks.h | 10 +++
arch/x86/mm/fault.c | 17 +++--
arch/x86/mm/pkeys.c | 86 ++++++++++++++++++++++
include/linux/pks.h | 3 +
5 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst
index 2ec35349ecfd..5fdc83a39d4e 100644
--- a/Documentation/core-api/protection-keys.rst
+++ b/Documentation/core-api/protection-keys.rst
@@ -149,6 +149,12 @@ Changing permissions of individual keys
.. kernel-doc:: include/linux/pks.h
:identifiers: pks_set_readwrite pks_set_noaccess

+Overriding Default Fault Behavior
+---------------------------------
+
+.. kernel-doc:: arch/x86/mm/pkeys.c
+ :doc: DEFINE_PKS_FAULT_CALLBACK
+
MSR details
~~~~~~~~~~~

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index a7bad7301783..e9ad3ecd7ed0 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -7,11 +7,21 @@
void pks_setup(void);
void x86_pkrs_load(struct thread_struct *thread);

+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address);
+
#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

static inline void pks_setup(void) { }
static inline void x86_pkrs_load(struct thread_struct *thread) { }

+static inline bool pks_handle_key_fault(struct pt_regs *regs,
+ unsigned long hw_error_code,
+ unsigned long address)
+{
+ return false;
+}
+
#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

#endif /* _ASM_X86_PKS_H */
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5599109d1124..e8934df1b886 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -33,6 +33,7 @@
#include <asm/kvm_para.h> /* kvm_handle_async_pf */
#include <asm/vdso.h> /* fixup_vdso_exception() */
#include <asm/irq_stack.h>
+#include <asm/pks.h> /* pks_handle_key_fault() */

#define CREATE_TRACE_POINTS
#include <asm/trace/exceptions.h>
@@ -1147,12 +1148,16 @@ static void
do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
{
- /*
- * PF_PF faults should only occur on kernel
- * addresses when supervisor pkeys are enabled.
- */
- WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) &&
- (hw_error_code & X86_PF_PK));
+ if (hw_error_code & X86_PF_PK) {
+ /*
+ * PF_PF faults should only occur on kernel
+ * addresses when supervisor pkeys are enabled.
+ */
+ WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS));
+
+ if (pks_handle_key_fault(regs, hw_error_code, address))
+ return;
+ }

#ifdef CONFIG_X86_32
/*
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index e4cbc79686ea..a3b27b7811da 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -12,6 +12,7 @@

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/mmu_context.h> /* vma_pkey() */
+#include <asm/trap_pf.h> /* X86_PF_WRITE */

int __execute_only_pkey(struct mm_struct *mm)
{
@@ -216,6 +217,91 @@ u32 pkey_update_pkval(u32 pkval, u8 pkey, u32 accessbits)

static DEFINE_PER_CPU(u32, pkrs_cache);

+/**
+ * DOC: DEFINE_PKS_FAULT_CALLBACK
+ *
+ * Users may also provide a fault handler which can handle a fault differently
+ * than an oops. For example if 'MY_FEATURE' wanted to define a handler they
+ * can do so by adding the coresponding entry to the pks_key_callbacks array.
+ *
+ * .. code-block:: c
+ *
+ * #ifdef CONFIG_MY_FEATURE
+ * bool my_feature_pks_fault_callback(struct pt_regs *regs,
+ * unsigned long address, bool write)
+ * {
+ * if (my_feature_fault_is_ok)
+ * return true;
+ * return false;
+ * }
+ * #endif
+ *
+ * static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = {
+ * [PKS_KEY_DEFAULT] = NULL,
+ * #ifdef CONFIG_MY_FEATURE
+ * [PKS_KEY_MY_FEATURE] = my_feature_pks_fault_callback,
+ * #endif
+ * };
+ */
+static const pks_key_callback pks_key_callbacks[PKS_KEY_MAX] = { 0 };
+
+static bool pks_call_fault_callback(struct pt_regs *regs, unsigned long address,
+ bool write, u16 key)
+{
+ if (key >= PKS_KEY_MAX)
+ return false;
+
+ if (pks_key_callbacks[key])
+ return pks_key_callbacks[key](regs, address, write);
+
+ return false;
+}
+
+bool pks_handle_key_fault(struct pt_regs *regs, unsigned long hw_error_code,
+ unsigned long address)
+{
+ bool write;
+ pgd_t pgd;
+ p4d_t p4d;
+ pud_t pud;
+ pmd_t pmd;
+ pte_t pte;
+
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return false;
+
+ write = (hw_error_code & X86_PF_WRITE);
+
+ pgd = READ_ONCE(*(init_mm.pgd + pgd_index(address)));
+ if (!pgd_present(pgd))
+ return false;
+
+ p4d = READ_ONCE(*p4d_offset(&pgd, address));
+ if (p4d_large(p4d))
+ return pks_call_fault_callback(regs, address, write,
+ pte_flags_pkey(p4d_val(p4d)));
+ if (!p4d_present(p4d))
+ return false;
+
+ pud = READ_ONCE(*pud_offset(&p4d, address));
+ if (pud_large(pud))
+ return pks_call_fault_callback(regs, address, write,
+ pte_flags_pkey(pud_val(pud)));
+ if (!pud_present(pud))
+ return false;
+
+ pmd = READ_ONCE(*pmd_offset(&pud, address));
+ if (pmd_large(pmd))
+ return pks_call_fault_callback(regs, address, write,
+ pte_flags_pkey(pmd_val(pmd)));
+ if (!pmd_present(pmd))
+ return false;
+
+ pte = READ_ONCE(*pte_offset_kernel(&pmd, address));
+ return pks_call_fault_callback(regs, address, write,
+ pte_flags_pkey(pte_val(pte)));
+}
+
/*
* pks_write_pkrs() - Write the pkrs of the current CPU
* @new_pkrs: New value to write to the current CPU register
diff --git a/include/linux/pks.h b/include/linux/pks.h
index 9f18f8b4cbb1..d0d8bf1aaa1d 100644
--- a/include/linux/pks.h
+++ b/include/linux/pks.h
@@ -34,6 +34,9 @@ static inline void pks_set_readwrite(u8 pkey)
pks_update_protection(pkey, PKEY_READ_WRITE);
}

+typedef bool (*pks_key_callback)(struct pt_regs *regs, unsigned long address,
+ bool write);
+
#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

static inline void pks_set_noaccess(u8 pkey) {}
--
2.35.1

2022-03-11 21:50:07

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 16/45] x86/pkeys: Preserve the PKS MSR on context switch

From: Ira Weiny <[email protected]>

The PKS MSR (PKRS) is a per-logical-processor register. Unfortunately,
the MSR is not managed by XSAVE. Therefore, software must save/restore
the MSR value on context switch.

Allocate space in thread_struct to hold the saved MSR value. Ensure all
tasks, including the init_task are properly initialized. Set the CPU
PKRS value when a task is scheduled.

Co-developed-by: Fenghua Yu <[email protected]>
Signed-off-by: Fenghua Yu <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
From Dave Hansen
Clarify the commit message
s/pks_saved_pkrs/pkrs/
s/pks_write_current/x86_pkrs_load/
Change x86_pkrs_load to take the next thread instead of
'current'

Changes for V8
From Thomas
Ensure pkrs_write_current() does not suffer the overhead
of preempt disable.
Fix setting of initial value
Remove flawed and broken create_initial_pkrs_value() in
favor of a much simpler and robust macro default
Update function names to be consistent.

s/pkrs_write_current/pks_write_current
This is a more consistent name
s/saved_pkrs/pks_saved_pkrs
s/pkrs_init_value/PKS_INIT_VALUE
Remove pks_init_task()
This function was added mainly to avoid the header file
issue. Adding pks-keys.h solved that and saves the
complexity.

Changes for V7
Move definitions from asm/processor.h to asm/pks.h
s/INIT_PKRS_VALUE/pkrs_init_value
Change pks_init_task()/pks_sched_in() to functions
s/pks_sched_in/pks_write_current to be used more generically
later in the series
---
arch/x86/include/asm/pks.h | 2 ++
arch/x86/include/asm/processor.h | 15 ++++++++++++++-
arch/x86/kernel/process_64.c | 2 ++
arch/x86/mm/pkeys.c | 9 +++++++++
4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index 8180fc59790b..a7bad7301783 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -5,10 +5,12 @@
#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS

void pks_setup(void);
+void x86_pkrs_load(struct thread_struct *thread);

#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

static inline void pks_setup(void) { }
+static inline void x86_pkrs_load(struct thread_struct *thread) { }

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2c5f12ae7d04..e3874c2d175e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -2,6 +2,8 @@
#ifndef _ASM_X86_PROCESSOR_H
#define _ASM_X86_PROCESSOR_H

+#include <linux/pks-keys.h>
+
#include <asm/processor-flags.h>

/* Forward declaration, a strange C thing */
@@ -527,6 +529,10 @@ struct thread_struct {
* PKRU is the hardware itself.
*/
u32 pkru;
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+ /* Saved Protection key register for supervisor mappings */
+ u32 pkrs;
+#endif

/* Floating point and extended processor state */
struct fpu fpu;
@@ -769,7 +775,14 @@ static inline void spin_lock_prefetch(const void *x)
#define KSTK_ESP(task) (task_pt_regs(task)->sp)

#else
-#define INIT_THREAD { }
+
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+#define INIT_THREAD { \
+ .pkrs = PKS_INIT_VALUE, \
+}
+#else
+#define INIT_THREAD { }
+#endif

extern unsigned long KSTK_ESP(struct task_struct *task);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3402edec236c..e703cc451128 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,6 +59,7 @@
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
#endif
+#include <asm/pks.h>

#include "process.h"

@@ -612,6 +613,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
x86_fsgsbase_load(prev, next);

x86_pkru_load(prev, next);
+ x86_pkrs_load(next);

/*
* Switch the PDA and FPU contexts.
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 10521f1a292e..39e4c2cbc279 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -246,6 +246,15 @@ static inline void pks_write_pkrs(u32 new_pkrs)
}
}

+/* x86_pkrs_load() - Update CPU with the incoming thread pkrs value */
+void x86_pkrs_load(struct thread_struct *thread)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_PKS))
+ return;
+
+ pks_write_pkrs(thread->pkrs);
+}
+
/*
* PKS is independent of PKU and either or both may be supported on a CPU.
*
--
2.35.1

2022-03-11 21:59:08

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 25/45] entry: Add calls for save/restore auxiliary pt_regs

From: Ira Weiny <[email protected]>

Some architectures have auxiliary pt_regs space which is available to
store extra information on the stack. For ease of implementation the
common C code was left to fill in the data when needed.

Add calls to the architecture save and restore auxiliary pt_regs
functions. Define empty calls for any architecture which does not have
auxiliary pt_regs.

NOTE: Due to the split nature of the Xen exit code
irqentry_exit_cond_resched() requires an unbalanced call to
arch_restore_aux_pt_regs() regardless of the nature of the preemption
configuration.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Update commit message

Changes for V8
New patch which introduces a generic auxiliary pt_register save
restore.
---
include/linux/entry-common.h | 7 +++++++
kernel/entry/common.c | 16 ++++++++++++++--
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 14fd329847e7..b243f1cfd491 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -99,6 +99,13 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs
}
#endif

+#ifndef CONFIG_ARCH_HAS_PTREGS_AUXILIARY
+
+static inline void arch_save_aux_pt_regs(struct pt_regs *regs) { }
+static inline void arch_restore_aux_pt_regs(struct pt_regs *regs) { }
+
+#endif
+
/**
* enter_from_user_mode - Establish state when coming from user mode
*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f4210a7fc84d..c778e9783361 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -323,7 +323,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)

if (user_mode(regs)) {
irqentry_enter_from_user_mode(regs);
- return ret;
+ goto aux_save;
}

/*
@@ -362,7 +362,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
instrumentation_end();

ret.exit_rcu = true;
- return ret;
+ goto aux_save;
}

/*
@@ -377,6 +377,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
trace_hardirqs_off_finish();
instrumentation_end();

+aux_save:
+ instrumentation_begin();
+ arch_save_aux_pt_regs(regs);
+ instrumentation_end();
+
return ret;
}

@@ -408,6 +413,7 @@ static void exit_cond_resched(void)

void irqentry_exit_cond_resched(struct pt_regs *regs)
{
+ arch_restore_aux_pt_regs(regs);
exit_cond_resched();
}

@@ -415,6 +421,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
{
lockdep_assert_irqs_disabled();

+ instrumentation_begin();
+ arch_restore_aux_pt_regs(regs);
+ instrumentation_end();
+
/* Check whether this returns to user mode */
if (user_mode(regs)) {
irqentry_exit_to_user_mode(regs);
@@ -464,6 +474,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
instrumentation_begin();
trace_hardirqs_off_finish();
ftrace_nmi_enter();
+ arch_save_aux_pt_regs(regs);
instrumentation_end();

return irq_state;
@@ -472,6 +483,7 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
{
instrumentation_begin();
+ arch_restore_aux_pt_regs(regs);
ftrace_nmi_exit();
if (irq_state.lockdep) {
trace_hardirqs_on_prepare();
--
2.35.1

2022-03-11 22:20:41

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 39/45] memremap_pages: Set PKS pkey in PTEs if requested

From: Ira Weiny <[email protected]>

When a devmap caller requests protections, the dev_pagemap PTE's need to
have a PKEY set.

When PGMAP_PROTECTIONS is requested add the pkey to the page
protections.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
From Dave Hansen
use pkey
---
mm/memremap.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index 38d321cc59c2..cefdf541bcc1 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -82,6 +82,14 @@ static void devmap_protection_enable(void)
static_branch_inc(&dev_pgmap_protection_static_key);
}

+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+ pgprotval_t val;
+
+ val = pgprot_val(prot);
+ return __pgprot(val | _PAGE_PKEY(PKS_KEY_PGMAP_PROTECTION));
+}
+
static void devmap_protection_disable(void)
{
static_branch_dec(&dev_pgmap_protection_static_key);
@@ -92,6 +100,10 @@ static void devmap_protection_disable(void)
static void devmap_protection_enable(void) { }
static void devmap_protection_disable(void) { }

+static pgprot_t devmap_protection_adjust_pgprot(pgprot_t prot)
+{
+ return prot;
+}
#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */

static void pgmap_array_delete(struct range *range)
@@ -346,6 +358,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
if (!pgmap_protection_available())
return ERR_PTR(-EINVAL);
devmap_protection_enable();
+ params.pgprot = devmap_protection_adjust_pgprot(params.pgprot);
}

switch (pgmap->type) {
--
2.35.1

2022-03-11 22:47:31

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 40/45] memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls

From: Ira Weiny <[email protected]>

A thread that wants to access memory protected by PGMAP protections must
first enable access, and then disable access when it is done.

Introduce pgmap_set_{readwrite|noaccess}() for this purpose. The two
calls are destined to be used by the kmap API and take a struct page for
convenience. They determine if the page is protected and, if so,
perform the requested operation.

Toggling between Read/Write and No Access was chosen as it fits well
with the accessibility of a kmap'ed page. Discussions did occur
regarding making a finer grained mapping for Read Only but that is
something which can be added at a later date.

In addition, two lower level functions are exported. They take the
dev_pagemap object directly for internal consumers who have knowledge of
the of the dev_pagemap.

All changes in the protections must be through the above calls. They
abstract the protection implementation (currently the PKS api) from
upper layer consumers.

The calls are made nestable by the use of a per task reference count.
This ensures that the first call to re-enable protection does not
'break' the last access of the device memory. Expansion of the task
struct is unavoidable due to the desire to maintain kmap_local_page() as
non-atomic and migratable. The only other idea to track a reference
count was in a per-cpu variable. However, doing so would make
kmap_local_page() equivalent to kmap_atomic() which is undesirable.

Access to device memory during exceptions (#PF) is expected only from
user faults. Therefore there is no need to maintain the reference count
during exceptions.

NOTE: It is not anticipated that any code path will directly nest these
calls. For this reason multiple reviewers, including Dan and Thomas,
asked why this reference counting was needed at this level rather than
in a higher level call such as kmap_local_page(). The reason is that
pgmap_set_readwrite() can nest with kmap_{atomic,local_page}().
Therefore this reference counting is pushed to the lower level to ensure
that any combination of calls is nestable.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
From Dan Williams
Update the commit message with details on why the thread
struct needs to be expanded.
Following on Dave Hansens suggestion for pks_mk
s/pgmap_mk_*/pgmap_set_*/

Changes for V8
Split these functions into their own patch.
This helps to clarify the commit message and usage.
---
include/linux/mm.h | 35 +++++++++++++++++++++++++++++++++++
include/linux/sched.h | 7 +++++++
init/init_task.c | 3 +++
mm/memremap.c | 14 ++++++++++++++
4 files changed, 59 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4ca24329848a..c85189b24eca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1168,8 +1168,43 @@ static inline bool devmap_protected(struct page *page)
return false;
}

+void __pgmap_set_readwrite(struct dev_pagemap *pgmap);
+void __pgmap_set_noaccess(struct dev_pagemap *pgmap);
+
+static inline bool pgmap_check_pgmap_prot(struct page *page)
+{
+ if (!devmap_protected(page))
+ return false;
+
+ /*
+ * There is no known use case to change permissions in an irq for pgmap
+ * pages
+ */
+ lockdep_assert_in_irq();
+ return true;
+}
+
+static inline void pgmap_set_readwrite(struct page *page)
+{
+ if (!pgmap_check_pgmap_prot(page))
+ return;
+ __pgmap_set_readwrite(page->pgmap);
+}
+
+static inline void pgmap_set_noaccess(struct page *page)
+{
+ if (!pgmap_check_pgmap_prot(page))
+ return;
+ __pgmap_set_noaccess(page->pgmap);
+}
+
#else

+static inline void __pgmap_set_readwrite(struct dev_pagemap *pgmap) { }
+static inline void __pgmap_set_noaccess(struct dev_pagemap *pgmap) { }
+static inline void pgmap_set_readwrite(struct page *page) { }
+static inline void pgmap_set_noaccess(struct page *page) { }
+
static inline bool pgmap_protection_available(void)
{
return false;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75ba8aa60248..a79f2090e291 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1492,6 +1492,13 @@ struct task_struct {
struct callback_head l1d_flush_kill;
#endif

+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+ /*
+ * NOTE: pgmap_prot_count is modified within a single thread of
+ * execution. So it does not need to be atomic_t.
+ */
+ u32 pgmap_prot_count;
+#endif
/*
* New fields for task_struct should be added above here, so that
* they are included in the randomized portion of task_struct.
diff --git a/init/init_task.c b/init/init_task.c
index 73cc8f03511a..948b32cf8139 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -209,6 +209,9 @@ struct task_struct init_task
#ifdef CONFIG_SECCOMP_FILTER
.seccomp = { .filter_count = ATOMIC_INIT(0) },
#endif
+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+ .pgmap_prot_count = 0,
+#endif
};
EXPORT_SYMBOL(init_task);

diff --git a/mm/memremap.c b/mm/memremap.c
index cefdf541bcc1..6fa259748a0b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -95,6 +95,20 @@ static void devmap_protection_disable(void)
static_branch_dec(&dev_pgmap_protection_static_key);
}

+void __pgmap_set_readwrite(struct dev_pagemap *pgmap)
+{
+ if (!current->pgmap_prot_count++)
+ pks_set_readwrite(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_set_readwrite);
+
+void __pgmap_set_noaccess(struct dev_pagemap *pgmap)
+{
+ if (!--current->pgmap_prot_count)
+ pks_set_noaccess(PKS_KEY_PGMAP_PROTECTION);
+}
+EXPORT_SYMBOL_GPL(__pgmap_set_noaccess);
+
#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */

static void devmap_protection_enable(void) { }
--
2.35.1

2022-03-11 22:52:14

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 38/45] memremap_pages: Reserve a PKS pkey for eventual use by PMEM

From: Ira Weiny <[email protected]>

Reserve a pkey for use by the memmap facility and set the default
protections to Access Disabled.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Adjust for new key allocation
From Dave Hansen
use pkey
---
include/linux/pks-keys.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/pks-keys.h b/include/linux/pks-keys.h
index f7e82e462659..32075ac54964 100644
--- a/include/linux/pks-keys.h
+++ b/include/linux/pks-keys.h
@@ -61,7 +61,9 @@
/* PKS_KEY_DEFAULT must be 0 */
#define PKS_KEY_DEFAULT 0
#define PKS_KEY_TEST PKS_NEW_KEY(PKS_KEY_DEFAULT, CONFIG_PKS_TEST)
-#define PKS_KEY_MAX PKS_NEW_KEY(PKS_KEY_TEST, 1)
+#define PKS_KEY_PGMAP_PROTECTION \
+ PKS_NEW_KEY(PKS_KEY_TEST, CONFIG_DEVMAP_ACCESS_PROTECTION)
+#define PKS_KEY_MAX PKS_NEW_KEY(PKS_KEY_PGMAP_PROTECTION, 1)

#ifdef CONFIG_PKS_TEST_ALL_KEYS
#undef PKS_KEY_MAX
@@ -72,6 +74,8 @@
#define PKS_KEY_DEFAULT_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_DEFAULT, RW, 1)
#define PKS_KEY_TEST_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_TEST, AD, \
CONFIG_PKS_TEST)
+#define PKS_KEY_PGMAP_INIT PKS_DECLARE_INIT_VALUE(PKS_KEY_PGMAP_PROTECTION, \
+ AD, CONFIG_DEVMAP_ACCESS_PROTECTION)

#define PKS_ALL_AD_MASK \
GENMASK(PKS_NUM_PKEYS * PKR_BITS_PER_PKEY, \
@@ -79,7 +83,8 @@

#define PKS_INIT_VALUE ((PKS_ALL_AD & PKS_ALL_AD_MASK) | \
PKS_KEY_DEFAULT_INIT | \
- PKS_KEY_TEST_INIT \
+ PKS_KEY_TEST_INIT | \
+ PKS_KEY_PGMAP_INIT \
)

#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
--
2.35.1

2022-03-11 23:31:00

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V9 36/45] memremap_pages: Introduce a PGMAP_PROTECTION flag

From: Ira Weiny <[email protected]>

The persistent memory (PMEM) driver uses the memremap_pages facility to
provide 'struct page' metadata (vmemmap) for PMEM. Given that PMEM
capacity maybe orders of magnitude higher capacity than System RAM it
presents a large vulnerability surface to stray writes. Unlike stray
writes to System RAM, which may result in a crash or other undesirable
behavior, stray writes to PMEM additionally are more likely to result in
permanent data loss. Reboot is not a remediation for PMEM corruption
like it is for System RAM.

Given that PMEM access from the kernel is limited to a constrained set
of locations (PMEM driver, Filesystem-DAX, and direct-I/O to a DAX
page), it is amenable to supervisor pkey protection.

Some systems which have configured DEVMAP_ACCESS_PROTECTION may not have
PMEM installed. Or the PMEM may not be mapped into the direct map. In
addition, some callers of memremap_pages() will not want the mapped
pages protected.

Define a new PGMAP flag to distinguish page maps which are protected.
Use this flag to enable runtime protection support. A static key is
used to optimize the runtime support.

Specifying this flag on a system which can't support protections will
fail. Callers are expected to check if protections are supported via
pgmap_protection_available(). It was considered to have callers specify
the flag and check if the dev_pagemap object returned was protected or
not. But this was considered less efficient than a direct check
beforehand.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes for V9
Clean up commit message

Changes for V8
Split this out into it's own patch
---
include/linux/memremap.h | 1 +
mm/memremap.c | 40 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..84402f73712c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -80,6 +80,7 @@ struct dev_pagemap_ops {
};

#define PGMAP_ALTMAP_VALID (1 << 0)
+#define PGMAP_PROTECTION (1 << 1)

/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11f..38d321cc59c2 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -63,6 +63,37 @@ static void devmap_managed_enable_put(struct dev_pagemap *pgmap)
}
#endif /* CONFIG_DEV_PAGEMAP_OPS */

+#ifdef CONFIG_DEVMAP_ACCESS_PROTECTION
+
+/*
+ * Note; all devices which have asked for protections share the same key. The
+ * key may, or may not, have been provided by the core. If not, protection
+ * will be disabled. The key acquisition is attempted when the first ZONE
+ * DEVICE requests it and freed when all zones have been unmapped.
+ *
+ * Also this must be EXPORT_SYMBOL rather than EXPORT_SYMBOL_GPL because it is
+ * intended to be used in the kmap API.
+ */
+DEFINE_STATIC_KEY_FALSE(dev_pgmap_protection_static_key);
+EXPORT_SYMBOL(dev_pgmap_protection_static_key);
+
+static void devmap_protection_enable(void)
+{
+ static_branch_inc(&dev_pgmap_protection_static_key);
+}
+
+static void devmap_protection_disable(void)
+{
+ static_branch_dec(&dev_pgmap_protection_static_key);
+}
+
+#else /* !CONFIG_DEVMAP_ACCESS_PROTECTION */
+
+static void devmap_protection_enable(void) { }
+static void devmap_protection_disable(void) { }
+
+#endif /* CONFIG_DEVMAP_ACCESS_PROTECTION */
+
static void pgmap_array_delete(struct range *range)
{
xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
@@ -162,6 +193,9 @@ void memunmap_pages(struct dev_pagemap *pgmap)

WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
devmap_managed_enable_put(pgmap);
+
+ if (pgmap->flags & PGMAP_PROTECTION)
+ devmap_protection_disable();
}
EXPORT_SYMBOL_GPL(memunmap_pages);

@@ -308,6 +342,12 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
return ERR_PTR(-EINVAL);

+ if (pgmap->flags & PGMAP_PROTECTION) {
+ if (!pgmap_protection_available())
+ return ERR_PTR(-EINVAL);
+ devmap_protection_enable();
+ }
+
switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
--
2.35.1

2022-04-01 15:41:11

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V9 00/45] PKS/PMEM: Add Stray Write Protection

On Thu, Mar 10, 2022 at 09:19:34AM -0800, Ira wrote:
> From: Ira Weiny <[email protected]>
>
>
> I'm looking for Intel acks on the series prior to submitting to maintainers.
> Most of the changes from V8 to V9 was in getting the tests straightened out.
> But there are some improvements in the actual code.

Is there any feedback on this?

Ira

>
>
> Changes for V9
>
> Review and update all commit messages.
> Update cover letter below
>
> PKS Core
> Separate user and supervisor pkey code in the headers
> create linux/pks.h for supervisor calls
> This facilitated making the pmem code more efficient
> Completely rearchitect the test code
> [After Dave Hansen and Rick Edgecombe found issues in the test
> code it was easier to rearchitect the code completely
> rather than attempt to fix it.]
> Remove pks_test_callback in favor of using fault hooks
> Fault hooks also isolate the fault callbacks from being
> false positives if non-test consumers are running
> Make additional PKS_TEST_RUN_ALL Kconfig option which is
> mutually exclusive to any non-test PKS consumer
> PKS_TEST_RUN_ALL takes over all pkey callbacks
> Ensure that each test runs within it's own context and is
> mutually exclusive from running while any other test is
> running.
> Ensure test session and context memory is cleaned up on file
> close
> Use pr_debug() and dynamic debug for in kernel debug messages
> Enhance test_pks selftest
> Add the ability to run all tests not just the context
> switch test
> Standardize output [PASS][FAIL][SKIP]
> Add '-d' option enables dynamic debug to see the kernel
> debug messages
>
> Incorporate feedback from Rick Edgecombe
> Update all pkey types to u8
> Fix up test code barriers
> Move patch declaring PKS_INIT_VALUE ahead of the patch which enables
> PKS so that PKS_INIT_VALUE can be used when pks_setup() is
> first created
> From Dan Williams
> Use macros instead of an enum for a pkey allocation scheme
> which is predicated on the config options of consumers
> This almost worked perfectly. It required a bit of
> tweeking to be able to allocate all of the keys.
>
> From Dave Hansen
> Reposition some code to be near/similar to user pkeys
> s/pks_write_current/x86_pkrs_load
> s/pks_saved_pkrs/pkrs
> Update Documentation
> s/PKR_{RW,AD,WD}_KEY/PKR_{RW,AD,WD}_MASK
> Consistently use lower case for pkey
> Update commit messages
> Add Acks
>
> PMEM Stray Write
> Building on the change to the pks_mk_*() function rename
> s/pgmap_mk_*/pgmap_set_*/
> s/dax_mk_*/dax_set_*/
> From Dan Williams
> Avoid adding new dax operations by teaching dax_device about pgmap
> Remove pgmap_protection_flag_invalid() patch (Just let
> kmap'ings fail)
>
>
> PKS/PMEM Stray write protection
> ===============================
>
> This series is broken into 2 parts.
>
> 1) Introduce Protection Key Supervisor (PKS), testing, and
> documentation
> 2) Use PKS to protect PMEM from stray writes
>
> Introduce Protection Key Supervisor (PKS)
> -----------------------------------------
>
> PKS enables protections on 'domains' of supervisor pages to limit supervisor
> mode access to pages beyond the normal paging protections. PKS works in a
> similar fashion to user space pkeys, PKU. As with PKU, supervisor pkeys are
> checked in addition to normal paging protections. And page mappings are
> assigned to a domain by setting a 4 bit pkey in the PTE of that mapping.
>
> Unlike PKU, permissions are changed via a MSR update. This update avoids TLB
> flushes making this an efficient way to alter protections vs PTE updates.
>
> Also, unlike PTE updates PKS permission changes apply only to the current
> processor. Therefore changing permissions apply only to that thread and not
> any other cpu/process. This allows protections to remain in place on other
> cpus for additional protection and isolation.
>
> Even though PKS updates are thread local, XSAVE is not supported for the PKRS
> MSR. Therefore this implementation saves and restores the MSR across context
> switches and during exceptions within software. Nested exceptions are
> supported by each exception getting a new PKS state.
>
> For consistent behavior with current paging protections, pkey 0 is reserved and
> configured to allow full access via the pkey mechanism, thus preserving the
> default paging protections because PTEs naturally have a pkey value of 0.
>
> Other keys, (1-15) are statically allocated by kernel consumers when
> configured. This is done by adding the appropriate PKS_NEW_KEY and
> PKS_DECLARE_INIT_VALUE macros to pks-keys.h.
>
> Two PKS consumers, PKS_TEST and PMEM stray write protection, are included in
> this series. When the number of users grows larger the sharing of keys will
> need to be resolved depending on the needs of the users at that time. Many
> methods have been contemplated but the number of kernel users and use cases
> envisioned is still quite small, much less than the 15 available keys.
>
> To summarize, the following are key attributes of PKS.
>
> 1) Fast switching of permissions
> 1a) Prevents access without page table manipulations
> 1b) No TLB flushes required
> 2) Works on a per thread basis, thus allowing protections to be
> preserved on threads which are not actively accessing data through
> the mapping.
>
> PKS is available with 4 and 5 level paging. For this and simplicity of
> implementation, the feature is restricted to x86_64.
>
>
> Use PKS to protect PMEM from stray writes
> -----------------------------------------
>
> DAX leverages the direct-map to enable 'struct page' services for PMEM. Given
> that PMEM capacity may be an order of magnitude higher capacity than System RAM
> it presents a large vulnerability surface to stray writes. Such a stray write
> becomes a silent data corruption bug.
>
> Stray pointers to System RAM may result in a crash or other undesirable
> behavior which, while unfortunate, are usually recoverable with a reboot.
> Stray writes to PMEM are permanent in nature and thus are more likely to result
> in permanent user data loss. Given that PMEM access from the kernel is limited
> to a constrained set of locations (PMEM driver, Filesystem-DAX, direct-I/O, and
> any properly kmap'ed page), it is amenable to PKS protection.
>
> Set up an infrastructure for extra device access protection. Then implement the
> protection using the new Protection Keys Supervisor (PKS) on architectures
> which support it.
>
> Because PMEM pages are all associated with a struct dev_pagemap and flags in
> struct page are valuable the flag of protecting memory can be stored in struct
> dev_pagemap. All PMEM is protected by the same pkey. So a single flag is all
> that is needed in each dev_pagemap to indicate protection.
>
> General access in the kernel is supported by modifying the kmap infrastructure
> which can detect if a page is pks protected and enable access until the
> corresponding unmap is called.
>
> Because PKS is a thread local mechanism and because kmap was never really
> intended to create a long term mapping, this implementation does not support
> the kmap()/kunmap() calls. Calling kmap() on a PMEM protected page is allowed
> but accessing that mapping will cause a fault.
>
> Originally this series modified many of the kmap call sites to indicate they
> were thread local.[1] And an attempt to support kmap()[2] was made. But now
> that kmap_local_page() has been developed[3] and in more wide spread use,
> kmap() can safely be left unsupported.
>
> How the fault is handled is configurable via a new module parameter
> memremap.pks_fault_mode. Two modes are supported.
>
> 'relaxed' (default) -- WARN_ONCE, disable the protection and allow
> access
>
> 'strict' -- prevent any unguarded access to a protected dev_pagemap
> range
>
> This 'safety valve' feature has already been useful in the development of this
> feature.
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> [2] https://lore.kernel.org/lkml/[email protected]/
>
> [3] https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
> https://lore.kernel.org/lkml/[email protected]/
>
>
> ----------------------------------------------------------------------------
> Changes for V8
>
> Feedback from Thomas
> * clean up noinstr mess
> * Fix static PKEY allocation mess
> * Ensure all functions are consistently named.
> * Split up patches to do 1 thing per patch
> * pkey_update_pkval() implementation
> * Streamline the use of pks_write_pkrs() by not disabling preemption
> - Leave this to the callers who require it.
> - Use documentation and lockdep to prevent errors
> * Clean up commit messages to explain in detail _why_ each patch is
> there.
>
> Feedback from Dave H.
> * Leave out pks_mk_readonly() as it is not used by the PMEM use case
>
> Feedback from Peter Anvin
> * Replace pks_abandon_pkey() with pks_update_exception()
> This is an even greater simplification in that it no longer
> attempts to shield users from faults. As the main use case for
> abandoning a key was to allow a system to continue running even
> with an error. This should be a rare event so the performance
> should not be an issue.
>
> * Simplify ARCH_ENABLE_SUPERVISOR_PKEYS
>
> * Update PKS Test code
> - Add default value test
> - Split up the test code into patches which follow each feature
> addition
> - simplify test code processing
> - ensure consistent reporting of errors.
>
> * Ensure all entry points to the PKS code are protected by
> cpu_feature_enabled(X86_FEATURE_PKS)
> - At the same time make sure non-entry points or sub-functions to the
> PKS code are not _unnecessarily_ protected by the feature check
>
> * Update documentation
> - Use kernel docs to place the docs with the code for easier internal
> developer use
>
> * Adjust the PMEM use cases for the core changes
>
> * Split the PMEM patches up to be 1 change per patch and help clarify review
>
> * Review all header files and remove those no longer needed
>
> * Review/update/clarify all commit messages
>
> Fenghua Yu (1):
> mm/pkeys: Define PKS page table macros
>
> Ira Weiny (43):
> entry: Create an internal irqentry_exit_cond_resched() call
> Documentation/protection-keys: Clean up documentation for User Space
> pkeys
> x86/pkeys: Clarify PKRU_AD_KEY macro
> x86/pkeys: Make PKRU macros generic
> x86/fpu: Refactor arch_set_user_pkey_access()
> mm/pkeys: Add Kconfig options for PKS
> x86/pkeys: Add PKS CPU feature bit
> x86/fault: Adjust WARN_ON for pkey fault
> Documentation/pkeys: Add initial PKS documentation
> mm/pkeys: Provide for PKS key allocation
> x86/pkeys: Enable PKS on cpus which support it
> mm/pkeys: PKS testing, add initial test code
> x86/selftests: Add test_pks
> x86/pkeys: Introduce pks_write_pkrs()
> x86/pkeys: Preserve the PKS MSR on context switch
> mm/pkeys: Introduce pks_set_readwrite()
> mm/pkeys: Introduce pks_set_noaccess()
> mm/pkeys: PKS testing, add a fault call back
> mm/pkeys: PKS testing, add pks_set_*() tests
> mm/pkeys: PKS testing, test context switching
> x86/entry: Add auxiliary pt_regs space
> entry: Split up irqentry_exit_cond_resched()
> entry: Add calls for save/restore auxiliary pt_regs
> x86/entry: Define arch_{save|restore}_auxiliary_pt_regs()
> x86/pkeys: Preserve PKRS MSR across exceptions
> x86/fault: Print PKS MSR on fault
> mm/pkeys: PKS testing, Add exception test
> mm/pkeys: Introduce pks_update_exception()
> mm/pkeys: PKS testing, test pks_update_exception()
> mm/pkeys: PKS testing, add test for all keys
> mm/pkeys: Add pks_available()
> memremap_pages: Add Kconfig for DEVMAP_ACCESS_PROTECTION
> memremap_pages: Introduce pgmap_protection_available()
> memremap_pages: Introduce a PGMAP_PROTECTION flag
> memremap_pages: Introduce devmap_protected()
> memremap_pages: Reserve a PKS pkey for eventual use by PMEM
> memremap_pages: Set PKS pkey in PTEs if requested
> memremap_pages: Define pgmap_set_{readwrite|noaccess}() calls
> memremap_pages: Add memremap.pks_fault_mode
> kmap: Make kmap work for devmap protected pages
> dax: Stray access protection for dax_direct_access()
> nvdimm/pmem: Enable stray access protection
> devdax: Enable stray access protection
>
> Rick Edgecombe (1):
> mm/pkeys: Introduce PKS fault callbacks
>
> .../admin-guide/kernel-parameters.txt | 12 +
> Documentation/core-api/protection-keys.rst | 130 ++-
> arch/x86/Kconfig | 6 +
> arch/x86/entry/calling.h | 20 +
> arch/x86/entry/common.c | 2 +-
> arch/x86/entry/entry_64.S | 22 +
> arch/x86/entry/entry_64_compat.S | 6 +
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/disabled-features.h | 8 +-
> arch/x86/include/asm/entry-common.h | 15 +
> arch/x86/include/asm/msr-index.h | 1 +
> arch/x86/include/asm/pgtable_types.h | 22 +
> arch/x86/include/asm/pkeys.h | 2 +
> arch/x86/include/asm/pkeys_common.h | 18 +
> arch/x86/include/asm/pkru.h | 20 +-
> arch/x86/include/asm/pks.h | 46 ++
> arch/x86/include/asm/processor.h | 15 +-
> arch/x86/include/asm/ptrace.h | 21 +
> arch/x86/include/uapi/asm/processor-flags.h | 2 +
> arch/x86/kernel/asm-offsets_64.c | 15 +
> arch/x86/kernel/cpu/common.c | 2 +
> arch/x86/kernel/dumpstack.c | 32 +-
> arch/x86/kernel/fpu/xstate.c | 22 +-
> arch/x86/kernel/head_64.S | 6 +
> arch/x86/kernel/process_64.c | 3 +
> arch/x86/mm/fault.c | 17 +-
> arch/x86/mm/pkeys.c | 320 +++++++-
> drivers/dax/device.c | 2 +
> drivers/dax/super.c | 59 ++
> drivers/md/dm-writecache.c | 8 +-
> drivers/nvdimm/pmem.c | 26 +
> fs/dax.c | 8 +
> fs/fuse/virtio_fs.c | 2 +
> include/linux/dax.h | 5 +
> include/linux/entry-common.h | 15 +-
> include/linux/highmem-internal.h | 4 +
> include/linux/memremap.h | 1 +
> include/linux/mm.h | 72 ++
> include/linux/pgtable.h | 4 +
> include/linux/pks-keys.h | 92 +++
> include/linux/pks.h | 73 ++
> include/linux/sched.h | 7 +
> include/uapi/asm-generic/mman-common.h | 1 +
> init/init_task.c | 3 +
> kernel/entry/common.c | 44 +-
> kernel/sched/core.c | 40 +-
> lib/Kconfig.debug | 33 +
> lib/Makefile | 3 +
> lib/pks/Makefile | 3 +
> lib/pks/pks_test.c | 755 ++++++++++++++++++
> mm/Kconfig | 32 +
> mm/memremap.c | 132 +++
> tools/testing/selftests/x86/Makefile | 2 +-
> tools/testing/selftests/x86/test_pks.c | 514 ++++++++++++
> 54 files changed, 2617 insertions(+), 109 deletions(-)
> create mode 100644 arch/x86/include/asm/pkeys_common.h
> create mode 100644 arch/x86/include/asm/pks.h
> create mode 100644 include/linux/pks-keys.h
> create mode 100644 include/linux/pks.h
> create mode 100644 lib/pks/Makefile
> create mode 100644 lib/pks/pks_test.c
> create mode 100644 tools/testing/selftests/x86/test_pks.c
>
> --
> 2.35.1
>

2022-04-07 08:24:59

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V9 01/45] entry: Create an internal irqentry_exit_cond_resched() call

On Thu, Mar 10, 2022 at 09:19:35AM -0800, Ira wrote:
> From: Ira Weiny <[email protected]>

Rebasing to 5.18-rc1 revealed a different fix has been applied for this
work.[1]

Please disregard this patch.

Ira

[1] 4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched()
callers")

>
> The static call to irqentry_exit_cond_resched() was not properly being
> overridden when called from xen_pv_evtchn_do_upcall().
>
> Define __irqentry_exit_cond_resched() as the static call and place the
> override logic in irqentry_exit_cond_resched().
>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for V9
> Update the commit message a bit
>
> Because this was found via code inspection and it does not actually fix
> any seen bug I've not added a fixes tag.
>
> But for reference:
> Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
> ---
> include/linux/entry-common.h | 5 ++++-
> kernel/entry/common.c | 23 +++++++++++++--------
> kernel/sched/core.c | 40 ++++++++++++++++++------------------
> 3 files changed, 38 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 2e2b8d6140ed..ddaffc983e62 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -455,10 +455,13 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
> * Conditional reschedule with additional sanity checks.
> */
> void irqentry_exit_cond_resched(void);
> +
> +void __irqentry_exit_cond_resched(void);
> #ifdef CONFIG_PREEMPT_DYNAMIC
> -DECLARE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +DECLARE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
> #endif
>
> +
> /**
> * irqentry_exit - Handle return from exception that used irqentry_enter()
> * @regs: Pointer to pt_regs (exception entry regs)
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bad713684c2e..490442a48332 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -380,7 +380,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> return ret;
> }
>
> -void irqentry_exit_cond_resched(void)
> +void __irqentry_exit_cond_resched(void)
> {
> if (!preempt_count()) {
> /* Sanity check RCU and thread stack */
> @@ -392,9 +392,20 @@ void irqentry_exit_cond_resched(void)
> }
> }
> #ifdef CONFIG_PREEMPT_DYNAMIC
> -DEFINE_STATIC_CALL(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> +DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
> #endif
>
> +void irqentry_exit_cond_resched(void)
> +{
> + if (IS_ENABLED(CONFIG_PREEMPTION)) {
> +#ifdef CONFIG_PREEMPT_DYNAMIC
> + static_call(__irqentry_exit_cond_resched)();
> +#else
> + __irqentry_exit_cond_resched();
> +#endif
> + }
> +}
> +
> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> {
> lockdep_assert_irqs_disabled();
> @@ -420,13 +431,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> }
>
> instrumentation_begin();
> - if (IS_ENABLED(CONFIG_PREEMPTION)) {
> -#ifdef CONFIG_PREEMPT_DYNAMIC
> - static_call(irqentry_exit_cond_resched)();
> -#else
> - irqentry_exit_cond_resched();
> -#endif
> - }
> + irqentry_exit_cond_resched();
> /* Covers both tracing and lockdep */
> trace_hardirqs_on();
> instrumentation_end();
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9745613d531c..f56db4bd9730 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6571,29 +6571,29 @@ EXPORT_STATIC_CALL_TRAMP(preempt_schedule_notrace);
> * SC:might_resched
> * SC:preempt_schedule
> * SC:preempt_schedule_notrace
> - * SC:irqentry_exit_cond_resched
> + * SC:__irqentry_exit_cond_resched
> *
> *
> * NONE:
> - * cond_resched <- __cond_resched
> - * might_resched <- RET0
> - * preempt_schedule <- NOP
> - * preempt_schedule_notrace <- NOP
> - * irqentry_exit_cond_resched <- NOP
> + * cond_resched <- __cond_resched
> + * might_resched <- RET0
> + * preempt_schedule <- NOP
> + * preempt_schedule_notrace <- NOP
> + * __irqentry_exit_cond_resched <- NOP
> *
> * VOLUNTARY:
> - * cond_resched <- __cond_resched
> - * might_resched <- __cond_resched
> - * preempt_schedule <- NOP
> - * preempt_schedule_notrace <- NOP
> - * irqentry_exit_cond_resched <- NOP
> + * cond_resched <- __cond_resched
> + * might_resched <- __cond_resched
> + * preempt_schedule <- NOP
> + * preempt_schedule_notrace <- NOP
> + * __irqentry_exit_cond_resched <- NOP
> *
> * FULL:
> - * cond_resched <- RET0
> - * might_resched <- RET0
> - * preempt_schedule <- preempt_schedule
> - * preempt_schedule_notrace <- preempt_schedule_notrace
> - * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
> + * cond_resched <- RET0
> + * might_resched <- RET0
> + * preempt_schedule <- preempt_schedule
> + * preempt_schedule_notrace <- preempt_schedule_notrace
> + * __irqentry_exit_cond_resched <- __irqentry_exit_cond_resched
> */
>
> enum {
> @@ -6629,7 +6629,7 @@ void sched_dynamic_update(int mode)
> static_call_update(might_resched, __cond_resched);
> static_call_update(preempt_schedule, __preempt_schedule_func);
> static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
> - static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> + static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
>
> switch (mode) {
> case preempt_dynamic_none:
> @@ -6637,7 +6637,7 @@ void sched_dynamic_update(int mode)
> static_call_update(might_resched, (void *)&__static_call_return0);
> static_call_update(preempt_schedule, NULL);
> static_call_update(preempt_schedule_notrace, NULL);
> - static_call_update(irqentry_exit_cond_resched, NULL);
> + static_call_update(__irqentry_exit_cond_resched, NULL);
> pr_info("Dynamic Preempt: none\n");
> break;
>
> @@ -6646,7 +6646,7 @@ void sched_dynamic_update(int mode)
> static_call_update(might_resched, __cond_resched);
> static_call_update(preempt_schedule, NULL);
> static_call_update(preempt_schedule_notrace, NULL);
> - static_call_update(irqentry_exit_cond_resched, NULL);
> + static_call_update(__irqentry_exit_cond_resched, NULL);
> pr_info("Dynamic Preempt: voluntary\n");
> break;
>
> @@ -6655,7 +6655,7 @@ void sched_dynamic_update(int mode)
> static_call_update(might_resched, (void *)&__static_call_return0);
> static_call_update(preempt_schedule, __preempt_schedule_func);
> static_call_update(preempt_schedule_notrace, __preempt_schedule_notrace_func);
> - static_call_update(irqentry_exit_cond_resched, irqentry_exit_cond_resched);
> + static_call_update(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
> pr_info("Dynamic Preempt: full\n");
> break;
> }
> --
> 2.35.1
>

2022-04-07 20:28:39

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V9 24/45] entry: Split up irqentry_exit_cond_resched()

On Thu, Mar 10, 2022 at 09:19:58AM -0800, Ira wrote:
> From: Ira Weiny <[email protected]>
>
> Auxiliary pt_regs space needs to be manipulated by the generic
> entry/exit code.

Because of fix to the irqentry_exit_cond_resched() code[1] this patch needed
rework upon rebasing to 5.18-rc1

This basic design of this patch remains but the code is different. The
irqentry_exit_cond_resched() still needs to have pt_regs passed into it.

However, this could be safely ignored for this review cycle as well.

As soon as I have a series based on 5.18 I'll resend the full series.

Thanks for understanding,
Ira

[1] 4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched()
callers")

>
> Normally irqentry_exit() would take care of handling any auxiliary
> pt_regs on exit. Unfortunately, the call to
> irqentry_exit_cond_resched() from xen_pv_evtchn_do_upcall() bypasses the
> normal irqentry_exit() call. Because of this bypass
> irqentry_exit_cond_resched() will be required to handle any auxiliary
> pt_regs exit handling. However, this prevents irqentry_exit() from
> being able to call irqentry_exit_cond_resched() and while maintaining
> control of the auxiliary pt_regs.
>
> Separate out the common functionality of irqentry_exit_cond_resched() so
> that functionality can be used by irqentry_exit(). Add a pt_regs
> parameter in anticipation of having irqentry_exit_cond_resched() handle
> the auxiliary pt_regs separately from irqentry_exit().
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes for V9
> Update commit message
>
> Changes for V8
> New Patch
> ---
> arch/x86/entry/common.c | 2 +-
> include/linux/entry-common.h | 3 ++-
> kernel/entry/common.c | 9 +++++++--
> 3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 6c2826417b33..f1ba770d035d 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -309,7 +309,7 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>
> inhcall = get_and_clear_inhcall();
> if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> - irqentry_exit_cond_resched();
> + irqentry_exit_cond_resched(regs);
> instrumentation_end();
> restore_inhcall(inhcall);
> } else {
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index ddaffc983e62..14fd329847e7 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -451,10 +451,11 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
>
> /**
> * irqentry_exit_cond_resched - Conditionally reschedule on return from interrupt
> + * @regs: Pointer to pt_regs of interrupted context
> *
> * Conditional reschedule with additional sanity checks.
> */
> -void irqentry_exit_cond_resched(void);
> +void irqentry_exit_cond_resched(struct pt_regs *regs);
>
> void __irqentry_exit_cond_resched(void);
> #ifdef CONFIG_PREEMPT_DYNAMIC
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 490442a48332..f4210a7fc84d 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -395,7 +395,7 @@ void __irqentry_exit_cond_resched(void)
> DEFINE_STATIC_CALL(__irqentry_exit_cond_resched, __irqentry_exit_cond_resched);
> #endif
>
> -void irqentry_exit_cond_resched(void)
> +static void exit_cond_resched(void)
> {
> if (IS_ENABLED(CONFIG_PREEMPTION)) {
> #ifdef CONFIG_PREEMPT_DYNAMIC
> @@ -406,6 +406,11 @@ void irqentry_exit_cond_resched(void)
> }
> }
>
> +void irqentry_exit_cond_resched(struct pt_regs *regs)
> +{
> + exit_cond_resched();
> +}
> +
> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> {
> lockdep_assert_irqs_disabled();
> @@ -431,7 +436,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> }
>
> instrumentation_begin();
> - irqentry_exit_cond_resched();
> + exit_cond_resched();
> /* Covers both tracing and lockdep */
> trace_hardirqs_on();
> instrumentation_end();
> --
> 2.35.1
>