2023-01-11 13:50:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 00/17] Linear Address Masking enabling

Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.

The capability can be used for efficient address sanitizers (ASAN)
implementation and for optimizations in JITs and virtual machines.

The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
this time.

Please review and consider applying.

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam

v14:
- Rework address range check in get_user() and put_user();
- Introduce CONFIG_ADDRESS_MASKING;
- Cache untag masking in per-CPU variable;
- Reject LAM enabling via PTRACE_ARCH_PRCTL;
- Fix locking around untagged_addr_remote();
- Fix typo in MM_CONTEXT_* conversion patch;
- Fix selftest;
v13:
- Fix race between untagged_addr() and LAM enabling:
+ Do not allow to enable LAM after the process spawned the second thread;
+ untagged_addr() untags the address according to rules of the current
process;
+ untagged_addr_remote() can be used for untagging addresses for foreign
process. It requires mmap lock for the target process to be taken;
v12:
- Rebased onto tip/x86/mm;
- Drop VM_WARN_ON() that may produce false-positive on race between context
switch and LAM enabling;
- Adjust comments explain possible race;
- User READ_ONCE() in mm_lam_cr3_mask();
- Do not assume &init_mm == mm in initialize_tlbstate_and_flush();
- Ack by Andy;
v11:
- Move untag_mask to /proc/$PID/status;
- s/SVM/SVA/g;
- static inline arch_pgtable_dma_compat() instead of macros;
- Replace pasid_valid() with mm_valid_pasid();
- Acks from Ashok and Jacob (forgot to apply from v9);
v10:
- Rebased to v6.1-rc1;
- Add selftest for SVM vs LAM;
v9:
- Fix race between LAM enabling and check that KVM memslot address doesn't
have any tags;
- Reduce untagged_addr() overhead until the first LAM user;
- Clarify SVM vs. LAM semantics;
- Use mmap_lock to serialize LAM enabling;
v8:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit messages;
- Selftests updates;
- Acked/Reviewed/Tested-bys from Alexander and Peter;
v7:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit message;
- Fix indentation;
v6:
- Rebased onto v6.0-rc1
- LAM_U48 excluded from the patchet. Still available in the git tree;
- add ARCH_GET_MAX_TAG_BITS;
- Fix build without CONFIG_DEBUG_VM;
- Update comments;
- Reviewed/Tested-by from Alexander;
v5:
- Do not use switch_mm() in enable_lam_func()
- Use mb()/READ_ONCE() pair on LAM enabling;
- Add self-test by Weihong Zhang;
- Add comments;
v4:
- Fix untagged_addr() for LAM_U48;
- Remove no-threads restriction on LAM enabling;
- Fix mm_struct access from /proc/$PID/arch_status
- Fix LAM handling in initialize_tlbstate_and_flush()
- Pack tlb_state better;
- Comments and commit messages;
v3:
- Rebased onto v5.19-rc1
- Per-process enabling;
- API overhaul (again);
- Avoid branches and costly computations in the fast path;
- LAM_U48 is in optional patch.
v2:
- Rebased onto v5.18-rc1
- New arch_prctl(2)-based API
- Expose status of LAM (or other thread features) in
/proc/$PID/arch_status

[1] ISE, Chapter 10. https://cdrdv2.intel.com/v1/dl/getContent/671368
Kirill A. Shutemov (12):
x86/mm: Rework address range check in get_user() and put_user()
x86: Allow atomic MM_CONTEXT flags setting
x86: CPUID and CR3/CR4 flags for Linear Address Masking
x86/mm: Handle LAM on context switch
mm: Introduce untagged_addr_remote()
x86/uaccess: Provide untagged_addr() and remove tags before address
check
x86/mm: Provide arch_prctl() interface for LAM
x86/mm: Reduce untagged_addr() overhead until the first LAM user
mm: Expose untagging mask in /proc/$PID/status
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
x86/mm/iommu/sva: Make LAM and SVA mutually exclusive
selftests/x86/lam: Add test cases for LAM vs thread creation

Weihong Zhang (5):
selftests/x86/lam: Add malloc and tag-bits test cases for
linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address
masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for
linear-address masking

arch/arm64/include/asm/mmu_context.h | 6 +
arch/sparc/include/asm/mmu_context_64.h | 6 +
arch/sparc/include/asm/uaccess_64.h | 2 +
arch/x86/Kconfig | 11 +
arch/x86/entry/vsyscall/vsyscall_64.c | 2 +-
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/mmu.h | 18 +-
arch/x86/include/asm/mmu_context.h | 49 +-
arch/x86/include/asm/processor-flags.h | 2 +
arch/x86/include/asm/tlbflush.h | 48 +-
arch/x86/include/asm/uaccess.h | 35 +-
arch/x86/include/uapi/asm/prctl.h | 5 +
arch/x86/include/uapi/asm/processor-flags.h | 6 +
arch/x86/kernel/process.c | 6 +
arch/x86/kernel/process_64.c | 70 +-
arch/x86/kernel/traps.c | 6 +-
arch/x86/lib/getuser.S | 83 +-
arch/x86/lib/putuser.S | 54 +-
arch/x86/mm/init.c | 5 +
arch/x86/mm/tlb.c | 53 +-
drivers/iommu/iommu-sva.c | 8 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/proc/array.c | 6 +
fs/proc/task_mmu.c | 9 +-
include/linux/ioasid.h | 9 -
include/linux/mm.h | 11 -
include/linux/mmu_context.h | 14 +
include/linux/sched/mm.h | 8 +-
include/linux/uaccess.h | 22 +
mm/debug.c | 1 +
mm/gup.c | 4 +-
mm/madvise.c | 5 +-
mm/migrate.c | 11 +-
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 1241 +++++++++++++++++++
35 files changed, 1673 insertions(+), 148 deletions(-)
create mode 100644 tools/testing/selftests/x86/lam.c

--
2.38.2


2023-01-11 13:50:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 15/17] selftests/x86/lam: Add inherit test cases for linear-address masking

From: Weihong Zhang <[email protected]>

LAM is enabled per-thread and gets inherited on fork(2)/clone(2). exec()
reverts LAM status to the default disabled state.

There are two test scenarios:

- Fork test cases:

These cases were used to test the inheritance of LAM for per-thread,
Child process generated by fork() should inherit LAM feature from
parent process, Child process can get the LAM mode same as parent
process.

- Execve test cases:

Processes generated by execve() are different from processes
generated by fork(), these processes revert LAM status to disabled
status.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 125 +++++++++++++++++++++++++++++-
1 file changed, 121 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 52750ebd0887..ebabd4333b7d 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -37,8 +37,9 @@
#define FUNC_MMAP 0x4
#define FUNC_SYSCALL 0x8
#define FUNC_URING 0x10
+#define FUNC_INHERITE 0x20

-#define TEST_MASK 0x1f
+#define TEST_MASK 0x3f

#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -174,6 +175,28 @@ static unsigned long get_default_tag_bits(void)
return lam;
}

+/*
+ * Set tagged address and read back untag mask.
+ * check if the untag mask is expected.
+ */
+static int get_lam(void)
+{
+ uint64_t ptr = 0;
+ int ret = -1;
+ /* Get untagged mask */
+ if (syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr) == -1)
+ return -1;
+
+ /* Check mask returned is expected */
+ if (ptr == ~(LAM_U57_MASK))
+ ret = LAM_U57_BITS;
+ else if (ptr == -1ULL)
+ ret = LAM_NONE;
+
+
+ return ret;
+}
+
/* According to LAM mode, set metadata in high bits */
static uint64_t set_metadata(uint64_t src, unsigned long lam)
{
@@ -581,7 +604,7 @@ int do_uring(unsigned long lam)

switch (lam) {
case LAM_U57_BITS: /* Clear bits 62:57 */
- addr = (addr & ~(0x3fULL << 57));
+ addr = (addr & ~(LAM_U57_MASK));
break;
}
free((void *)addr);
@@ -632,6 +655,72 @@ static int fork_test(struct testcases *test)
return ret;
}

+static int handle_execve(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ char path[PATH_MAX];
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ /* Get current binary's path and the binary was run by execve */
+ if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ exit(-1);
+
+ /* run binary to get LAM mode and return to parent process */
+ if (execlp(path, path, "-t 0x0", NULL) < 0) {
+ perror("error on exec");
+ exit(-1);
+ }
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ if (ret != LAM_NONE)
+ return 1;
+ }
+
+ return 0;
+}
+
+static int handle_inheritance(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ return 1;
+ } else if (pid == 0) {
+ /* Set LAM mode in parent process */
+ int child_lam = get_lam();
+
+ exit(child_lam);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+
+ if (lam != ret)
+ return 1;
+ }
+
+ return 0;
+}
+
static void run_test(struct testcases *test, int count)
{
int i, ret = 0;
@@ -740,11 +829,26 @@ static struct testcases mmap_cases[] = {
},
};

+static struct testcases inheritance_cases[] = {
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_inheritance,
+ .msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n",
+ },
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_execve,
+ .msg = "EXECVE: LAM_U57, child process should get disabled LAM mode\n",
+ },
+};
+
static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring.\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring; 0x20:inherit;\n");
printf("\t-h: help\n");
}

@@ -764,7 +868,7 @@ int main(int argc, char **argv)
switch (c) {
case 't':
tests = strtoul(optarg, NULL, 16);
- if (!(tests & TEST_MASK)) {
+ if (tests && !(tests & TEST_MASK)) {
ksft_print_msg("Invalid argument!\n");
return -1;
}
@@ -778,6 +882,16 @@ int main(int argc, char **argv)
}
}

+ /*
+ * When tests is 0, it is not a real test case;
+ * the option used by test case(execve) to check the lam mode in
+ * process generated by execve, the process read back lam mode and
+ * check with lam mode in parent process.
+ */
+ if (!tests)
+ return (get_lam());
+
+ /* Run test cases */
if (tests & FUNC_MALLOC)
run_test(malloc_cases, ARRAY_SIZE(malloc_cases));

@@ -793,6 +907,9 @@ int main(int argc, char **argv)
if (tests & FUNC_URING)
run_test(uring_cases, ARRAY_SIZE(uring_cases));

+ if (tests & FUNC_INHERITE)
+ run_test(inheritance_cases, ARRAY_SIZE(inheritance_cases));
+
ksft_set_plan(tests_cnt);

return ksft_exit_pass();
--
2.38.2

2023-01-11 13:50:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 14/17] selftests/x86/lam: Add io_uring test cases for linear-address masking

From: Weihong Zhang <[email protected]>

LAM should be supported in kernel thread, using io_uring to verify LAM feature.
The test cases implement read a file through io_uring, the test cases choose an
iovec array as receiving buffer, which used to receive data, according to LAM
mode, set metadata in high bits of these buffer.

io_uring can deal with these buffers that pointed to pointers with the metadata
in high bits.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 341 +++++++++++++++++++++++++++++-
1 file changed, 339 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 39ebfc511685..52750ebd0887 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -9,8 +9,12 @@
#include <sys/mman.h>
#include <sys/utsname.h>
#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
#include <inttypes.h>

+#include <sys/uio.h>
+#include <linux/io_uring.h>
#include "../kselftest.h"

#ifndef __x86_64__
@@ -32,8 +36,9 @@
#define FUNC_BITS 0x2
#define FUNC_MMAP 0x4
#define FUNC_SYSCALL 0x8
+#define FUNC_URING 0x10

-#define TEST_MASK 0xf
+#define TEST_MASK 0x1f

#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -42,6 +47,13 @@

#define PAGE_SIZE (4 << 10)

+#define barrier() ({ \
+ __asm__ __volatile__("" : : : "memory"); \
+})
+
+#define URING_QUEUE_SZ 1
+#define URING_BLOCK_SZ 2048
+
struct testcases {
unsigned int later;
int expected; /* 2: SIGSEGV Error; 1: other errors */
@@ -51,6 +63,33 @@ struct testcases {
const char *msg;
};

+/* Used by CQ of uring, source file handler and file's size */
+struct file_io {
+ int file_fd;
+ off_t file_sz;
+ struct iovec iovecs[];
+};
+
+struct io_uring_queue {
+ unsigned int *head;
+ unsigned int *tail;
+ unsigned int *ring_mask;
+ unsigned int *ring_entries;
+ unsigned int *flags;
+ unsigned int *array;
+ union {
+ struct io_uring_cqe *cqes;
+ struct io_uring_sqe *sqes;
+ } queue;
+ size_t ring_sz;
+};
+
+struct io_ring {
+ int ring_fd;
+ struct io_uring_queue sq_ring;
+ struct io_uring_queue cq_ring;
+};
+
int tests_cnt;
jmp_buf segv_env;

@@ -294,6 +333,285 @@ static int handle_syscall(struct testcases *test)
return ret;
}

+int sys_uring_setup(unsigned int entries, struct io_uring_params *p)
+{
+ return (int)syscall(__NR_io_uring_setup, entries, p);
+}
+
+int sys_uring_enter(int fd, unsigned int to, unsigned int min, unsigned int flags)
+{
+ return (int)syscall(__NR_io_uring_enter, fd, to, min, flags, NULL, 0);
+}
+
+/* Init submission queue and completion queue */
+int mmap_io_uring(struct io_uring_params p, struct io_ring *s)
+{
+ struct io_uring_queue *sring = &s->sq_ring;
+ struct io_uring_queue *cring = &s->cq_ring;
+
+ sring->ring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned int);
+ cring->ring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
+
+ if (p.features & IORING_FEAT_SINGLE_MMAP) {
+ if (cring->ring_sz > sring->ring_sz)
+ sring->ring_sz = cring->ring_sz;
+
+ cring->ring_sz = sring->ring_sz;
+ }
+
+ void *sq_ptr = mmap(0, sring->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, s->ring_fd,
+ IORING_OFF_SQ_RING);
+
+ if (sq_ptr == MAP_FAILED) {
+ perror("sub-queue!");
+ return 1;
+ }
+
+ void *cq_ptr = sq_ptr;
+
+ if (!(p.features & IORING_FEAT_SINGLE_MMAP)) {
+ cq_ptr = mmap(0, cring->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, s->ring_fd,
+ IORING_OFF_CQ_RING);
+ if (cq_ptr == MAP_FAILED) {
+ perror("cpl-queue!");
+ munmap(sq_ptr, sring->ring_sz);
+ return 1;
+ }
+ }
+
+ sring->head = sq_ptr + p.sq_off.head;
+ sring->tail = sq_ptr + p.sq_off.tail;
+ sring->ring_mask = sq_ptr + p.sq_off.ring_mask;
+ sring->ring_entries = sq_ptr + p.sq_off.ring_entries;
+ sring->flags = sq_ptr + p.sq_off.flags;
+ sring->array = sq_ptr + p.sq_off.array;
+
+ /* Map a queue as mem map */
+ s->sq_ring.queue.sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
+ s->ring_fd, IORING_OFF_SQES);
+ if (s->sq_ring.queue.sqes == MAP_FAILED) {
+ munmap(sq_ptr, sring->ring_sz);
+ if (sq_ptr != cq_ptr) {
+ ksft_print_msg("failed to mmap uring queue!");
+ munmap(cq_ptr, cring->ring_sz);
+ return 1;
+ }
+ }
+
+ cring->head = cq_ptr + p.cq_off.head;
+ cring->tail = cq_ptr + p.cq_off.tail;
+ cring->ring_mask = cq_ptr + p.cq_off.ring_mask;
+ cring->ring_entries = cq_ptr + p.cq_off.ring_entries;
+ cring->queue.cqes = cq_ptr + p.cq_off.cqes;
+
+ return 0;
+}
+
+/* Init io_uring queues */
+int setup_io_uring(struct io_ring *s)
+{
+ struct io_uring_params para;
+
+ memset(&para, 0, sizeof(para));
+ s->ring_fd = sys_uring_setup(URING_QUEUE_SZ, &para);
+ if (s->ring_fd < 0)
+ return 1;
+
+ return mmap_io_uring(para, s);
+}
+
+/*
+ * Get data from completion queue. the data buffer saved the file data
+ * return 0: success; others: error;
+ */
+int handle_uring_cq(struct io_ring *s)
+{
+ struct file_io *fi = NULL;
+ struct io_uring_queue *cring = &s->cq_ring;
+ struct io_uring_cqe *cqe;
+ unsigned int head;
+ off_t len = 0;
+
+ head = *cring->head;
+
+ do {
+ barrier();
+ if (head == *cring->tail)
+ break;
+ /* Get the entry */
+ cqe = &cring->queue.cqes[head & *s->cq_ring.ring_mask];
+ fi = (struct file_io *)cqe->user_data;
+ if (cqe->res < 0)
+ break;
+
+ int blocks = (int)(fi->file_sz + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ for (int i = 0; i < blocks; i++)
+ len += fi->iovecs[i].iov_len;
+
+ head++;
+ } while (1);
+
+ *cring->head = head;
+ barrier();
+
+ return (len != fi->file_sz);
+}
+
+/*
+ * Submit squeue. specify via IORING_OP_READV.
+ * the buffer need to be set metadata according to LAM mode
+ */
+int handle_uring_sq(struct io_ring *ring, struct file_io *fi, unsigned long lam)
+{
+ int file_fd = fi->file_fd;
+ struct io_uring_queue *sring = &ring->sq_ring;
+ unsigned int index = 0, cur_block = 0, tail = 0, next_tail = 0;
+ struct io_uring_sqe *sqe;
+
+ off_t remain = fi->file_sz;
+ int blocks = (int)(remain + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ while (remain) {
+ off_t bytes = remain;
+ void *buf;
+
+ if (bytes > URING_BLOCK_SZ)
+ bytes = URING_BLOCK_SZ;
+
+ fi->iovecs[cur_block].iov_len = bytes;
+
+ if (posix_memalign(&buf, URING_BLOCK_SZ, URING_BLOCK_SZ))
+ return 1;
+
+ fi->iovecs[cur_block].iov_base = (void *)set_metadata((uint64_t)buf, lam);
+ remain -= bytes;
+ cur_block++;
+ }
+
+ next_tail = *sring->tail;
+ tail = next_tail;
+ next_tail++;
+
+ barrier();
+
+ index = tail & *ring->sq_ring.ring_mask;
+
+ sqe = &ring->sq_ring.queue.sqes[index];
+ sqe->fd = file_fd;
+ sqe->flags = 0;
+ sqe->opcode = IORING_OP_READV;
+ sqe->addr = (unsigned long)fi->iovecs;
+ sqe->len = blocks;
+ sqe->off = 0;
+ sqe->user_data = (uint64_t)fi;
+
+ sring->array[index] = index;
+ tail = next_tail;
+
+ if (*sring->tail != tail) {
+ *sring->tail = tail;
+ barrier();
+ }
+
+ if (sys_uring_enter(ring->ring_fd, 1, 1, IORING_ENTER_GETEVENTS) < 0)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Test LAM in async I/O and io_uring, read current binery through io_uring
+ * Set metadata in pointers to iovecs buffer.
+ */
+int do_uring(unsigned long lam)
+{
+ struct io_ring *ring;
+ struct file_io *fi;
+ struct stat st;
+ int ret = 1;
+ char path[PATH_MAX];
+
+ /* get current process path */
+ if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ return 1;
+
+ int file_fd = open(path, O_RDONLY);
+
+ if (file_fd < 0)
+ return 1;
+
+ if (fstat(file_fd, &st) < 0)
+ return 1;
+
+ off_t file_sz = st.st_size;
+
+ int blocks = (int)(file_sz + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ fi = malloc(sizeof(*fi) + sizeof(struct iovec) * blocks);
+ if (!fi)
+ return 1;
+
+ fi->file_sz = file_sz;
+ fi->file_fd = file_fd;
+
+ ring = malloc(sizeof(*ring));
+ if (!ring)
+ return 1;
+
+ memset(ring, 0, sizeof(struct io_ring));
+
+ if (setup_io_uring(ring))
+ goto out;
+
+ if (handle_uring_sq(ring, fi, lam))
+ goto out;
+
+ ret = handle_uring_cq(ring);
+
+out:
+ free(ring);
+
+ for (int i = 0; i < blocks; i++) {
+ if (fi->iovecs[i].iov_base) {
+ uint64_t addr = ((uint64_t)fi->iovecs[i].iov_base);
+
+ switch (lam) {
+ case LAM_U57_BITS: /* Clear bits 62:57 */
+ addr = (addr & ~(0x3fULL << 57));
+ break;
+ }
+ free((void *)addr);
+ fi->iovecs[i].iov_base = NULL;
+ }
+ }
+
+ free(fi);
+
+ return ret;
+}
+
+int handle_uring(struct testcases *test)
+{
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = do_uring(test->lam);
+ } else {
+ ret = 2;
+ }
+
+ return ret;
+}
+
static int fork_test(struct testcases *test)
{
int ret, child_ret;
@@ -340,6 +658,22 @@ static void run_test(struct testcases *test, int count)
}
}

+static struct testcases uring_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_uring,
+ .msg = "URING: LAM_U57. Dereferencing pointer with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_uring,
+ .msg = "URING:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
static struct testcases malloc_cases[] = {
{
.later = 0,
@@ -410,7 +744,7 @@ static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall.\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring.\n");
printf("\t-h: help\n");
}

@@ -456,6 +790,9 @@ int main(int argc, char **argv)
if (tests & FUNC_SYSCALL)
run_test(syscall_cases, ARRAY_SIZE(syscall_cases));

+ if (tests & FUNC_URING)
+ run_test(uring_cases, ARRAY_SIZE(uring_cases));
+
ksft_set_plan(tests_cnt);

return ksft_exit_pass();
--
2.38.2

2023-01-11 13:51:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 04/17] x86/mm: Handle LAM on context switch

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
---
arch/x86/include/asm/mmu.h | 5 +++
arch/x86/include/asm/mmu_context.h | 24 ++++++++++++++
arch/x86/include/asm/tlbflush.h | 38 ++++++++++++++++++++-
arch/x86/mm/tlb.c | 53 +++++++++++++++++++++---------
4 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index efa3eaee522c..22fc9fbf1d0a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -42,6 +42,11 @@ typedef struct {
unsigned long flags;
#endif

+#ifdef CONFIG_ADDRESS_MASKING
+ /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
+ unsigned long lam_cr3_mask;
+#endif
+
struct mutex lock;
void __user *vdso; /* vdso base address */
const struct vdso_image *vdso_image; /* vdso image in use */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 53ef591a6166..8388fccc4700 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -91,6 +91,29 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
}
#endif

+#ifdef CONFIG_ADDRESS_MASKING
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return READ_ONCE(mm->context.lam_cr3_mask);
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+}
+
+#else
+
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+#endif
+
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);

@@ -168,6 +191,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
{
arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
+ dup_lam(oldmm, mm);
return ldt_dup_context(oldmm, mm);
}

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..e8b47f57bd4a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,7 +2,7 @@
#ifndef _ASM_X86_TLBFLUSH_H
#define _ASM_X86_TLBFLUSH_H

-#include <linux/mm.h>
+#include <linux/mm_types.h>
#include <linux/sched.h>

#include <asm/processor.h>
@@ -12,6 +12,7 @@
#include <asm/invpcid.h>
#include <asm/pti.h>
#include <asm/processor-flags.h>
+#include <asm/pgtable.h>

void __flush_tlb_all(void);

@@ -101,6 +102,16 @@ struct tlb_state {
*/
bool invalidate_other;

+#ifdef CONFIG_ADDRESS_MASKING
+ /*
+ * Active LAM mode.
+ *
+ * X86_CR3_LAM_U57/U48 shifted right by X86_CR3_LAM_U57_BIT or 0 if LAM
+ * disabled.
+ */
+ u8 lam;
+#endif
+
/*
* Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
* the corresponding user PCID needs a flush next time we
@@ -357,6 +368,31 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#define huge_pmd_needs_flush huge_pmd_needs_flush

+#ifdef CONFIG_ADDRESS_MASKING
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+ u64 lam = this_cpu_read(cpu_tlbstate.lam);
+
+ return lam << X86_CR3_LAM_U57_BIT;
+}
+
+static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+{
+ this_cpu_write(cpu_tlbstate.lam,
+ mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
+}
+
+#else
+
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+ return 0;
+}
+
+static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+{
+}
+#endif
#endif /* !MODULE */

static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c1e31e9a85d7..8c330a6d0ece 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -154,26 +154,30 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
{
+ unsigned long cr3 = __sme_pa(pgd) | lam;
+
if (static_cpu_has(X86_FEATURE_PCID)) {
- return __sme_pa(pgd) | kern_pcid(asid);
+ VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+ cr3 |= kern_pcid(asid);
} else {
VM_WARN_ON_ONCE(asid != 0);
- return __sme_pa(pgd);
}
+
+ return cr3;
}

-static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
+ unsigned long lam)
{
- VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
/*
* Use boot_cpu_has() instead of this_cpu_has() as this function
* might be called during early boot. This should work even after
* boot because all CPU's the have same capabilities:
*/
VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
- return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
+ return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
}

/*
@@ -274,15 +278,16 @@ static inline void invalidate_user_asid(u16 asid)
(unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
}

-static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, unsigned long lam,
+ bool need_flush)
{
unsigned long new_mm_cr3;

if (need_flush) {
invalidate_user_asid(new_asid);
- new_mm_cr3 = build_cr3(pgdir, new_asid);
+ new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
} else {
- new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+ new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
}

/*
@@ -491,6 +496,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
@@ -520,7 +526,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* isn't free.
*/
#ifdef CONFIG_DEBUG_VM
- if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
+ if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid,
+ tlbstate_lam_cr3_mask()))) {
/*
* If we were to BUG here, we'd be very likely to kill
* the system so hard that we don't see the call trace.
@@ -552,9 +559,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* instruction.
*/
if (real_prev == next) {
+ /* Not actually switching mm's */
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

+ /*
+ * If this races with another thread that enables lam, 'new_lam'
+ * might not match tlbstate_lam_cr3_mask().
+ */
+
/*
* Even in lazy TLB mode, the CPU should stay set in the
* mm_cpumask. The TLB shootdown code can figure out from
@@ -622,15 +635,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
barrier();
}

+ set_tlbstate_lam_mode(next);
if (need_flush) {
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, true);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
} else {
/* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, false);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
@@ -691,6 +705,10 @@ void initialize_tlbstate_and_flush(void)
/* Assert that CR3 already references the right mm. */
WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));

+ /* LAM expected to be disabled */
+ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
+ WARN_ON(mm_lam_cr3_mask(mm));
+
/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
* doesn't work like other CR4 bits because it can only be set from
@@ -699,8 +717,8 @@ void initialize_tlbstate_and_flush(void)
WARN_ON(boot_cpu_has(X86_FEATURE_PCID) &&
!(cr4_read_shadow() & X86_CR4_PCIDE));

- /* Force ASID 0 and force a TLB flush. */
- write_cr3(build_cr3(mm->pgd, 0));
+ /* Disable LAM, force ASID 0 and force a TLB flush. */
+ write_cr3(build_cr3(mm->pgd, 0, 0));

/* Reinitialize tlbstate. */
this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
@@ -708,6 +726,7 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
+ set_tlbstate_lam_mode(mm);

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
@@ -1071,8 +1090,10 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
*/
unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3 =
+ build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+ tlbstate_lam_cr3_mask());

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
--
2.38.2

2023-01-11 13:51:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 10/17] iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()

Kernel has few users of pasid_valid() and all but one checks if the
process has PASID allocated. The helper takes ioasid_t as the input.

Replace the helper with mm_valid_pasid() that takes mm_struct as the
argument. The only call that checks PASID that is not tied to mm_struct
is open-codded now.

This is preparatory patch. It helps avoid ifdeffery: no need to
dereference mm->pasid in generic code to check if the process has PASID.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/traps.c | 6 +++---
drivers/iommu/iommu-sva.c | 4 ++--
include/linux/ioasid.h | 9 ---------
include/linux/sched/mm.h | 8 +++++++-
4 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..8b83d8fbce71 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -671,15 +671,15 @@ static bool try_fixup_enqcmd_gp(void)
if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
return false;

- pasid = current->mm->pasid;
-
/*
* If the mm has not been allocated a
* PASID, the #GP can not be fixed up.
*/
- if (!pasid_valid(pasid))
+ if (!mm_valid_pasid(current->mm))
return false;

+ pasid = current->mm->pasid;
+
/*
* Did this thread already have its PASID activated?
* If so, the #GP must be from something else.
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 24bf9b2b58aa..4ee2929f0d7a 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -34,14 +34,14 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)

mutex_lock(&iommu_sva_lock);
/* Is a PASID already associated with this mm? */
- if (pasid_valid(mm->pasid)) {
+ if (mm_valid_pasid(mm)) {
if (mm->pasid < min || mm->pasid >= max)
ret = -EOVERFLOW;
goto out;
}

pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm);
- if (!pasid_valid(pasid))
+ if (pasid == INVALID_IOASID)
ret = -ENOMEM;
else
mm_pasid_set(mm, pasid);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index af1c9d62e642..836ae09e92c2 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -40,10 +40,6 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_set_data(ioasid_t ioasid, void *data);
-static inline bool pasid_valid(ioasid_t ioasid)
-{
- return ioasid != INVALID_IOASID;
-}

#else /* !CONFIG_IOASID */
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
@@ -74,10 +70,5 @@ static inline int ioasid_set_data(ioasid_t ioasid, void *data)
return -ENOTSUPP;
}

-static inline bool pasid_valid(ioasid_t ioasid)
-{
- return false;
-}
-
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..b69fe7e8c0ac 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -457,6 +457,11 @@ static inline void mm_pasid_init(struct mm_struct *mm)
mm->pasid = INVALID_IOASID;
}

+static inline bool mm_valid_pasid(struct mm_struct *mm)
+{
+ return mm->pasid != INVALID_IOASID;
+}
+
/* Associate a PASID with an mm_struct: */
static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid)
{
@@ -465,13 +470,14 @@ static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid)

static inline void mm_pasid_drop(struct mm_struct *mm)
{
- if (pasid_valid(mm->pasid)) {
+ if (mm_valid_pasid(mm)) {
ioasid_free(mm->pasid);
mm->pasid = INVALID_IOASID;
}
}
#else
static inline void mm_pasid_init(struct mm_struct *mm) {}
+static inline bool mm_valid_pasid(struct mm_struct *mm) { return false; }
static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid) {}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
#endif
--
2.38.2

2023-01-11 13:56:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 12/17] selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking

From: Weihong Zhang <[email protected]>

LAM is supported only in 64-bit mode and applies only addresses used for data
accesses. In 64-bit mode, linear address have 64 bits. LAM is applied to 64-bit
linear address and allow software to use high bits for metadata.
LAM supports configurations that differ regarding which pointer bits are masked
and can be used for metadata.

LAM includes following mode:

- LAM_U57, pointer bits in positions 62:57 are masked (LAM width 6),
allows bits 62:57 of a user pointer to be used as metadata.

There are some arch_prctls:
ARCH_ENABLE_TAGGED_ADDR: enable LAM mode, mask high bits of a user pointer.
ARCH_GET_UNTAG_MASK: get current untagged mask.
ARCH_GET_MAX_TAG_BITS: the maximum tag bits user can request. zero if LAM
is not supported.

The LAM mode is for pre-process, a process has only one chance to set LAM mode.
But there is no API to disable LAM mode. So all of test cases are run under
child process.

Functions of this test:

MALLOC

- LAM_U57 masks bits 57:62 of a user pointer. Process on user space
can dereference such pointers.

- Disable LAM, dereference a pointer with metadata above 48 bit or 57 bit
lead to trigger SIGSEGV.

TAG_BITS

- Max tag bits of LAM_U57 is 6.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 326 +++++++++++++++++++++++++++
2 files changed, 327 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/lam.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..c1a16a9d4f2f 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx
+ corrupt_xstate_header amx lam
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
new file mode 100644
index 000000000000..268c1d2749af
--- /dev/null
+++ b/tools/testing/selftests/x86/lam.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <inttypes.h>
+
+#include "../kselftest.h"
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+/* LAM modes, these definitions were copied from kernel code */
+#define LAM_NONE 0
+#define LAM_U57_BITS 6
+
+#define LAM_U57_MASK (0x3fULL << 57)
+/* arch prctl for LAM */
+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
+/* Specified test function bits */
+#define FUNC_MALLOC 0x1
+#define FUNC_BITS 0x2
+
+#define TEST_MASK 0x3
+
+#define MALLOC_LEN 32
+
+struct testcases {
+ unsigned int later;
+ int expected; /* 2: SIGSEGV Error; 1: other errors */
+ unsigned long lam;
+ uint64_t addr;
+ int (*test_func)(struct testcases *test);
+ const char *msg;
+};
+
+int tests_cnt;
+jmp_buf segv_env;
+
+static void segv_handler(int sig)
+{
+ ksft_print_msg("Get segmentation fault(%d).", sig);
+ siglongjmp(segv_env, 1);
+}
+
+static inline int cpu_has_lam(void)
+{
+ unsigned int cpuinfo[4];
+
+ __cpuid_count(0x7, 1, cpuinfo[0], cpuinfo[1], cpuinfo[2], cpuinfo[3]);
+
+ return (cpuinfo[0] & (1 << 26));
+}
+
+/*
+ * Set tagged address and read back untag mask.
+ * check if the untagged mask is expected.
+ *
+ * @return:
+ * 0: Set LAM mode successfully
+ * others: failed to set LAM
+ */
+static int set_lam(unsigned long lam)
+{
+ int ret = 0;
+ uint64_t ptr = 0;
+
+ if (lam != LAM_U57_BITS && lam != LAM_NONE)
+ return -1;
+
+ /* Skip check return */
+ syscall(SYS_arch_prctl, ARCH_ENABLE_TAGGED_ADDR, lam);
+
+ /* Get untagged mask */
+ syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr);
+
+ /* Check mask returned is expected */
+ if (lam == LAM_U57_BITS)
+ ret = (ptr != ~(LAM_U57_MASK));
+ else if (lam == LAM_NONE)
+ ret = (ptr != -1ULL);
+
+ return ret;
+}
+
+static unsigned long get_default_tag_bits(void)
+{
+ pid_t pid;
+ int lam = LAM_NONE;
+ int ret = 0;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ } else if (pid == 0) {
+ /* Set LAM mode in child process */
+ if (set_lam(LAM_U57_BITS) == 0)
+ lam = LAM_U57_BITS;
+ else
+ lam = LAM_NONE;
+ exit(lam);
+ } else {
+ wait(&ret);
+ lam = WEXITSTATUS(ret);
+ }
+
+ return lam;
+}
+
+/* According to LAM mode, set metadata in high bits */
+static uint64_t set_metadata(uint64_t src, unsigned long lam)
+{
+ uint64_t metadata;
+
+ srand(time(NULL));
+
+ switch (lam) {
+ case LAM_U57_BITS: /* Set metadata in bits 62:57 */
+ /* Get a random non-zero value as metadata */
+ metadata = (rand() % ((1UL << LAM_U57_BITS) - 1) + 1) << 57;
+ metadata |= (src & ~(LAM_U57_MASK));
+ break;
+ default:
+ metadata = src;
+ break;
+ }
+
+ return metadata;
+}
+
+/*
+ * Set metadata in user pointer, compare new pointer with original pointer.
+ * both pointers should point to the same address.
+ *
+ * @return:
+ * 0: value on the pointer with metadate and value on original are same
+ * 1: not same.
+ */
+static int handle_lam_test(void *src, unsigned int lam)
+{
+ char *ptr;
+
+ strcpy((char *)src, "USER POINTER");
+
+ ptr = (char *)set_metadata((uint64_t)src, lam);
+ if (src == ptr)
+ return 0;
+
+ /* Copy a string into the pointer with metadata */
+ strcpy((char *)ptr, "METADATA POINTER");
+
+ return (!!strcmp((char *)src, (char *)ptr));
+}
+
+
+int handle_max_bits(struct testcases *test)
+{
+ unsigned long exp_bits = get_default_tag_bits();
+ unsigned long bits = 0;
+
+ if (exp_bits != LAM_NONE)
+ exp_bits = LAM_U57_BITS;
+
+ /* Get LAM max tag bits */
+ if (syscall(SYS_arch_prctl, ARCH_GET_MAX_TAG_BITS, &bits) == -1)
+ return 1;
+
+ return (exp_bits != bits);
+}
+
+/*
+ * Test lam feature through dereference pointer get from malloc.
+ * @return 0: Pass test. 1: Get failure during test 2: Get SIGSEGV
+ */
+static int handle_malloc(struct testcases *test)
+{
+ char *ptr = NULL;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1)
+ return 1;
+
+ ptr = (char *)malloc(MALLOC_LEN);
+ if (ptr == NULL) {
+ perror("malloc() failure\n");
+ return 1;
+ }
+
+ /* Set signal handler */
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = handle_lam_test(ptr, test->lam);
+ } else {
+ ret = 2;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1 && ret == 0)
+ ret = 1;
+
+ free(ptr);
+
+ return ret;
+}
+
+static int fork_test(struct testcases *test)
+{
+ int ret, child_ret;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ ret = test->test_func(test);
+ exit(ret);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ }
+
+ return ret;
+}
+
+static void run_test(struct testcases *test, int count)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < count; i++) {
+ struct testcases *t = test + i;
+
+ /* fork a process to run test case */
+ ret = fork_test(t);
+ if (ret != 0)
+ ret = (t->expected == ret);
+ else
+ ret = !(t->expected);
+
+ tests_cnt++;
+ ksft_test_result(ret, t->msg);
+ }
+}
+
+static struct testcases malloc_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC: LAM_U57. Dereferencing pointer with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 2,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
+
+static struct testcases bits_cases[] = {
+ {
+ .test_func = handle_max_bits,
+ .msg = "BITS: Check default tag bits\n",
+ },
+};
+
+static void cmd_help(void)
+{
+ printf("usage: lam [-h] [-t test list]\n");
+ printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
+ printf("\t\t0x1:malloc; 0x2:max_bits;\n");
+ printf("\t-h: help\n");
+}
+
+int main(int argc, char **argv)
+{
+ int c = 0;
+ unsigned int tests = TEST_MASK;
+
+ tests_cnt = 0;
+
+ if (!cpu_has_lam()) {
+ ksft_print_msg("Unsupported LAM feature!\n");
+ return -1;
+ }
+
+ while ((c = getopt(argc, argv, "ht:")) != -1) {
+ switch (c) {
+ case 't':
+ tests = strtoul(optarg, NULL, 16);
+ if (!(tests & TEST_MASK)) {
+ ksft_print_msg("Invalid argument!\n");
+ return -1;
+ }
+ break;
+ case 'h':
+ cmd_help();
+ return 0;
+ default:
+ ksft_print_msg("Invalid argument\n");
+ return -1;
+ }
+ }
+
+ if (tests & FUNC_MALLOC)
+ run_test(malloc_cases, ARRAY_SIZE(malloc_cases));
+
+ if (tests & FUNC_BITS)
+ run_test(bits_cases, ARRAY_SIZE(bits_cases));
+
+ ksft_set_plan(tests_cnt);
+
+ return ksft_exit_pass();
+}
--
2.38.2

2023-01-11 13:59:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 09/17] mm: Expose untagging mask in /proc/$PID/status

Add a line in /proc/$PID/status to report untag_mask. It can be
used to find out LAM status of the process from the outside. It is
useful for debuggers.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
---
arch/arm64/include/asm/mmu_context.h | 6 ++++++
arch/sparc/include/asm/mmu_context_64.h | 6 ++++++
arch/x86/include/asm/mmu_context.h | 6 ++++++
fs/proc/array.c | 6 ++++++
include/linux/mmu_context.h | 7 +++++++
5 files changed, 31 insertions(+)

diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 72dbd6400549..56911691bef0 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -288,6 +288,12 @@ void post_ttbr_update_workaround(void);
unsigned long arm64_mm_context_get(struct mm_struct *mm);
void arm64_mm_context_put(struct mm_struct *mm);

+#define mm_untag_mask mm_untag_mask
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return -1UL >> 8;
+}
+
#include <asm-generic/mmu_context.h>

#endif /* !__ASSEMBLY__ */
diff --git a/arch/sparc/include/asm/mmu_context_64.h b/arch/sparc/include/asm/mmu_context_64.h
index 7a8380c63aab..799e797c5cdd 100644
--- a/arch/sparc/include/asm/mmu_context_64.h
+++ b/arch/sparc/include/asm/mmu_context_64.h
@@ -185,6 +185,12 @@ static inline void finish_arch_post_lock_switch(void)
}
}

+#define mm_untag_mask mm_untag_mask
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return -1UL >> adi_nbits();
+}
+
#include <asm-generic/mmu_context.h>

#endif /* !(__ASSEMBLY__) */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1d0b743daebb..4b77c96da07f 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -103,6 +103,12 @@ static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
mm->context.untag_mask = oldmm->context.untag_mask;
}

+#define mm_untag_mask mm_untag_mask
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return mm->context.untag_mask;
+}
+
static inline void mm_reset_untag_mask(struct mm_struct *mm)
{
mm->context.untag_mask = -1UL;
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 49283b8103c7..d2a94eafe9a3 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -428,6 +428,11 @@ static inline void task_thp_status(struct seq_file *m, struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
}

+static inline void task_untag_mask(struct seq_file *m, struct mm_struct *mm)
+{
+ seq_printf(m, "untag_mask:\t%#lx\n", mm_untag_mask(mm));
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -443,6 +448,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_mem(m, mm);
task_core_dumping(m, task);
task_thp_status(m, mm);
+ task_untag_mask(m, mm);
mmput(mm);
}
task_sig(m, task);
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index b9b970f7ab45..14b9c1fa05c4 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -28,4 +28,11 @@ static inline void leave_mm(int cpu) { }
# define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
#endif

+#ifndef mm_untag_mask
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return -1UL;
+}
+#endif
+
#endif
--
2.38.2

2023-01-11 14:04:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 11/17] x86/mm/iommu/sva: Make LAM and SVA mutually exclusive

IOMMU and SVA-capable devices know nothing about LAM and only expect
canonical addresses. An attempt to pass down tagged pointer will lead
to address translation failure.

By default do not allow to enable both LAM and use SVA in the same
process.

The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation.
By using the arch_prctl() userspace takes responsibility to never pass
tagged address to the device.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Jacob Pan <[email protected]>
---
arch/x86/include/asm/mmu.h | 2 ++
arch/x86/include/asm/mmu_context.h | 6 ++++++
arch/x86/include/uapi/asm/prctl.h | 1 +
arch/x86/kernel/process_64.c | 7 +++++++
drivers/iommu/iommu-sva.c | 4 ++++
include/linux/mmu_context.h | 7 +++++++
6 files changed, 27 insertions(+)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index e80762e998ce..0da5c227f490 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -14,6 +14,8 @@
#define MM_CONTEXT_HAS_VSYSCALL 1
/* Do not allow changing LAM mode */
#define MM_CONTEXT_LOCK_LAM 2
+/* Allow LAM and SVA coexisting */
+#define MM_CONTEXT_FORCE_TAGGED_SVA 3

/*
* x86 has arch-specific MMU state beyond what lives in mm_struct.
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 4b77c96da07f..6ffc42dfd59d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -114,6 +114,12 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
mm->context.untag_mask = -1UL;
}

+#define arch_pgtable_dma_compat arch_pgtable_dma_compat
+static inline bool arch_pgtable_dma_compat(struct mm_struct *mm)
+{
+ return !mm_lam_cr3_mask(mm) ||
+ test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags);
+}
#else

static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index a31e27b95b19..eb290d89cb32 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -23,5 +23,6 @@
#define ARCH_GET_UNTAG_MASK 0x4001
#define ARCH_ENABLE_TAGGED_ADDR 0x4002
#define ARCH_GET_MAX_TAG_BITS 0x4003
+#define ARCH_FORCE_TAGGED_SVA 0x4004

#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1b41c60ebf6e..0831d2be190f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -759,6 +759,10 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
if (current->mm != mm)
return -EINVAL;

+ if (mm_valid_pasid(mm) &&
+ !test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags))
+ return -EINTR;
+
if (mmap_write_lock_killable(mm))
return -EINTR;

@@ -882,6 +886,9 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
(unsigned long __user *)arg2);
case ARCH_ENABLE_TAGGED_ADDR:
return prctl_enable_tagged_addr(task->mm, arg2);
+ case ARCH_FORCE_TAGGED_SVA:
+ set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags);
+ return 0;
case ARCH_GET_MAX_TAG_BITS:
if (!cpu_feature_enabled(X86_FEATURE_LAM))
return put_user(0, (unsigned long __user *)arg2);
diff --git a/drivers/iommu/iommu-sva.c b/drivers/iommu/iommu-sva.c
index 4ee2929f0d7a..dd76a1a09cf7 100644
--- a/drivers/iommu/iommu-sva.c
+++ b/drivers/iommu/iommu-sva.c
@@ -2,6 +2,7 @@
/*
* Helpers for IOMMU drivers implementing SVA
*/
+#include <linux/mmu_context.h>
#include <linux/mutex.h>
#include <linux/sched/mm.h>
#include <linux/iommu.h>
@@ -32,6 +33,9 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
min == 0 || max < min)
return -EINVAL;

+ if (!arch_pgtable_dma_compat(mm))
+ return -EBUSY;
+
mutex_lock(&iommu_sva_lock);
/* Is a PASID already associated with this mm? */
if (mm_valid_pasid(mm)) {
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 14b9c1fa05c4..f2b7a3f04099 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -35,4 +35,11 @@ static inline unsigned long mm_untag_mask(struct mm_struct *mm)
}
#endif

+#ifndef arch_pgtable_dma_compat
+static inline bool arch_pgtable_dma_compat(struct mm_struct *mm)
+{
+ return true;
+}
+#endif
+
#endif
--
2.38.2

2023-01-11 14:20:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14 17/17] selftests/x86/lam: Add test cases for LAM vs thread creation

LAM enabling is only allowed when the process has single thread.
LAM mode is inherited into child thread.

Trying to enable LAM after spawning a thread has to fail.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 92 +++++++++++++++++++++++++++++++
1 file changed, 92 insertions(+)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index a8c91829b616..eb0e46905bf9 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
@@ -12,6 +13,7 @@
#include <sys/stat.h>
#include <fcntl.h>
#include <inttypes.h>
+#include <sched.h>

#include <sys/uio.h>
#include <linux/io_uring.h>
@@ -50,6 +52,8 @@

#define PAGE_SIZE (4 << 10)

+#define STACK_SIZE 65536
+
#define barrier() ({ \
__asm__ __volatile__("" : : : "memory"); \
})
@@ -731,6 +735,75 @@ static int handle_inheritance(struct testcases *test)
return 0;
}

+static int thread_fn_get_lam(void *arg)
+{
+ return get_lam();
+}
+
+static int thread_fn_set_lam(void *arg)
+{
+ struct testcases *test = arg;
+
+ return set_lam(test->lam);
+}
+
+static int handle_thread(struct testcases *test)
+{
+ char stack[STACK_SIZE];
+ int ret, child_ret;
+ int lam = 0;
+ pid_t pid;
+
+ /* Set LAM mode in parent process */
+ if (!test->later) {
+ lam = test->lam;
+ if (set_lam(lam) != 0)
+ return 1;
+ }
+
+ pid = clone(thread_fn_get_lam, stack + STACK_SIZE,
+ SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, NULL);
+ if (pid < 0) {
+ perror("Clone failed.");
+ return 1;
+ }
+
+ waitpid(pid, &child_ret, 0);
+ ret = WEXITSTATUS(child_ret);
+
+ if (lam != ret)
+ return 1;
+
+ if (test->later) {
+ if (set_lam(test->lam) != 0)
+ return 1;
+ }
+
+ return 0;
+}
+
+static int handle_thread_enable(struct testcases *test)
+{
+ char stack[STACK_SIZE];
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ pid = clone(thread_fn_set_lam, stack + STACK_SIZE,
+ SIGCHLD | CLONE_FILES | CLONE_FS | CLONE_VM, test);
+ if (pid < 0) {
+ perror("Clone failed.");
+ return 1;
+ }
+
+ waitpid(pid, &child_ret, 0);
+ ret = WEXITSTATUS(child_ret);
+
+ if (lam != ret)
+ return 1;
+
+ return 0;
+}
static void run_test(struct testcases *test, int count)
{
int i, ret = 0;
@@ -846,6 +919,25 @@ static struct testcases inheritance_cases[] = {
.test_func = handle_inheritance,
.msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n",
},
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_thread,
+ .msg = "THREAD: LAM_U57, child thread should get LAM mode same as parent\n",
+ },
+ {
+ .expected = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_thread_enable,
+ .msg = "THREAD: [NEGATIVE] Enable LAM in child.\n",
+ },
+ {
+ .expected = 1,
+ .later = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_thread,
+ .msg = "THREAD: [NEGATIVE] Enable LAM in parent after thread created.\n",
+ },
{
.expected = 0,
.lam = LAM_U57_BITS,
--
2.38.2

2023-01-11 14:50:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv14 04/17] x86/mm: Handle LAM on context switch

On Wed, Jan 11, 2023 at 07:49:30AM -0600, Linus Torvalds wrote:
> Sorry, I'm traveling, and right now only on mobile. So I'm reading patches
> on my phone, and answering on it too, so html garbage..
>
> On Wed, Jan 11, 2023, 07:24 Kirill A. Shutemov <
> [email protected]> wrote:
>
> >
> > +static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> > +{
> > + return READ_ONCE(mm->context.lam_cr3_mask);
> > +}
> >
>
> I mentioned this before - in the original version this needed (but didn't
> have, iirc) that READ_ONCE, but in the new non-thread situation I don't
> think that's true. There should be no concurrent changes that can interfere
> with the read, no?

It should be safe, yes.

It is reachable from iommu_sva_bind_device(), but it only called with mm
== current->mm.

I will drop the READ_ONCE().

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-01-11 14:51:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv14.1 04/17] x86/mm: Handle LAM on context switch

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
---
v14.1:
- Drop unneeded READ_ONCE();
---
arch/x86/include/asm/mmu.h | 5 +++
arch/x86/include/asm/mmu_context.h | 24 ++++++++++++++
arch/x86/include/asm/tlbflush.h | 38 ++++++++++++++++++++-
arch/x86/mm/tlb.c | 53 +++++++++++++++++++++---------
4 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index efa3eaee522c..22fc9fbf1d0a 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -42,6 +42,11 @@ typedef struct {
unsigned long flags;
#endif

+#ifdef CONFIG_ADDRESS_MASKING
+ /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
+ unsigned long lam_cr3_mask;
+#endif
+
struct mutex lock;
void __user *vdso; /* vdso base address */
const struct vdso_image *vdso_image; /* vdso image in use */
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 53ef591a6166..a62e70801ea8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -91,6 +91,29 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
}
#endif

+#ifdef CONFIG_ADDRESS_MASKING
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return mm->context.lam_cr3_mask;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+}
+
+#else
+
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+#endif
+
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);

@@ -168,6 +191,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
{
arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
+ dup_lam(oldmm, mm);
return ldt_dup_context(oldmm, mm);
}

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..e8b47f57bd4a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,7 +2,7 @@
#ifndef _ASM_X86_TLBFLUSH_H
#define _ASM_X86_TLBFLUSH_H

-#include <linux/mm.h>
+#include <linux/mm_types.h>
#include <linux/sched.h>

#include <asm/processor.h>
@@ -12,6 +12,7 @@
#include <asm/invpcid.h>
#include <asm/pti.h>
#include <asm/processor-flags.h>
+#include <asm/pgtable.h>

void __flush_tlb_all(void);

@@ -101,6 +102,16 @@ struct tlb_state {
*/
bool invalidate_other;

+#ifdef CONFIG_ADDRESS_MASKING
+ /*
+ * Active LAM mode.
+ *
+ * X86_CR3_LAM_U57/U48 shifted right by X86_CR3_LAM_U57_BIT or 0 if LAM
+ * disabled.
+ */
+ u8 lam;
+#endif
+
/*
* Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
* the corresponding user PCID needs a flush next time we
@@ -357,6 +368,31 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#define huge_pmd_needs_flush huge_pmd_needs_flush

+#ifdef CONFIG_ADDRESS_MASKING
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+ u64 lam = this_cpu_read(cpu_tlbstate.lam);
+
+ return lam << X86_CR3_LAM_U57_BIT;
+}
+
+static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+{
+ this_cpu_write(cpu_tlbstate.lam,
+ mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
+}
+
+#else
+
+static inline u64 tlbstate_lam_cr3_mask(void)
+{
+ return 0;
+}
+
+static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+{
+}
+#endif
#endif /* !MODULE */

static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c1e31e9a85d7..8c330a6d0ece 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -154,26 +154,30 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
{
+ unsigned long cr3 = __sme_pa(pgd) | lam;
+
if (static_cpu_has(X86_FEATURE_PCID)) {
- return __sme_pa(pgd) | kern_pcid(asid);
+ VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+ cr3 |= kern_pcid(asid);
} else {
VM_WARN_ON_ONCE(asid != 0);
- return __sme_pa(pgd);
}
+
+ return cr3;
}

-static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
+ unsigned long lam)
{
- VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
/*
* Use boot_cpu_has() instead of this_cpu_has() as this function
* might be called during early boot. This should work even after
* boot because all CPU's the have same capabilities:
*/
VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
- return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
+ return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
}

/*
@@ -274,15 +278,16 @@ static inline void invalidate_user_asid(u16 asid)
(unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
}

-static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, unsigned long lam,
+ bool need_flush)
{
unsigned long new_mm_cr3;

if (need_flush) {
invalidate_user_asid(new_asid);
- new_mm_cr3 = build_cr3(pgdir, new_asid);
+ new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
} else {
- new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+ new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
}

/*
@@ -491,6 +496,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
@@ -520,7 +526,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* isn't free.
*/
#ifdef CONFIG_DEBUG_VM
- if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
+ if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid,
+ tlbstate_lam_cr3_mask()))) {
/*
* If we were to BUG here, we'd be very likely to kill
* the system so hard that we don't see the call trace.
@@ -552,9 +559,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* instruction.
*/
if (real_prev == next) {
+ /* Not actually switching mm's */
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

+ /*
+ * If this races with another thread that enables lam, 'new_lam'
+ * might not match tlbstate_lam_cr3_mask().
+ */
+
/*
* Even in lazy TLB mode, the CPU should stay set in the
* mm_cpumask. The TLB shootdown code can figure out from
@@ -622,15 +635,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
barrier();
}

+ set_tlbstate_lam_mode(next);
if (need_flush) {
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, true);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
} else {
/* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, false);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
@@ -691,6 +705,10 @@ void initialize_tlbstate_and_flush(void)
/* Assert that CR3 already references the right mm. */
WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));

+ /* LAM expected to be disabled */
+ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
+ WARN_ON(mm_lam_cr3_mask(mm));
+
/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
* doesn't work like other CR4 bits because it can only be set from
@@ -699,8 +717,8 @@ void initialize_tlbstate_and_flush(void)
WARN_ON(boot_cpu_has(X86_FEATURE_PCID) &&
!(cr4_read_shadow() & X86_CR4_PCIDE));

- /* Force ASID 0 and force a TLB flush. */
- write_cr3(build_cr3(mm->pgd, 0));
+ /* Disable LAM, force ASID 0 and force a TLB flush. */
+ write_cr3(build_cr3(mm->pgd, 0, 0));

/* Reinitialize tlbstate. */
this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
@@ -708,6 +726,7 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
+ set_tlbstate_lam_mode(mm);

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
@@ -1071,8 +1090,10 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
*/
unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3 =
+ build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+ tlbstate_lam_cr3_mask());

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
--
2.38.2

2023-01-18 16:55:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCHv14 00/17] Linear Address Masking enabling

On Wed, Jan 11, 2023 at 03:37:19PM +0300, Kirill A. Shutemov wrote:

> Kirill A. Shutemov (12):
> x86/mm: Rework address range check in get_user() and put_user()
> x86: Allow atomic MM_CONTEXT flags setting
> x86: CPUID and CR3/CR4 flags for Linear Address Masking
> x86/mm: Handle LAM on context switch
> mm: Introduce untagged_addr_remote()
> x86/uaccess: Provide untagged_addr() and remove tags before address
> check
> x86/mm: Provide arch_prctl() interface for LAM
> x86/mm: Reduce untagged_addr() overhead until the first LAM user
> mm: Expose untagging mask in /proc/$PID/status
> iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
> x86/mm/iommu/sva: Make LAM and SVA mutually exclusive

+- the static_branch() thing,

Acked-by: Peter Zijlstra (Intel) <[email protected]>