Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.
The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
this time.
Please review and consider applying.
git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
v8:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit messages;
- Selftests updates;
- Acked/Reviewed/Tested-bys from Alexander and Peter;
v7:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit message;
- Fix indentation;
v6:
- Rebased onto v6.0-rc1
- LAM_U48 excluded from the patchet. Still available in the git tree;
- add ARCH_GET_MAX_TAG_BITS;
- Fix build without CONFIG_DEBUG_VM;
- Update comments;
- Reviewed/Tested-by from Alexander;
v5:
- Do not use switch_mm() in enable_lam_func()
- Use mb()/READ_ONCE() pair on LAM enabling;
- Add self-test by Weihong Zhang;
- Add comments;
v4:
- Fix untagged_addr() for LAM_U48;
- Remove no-threads restriction on LAM enabling;
- Fix mm_struct access from /proc/$PID/arch_status
- Fix LAM handling in initialize_tlbstate_and_flush()
- Pack tlb_state better;
- Comments and commit messages;
v3:
- Rebased onto v5.19-rc1
- Per-process enabling;
- API overhaul (again);
- Avoid branches and costly computations in the fast path;
- LAM_U48 is in optional patch.
v2:
- Rebased onto v5.18-rc1
- New arch_prctl(2)-based API
- Expose status of LAM (or other thread features) in
/proc/$PID/arch_status
[1] ISE, Chapter 10. https://cdrdv2.intel.com/v1/dl/getContent/671368
Kirill A. Shutemov (7):
x86/mm: Fix CR3_ADDR_MASK
x86: CPUID and CR3/CR4 flags for Linear Address Masking
mm: Pass down mm_struct to untagged_addr()
x86/mm: Handle LAM on context switch
x86/uaccess: Provide untagged_addr() and remove tags before address
check
x86/mm: Provide arch_prctl() interface for LAM
x86: Expose untagging mask in /proc/$PID/arch_status
Weihong Zhang (4):
selftests/x86/lam: Add malloc and tag-bits test cases for
linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address
masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
arch/arm64/include/asm/memory.h | 4 +-
arch/arm64/include/asm/signal.h | 2 +-
arch/arm64/include/asm/uaccess.h | 4 +-
arch/arm64/kernel/hw_breakpoint.c | 2 +-
arch/arm64/kernel/traps.c | 4 +-
arch/arm64/mm/fault.c | 10 +-
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/include/asm/uaccess_64.h | 2 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/mmu.h | 6 +
arch/x86/include/asm/mmu_context.h | 45 +
arch/x86/include/asm/processor-flags.h | 4 +-
arch/x86/include/asm/tlbflush.h | 35 +
arch/x86/include/asm/uaccess.h | 42 +-
arch/x86/include/uapi/asm/prctl.h | 4 +
arch/x86/include/uapi/asm/processor-flags.h | 6 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/fpu/xstate.c | 47 -
arch/x86/kernel/proc.c | 60 ++
arch/x86/kernel/process.c | 3 +
arch/x86/kernel/process_64.c | 64 +-
arch/x86/mm/tlb.c | 48 +-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/radeon/radeon_gem.c | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/media/common/videobuf2/frame_vector.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
.../staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
drivers/tee/tee_shm.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 11 -
include/linux/uaccess.h | 15 +
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
mm/gup.c | 6 +-
mm/madvise.c | 2 +-
mm/mempolicy.c | 6 +-
mm/migrate.c | 2 +-
mm/mincore.c | 2 +-
mm/mlock.c | 4 +-
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/msync.c | 2 +-
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 909 ++++++++++++++++++
virt/kvm/kvm_main.c | 2 +-
49 files changed, 1268 insertions(+), 122 deletions(-)
create mode 100644 arch/x86/kernel/proc.c
create mode 100644 tools/testing/selftests/x86/lam.c
--
2.35.1
untagged_addr() is a helper used by the core-mm to strip tag bits and
get the address to the canonical shape. In only handles userspace
addresses. The untagging mask is stored in mmu_context and will be set
on enabling LAM for the process.
The tags must not be included into check whether it's okay to access the
userspace address.
Strip tags in access_ok().
get_user() and put_user() don't use access_ok(), but check access
against TASK_SIZE directly in assembly. Strip tags, before calling into
the assembly helper.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/mmu.h | 3 +++
arch/x86/include/asm/mmu_context.h | 11 ++++++++
arch/x86/include/asm/uaccess.h | 42 +++++++++++++++++++++++++++---
arch/x86/kernel/process.c | 3 +++
4 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 002889ca8978..2fdb390040b5 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -43,6 +43,9 @@ typedef struct {
/* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
unsigned long lam_cr3_mask;
+
+ /* Significant bits of the virtual address. Excludes tag bits. */
+ u64 untag_mask;
#endif
struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 69c943b2ae90..5bd3d46685dc 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,6 +100,12 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+ mm->context.untag_mask = oldmm->context.untag_mask;
+}
+
+static inline void mm_reset_untag_mask(struct mm_struct *mm)
+{
+ mm->context.untag_mask = -1UL;
}
#else
@@ -112,6 +118,10 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
}
+
+static inline void mm_reset_untag_mask(struct mm_struct *mm)
+{
+}
#endif
#define enter_lazy_tlb enter_lazy_tlb
@@ -138,6 +148,7 @@ static inline int init_new_context(struct task_struct *tsk,
mm->context.execute_only_pkey = -1;
}
#endif
+ mm_reset_untag_mask(mm);
init_new_context_ldt(mm);
return 0;
}
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 913e593a3b45..803241dfc473 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -6,6 +6,7 @@
*/
#include <linux/compiler.h>
#include <linux/kasan-checks.h>
+#include <linux/mm_types.h>
#include <linux/string.h>
#include <asm/asm.h>
#include <asm/page.h>
@@ -20,6 +21,30 @@ static inline bool pagefault_disabled(void);
# define WARN_ON_IN_IRQ()
#endif
+#ifdef CONFIG_X86_64
+/*
+ * Mask out tag bits from the address.
+ *
+ * Magic with the 'sign' allows to untag userspace pointer without any branches
+ * while leaving kernel addresses intact.
+ */
+#define untagged_addr(mm, addr) ({ \
+ u64 __addr = (__force u64)(addr); \
+ s64 sign = (s64)__addr >> 63; \
+ __addr &= (mm)->context.untag_mask | sign; \
+ (__force __typeof__(addr))__addr; \
+})
+
+#define untagged_ptr(mm, ptr) ({ \
+ u64 __ptrval = (__force u64)(ptr); \
+ __ptrval = untagged_addr(mm, __ptrval); \
+ (__force __typeof__(*(ptr)) *)__ptrval; \
+})
+#else
+#define untagged_addr(mm, addr) (addr)
+#define untagged_ptr(mm, ptr) (ptr)
+#endif
+
/**
* access_ok - Checks if a user space pointer is valid
* @addr: User space pointer to start of block to check
@@ -40,7 +65,7 @@ static inline bool pagefault_disabled(void);
#define access_ok(addr, size) \
({ \
WARN_ON_IN_IRQ(); \
- likely(__access_ok(addr, size)); \
+ likely(__access_ok(untagged_addr(current->mm, addr), size)); \
})
#include <asm-generic/access_ok.h>
@@ -125,7 +150,13 @@ extern int __get_user_bad(void);
* Return: zero on success, or -EFAULT on error.
* On error, the variable @x is set to zero.
*/
-#define get_user(x,ptr) ({ might_fault(); do_get_user_call(get_user,x,ptr); })
+#define get_user(x,ptr) \
+({ \
+ __typeof__(*(ptr)) __user *__ptr_clean; \
+ __ptr_clean = untagged_ptr(current->mm, ptr); \
+ might_fault(); \
+ do_get_user_call(get_user,x,__ptr_clean); \
+})
/**
* __get_user - Get a simple variable from user space, with less checking.
@@ -222,7 +253,12 @@ extern void __put_user_nocheck_8(void);
*
* Return: zero on success, or -EFAULT on error.
*/
-#define put_user(x, ptr) ({ might_fault(); do_put_user_call(put_user,x,ptr); })
+#define put_user(x, ptr) ({ \
+ __typeof__(*(ptr)) __user *__ptr_clean; \
+ __ptr_clean = untagged_ptr(current->mm, ptr); \
+ might_fault(); \
+ do_put_user_call(put_user,x,__ptr_clean); \
+})
/**
* __put_user - Write a simple value into user space, with less checking.
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 58a6ea472db9..b0e86fb11ffa 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,6 +47,7 @@
#include <asm/frame.h>
#include <asm/unwind.h>
#include <asm/tdx.h>
+#include <asm/mmu_context.h>
#include "process.h"
@@ -367,6 +368,8 @@ void arch_setup_new_exec(void)
task_clear_spec_ssb_noexec(current);
speculation_ctrl_update(read_thread_flags());
}
+
+ mm_reset_untag_mask(current->mm);
}
#ifdef CONFIG_X86_IOPL_IOPERM
--
2.35.1
Add a couple of arch_prctl() handles:
- ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number
of tag bits. It is rounded up to the nearest LAM mode that can
provide it. For now only LAM_U57 is supported, with 6 tag bits.
- ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag
bits located in the address.
- ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request.
Zero if LAM is not supported.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/uapi/asm/prctl.h | 4 ++
arch/x86/kernel/process_64.c | 64 ++++++++++++++++++++++++++++++-
2 files changed, 67 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..a31e27b95b19 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,8 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003
+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1962008fe743..455601a3cc3f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -742,6 +742,59 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
}
#endif
+static void enable_lam_func(void *mm)
+{
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned long lam_mask;
+ unsigned long cr3;
+
+ if (loaded_mm != mm)
+ return;
+
+ lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
+
+ /* Update CR3 to get LAM active on the CPU */
+ cr3 = __read_cr3();
+ cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
+ cr3 |= lam_mask;
+ write_cr3(cr3);
+ set_tlbstate_cr3_lam_mask(lam_mask);
+}
+
+#define LAM_U57_BITS 6
+
+static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
+{
+ int ret = 0;
+
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return -ENODEV;
+
+ mutex_lock(&mm->context.lock);
+
+ /* Already enabled? */
+ if (mm->context.lam_cr3_mask) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (!nr_bits) {
+ ret = -EINVAL;
+ goto out;
+ } else if (nr_bits <= LAM_U57_BITS) {
+ mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
+ mm->context.untag_mask = ~GENMASK(62, 57);
+ } else {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+out:
+ mutex_unlock(&mm->context.lock);
+ return ret;
+}
+
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
@@ -829,7 +882,16 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(&vdso_image_64, arg2);
#endif
-
+ case ARCH_GET_UNTAG_MASK:
+ return put_user(task->mm->context.untag_mask,
+ (unsigned long __user *)arg2);
+ case ARCH_ENABLE_TAGGED_ADDR:
+ return prctl_enable_tagged_addr(task->mm, arg2);
+ case ARCH_GET_MAX_TAG_BITS:
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return put_user(0, (unsigned long __user *)arg2);
+ else
+ return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
default:
ret = -EINVAL;
break;
--
2.35.1
From: Weihong Zhang <[email protected]>
Add mmap and SYSCALL test cases.
SYSCALL test cases:
- LAM supports set metadata in high bits 62:57 (LAM_U57) of a user pointer, pass
the pointer to SYSCALL, SYSCALL can dereference the pointer and return correct
result.
- Disable LAM, pass a pointer with metadata in high bits to SYSCALL,
SYSCALL returns -1 (EFAULT).
MMAP test cases:
- Enable LAM_U57, MMAP with low address (below bits 47), set metadata
in high bits of the address, dereference the address should be
allowed.
- Enable LAM_U57, MMAP with high address (above bits 47), set metadata
in high bits of the address, dereference the address should be
allowed.
Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
tools/testing/selftests/x86/lam.c | 135 +++++++++++++++++++++++++++++-
1 file changed, 132 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 900a3a0fb709..b88e007ee0a3 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -7,6 +7,7 @@
#include <signal.h>
#include <setjmp.h>
#include <sys/mman.h>
+#include <sys/utsname.h>
#include <sys/wait.h>
#include <inttypes.h>
@@ -29,11 +30,18 @@
/* Specified test function bits */
#define FUNC_MALLOC 0x1
#define FUNC_BITS 0x2
+#define FUNC_MMAP 0x4
+#define FUNC_SYSCALL 0x8
-#define TEST_MASK 0x3
+#define TEST_MASK 0xf
+
+#define LOW_ADDR (0x1UL << 30)
+#define HIGH_ADDR (0x3UL << 48)
#define MALLOC_LEN 32
+#define PAGE_SIZE (4 << 10)
+
struct testcases {
unsigned int later;
int expected; /* 2: SIGSEGV Error; 1: other errors */
@@ -49,6 +57,7 @@ jmp_buf segv_env;
static void segv_handler(int sig)
{
ksft_print_msg("Get segmentation fault(%d).", sig);
+
siglongjmp(segv_env, 1);
}
@@ -61,6 +70,16 @@ static inline int cpu_has_lam(void)
return (cpuinfo[0] & (1 << 26));
}
+/* Check 5-level page table feature in CPUID.(EAX=07H, ECX=00H):ECX.[bit 16] */
+static inline int cpu_has_la57(void)
+{
+ unsigned int cpuinfo[4];
+
+ __cpuid_count(0x7, 0, cpuinfo[0], cpuinfo[1], cpuinfo[2], cpuinfo[3]);
+
+ return (cpuinfo[2] & (1 << 16));
+}
+
/*
* Set tagged address and read back untag mask.
* check if the untagged mask is expected.
@@ -213,6 +232,68 @@ static int handle_malloc(struct testcases *test)
return ret;
}
+static int handle_mmap(struct testcases *test)
+{
+ void *ptr;
+ unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ ptr = mmap((void *)test->addr, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ flags, -1, 0);
+ if (ptr == MAP_FAILED) {
+ if (test->addr == HIGH_ADDR)
+ if (!cpu_has_la57())
+ return 3; /* unsupport LA57 */
+ return 1;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ ret = 1;
+
+ if (ret == 0) {
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = handle_lam_test(ptr, test->lam);
+ } else {
+ ret = 2;
+ }
+ }
+
+ munmap(ptr, PAGE_SIZE);
+ return ret;
+}
+
+static int handle_syscall(struct testcases *test)
+{
+ struct utsname unme, *pu;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ pu = (struct utsname *)set_metadata((uint64_t)&unme, test->lam);
+ ret = uname(pu);
+ if (ret < 0)
+ ret = 1;
+ } else {
+ ret = 2;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) != -1 && ret == 0)
+ ret = 1;
+
+ return ret;
+}
+
static int fork_test(struct testcases *test)
{
int ret, child_ret;
@@ -268,7 +349,6 @@ static struct testcases malloc_cases[] = {
},
};
-
static struct testcases bits_cases[] = {
{
.test_func = handle_max_bits,
@@ -276,11 +356,54 @@ static struct testcases bits_cases[] = {
},
};
+static struct testcases syscall_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_syscall,
+ .msg = "SYSCALL: LAM_U57. syscall with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_syscall,
+ .msg = "SYSCALL:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
+static struct testcases mmap_cases[] = {
+ {
+ .later = 1,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = HIGH_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First mmap high address, then set LAM_U57.\n",
+ },
+ {
+ .later = 0,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = HIGH_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First LAM_U57, then High address.\n",
+ },
+ {
+ .later = 0,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = LOW_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First LAM_U57, then Low address.\n",
+ },
+};
+
static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits;\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall.\n");
printf("\t-h: help\n");
}
@@ -320,6 +443,12 @@ int main(int argc, char **argv)
if (tests & FUNC_BITS)
run_test(bits_cases, ARRAY_SIZE(bits_cases));
+ if (tests & FUNC_MMAP)
+ run_test(mmap_cases, ARRAY_SIZE(mmap_cases));
+
+ if (tests & FUNC_SYSCALL)
+ run_test(syscall_cases, ARRAY_SIZE(syscall_cases));
+
ksft_set_plan(tests_cnt);
return ksft_exit_pass();
--
2.35.1
From: Weihong Zhang <[email protected]>
LAM should be supported in kernel thread, using io_uring to verify LAM feature.
The test cases implement read a file through io_uring, the test cases choose an
iovec array as receiving buffer, which used to receive data, according to LAM
mode, set metadata in high bits of these buffer.
io_uring can deal with these buffers that pointed to pointers with the metadata
in high bits.
Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
tools/testing/selftests/x86/lam.c | 341 +++++++++++++++++++++++++++++-
1 file changed, 339 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index b88e007ee0a3..25e6d0fc4a83 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -9,8 +9,12 @@
#include <sys/mman.h>
#include <sys/utsname.h>
#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
#include <inttypes.h>
+#include <sys/uio.h>
+#include <linux/io_uring.h>
#include "../kselftest.h"
#ifndef __x86_64__
@@ -32,8 +36,9 @@
#define FUNC_BITS 0x2
#define FUNC_MMAP 0x4
#define FUNC_SYSCALL 0x8
+#define FUNC_URING 0x10
-#define TEST_MASK 0xf
+#define TEST_MASK 0x1f
#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -42,6 +47,13 @@
#define PAGE_SIZE (4 << 10)
+#define barrier() ({ \
+ __asm__ __volatile__("" : : : "memory"); \
+})
+
+#define URING_QUEUE_SZ 1
+#define URING_BLOCK_SZ 2048
+
struct testcases {
unsigned int later;
int expected; /* 2: SIGSEGV Error; 1: other errors */
@@ -51,6 +63,33 @@ struct testcases {
const char *msg;
};
+/* Used by CQ of uring, source file handler and file's size */
+struct file_io {
+ int file_fd;
+ off_t file_sz;
+ struct iovec iovecs[];
+};
+
+struct io_uring_queue {
+ unsigned int *head;
+ unsigned int *tail;
+ unsigned int *ring_mask;
+ unsigned int *ring_entries;
+ unsigned int *flags;
+ unsigned int *array;
+ union {
+ struct io_uring_cqe *cqes;
+ struct io_uring_sqe *sqes;
+ } queue;
+ size_t ring_sz;
+};
+
+struct io_ring {
+ int ring_fd;
+ struct io_uring_queue sq_ring;
+ struct io_uring_queue cq_ring;
+};
+
int tests_cnt;
jmp_buf segv_env;
@@ -294,6 +333,285 @@ static int handle_syscall(struct testcases *test)
return ret;
}
+int sys_uring_setup(unsigned int entries, struct io_uring_params *p)
+{
+ return (int)syscall(__NR_io_uring_setup, entries, p);
+}
+
+int sys_uring_enter(int fd, unsigned int to, unsigned int min, unsigned int flags)
+{
+ return (int)syscall(__NR_io_uring_enter, fd, to, min, flags, NULL, 0);
+}
+
+/* Init submission queue and completion queue */
+int mmap_io_uring(struct io_uring_params p, struct io_ring *s)
+{
+ struct io_uring_queue *sring = &s->sq_ring;
+ struct io_uring_queue *cring = &s->cq_ring;
+
+ sring->ring_sz = p.sq_off.array + p.sq_entries * sizeof(unsigned int);
+ cring->ring_sz = p.cq_off.cqes + p.cq_entries * sizeof(struct io_uring_cqe);
+
+ if (p.features & IORING_FEAT_SINGLE_MMAP) {
+ if (cring->ring_sz > sring->ring_sz)
+ sring->ring_sz = cring->ring_sz;
+
+ cring->ring_sz = sring->ring_sz;
+ }
+
+ void *sq_ptr = mmap(0, sring->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, s->ring_fd,
+ IORING_OFF_SQ_RING);
+
+ if (sq_ptr == MAP_FAILED) {
+ perror("sub-queue!");
+ return 1;
+ }
+
+ void *cq_ptr = sq_ptr;
+
+ if (!(p.features & IORING_FEAT_SINGLE_MMAP)) {
+ cq_ptr = mmap(0, cring->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, s->ring_fd,
+ IORING_OFF_CQ_RING);
+ if (cq_ptr == MAP_FAILED) {
+ perror("cpl-queue!");
+ munmap(sq_ptr, sring->ring_sz);
+ return 1;
+ }
+ }
+
+ sring->head = sq_ptr + p.sq_off.head;
+ sring->tail = sq_ptr + p.sq_off.tail;
+ sring->ring_mask = sq_ptr + p.sq_off.ring_mask;
+ sring->ring_entries = sq_ptr + p.sq_off.ring_entries;
+ sring->flags = sq_ptr + p.sq_off.flags;
+ sring->array = sq_ptr + p.sq_off.array;
+
+ /* Map a queue as mem map */
+ s->sq_ring.queue.sqes = mmap(0, p.sq_entries * sizeof(struct io_uring_sqe),
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
+ s->ring_fd, IORING_OFF_SQES);
+ if (s->sq_ring.queue.sqes == MAP_FAILED) {
+ munmap(sq_ptr, sring->ring_sz);
+ if (sq_ptr != cq_ptr) {
+ ksft_print_msg("failed to mmap uring queue!");
+ munmap(cq_ptr, cring->ring_sz);
+ return 1;
+ }
+ }
+
+ cring->head = cq_ptr + p.cq_off.head;
+ cring->tail = cq_ptr + p.cq_off.tail;
+ cring->ring_mask = cq_ptr + p.cq_off.ring_mask;
+ cring->ring_entries = cq_ptr + p.cq_off.ring_entries;
+ cring->queue.cqes = cq_ptr + p.cq_off.cqes;
+
+ return 0;
+}
+
+/* Init io_uring queues */
+int setup_io_uring(struct io_ring *s)
+{
+ struct io_uring_params para;
+
+ memset(¶, 0, sizeof(para));
+ s->ring_fd = sys_uring_setup(URING_QUEUE_SZ, ¶);
+ if (s->ring_fd < 0)
+ return 1;
+
+ return mmap_io_uring(para, s);
+}
+
+/*
+ * Get data from completion queue. the data buffer saved the file data
+ * return 0: success; others: error;
+ */
+int handle_uring_cq(struct io_ring *s)
+{
+ struct file_io *fi = NULL;
+ struct io_uring_queue *cring = &s->cq_ring;
+ struct io_uring_cqe *cqe;
+ unsigned int head;
+ off_t len = 0;
+
+ head = *cring->head;
+
+ do {
+ barrier();
+ if (head == *cring->tail)
+ break;
+ /* Get the entry */
+ cqe = &cring->queue.cqes[head & *s->cq_ring.ring_mask];
+ fi = (struct file_io *)cqe->user_data;
+ if (cqe->res < 0)
+ break;
+
+ int blocks = (int)(fi->file_sz + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ for (int i = 0; i < blocks; i++)
+ len += fi->iovecs[i].iov_len;
+
+ head++;
+ } while (1);
+
+ *cring->head = head;
+ barrier();
+
+ return (len != fi->file_sz);
+}
+
+/*
+ * Submit squeue. specify via IORING_OP_READV.
+ * the buffer need to be set metadata according to LAM mode
+ */
+int handle_uring_sq(struct io_ring *ring, struct file_io *fi, unsigned long lam)
+{
+ int file_fd = fi->file_fd;
+ struct io_uring_queue *sring = &ring->sq_ring;
+ unsigned int index = 0, cur_block = 0, tail = 0, next_tail = 0;
+ struct io_uring_sqe *sqe;
+
+ off_t remain = fi->file_sz;
+ int blocks = (int)(remain + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ while (remain) {
+ off_t bytes = remain;
+ void *buf;
+
+ if (bytes > URING_BLOCK_SZ)
+ bytes = URING_BLOCK_SZ;
+
+ fi->iovecs[cur_block].iov_len = bytes;
+
+ if (posix_memalign(&buf, URING_BLOCK_SZ, URING_BLOCK_SZ))
+ return 1;
+
+ fi->iovecs[cur_block].iov_base = (void *)set_metadata((uint64_t)buf, lam);
+ remain -= bytes;
+ cur_block++;
+ }
+
+ next_tail = *sring->tail;
+ tail = next_tail;
+ next_tail++;
+
+ barrier();
+
+ index = tail & *ring->sq_ring.ring_mask;
+
+ sqe = &ring->sq_ring.queue.sqes[index];
+ sqe->fd = file_fd;
+ sqe->flags = 0;
+ sqe->opcode = IORING_OP_READV;
+ sqe->addr = (unsigned long)fi->iovecs;
+ sqe->len = blocks;
+ sqe->off = 0;
+ sqe->user_data = (uint64_t)fi;
+
+ sring->array[index] = index;
+ tail = next_tail;
+
+ if (*sring->tail != tail) {
+ *sring->tail = tail;
+ barrier();
+ }
+
+ if (sys_uring_enter(ring->ring_fd, 1, 1, IORING_ENTER_GETEVENTS) < 0)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Test LAM in async I/O and io_uring, read current binery through io_uring
+ * Set metadata in pointers to iovecs buffer.
+ */
+int do_uring(unsigned long lam)
+{
+ struct io_ring *ring;
+ struct file_io *fi;
+ struct stat st;
+ int ret = 1;
+ char path[PATH_MAX];
+
+ /* get current process path */
+ if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ return 1;
+
+ int file_fd = open(path, O_RDONLY);
+
+ if (file_fd < 0)
+ return 1;
+
+ if (fstat(file_fd, &st) < 0)
+ return 1;
+
+ off_t file_sz = st.st_size;
+
+ int blocks = (int)(file_sz + URING_BLOCK_SZ - 1) / URING_BLOCK_SZ;
+
+ fi = malloc(sizeof(*fi) + sizeof(struct iovec) * blocks);
+ if (!fi)
+ return 1;
+
+ fi->file_sz = file_sz;
+ fi->file_fd = file_fd;
+
+ ring = malloc(sizeof(*ring));
+ if (!ring)
+ return 1;
+
+ memset(ring, 0, sizeof(struct io_ring));
+
+ if (setup_io_uring(ring))
+ goto out;
+
+ if (handle_uring_sq(ring, fi, lam))
+ goto out;
+
+ ret = handle_uring_cq(ring);
+
+out:
+ free(ring);
+
+ for (int i = 0; i < blocks; i++) {
+ if (fi->iovecs[i].iov_base) {
+ uint64_t addr = ((uint64_t)fi->iovecs[i].iov_base);
+
+ switch (lam) {
+ case LAM_U57_BITS: /* Clear bits 62:57 */
+ addr = (addr & ~(0x3fULL << 57));
+ break;
+ }
+ free((void *)addr);
+ fi->iovecs[i].iov_base = NULL;
+ }
+ }
+
+ free(fi);
+
+ return ret;
+}
+
+int handle_uring(struct testcases *test)
+{
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = do_uring(test->lam);
+ } else {
+ ret = 2;
+ }
+
+ return ret;
+}
+
static int fork_test(struct testcases *test)
{
int ret, child_ret;
@@ -333,6 +651,22 @@ static void run_test(struct testcases *test, int count)
}
}
+static struct testcases uring_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_uring,
+ .msg = "URING: LAM_U57. Dereferencing pointer with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_uring,
+ .msg = "URING:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
static struct testcases malloc_cases[] = {
{
.later = 0,
@@ -403,7 +737,7 @@ static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall.\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring.\n");
printf("\t-h: help\n");
}
@@ -449,6 +783,9 @@ int main(int argc, char **argv)
if (tests & FUNC_SYSCALL)
run_test(syscall_cases, ARRAY_SIZE(syscall_cases));
+ if (tests & FUNC_URING)
+ run_test(uring_cases, ARRAY_SIZE(uring_cases));
+
ksft_set_plan(tests_cnt);
return ksft_exit_pass();
--
2.35.1
From: Weihong Zhang <[email protected]>
LAM is supported only in 64-bit mode and applies only addresses used for data
accesses. In 64-bit mode, linear address have 64 bits. LAM is applied to 64-bit
linear address and allow software to use high bits for metadata.
LAM supports configurations that differ regarding which pointer bits are masked
and can be used for metadata.
LAM includes following mode:
- LAM_U57, pointer bits in positions 62:57 are masked (LAM width 6),
allows bits 62:57 of a user pointer to be used as metadata.
There are some arch_prctls:
ARCH_ENABLE_TAGGED_ADDR: enable LAM mode, mask high bits of a user pointer.
ARCH_GET_UNTAG_MASK: get current untagged mask.
ARCH_GET_MAX_TAG_BITS: the maximum tag bits user can request. zero if LAM
is not supported.
The LAM mode is for pre-process, a process has only one chance to set LAM mode.
But there is no API to disable LAM mode. So all of test cases are run under
child process.
Functions of this test:
MALLOC
- LAM_U57 masks bits 57:62 of a user pointer. Process on user space
can dereference such pointers.
- Disable LAM, dereference a pointer with metadata above 48 bit or 57 bit
lead to trigger SIGSEGV.
TAG_BITS
- Max tag bits of LAM_U57 is 6.
Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 326 +++++++++++++++++++++++++++
2 files changed, 327 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/lam.c
diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..c1a16a9d4f2f 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx
+ corrupt_xstate_header amx lam
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
new file mode 100644
index 000000000000..900a3a0fb709
--- /dev/null
+++ b/tools/testing/selftests/x86/lam.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <inttypes.h>
+
+#include "../kselftest.h"
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+/* LAM modes, these definitions were copied from kernel code */
+#define LAM_NONE 0
+#define LAM_U57_BITS 6
+
+#define LAM_U57_MASK (0x3fULL << 57)
+/* arch prctl for LAM */
+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
+/* Specified test function bits */
+#define FUNC_MALLOC 0x1
+#define FUNC_BITS 0x2
+
+#define TEST_MASK 0x3
+
+#define MALLOC_LEN 32
+
+struct testcases {
+ unsigned int later;
+ int expected; /* 2: SIGSEGV Error; 1: other errors */
+ unsigned long lam;
+ uint64_t addr;
+ int (*test_func)(struct testcases *test);
+ const char *msg;
+};
+
+int tests_cnt;
+jmp_buf segv_env;
+
+static void segv_handler(int sig)
+{
+ ksft_print_msg("Get segmentation fault(%d).", sig);
+ siglongjmp(segv_env, 1);
+}
+
+static inline int cpu_has_lam(void)
+{
+ unsigned int cpuinfo[4];
+
+ __cpuid_count(0x7, 1, cpuinfo[0], cpuinfo[1], cpuinfo[2], cpuinfo[3]);
+
+ return (cpuinfo[0] & (1 << 26));
+}
+
+/*
+ * Set tagged address and read back untag mask.
+ * check if the untagged mask is expected.
+ *
+ * @return:
+ * 0: Set LAM mode successfully
+ * others: failed to set LAM
+ */
+static int set_lam(unsigned long lam)
+{
+ int ret = 0;
+ uint64_t ptr = 0;
+
+ if (lam != LAM_U57_BITS && lam != LAM_NONE)
+ return -1;
+
+ /* Skip check return */
+ syscall(SYS_arch_prctl, ARCH_ENABLE_TAGGED_ADDR, lam);
+
+ /* Get untagged mask */
+ syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr);
+
+ /* Check mask returned is expected */
+ if (lam == LAM_U57_BITS)
+ ret = (ptr != ~(LAM_U57_MASK));
+ else if (lam == LAM_NONE)
+ ret = (ptr != -1ULL);
+
+ return ret;
+}
+
+static unsigned long get_default_tag_bits(void)
+{
+ pid_t pid;
+ int lam = LAM_NONE;
+ int ret = 0;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ } else if (pid == 0) {
+ /* Set LAM mode in child process */
+ if (set_lam(LAM_U57_BITS) == 0)
+ lam = LAM_U57_BITS;
+ else
+ lam = LAM_NONE;
+ exit(lam);
+ } else {
+ wait(&ret);
+ lam = WEXITSTATUS(ret);
+ }
+
+ return lam;
+}
+
+/* According to LAM mode, set metadata in high bits */
+static uint64_t set_metadata(uint64_t src, unsigned long lam)
+{
+ uint64_t metadata;
+
+ srand(time(NULL));
+ /* Get a random value as metadata */
+ metadata = rand();
+
+ switch (lam) {
+ case LAM_U57_BITS: /* Set metadata in bits 62:57 */
+ metadata = (src & ~(LAM_U57_MASK)) | ((metadata & 0x3f) << 57);
+ break;
+ default:
+ metadata = src;
+ break;
+ }
+
+ return metadata;
+}
+
+/*
+ * Set metadata in user pointer, compare new pointer with original pointer.
+ * both pointers should point to the same address.
+ *
+ * @return:
+ * 0: value on the pointer with metadate and value on original are same
+ * 1: not same.
+ */
+static int handle_lam_test(void *src, unsigned int lam)
+{
+ char *ptr;
+
+ strcpy((char *)src, "USER POINTER");
+
+ ptr = (char *)set_metadata((uint64_t)src, lam);
+ if (src == ptr)
+ return 0;
+
+ /* Copy a string into the pointer with metadata */
+ strcpy((char *)ptr, "METADATA POINTER");
+
+ return (!!strcmp((char *)src, (char *)ptr));
+}
+
+
+int handle_max_bits(struct testcases *test)
+{
+ unsigned long exp_bits = get_default_tag_bits();
+ unsigned long bits = 0;
+
+ if (exp_bits != LAM_NONE)
+ exp_bits = LAM_U57_BITS;
+
+ /* Get LAM max tag bits */
+ if (syscall(SYS_arch_prctl, ARCH_GET_MAX_TAG_BITS, &bits) == -1)
+ return 1;
+
+ return (exp_bits != bits);
+}
+
+/*
+ * Test lam feature through dereference pointer get from malloc.
+ * @return 0: Pass test. 1: Get failure during test 2: Get SIGSEGV
+ */
+static int handle_malloc(struct testcases *test)
+{
+ char *ptr = NULL;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1)
+ return 1;
+
+ ptr = (char *)malloc(MALLOC_LEN);
+ if (ptr == NULL) {
+ perror("malloc() failure\n");
+ return 1;
+ }
+
+ /* Set signal handler */
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = handle_lam_test(ptr, test->lam);
+ } else {
+ ret = 2;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1 && ret == 0)
+ ret = 1;
+
+ free(ptr);
+
+ return ret;
+}
+
+static int fork_test(struct testcases *test)
+{
+ int ret, child_ret;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ ret = test->test_func(test);
+ exit(ret);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ }
+
+ return ret;
+}
+
+static void run_test(struct testcases *test, int count)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < count; i++) {
+ struct testcases *t = test + i;
+
+ /* fork a process to run test case */
+ ret = fork_test(t);
+ if (ret != 0)
+ ret = (t->expected == ret);
+ else
+ ret = !(t->expected);
+
+ tests_cnt++;
+ ksft_test_result(ret, t->msg);
+ }
+}
+
+static struct testcases malloc_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC: LAM_U57. Dereferencing pointer with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 2,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
+
+static struct testcases bits_cases[] = {
+ {
+ .test_func = handle_max_bits,
+ .msg = "BITS: Check default tag bits\n",
+ },
+};
+
+static void cmd_help(void)
+{
+ printf("usage: lam [-h] [-t test list]\n");
+ printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
+ printf("\t\t0x1:malloc; 0x2:max_bits;\n");
+ printf("\t-h: help\n");
+}
+
+int main(int argc, char **argv)
+{
+ int c = 0;
+ unsigned int tests = TEST_MASK;
+
+ tests_cnt = 0;
+
+ if (!cpu_has_lam()) {
+ ksft_print_msg("Unsupported LAM feature!\n");
+ return -1;
+ }
+
+ while ((c = getopt(argc, argv, "ht:")) != -1) {
+ switch (c) {
+ case 't':
+ tests = strtoul(optarg, NULL, 16);
+ if (!(tests & TEST_MASK)) {
+ ksft_print_msg("Invalid argument!\n");
+ return -1;
+ }
+ break;
+ case 'h':
+ cmd_help();
+ return 0;
+ default:
+ ksft_print_msg("Invalid argument\n");
+ return -1;
+ }
+ }
+
+ if (tests & FUNC_MALLOC)
+ run_test(malloc_cases, ARRAY_SIZE(malloc_cases));
+
+ if (tests & FUNC_BITS)
+ run_test(bits_cases, ARRAY_SIZE(bits_cases));
+
+ ksft_set_plan(tests_cnt);
+
+ return ksft_exit_pass();
+}
--
2.35.1
Add a line in /proc/$PID/arch_status to report untag_mask. It can be
used to find out LAM status of the process from the outside. It is
useful for debuggers.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 10 +++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/fpu/xstate.c | 47 -----------------------
arch/x86/kernel/proc.c | 60 ++++++++++++++++++++++++++++++
4 files changed, 72 insertions(+), 47 deletions(-)
create mode 100644 arch/x86/kernel/proc.c
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 5bd3d46685dc..b0e9ea23758b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -103,6 +103,11 @@ static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
mm->context.untag_mask = oldmm->context.untag_mask;
}
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return mm->context.untag_mask;
+}
+
static inline void mm_reset_untag_mask(struct mm_struct *mm)
{
mm->context.untag_mask = -1UL;
@@ -119,6 +124,11 @@ static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
}
+static inline unsigned long mm_untag_mask(struct mm_struct *mm)
+{
+ return -1UL;
+}
+
static inline void mm_reset_untag_mask(struct mm_struct *mm)
{
}
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a20a5ebfacd7..fada0e36031b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -139,6 +139,8 @@ obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
obj-$(CONFIG_AMD_MEM_ENCRYPT) += sev.o
+obj-$(CONFIG_PROC_FS) += proc.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8340156bfd2..838a6f0627fd 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,8 +10,6 @@
#include <linux/mman.h>
#include <linux/nospec.h>
#include <linux/pkeys.h>
-#include <linux/seq_file.h>
-#include <linux/proc_fs.h>
#include <linux/vmalloc.h>
#include <asm/fpu/api.h>
@@ -1745,48 +1743,3 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
return -EINVAL;
}
}
-
-#ifdef CONFIG_PROC_PID_ARCH_STATUS
-/*
- * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
- */
-static void avx512_status(struct seq_file *m, struct task_struct *task)
-{
- unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
- long delta;
-
- if (!timestamp) {
- /*
- * Report -1 if no AVX512 usage
- */
- delta = -1;
- } else {
- delta = (long)(jiffies - timestamp);
- /*
- * Cap to LONG_MAX if time difference > LONG_MAX
- */
- if (delta < 0)
- delta = LONG_MAX;
- delta = jiffies_to_msecs(delta);
- }
-
- seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
- seq_putc(m, '\n');
-}
-
-/*
- * Report architecture specific information
- */
-int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
- struct pid *pid, struct task_struct *task)
-{
- /*
- * Report AVX512 state if the processor and build option supported.
- */
- if (cpu_feature_enabled(X86_FEATURE_AVX512F))
- avx512_status(m, task);
-
- return 0;
-}
-#endif /* CONFIG_PROC_PID_ARCH_STATUS */
diff --git a/arch/x86/kernel/proc.c b/arch/x86/kernel/proc.c
new file mode 100644
index 000000000000..9765b4d05ce4
--- /dev/null
+++ b/arch/x86/kernel/proc.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/proc_fs.h>
+#include <linux/sched/mm.h>
+#include <linux/seq_file.h>
+#include <uapi/asm/prctl.h>
+#include <asm/mmu_context.h>
+
+/*
+ * Report the amount of time elapsed in millisecond since last AVX512
+ * use in the task.
+ */
+static void avx512_status(struct seq_file *m, struct task_struct *task)
+{
+ unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
+ long delta;
+
+ if (!timestamp) {
+ /*
+ * Report -1 if no AVX512 usage
+ */
+ delta = -1;
+ } else {
+ delta = (long)(jiffies - timestamp);
+ /*
+ * Cap to LONG_MAX if time difference > LONG_MAX
+ */
+ if (delta < 0)
+ delta = LONG_MAX;
+ delta = jiffies_to_msecs(delta);
+ }
+
+ seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
+ seq_putc(m, '\n');
+}
+
+/*
+ * Report architecture specific information
+ */
+int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct mm_struct *mm;
+ unsigned long untag_mask = -1UL;
+
+ /*
+ * Report AVX512 state if the processor and build option supported.
+ */
+ if (cpu_feature_enabled(X86_FEATURE_AVX512F))
+ avx512_status(m, task);
+
+ mm = get_task_mm(task);
+ if (mm) {
+ untag_mask = mm_untag_mask(task->mm);
+ mmput(mm);
+ }
+
+ seq_printf(m, "untag_mask:\t%#lx\n", untag_mask);
+
+ return 0;
+}
--
2.35.1
Intel Linear Address Masking (LAM) brings per-mm untagging rules. Pass
down mm_struct to the untagging helper. It will help to apply untagging
policy correctly.
In most cases, current->mm is the one to use, but there are some
exceptions, such as get_user_page_remote().
Move dummy implementation of untagged_addr() from <linux/mm.h> to
<linux/uaccess.h>. <asm/uaccess.h> can override the implementation.
Moving the dummy header outside <linux/mm.h> helps to avoid header hell
if you need to defer mm_struct within the helper.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/arm64/include/asm/memory.h | 4 ++--
arch/arm64/include/asm/signal.h | 2 +-
arch/arm64/include/asm/uaccess.h | 4 ++--
arch/arm64/kernel/hw_breakpoint.c | 2 +-
arch/arm64/kernel/traps.c | 4 ++--
arch/arm64/mm/fault.c | 10 +++++-----
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/include/asm/uaccess_64.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/radeon/radeon_gem.c | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/media/common/videobuf2/frame_vector.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
drivers/staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
drivers/tee/tee_shm.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 11 -----------
include/linux/uaccess.h | 15 +++++++++++++++
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
mm/gup.c | 6 +++---
mm/madvise.c | 2 +-
mm/mempolicy.c | 6 +++---
mm/migrate.c | 2 +-
mm/mincore.c | 2 +-
mm/mlock.c | 4 ++--
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/msync.c | 2 +-
virt/kvm/kvm_main.c | 2 +-
33 files changed, 59 insertions(+), 53 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 9dd08cd339c3..5b24ef93c6b9 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -227,8 +227,8 @@ static inline unsigned long kaslr_offset(void)
#define __untagged_addr(addr) \
((__force __typeof__(addr))sign_extend64((__force u64)(addr), 55))
-#define untagged_addr(addr) ({ \
- u64 __addr = (__force u64)(addr); \
+#define untagged_addr(mm, addr) ({ \
+ u64 __addr = (__force u64)(addr); \
__addr &= __untagged_addr(__addr); \
(__force __typeof__(addr))__addr; \
})
diff --git a/arch/arm64/include/asm/signal.h b/arch/arm64/include/asm/signal.h
index ef449f5f4ba8..0899c355c398 100644
--- a/arch/arm64/include/asm/signal.h
+++ b/arch/arm64/include/asm/signal.h
@@ -18,7 +18,7 @@ static inline void __user *arch_untagged_si_addr(void __user *addr,
if (sig == SIGTRAP && si_code == TRAP_BRKPT)
return addr;
- return untagged_addr(addr);
+ return untagged_addr(current->mm, addr);
}
#define arch_untagged_si_addr arch_untagged_si_addr
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 2fc9f0861769..0e17f44cf997 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -44,7 +44,7 @@ static inline int access_ok(const void __user *addr, unsigned long size)
*/
if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI) &&
(current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
return likely(__access_ok(addr, size));
}
@@ -217,7 +217,7 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
" csel %0, %1, xzr, eq\n"
: "=&r" (safe_ptr)
: "r" (ptr), "r" (TASK_SIZE_MAX - 1),
- "r" (untagged_addr(ptr))
+ "r" (untagged_addr(current->mm, ptr))
: "cc");
csdb();
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index b29a311bb055..d637cee7b771 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -715,7 +715,7 @@ static u64 get_distance_from_watchpoint(unsigned long addr, u64 val,
u64 wp_low, wp_high;
u32 lens, lene;
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
lens = __ffs(ctrl->len);
lene = __fls(ctrl->len);
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index b7fed33981f7..9edef0e155b6 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -476,7 +476,7 @@ void arm64_notify_segfault(unsigned long addr)
int code;
mmap_read_lock(current->mm);
- if (find_vma(current->mm, untagged_addr(addr)) == NULL)
+ if (find_vma(current->mm, untagged_addr(current->mm, addr)) == NULL)
code = SEGV_MAPERR;
else
code = SEGV_ACCERR;
@@ -540,7 +540,7 @@ static void user_cache_maint_handler(unsigned long esr, struct pt_regs *regs)
int ret = 0;
tagged_address = pt_regs_read_reg(regs, rt);
- address = untagged_addr(tagged_address);
+ address = untagged_addr(current->mm, tagged_address);
switch (crm) {
case ESR_ELx_SYS64_ISS_CRM_DC_CVAU: /* DC CVAU, gets promoted */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c33f1fad2745..1fa0f1166ac0 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -454,7 +454,7 @@ static void set_thread_esr(unsigned long address, unsigned long esr)
static void do_bad_area(unsigned long far, unsigned long esr,
struct pt_regs *regs)
{
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);
/*
* If we are in kernel mode at this point, we have no context to
@@ -524,7 +524,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
vm_fault_t fault;
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(mm, far);
if (kprobe_page_fault(regs, esr))
return 0;
@@ -679,7 +679,7 @@ static int __kprobes do_translation_fault(unsigned long far,
unsigned long esr,
struct pt_regs *regs)
{
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);
if (is_ttbr0_addr(addr))
return do_page_fault(far, esr, regs);
@@ -723,7 +723,7 @@ static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
* UNKNOWN for synchronous external aborts. Mask them out now
* so that userspace doesn't see them.
*/
- siaddr = untagged_addr(far);
+ siaddr = untagged_addr(current->mm, far);
}
arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);
@@ -813,7 +813,7 @@ static const struct fault_info fault_info[] = {
void do_mem_abort(unsigned long far, unsigned long esr, struct pt_regs *regs)
{
const struct fault_info *inf = esr_to_fault_info(esr);
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);
if (!inf->fn(far, esr, regs))
return;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index a779418ceba9..aa996ffe5c8c 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1052,7 +1052,7 @@ static inline unsigned long __untagged_addr(unsigned long start)
return start;
}
-#define untagged_addr(addr) \
+#define untagged_addr(mm, addr) \
((__typeof__(addr))(__untagged_addr((unsigned long)(addr))))
static inline bool pte_access_permitted(pte_t pte, bool write)
diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h
index 94266a5c5b04..b825a5dd0210 100644
--- a/arch/sparc/include/asm/uaccess_64.h
+++ b/arch/sparc/include/asm/uaccess_64.h
@@ -8,8 +8,10 @@
#include <linux/compiler.h>
#include <linux/string.h>
+#include <linux/mm_types.h>
#include <asm/asi.h>
#include <asm/spitfire.h>
+#include <asm/pgtable.h>
#include <asm/processor.h>
#include <asm-generic/access_ok.h>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a699134a1e8c..71babd0eb70a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1653,7 +1653,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
if (!offset || !*offset)
return -EINVAL;
- user_addr = untagged_addr(*offset);
+ user_addr = untagged_addr(current->mm, *offset);
} else if (flags & (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
bo_type = ttm_bo_type_sg;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 8ef31d687ef3..691dfb3f2c0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -382,7 +382,7 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;
- args->addr = untagged_addr(args->addr);
+ args->addr = untagged_addr(current->mm, args->addr);
if (offset_in_page(args->addr | args->size))
return -EINVAL;
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index 261fcbae88d7..cba2f4b19838 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -371,7 +371,7 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;
- args->addr = untagged_addr(args->addr);
+ args->addr = untagged_addr(current->mm, args->addr);
if (offset_in_page(args->addr | args->size))
return -EINVAL;
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 04a67b481608..b2860feeae3c 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -379,7 +379,7 @@ static struct ib_umem *mlx4_get_umem_mr(struct ib_device *device, u64 start,
* again
*/
if (!ib_access_writable(access_flags)) {
- unsigned long untagged_start = untagged_addr(start);
+ unsigned long untagged_start = untagged_addr(current->mm, start);
struct vm_area_struct *vma;
mmap_read_lock(current->mm);
diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..7e62f7a2555d 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -47,7 +47,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
if (WARN_ON_ONCE(nr_frames > vec->nr_allocated))
nr_frames = vec->nr_allocated;
- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
index 52312ce2ba05..a1444f8afa05 100644
--- a/drivers/media/v4l2-core/videobuf-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
@@ -157,8 +157,8 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem)
static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
struct videobuf_buffer *vb)
{
- unsigned long untagged_baddr = untagged_addr(vb->baddr);
struct mm_struct *mm = current->mm;
+ unsigned long untagged_baddr = untagged_addr(mm, vb->baddr);
struct vm_area_struct *vma;
unsigned long prev_pfn, this_pfn;
unsigned long pages_done, user_address;
diff --git a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
index f50494123f03..a43c65950554 100644
--- a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
+++ b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
@@ -794,7 +794,7 @@ static int alloc_user_pages(struct hmm_buffer_object *bo,
* and map to user space
*/
- userptr = untagged_addr(userptr);
+ userptr = untagged_addr(current->mm, userptr);
if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
page_nr = pin_user_pages((unsigned long)userptr, bo->pgnr,
diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index f2b1bcefcadd..386be09cb2cd 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -261,7 +261,7 @@ register_shm_helper(struct tee_context *ctx, unsigned long addr,
shm->flags = flags;
shm->ctx = ctx;
shm->id = id;
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
start = rounddown(addr, PAGE_SIZE);
shm->offset = addr - start;
shm->size = length;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index db516c90a977..1ab5adba6482 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -562,7 +562,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
goto done;
}
- vaddr = untagged_addr(vaddr);
+ vaddr = untagged_addr(mm, vaddr);
retry:
vma = vma_lookup(mm, vaddr);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a3398d0f1927..9255596b690f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1662,7 +1662,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
/* watch out for wraparound */
start_vaddr = end_vaddr;
if (svpfn <= (ULONG_MAX >> PAGE_SHIFT))
- start_vaddr = untagged_addr(svpfn << PAGE_SHIFT);
+ start_vaddr = untagged_addr(mm, svpfn << PAGE_SHIFT);
/* Ensure the address is inside the task */
if (start_vaddr > mm->task_size)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3bedc449c14d..ae2f8c9fbc4d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -95,17 +95,6 @@ extern int mmap_rnd_compat_bits __read_mostly;
#include <asm/page.h>
#include <asm/processor.h>
-/*
- * Architectures that support memory tagging (assigning tags to memory regions,
- * embedding these tags into addresses that point to these memory regions, and
- * checking that the memory and the pointer tags match on memory accesses)
- * redefine this macro to strip tags from pointers.
- * It's defined as noop for architectures that don't support memory tagging.
- */
-#ifndef untagged_addr
-#define untagged_addr(addr) (addr)
-#endif
-
#ifndef __pa_symbol
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 47e5d374c7eb..aed9555aed67 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -10,6 +10,21 @@
#include <asm/uaccess.h>
+/*
+ * Architectures that support memory tagging (assigning tags to memory regions,
+ * embedding these tags into addresses that point to these memory regions, and
+ * checking that the memory and the pointer tags match on memory accesses)
+ * redefine this macro to strip tags from pointers.
+ *
+ * Passing down mm_struct allows to define untagging rules on per-process
+ * basis.
+ *
+ * It's defined as noop for architectures that don't support memory tagging.
+ */
+#ifndef untagged_addr
+#define untagged_addr(mm, addr) (addr)
+#endif
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 6432b8c3e431..6e1e2aa0c994 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -121,7 +121,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
return 0;
max_addr = TASK_SIZE_MAX;
- src_addr = (unsigned long)untagged_addr(src);
+ src_addr = (unsigned long)untagged_addr(current->mm, src);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index feeb935a2299..abc096a68f05 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -97,7 +97,7 @@ long strnlen_user(const char __user *str, long count)
return 0;
max_addr = TASK_SIZE_MAX;
- src_addr = (unsigned long)untagged_addr(str);
+ src_addr = (unsigned long)untagged_addr(current->mm, str);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..9f2a14f7e77a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1125,7 +1125,7 @@ static long __get_user_pages(struct mm_struct *mm,
if (!nr_pages)
return 0;
- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
@@ -1307,7 +1307,7 @@ int fixup_user_fault(struct mm_struct *mm,
struct vm_area_struct *vma;
vm_fault_t ret;
- address = untagged_addr(address);
+ address = untagged_addr(mm, address);
if (unlocked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
@@ -2935,7 +2935,7 @@ static int internal_get_user_pages_fast(unsigned long start,
if (!(gup_flags & FOLL_FAST_ONLY))
might_lock_read(¤t->mm->mmap_lock);
- start = untagged_addr(start) & PAGE_MASK;
+ start = untagged_addr(current->mm, start) & PAGE_MASK;
len = nr_pages << PAGE_SHIFT;
if (check_add_overflow(start, len, &end))
return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 5f0f0948a50e..fb78881329c4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1373,7 +1373,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
size_t len;
struct blk_plug plug;
- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
if (!madvise_behavior_valid(behavior))
return -EINVAL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b73d3248d976..92819bccb082 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1456,7 +1456,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
int lmode = mode;
int err;
- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
err = sanitize_mpol_flags(&lmode, &mode_flags);
if (err)
return err;
@@ -1479,7 +1479,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
unsigned long end;
int err = -ENOENT;
- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
if (start & ~PAGE_MASK)
return -EINVAL;
/*
@@ -1682,7 +1682,7 @@ static int kernel_get_mempolicy(int __user *policy,
if (nmask != NULL && maxnode < nr_node_ids)
return -EINVAL;
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
err = do_get_mempolicy(&pval, &nodes, addr, flags);
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..e05e56605bce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1768,7 +1768,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
goto out_flush;
if (get_user(node, nodes + i))
goto out_flush;
- addr = (unsigned long)untagged_addr(p);
+ addr = (unsigned long)untagged_addr(mm, p);
err = -ENODEV;
if (node < 0 || node >= MAX_NUMNODES)
diff --git a/mm/mincore.c b/mm/mincore.c
index fa200c14185f..72c55bd9d184 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -236,7 +236,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
unsigned long pages;
unsigned char *tmp;
- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
/* Check the start address: needs to be page-aligned.. */
if (start & ~PAGE_MASK)
diff --git a/mm/mlock.c b/mm/mlock.c
index b14e929084cc..1ea100fe3db4 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -571,7 +571,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
unsigned long lock_limit;
int error = -ENOMEM;
- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
if (!can_do_mlock())
return -EPERM;
@@ -634,7 +634,7 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;
- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
len = PAGE_ALIGN(len + (offset_in_page(start)));
start &= PAGE_MASK;
diff --git a/mm/mmap.c b/mm/mmap.c
index c035020d0c89..90047dc5098a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2877,7 +2877,7 @@ EXPORT_SYMBOL(vm_munmap);
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
return __vm_munmap(addr, len, true);
}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 3a23dde73723..78adf9635194 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -669,7 +669,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
(prot & PROT_READ);
struct mmu_gather tlb;
- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
diff --git a/mm/mremap.c b/mm/mremap.c
index b522cd0259a0..f76648bc4f67 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -906,7 +906,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
*
* See Documentation/arm64/tagged-address-abi.rst for more information.
*/
- addr = untagged_addr(addr);
+ addr = untagged_addr(mm, addr);
if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
return ret;
diff --git a/mm/msync.c b/mm/msync.c
index 137d1c104f3e..5fe989bd3c4b 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -37,7 +37,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
int unmapped_error = 0;
int error = -EINVAL;
- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 515dfe9d3bcf..d2239aa85cf5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1943,7 +1943,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
/* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
+ (mem->userspace_addr != untagged_addr(kvm->mm, mem->userspace_addr)) ||
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
--
2.35.1
Enumerate Linear Address Masking and provide defines for CR3 and CR4
flags.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/processor-flags.h | 2 ++
arch/x86/include/uapi/asm/processor-flags.h | 6 ++++++
3 files changed, 9 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 235dc85c91c3..73c0cf5bd8a1 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -308,6 +308,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
#define X86_FEATURE_AVX_VNNI (12*32+ 4) /* AVX VNNI instructions */
#define X86_FEATURE_AVX512_BF16 (12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_LAM (12*32+26) /* Linear Address Masking */
/* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
#define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index a7f3d9100adb..d8cccadc83a6 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -28,6 +28,8 @@
* On systems with SME, one bit (in a variable position!) is stolen to indicate
* that the top-level paging structure is encrypted.
*
+ * On systemms with LAM, bits 61 and 62 are used to indicate LAM mode.
+ *
* All of the remaining bits indicate the physical address of the top-level
* paging structure.
*
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index c47cc7f2feeb..d898432947ff 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -82,6 +82,10 @@
#define X86_CR3_PCID_BITS 12
#define X86_CR3_PCID_MASK (_AC((1UL << X86_CR3_PCID_BITS) - 1, UL))
+#define X86_CR3_LAM_U57_BIT 61 /* Activate LAM for userspace, 62:57 bits masked */
+#define X86_CR3_LAM_U57 _BITULL(X86_CR3_LAM_U57_BIT)
+#define X86_CR3_LAM_U48_BIT 62 /* Activate LAM for userspace, 62:48 bits masked */
+#define X86_CR3_LAM_U48 _BITULL(X86_CR3_LAM_U48_BIT)
#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
#define X86_CR3_PCID_NOFLUSH _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
@@ -132,6 +136,8 @@
#define X86_CR4_PKE _BITUL(X86_CR4_PKE_BIT)
#define X86_CR4_CET_BIT 23 /* enable Control-flow Enforcement Technology */
#define X86_CR4_CET _BITUL(X86_CR4_CET_BIT)
+#define X86_CR4_LAM_SUP_BIT 28 /* LAM for supervisor pointers */
+#define X86_CR4_LAM_SUP _BITUL(X86_CR4_LAM_SUP_BIT)
/*
* x86-64 Task Priority Register, CR8
--
2.35.1
From: Weihong Zhang <[email protected]>
LAM is enabled per-thread and gets inherited on fork(2)/clone(2). exec()
reverts LAM status to the default disabled state.
There are two test scenarios:
- Fork test cases:
These cases were used to test the inheritance of LAM for per-thread,
Child process generated by fork() should inherit LAM feature from
parent process, Child process can get the LAM mode same as parent
process.
- Execve test cases:
Processes generated by execve() are different from processes
generated by fork(), these processes revert LAM status to disabled
status.
Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
tools/testing/selftests/x86/lam.c | 125 +++++++++++++++++++++++++++++-
1 file changed, 121 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 25e6d0fc4a83..0d76f883efdd 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -37,8 +37,9 @@
#define FUNC_MMAP 0x4
#define FUNC_SYSCALL 0x8
#define FUNC_URING 0x10
+#define FUNC_INHERITE 0x20
-#define TEST_MASK 0x1f
+#define TEST_MASK 0x3f
#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -174,6 +175,28 @@ static unsigned long get_default_tag_bits(void)
return lam;
}
+/*
+ * Set tagged address and read back untag mask.
+ * check if the untag mask is expected.
+ */
+static int get_lam(void)
+{
+ uint64_t ptr = 0;
+ int ret = -1;
+ /* Get untagged mask */
+ if (syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr) == -1)
+ return -1;
+
+ /* Check mask returned is expected */
+ if (ptr == ~(LAM_U57_MASK))
+ ret = LAM_U57_BITS;
+ else if (ptr == -1ULL)
+ ret = LAM_NONE;
+
+
+ return ret;
+}
+
/* According to LAM mode, set metadata in high bits */
static uint64_t set_metadata(uint64_t src, unsigned long lam)
{
@@ -581,7 +604,7 @@ int do_uring(unsigned long lam)
switch (lam) {
case LAM_U57_BITS: /* Clear bits 62:57 */
- addr = (addr & ~(0x3fULL << 57));
+ addr = (addr & ~(LAM_U57_MASK));
break;
}
free((void *)addr);
@@ -632,6 +655,72 @@ static int fork_test(struct testcases *test)
return ret;
}
+static int handle_execve(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ char path[PATH_MAX];
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ /* Get current binary's path and the binary was run by execve */
+ if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ exit(-1);
+
+ /* run binary to get LAM mode and return to parent process */
+ if (execlp(path, path, "-t 0x0", NULL) < 0) {
+ perror("error on exec");
+ exit(-1);
+ }
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ if (ret != LAM_NONE)
+ return 1;
+ }
+
+ return 0;
+}
+
+static int handle_inheritance(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ return 1;
+ } else if (pid == 0) {
+ /* Set LAM mode in parent process */
+ int child_lam = get_lam();
+
+ exit(child_lam);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+
+ if (lam != ret)
+ return 1;
+ }
+
+ return 0;
+}
+
static void run_test(struct testcases *test, int count)
{
int i, ret = 0;
@@ -733,11 +822,26 @@ static struct testcases mmap_cases[] = {
},
};
+static struct testcases inheritance_cases[] = {
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_inheritance,
+ .msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n",
+ },
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_execve,
+ .msg = "EXECVE: LAM_U57, child process should get disabled LAM mode\n",
+ },
+};
+
static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring.\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring; 0x20:inherit;\n");
printf("\t-h: help\n");
}
@@ -757,7 +861,7 @@ int main(int argc, char **argv)
switch (c) {
case 't':
tests = strtoul(optarg, NULL, 16);
- if (!(tests & TEST_MASK)) {
+ if (tests && !(tests & TEST_MASK)) {
ksft_print_msg("Invalid argument!\n");
return -1;
}
@@ -771,6 +875,16 @@ int main(int argc, char **argv)
}
}
+ /*
+ * When tests is 0, it is not a real test case;
+ * the option used by test case(execve) to check the lam mode in
+ * process generated by execve, the process read back lam mode and
+ * check with lam mode in parent process.
+ */
+ if (!tests)
+ return (get_lam());
+
+ /* Run test cases */
if (tests & FUNC_MALLOC)
run_test(malloc_cases, ARRAY_SIZE(malloc_cases));
@@ -786,6 +900,9 @@ int main(int argc, char **argv)
if (tests & FUNC_URING)
run_test(uring_cases, ARRAY_SIZE(uring_cases));
+ if (tests & FUNC_INHERITE)
+ run_test(inheritance_cases, ARRAY_SIZE(inheritance_cases));
+
ksft_set_plan(tests_cnt);
return ksft_exit_pass();
--
2.35.1
Hi Kirill,
On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> 64-bit linear addresses, allowing software to use of the untranslated
> address bits for metadata.
We discussed this internally, but didn't bubble up here.
Given that we are working on enabling Shared Virtual Addressing (SVA)
within the IOMMU. This permits user to share VA directly with the device,
and the device can participate even in fixing page-faults and such.
IOMMU enforces canonical addressing, since we are hijacking the top order
bits for meta-data, it will fail sanity check and we would return a failure
back to device on any page-faults from device.
It also complicates how device TLB and ATS work, and needs some major
improvements to detect device capability to accept tagged pointers, adjust
the devtlb to act accordingly.
Both are orthogonal features, but there is an intersection of both
that are fundamentally incompatible.
Its even more important, since an application might be using SVA under the
cover provided by some library that's used without their knowledge.
The path would be:
1. Ensure both LAM and SVM are incompatible by design, without major
changes.
- If LAM is enabled already and later SVM enabling is requested by
user, that should fail. and Vice versa.
- Provide an API to user to ask for opt-out. Now they know they
must sanitize the pointers before sending to device, or the
working set is already isolated and needs no work.
2. I suppose for any syscalls that take tagged pointers you would maybe
relax checks for how many bits to ignore for canonicallity. This is
required so user don't need to do the same for everything sanitization
before every syscall.
If you have it fail, the library might choose a less optimal path if one is
available.
Cheers,
Ashok
On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> Hi Kirill,
>
> On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > 64-bit linear addresses, allowing software to use of the untranslated
> > address bits for metadata.
>
> We discussed this internally, but didn't bubble up here.
>
> Given that we are working on enabling Shared Virtual Addressing (SVA)
> within the IOMMU. This permits user to share VA directly with the device,
> and the device can participate even in fixing page-faults and such.
>
> IOMMU enforces canonical addressing, since we are hijacking the top order
> bits for meta-data, it will fail sanity check and we would return a failure
> back to device on any page-faults from device.
>
> It also complicates how device TLB and ATS work, and needs some major
> improvements to detect device capability to accept tagged pointers, adjust
> the devtlb to act accordingly.
>
>
> Both are orthogonal features, but there is an intersection of both
> that are fundamentally incompatible.
>
> Its even more important, since an application might be using SVA under the
> cover provided by some library that's used without their knowledge.
>
> The path would be:
>
> 1. Ensure both LAM and SVM are incompatible by design, without major
> changes.
> - If LAM is enabled already and later SVM enabling is requested by
> user, that should fail. and Vice versa.
> - Provide an API to user to ask for opt-out. Now they know they
> must sanitize the pointers before sending to device, or the
> working set is already isolated and needs no work.
The patch below implements something like this. It is PoC, build-tested only.
To be honest, I hate it. It is clearly a layering violation. It feels
dirty. But I don't see any better way as we tie orthogonal features
together.
Also I have no idea how to make forced PASID allocation if LAM enabled.
What the API has to look like?
Any comments?
> 2. I suppose for any syscalls that take tagged pointers you would maybe
> relax checks for how many bits to ignore for canonicallity. This is
> required so user don't need to do the same for everything sanitization
> before every syscall.
I'm not quite follow this. For syscalls that allow tagged pointers, we do
untagged_addr() now. Not sure what else needed.
> If you have it fail, the library might choose a less optimal path if one is
> available.
>
> Cheers,
> Ashok
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index a31e27b95b19..e5c04ced36c9 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -23,5 +23,6 @@
#define ARCH_GET_UNTAG_MASK 0x4001
#define ARCH_ENABLE_TAGGED_ADDR 0x4002
#define ARCH_GET_MAX_TAG_BITS 0x4003
+#define ARCH_ENABLE_TAGGED_ADDR_FORCED 0x4004
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 337f80a0862f..7d89a2fd1a55 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -774,7 +774,8 @@ static bool lam_u48_allowed(void)
#define LAM_U48_BITS 15
#define LAM_U57_BITS 6
-static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
+static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits,
+ bool forced)
{
int ret = 0;
@@ -793,6 +794,11 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
goto out;
}
+ if (pasid_valid(mm->pasid) && !forced) {
+ ret = -EBUSY;
+ goto out;
+ }
+
if (!nr_bits) {
ret = -EINVAL;
goto out;
@@ -910,7 +916,9 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
return put_user(task->mm->context.untag_mask,
(unsigned long __user *)arg2);
case ARCH_ENABLE_TAGGED_ADDR:
- return prctl_enable_tagged_addr(task->mm, arg2);
+ return prctl_enable_tagged_addr(task->mm, arg2, false);
+ case ARCH_ENABLE_TAGGED_ADDR_FORCED:
+ return prctl_enable_tagged_addr(task->mm, arg2, true);
case ARCH_GET_MAX_TAG_BITS: {
int nr_bits;
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index 106506143896..a6ec17de1937 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -4,6 +4,7 @@
*/
#include <linux/mutex.h>
#include <linux/sched/mm.h>
+#include <asm/mmu_context.h>
#include "iommu-sva-lib.h"
@@ -32,6 +33,15 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
return -EINVAL;
mutex_lock(&iommu_sva_lock);
+
+ /* Serialize against LAM enabling */
+ mutex_lock(&mm->context.lock);
+
+ if (mm_lam_cr3_mask(mm)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
/* Is a PASID already associated with this mm? */
if (pasid_valid(mm->pasid)) {
if (mm->pasid < min || mm->pasid >= max)
@@ -45,6 +55,7 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
else
mm_pasid_set(mm, pasid);
out:
+ mutex_unlock(&mm->context.lock);
mutex_unlock(&iommu_sva_lock);
return ret;
}
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> 64-bit linear addresses, allowing software to use of the untranslated
> address bits for metadata.
>
> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> this time.
>
> Please review and consider applying.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
+Bharata, Ananth.
Do you folks have any feedback on the patchset?
Looks like AMD version of the tagged pointers feature does not get
traction as of now, but I want to be sure that the interface introduced
here can be suitable for your future plans.
Do you see anything in the interface that can prevent it to be extended to
the AMD feature?
--
Kiryl Shutsemau / Kirill A. Shutemov
Hi Kirill,
On 9/4/2022 6:30 AM, Kirill A. Shutemov wrote:
> On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
>> Linear Address Masking[1] (LAM) modifies the checking that is applied to
>> 64-bit linear addresses, allowing software to use of the untranslated
>> address bits for metadata.
>>
>> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
>> this time.
>>
>> Please review and consider applying.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
>
> +Bharata, Ananth.
>
> Do you folks have any feedback on the patchset?
>
> Looks like AMD version of the tagged pointers feature does not get
> traction as of now, but I want to be sure that the interface introduced
> here can be suitable for your future plans.
>
> Do you see anything in the interface that can prevent it to be extended to
> the AMD feature?
The arch_prctl() extensions is generic enough that it should be good.
The untagged_addr() macro looks like this from one of the callers:
start = untagged_addr(mm, start);
ffffffff814d39bb: 48 8b 8d 40 ff ff ff mov -0xc0(%rbp),%rcx
ffffffff814d39c2: 48 89 f2 mov %rsi,%rdx
ffffffff814d39c5: 48 c1 fa 3f sar $0x3f,%rdx
ffffffff814d39c9: 48 0b 91 50 03 00 00 or 0x350(%rcx),%rdx
ffffffff814d39d0: 48 21 f2 and %rsi,%rdx
ffffffff814d39d3: 49 89 d6 mov %rdx,%r14
Can this overhead of a few additional instructions be removed for
platforms that don't have LAM feature? I haven't measured how much
overhead this effectively contributes to in real, but wonder if it is
worth optimizing for non-LAM platforms.
Regards,
Bharata.
On Mon, Sep 05, 2022 at 10:35:44AM +0530, Bharata B Rao wrote:
> Hi Kirill,
>
> On 9/4/2022 6:30 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> >> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> >> 64-bit linear addresses, allowing software to use of the untranslated
> >> address bits for metadata.
> >>
> >> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> >> this time.
> >>
> >> Please review and consider applying.
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
> >
> > +Bharata, Ananth.
> >
> > Do you folks have any feedback on the patchset?
> >
> > Looks like AMD version of the tagged pointers feature does not get
> > traction as of now, but I want to be sure that the interface introduced
> > here can be suitable for your future plans.
> >
> > Do you see anything in the interface that can prevent it to be extended to
> > the AMD feature?
>
> The arch_prctl() extensions is generic enough that it should be good.
>
> The untagged_addr() macro looks like this from one of the callers:
>
> start = untagged_addr(mm, start);
> ffffffff814d39bb: 48 8b 8d 40 ff ff ff mov -0xc0(%rbp),%rcx
> ffffffff814d39c2: 48 89 f2 mov %rsi,%rdx
> ffffffff814d39c5: 48 c1 fa 3f sar $0x3f,%rdx
> ffffffff814d39c9: 48 0b 91 50 03 00 00 or 0x350(%rcx),%rdx
> ffffffff814d39d0: 48 21 f2 and %rsi,%rdx
> ffffffff814d39d3: 49 89 d6 mov %rdx,%r14
>
> Can this overhead of a few additional instructions be removed for
> platforms that don't have LAM feature? I haven't measured how much
> overhead this effectively contributes to in real, but wonder if it is
> worth optimizing for non-LAM platforms.
I'm not sure how the optimization should look like. I guess we can stick
static_cpu_has() there, but I'm not convinced that adding jumps there will
be beneficial.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 05, 2022 at 04:44:57PM +0300, Kirill A. Shutemov wrote:
> On Mon, Sep 05, 2022 at 10:35:44AM +0530, Bharata B Rao wrote:
> > Hi Kirill,
> >
> > On 9/4/2022 6:30 AM, Kirill A. Shutemov wrote:
> > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > >> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > >> 64-bit linear addresses, allowing software to use of the untranslated
> > >> address bits for metadata.
> > >>
> > >> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> > >> this time.
> > >>
> > >> Please review and consider applying.
> > >>
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
> > >
> > > +Bharata, Ananth.
> > >
> > > Do you folks have any feedback on the patchset?
> > >
> > > Looks like AMD version of the tagged pointers feature does not get
> > > traction as of now, but I want to be sure that the interface introduced
> > > here can be suitable for your future plans.
> > >
> > > Do you see anything in the interface that can prevent it to be extended to
> > > the AMD feature?
> >
> > The arch_prctl() extensions is generic enough that it should be good.
> >
> > The untagged_addr() macro looks like this from one of the callers:
> >
> > start = untagged_addr(mm, start);
> > ffffffff814d39bb: 48 8b 8d 40 ff ff ff mov -0xc0(%rbp),%rcx
> > ffffffff814d39c2: 48 89 f2 mov %rsi,%rdx
> > ffffffff814d39c5: 48 c1 fa 3f sar $0x3f,%rdx
> > ffffffff814d39c9: 48 0b 91 50 03 00 00 or 0x350(%rcx),%rdx
> > ffffffff814d39d0: 48 21 f2 and %rsi,%rdx
> > ffffffff814d39d3: 49 89 d6 mov %rdx,%r14
> >
> > Can this overhead of a few additional instructions be removed for
> > platforms that don't have LAM feature? I haven't measured how much
> > overhead this effectively contributes to in real, but wonder if it is
> > worth optimizing for non-LAM platforms.
>
> I'm not sure how the optimization should look like. I guess we can stick
> static_cpu_has() there, but I'm not convinced that adding jumps there will
> be beneficial.
I suppose the critical bit is the memory load. That can stall and then
you're sad. A jump_label is easy enough to add.
On Mon, Sep 05, 2022 at 04:30:25PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 05, 2022 at 04:44:57PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Sep 05, 2022 at 10:35:44AM +0530, Bharata B Rao wrote:
> > > Hi Kirill,
> > >
> > > On 9/4/2022 6:30 AM, Kirill A. Shutemov wrote:
> > > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > >> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > >> 64-bit linear addresses, allowing software to use of the untranslated
> > > >> address bits for metadata.
> > > >>
> > > >> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> > > >> this time.
> > > >>
> > > >> Please review and consider applying.
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
> > > >
> > > > +Bharata, Ananth.
> > > >
> > > > Do you folks have any feedback on the patchset?
> > > >
> > > > Looks like AMD version of the tagged pointers feature does not get
> > > > traction as of now, but I want to be sure that the interface introduced
> > > > here can be suitable for your future plans.
> > > >
> > > > Do you see anything in the interface that can prevent it to be extended to
> > > > the AMD feature?
> > >
> > > The arch_prctl() extensions is generic enough that it should be good.
> > >
> > > The untagged_addr() macro looks like this from one of the callers:
> > >
> > > start = untagged_addr(mm, start);
> > > ffffffff814d39bb: 48 8b 8d 40 ff ff ff mov -0xc0(%rbp),%rcx
> > > ffffffff814d39c2: 48 89 f2 mov %rsi,%rdx
> > > ffffffff814d39c5: 48 c1 fa 3f sar $0x3f,%rdx
> > > ffffffff814d39c9: 48 0b 91 50 03 00 00 or 0x350(%rcx),%rdx
> > > ffffffff814d39d0: 48 21 f2 and %rsi,%rdx
> > > ffffffff814d39d3: 49 89 d6 mov %rdx,%r14
> > >
> > > Can this overhead of a few additional instructions be removed for
> > > platforms that don't have LAM feature? I haven't measured how much
> > > overhead this effectively contributes to in real, but wonder if it is
> > > worth optimizing for non-LAM platforms.
> >
> > I'm not sure how the optimization should look like. I guess we can stick
> > static_cpu_has() there, but I'm not convinced that adding jumps there will
> > be beneficial.
>
> I suppose the critical bit is the memory load. That can stall and then
> you're sad. A jump_label is easy enough to add.
What about something like this?
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 803241dfc473..868d2730884b 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -30,8 +30,10 @@ static inline bool pagefault_disabled(void);
*/
#define untagged_addr(mm, addr) ({ \
u64 __addr = (__force u64)(addr); \
- s64 sign = (s64)__addr >> 63; \
- __addr &= (mm)->context.untag_mask | sign; \
+ if (static_cpu_has(X86_FEATURE_LAM)) { \
+ s64 sign = (s64)__addr >> 63; \
+ __addr &= (mm)->context.untag_mask | sign; \
+ } \
(__force __typeof__(addr))__addr; \
})
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 05, 2022 at 06:35:17PM +0300, Kirill A. Shutemov wrote:
> What about something like this?
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 803241dfc473..868d2730884b 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -30,8 +30,10 @@ static inline bool pagefault_disabled(void);
> */
> #define untagged_addr(mm, addr) ({ \
> u64 __addr = (__force u64)(addr); \
> - s64 sign = (s64)__addr >> 63; \
> - __addr &= (mm)->context.untag_mask | sign; \
> + if (static_cpu_has(X86_FEATURE_LAM)) { \
> + s64 sign = (s64)__addr >> 63; \
> + __addr &= (mm)->context.untag_mask | sign; \
> + } \
> (__force __typeof__(addr))__addr; \
> })
Well, if you go throught the trouble of adding it, might as well use a
regular static branch and only enable it once there's an actual user,
no?
On Mon, Sep 05, 2022 at 05:46:49PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 05, 2022 at 06:35:17PM +0300, Kirill A. Shutemov wrote:
> > What about something like this?
> >
> > diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> > index 803241dfc473..868d2730884b 100644
> > --- a/arch/x86/include/asm/uaccess.h
> > +++ b/arch/x86/include/asm/uaccess.h
> > @@ -30,8 +30,10 @@ static inline bool pagefault_disabled(void);
> > */
> > #define untagged_addr(mm, addr) ({ \
> > u64 __addr = (__force u64)(addr); \
> > - s64 sign = (s64)__addr >> 63; \
> > - __addr &= (mm)->context.untag_mask | sign; \
> > + if (static_cpu_has(X86_FEATURE_LAM)) { \
> > + s64 sign = (s64)__addr >> 63; \
> > + __addr &= (mm)->context.untag_mask | sign; \
> > + } \
> > (__force __typeof__(addr))__addr; \
> > })
>
> Well, if you go throught the trouble of adding it, might as well use a
> regular static branch and only enable it once there's an actual user,
> no?
Fair enough. How about this?
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 803241dfc473..1a03c65a9c0f 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -22,6 +22,8 @@ static inline bool pagefault_disabled(void);
#endif
#ifdef CONFIG_X86_64
+DECLARE_STATIC_KEY_FALSE(tagged_addr_key);
+
/*
* Mask out tag bits from the address.
*
@@ -30,8 +32,10 @@ static inline bool pagefault_disabled(void);
*/
#define untagged_addr(mm, addr) ({ \
u64 __addr = (__force u64)(addr); \
- s64 sign = (s64)__addr >> 63; \
- __addr &= (mm)->context.untag_mask | sign; \
+ if (static_branch_unlikely(&tagged_addr_key)) { \
+ s64 sign = (s64)__addr >> 63; \
+ __addr &= (mm)->context.untag_mask | sign; \
+ } \
(__force __typeof__(addr))__addr; \
})
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 337f80a0862f..63194bf43c9a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -742,6 +742,9 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
}
#endif
+DEFINE_STATIC_KEY_FALSE(tagged_addr_key);
+EXPORT_SYMBOL_GPL(tagged_addr_key);
+
static void enable_lam_func(void *mm)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -813,6 +816,7 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
}
on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+ static_branch_enable(&tagged_addr_key);
out:
mutex_unlock(&mm->context.lock);
mmap_write_unlock(mm);
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 05, 2022 at 07:47:08PM +0300, Kirill A. Shutemov wrote:
> Fair enough. How about this?
>
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 803241dfc473..1a03c65a9c0f 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -22,6 +22,8 @@ static inline bool pagefault_disabled(void);
> #endif
>
> #ifdef CONFIG_X86_64
> +DECLARE_STATIC_KEY_FALSE(tagged_addr_key);
> +
> /*
> * Mask out tag bits from the address.
> *
> @@ -30,8 +32,10 @@ static inline bool pagefault_disabled(void);
> */
> #define untagged_addr(mm, addr) ({ \
> u64 __addr = (__force u64)(addr); \
> - s64 sign = (s64)__addr >> 63; \
> - __addr &= (mm)->context.untag_mask | sign; \
> + if (static_branch_unlikely(&tagged_addr_key)) { \
> + s64 sign = (s64)__addr >> 63; \
> + __addr &= (mm)->context.untag_mask | sign; \
> + } \
> (__force __typeof__(addr))__addr; \
> })
>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 337f80a0862f..63194bf43c9a 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -742,6 +742,9 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
> }
> #endif
>
> +DEFINE_STATIC_KEY_FALSE(tagged_addr_key);
So here you use the: false-unlikely scenario which seems suboptimal in
this case, I was thinking the false-likely case would generate better
code (see the comment in linux/jump_label.h).
> +EXPORT_SYMBOL_GPL(tagged_addr_key);
> +
> static void enable_lam_func(void *mm)
> {
> struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> @@ -813,6 +816,7 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> }
>
> on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
> + static_branch_enable(&tagged_addr_key);
> out:
> mutex_unlock(&mm->context.lock);
> mmap_write_unlock(mm);
Aside from the one nit above, this looks about right.
On Tue, 2022-08-30 at 04:01 +0300, Kirill A. Shutemov wrote:
> From: Weihong Zhang <[email protected]>
>
> Add mmap and SYSCALL test cases.
>
> SYSCALL test cases:
>
> - LAM supports set metadata in high bits 62:57 (LAM_U57) of a user
> pointer, pass
> the pointer to SYSCALL, SYSCALL can dereference the pointer and
> return correct
> result.
>
> - Disable LAM, pass a pointer with metadata in high bits to SYSCALL,
> SYSCALL returns -1 (EFAULT).
>
> MMAP test cases:
>
> - Enable LAM_U57, MMAP with low address (below bits 47), set
> metadata
> in high bits of the address, dereference the address should be
> allowed.
>
> - Enable LAM_U57, MMAP with high address (above bits 47), set
> metadata
> in high bits of the address, dereference the address should be
> allowed.
>
> Signed-off-by: Weihong Zhang <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> tools/testing/selftests/x86/lam.c | 135
> +++++++++++++++++++++++++++++-
> 1 file changed, 132 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/x86/lam.c
> b/tools/testing/selftests/x86/lam.c
> index 900a3a0fb709..b88e007ee0a3 100644
> --- a/tools/testing/selftests/x86/lam.c
> +++ b/tools/testing/selftests/x86/lam.c
> @@ -7,6 +7,7 @@
> #include <signal.h>
> #include <setjmp.h>
> #include <sys/mman.h>
> +#include <sys/utsname.h>
> #include <sys/wait.h>
> #include <inttypes.h>
>
> @@ -29,11 +30,18 @@
> /* Specified test function bits */
> #define FUNC_MALLOC 0x1
> #define FUNC_BITS 0x2
> +#define FUNC_MMAP 0x4
> +#define FUNC_SYSCALL 0x8
>
> -#define TEST_MASK 0x3
> +#define TEST_MASK 0xf
> +
> +#define LOW_ADDR (0x1UL << 30)
> +#define HIGH_ADDR (0x3UL << 48)
>
> #define MALLOC_LEN 32
>
> +#define PAGE_SIZE (4 << 10)
> +
> struct testcases {
> unsigned int later;
> int expected; /* 2: SIGSEGV Error; 1: other errors */
> @@ -49,6 +57,7 @@ jmp_buf segv_env;
> static void segv_handler(int sig)
> {
> ksft_print_msg("Get segmentation fault(%d).", sig);
> +
> siglongjmp(segv_env, 1);
> }
>
> @@ -61,6 +70,16 @@ static inline int cpu_has_lam(void)
> return (cpuinfo[0] & (1 << 26));
> }
>
> +/* Check 5-level page table feature in CPUID.(EAX=07H,
> ECX=00H):ECX.[bit 16] */
> +static inline int cpu_has_la57(void)
> +{
> + unsigned int cpuinfo[4];
> +
> + __cpuid_count(0x7, 0, cpuinfo[0], cpuinfo[1], cpuinfo[2],
> cpuinfo[3]);
> +
> + return (cpuinfo[2] & (1 << 16));
> +}
> +
> /*
> * Set tagged address and read back untag mask.
> * check if the untagged mask is expected.
> @@ -213,6 +232,68 @@ static int handle_malloc(struct testcases *test)
> return ret;
> }
>
> +static int handle_mmap(struct testcases *test)
> +{
> + void *ptr;
> + unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED;
> + int ret = 0;
> +
> + if (test->later == 0 && test->lam != 0)
> + if (set_lam(test->lam) != 0)
> + return 1;
> +
> + ptr = mmap((void *)test->addr, PAGE_SIZE, PROT_READ |
> PROT_WRITE,
> + flags, -1, 0);
> + if (ptr == MAP_FAILED) {
> + if (test->addr == HIGH_ADDR)
> + if (!cpu_has_la57())
> + return 3; /* unsupport LA57 */
I think here return 3 to indicate skip cases.
Perhaps you can enable skip case like this? Just FYI, I'm not familiar
with selftests yet.
@@ -321,8 +323,10 @@ static int handle_mmap(struct testcases *test)
flags, -1, 0);
if (ptr == MAP_FAILED) {
if (test->addr == HIGH_ADDR)
- if (!cpu_has_la57())
+ if (!cpu_has_la57()) {
+ perror("Unsupport LA57. Skip");
return 3; /* unsupport LA57 */
+ }
return 1;
}
@@ -746,12 +750,16 @@ static void run_test(struct testcases *test, int
count)
/* fork a process to run test case */
ret = fork_test(t);
+ tests_cnt++;
+ if (ret == 3) {
+ ksft_test_result_skip(t->msg);
+ continue;
+ }
if (ret != 0)
ret = (t->expected == ret);
else
ret = !(t->expected);
- tests_cnt++;
ksft_test_result(ret, t->msg);
}
}
> + return 1;
> + }
> +
> + if (test->later != 0 && test->lam != 0)
> + if (set_lam(test->lam) != 0)
> + ret = 1;
> +
> + if (ret == 0) {
> + if (sigsetjmp(segv_env, 1) == 0) {
> + signal(SIGSEGV, segv_handler);
> + ret = handle_lam_test(ptr, test->lam);
> + } else {
> + ret = 2;
> + }
> + }
> +
> + munmap(ptr, PAGE_SIZE);
> + return ret;
> +}
> +
> +static int handle_syscall(struct testcases *test)
> +{
> + struct utsname unme, *pu;
> + int ret = 0;
> +
> + if (test->later == 0 && test->lam != 0)
> + if (set_lam(test->lam) != 0)
> + return 1;
> +
> + if (sigsetjmp(segv_env, 1) == 0) {
> + signal(SIGSEGV, segv_handler);
> + pu = (struct utsname *)set_metadata((uint64_t)&unme,
> test->lam);
> + ret = uname(pu);
> + if (ret < 0)
> + ret = 1;
> + } else {
> + ret = 2;
> + }
> +
> + if (test->later != 0 && test->lam != 0)
> + if (set_lam(test->lam) != -1 && ret == 0)
> + ret = 1;
> +
> + return ret;
> +}
> +
> static int fork_test(struct testcases *test)
> {
> int ret, child_ret;
> @@ -268,7 +349,6 @@ static struct testcases malloc_cases[] = {
> },
> };
>
> -
> static struct testcases bits_cases[] = {
> {
> .test_func = handle_max_bits,
> @@ -276,11 +356,54 @@ static struct testcases bits_cases[] = {
> },
> };
>
> +static struct testcases syscall_cases[] = {
> + {
> + .later = 0,
> + .lam = LAM_U57_BITS,
> + .test_func = handle_syscall,
> + .msg = "SYSCALL: LAM_U57. syscall with metadata\n",
> + },
> + {
> + .later = 1,
> + .expected = 1,
> + .lam = LAM_U57_BITS,
> + .test_func = handle_syscall,
> + .msg = "SYSCALL:[Negative] Disable LAM. Dereferencing
> pointer with metadata.\n",
> + },
> +};
> +
> +static struct testcases mmap_cases[] = {
> + {
> + .later = 1,
> + .expected = 0,
> + .lam = LAM_U57_BITS,
> + .addr = HIGH_ADDR,
> + .test_func = handle_mmap,
> + .msg = "MMAP: First mmap high address, then set
> LAM_U57.\n",
> + },
> + {
> + .later = 0,
> + .expected = 0,
> + .lam = LAM_U57_BITS,
> + .addr = HIGH_ADDR,
> + .test_func = handle_mmap,
> + .msg = "MMAP: First LAM_U57, then High address.\n",
> + },
> + {
> + .later = 0,
> + .expected = 0,
> + .lam = LAM_U57_BITS,
> + .addr = LOW_ADDR,
> + .test_func = handle_mmap,
> + .msg = "MMAP: First LAM_U57, then Low address.\n",
> + },
> +};
> +
> static void cmd_help(void)
> {
> printf("usage: lam [-h] [-t test list]\n");
> printf("\t-t test list: run tests specified in the test list,
> default:0x%x\n", TEST_MASK);
> - printf("\t\t0x1:malloc; 0x2:max_bits;\n");
> + printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap;
> 0x8:syscall.\n");
> printf("\t-h: help\n");
> }
>
> @@ -320,6 +443,12 @@ int main(int argc, char **argv)
> if (tests & FUNC_BITS)
> run_test(bits_cases, ARRAY_SIZE(bits_cases));
>
> + if (tests & FUNC_MMAP)
> + run_test(mmap_cases, ARRAY_SIZE(mmap_cases));
> +
> + if (tests & FUNC_SYSCALL)
> + run_test(syscall_cases, ARRAY_SIZE(syscall_cases));
> +
> ksft_set_plan(tests_cnt);
>
> return ksft_exit_pass();
> -----Original Message-----
> From: Robert Hoo <[email protected]>
> Sent: Wednesday, September 7, 2022 11:20 AM
> To: Kirill A. Shutemov <[email protected]>; Dave Hansen
> <[email protected]>; Lutomirski, Andy <[email protected]>; Peter
> Zijlstra <[email protected]>
> Cc: [email protected]; Kostya Serebryany <[email protected]>; Andrey Ryabinin
> <[email protected]>; Andrey Konovalov <[email protected]>;
> Alexander Potapenko <[email protected]>; Taras Madan
> <[email protected]>; Dmitry Vyukov <[email protected]>; H . J .
> Lu <[email protected]>; Andi Kleen <[email protected]>; Edgecombe,
> Rick P <[email protected]>; [email protected]; linux-
> [email protected]; Zhang, Weihong <[email protected]>
> Subject: Re: [PATCHv8 09/11] selftests/x86/lam: Add mmap and SYSCALL
> test cases for linear-address masking
>
> On Tue, 2022-08-30 at 04:01 +0300, Kirill A. Shutemov wrote:
> > From: Weihong Zhang <[email protected]>
> >
> > Add mmap and SYSCALL test cases.
> >
> > SYSCALL test cases:
> >
> > - LAM supports set metadata in high bits 62:57 (LAM_U57) of a user
> > pointer, pass
> > the pointer to SYSCALL, SYSCALL can dereference the pointer and
> > return correct
> > result.
> >
> > - Disable LAM, pass a pointer with metadata in high bits to SYSCALL,
> > SYSCALL returns -1 (EFAULT).
> >
> > MMAP test cases:
> >
> > - Enable LAM_U57, MMAP with low address (below bits 47), set
> metadata
> > in high bits of the address, dereference the address should be
> > allowed.
> >
> > - Enable LAM_U57, MMAP with high address (above bits 47), set
> > metadata
> > in high bits of the address, dereference the address should be
> > allowed.
> >
> > Signed-off-by: Weihong Zhang <[email protected]>
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > tools/testing/selftests/x86/lam.c | 135
> > +++++++++++++++++++++++++++++-
> > 1 file changed, 132 insertions(+), 3 deletions(-)
> >
> > diff --git a/tools/testing/selftests/x86/lam.c
> > b/tools/testing/selftests/x86/lam.c
> > index 900a3a0fb709..b88e007ee0a3 100644
> > --- a/tools/testing/selftests/x86/lam.c
> > +++ b/tools/testing/selftests/x86/lam.c
> > @@ -7,6 +7,7 @@
> > #include <signal.h>
> > #include <setjmp.h>
> > #include <sys/mman.h>
> > +#include <sys/utsname.h>
> > #include <sys/wait.h>
> > #include <inttypes.h>
> >
> > @@ -29,11 +30,18 @@
> > /* Specified test function bits */
> > #define FUNC_MALLOC 0x1
> > #define FUNC_BITS 0x2
> > +#define FUNC_MMAP 0x4
> > +#define FUNC_SYSCALL 0x8
> >
> > -#define TEST_MASK 0x3
> > +#define TEST_MASK 0xf
> > +
> > +#define LOW_ADDR (0x1UL << 30)
> > +#define HIGH_ADDR (0x3UL << 48)
> >
> > #define MALLOC_LEN 32
> >
> > +#define PAGE_SIZE (4 << 10)
> > +
> > struct testcases {
> > unsigned int later;
> > int expected; /* 2: SIGSEGV Error; 1: other errors */ @@ -49,6 +57,7
> > @@ jmp_buf segv_env; static void segv_handler(int sig) {
> > ksft_print_msg("Get segmentation fault(%d).", sig);
> > +
> > siglongjmp(segv_env, 1);
> > }
> >
> > @@ -61,6 +70,16 @@ static inline int cpu_has_lam(void)
> > return (cpuinfo[0] & (1 << 26));
> > }
> >
> > +/* Check 5-level page table feature in CPUID.(EAX=07H,
> > ECX=00H):ECX.[bit 16] */
> > +static inline int cpu_has_la57(void)
> > +{
> > + unsigned int cpuinfo[4];
> > +
> > + __cpuid_count(0x7, 0, cpuinfo[0], cpuinfo[1], cpuinfo[2],
> > cpuinfo[3]);
> > +
> > + return (cpuinfo[2] & (1 << 16));
> > +}
> > +
> > /*
> > * Set tagged address and read back untag mask.
> > * check if the untagged mask is expected.
> > @@ -213,6 +232,68 @@ static int handle_malloc(struct testcases *test)
> > return ret;
> > }
> >
> > +static int handle_mmap(struct testcases *test) {
> > + void *ptr;
> > + unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS |
> MAP_FIXED;
> > + int ret = 0;
> > +
> > + if (test->later == 0 && test->lam != 0)
> > + if (set_lam(test->lam) != 0)
> > + return 1;
> > +
> > + ptr = mmap((void *)test->addr, PAGE_SIZE, PROT_READ |
> > PROT_WRITE,
> > + flags, -1, 0);
> > + if (ptr == MAP_FAILED) {
> > + if (test->addr == HIGH_ADDR)
> > + if (!cpu_has_la57())
> > + return 3; /* unsupport LA57 */
>
> I think here return 3 to indicate skip cases.
>
> Perhaps you can enable skip case like this? Just FYI, I'm not familiar with
> selftests yet.
>
Yes, this case can be regarded as skip case.
> @@ -321,8 +323,10 @@ static int handle_mmap(struct testcases *test)
> flags, -1, 0);
> if (ptr == MAP_FAILED) {
> if (test->addr == HIGH_ADDR)
> - if (!cpu_has_la57())
> + if (!cpu_has_la57()) {
> + perror("Unsupport LA57. Skip");
> return 3; /* unsupport LA57 */
> + }
> return 1;
> }
>
> @@ -746,12 +750,16 @@ static void run_test(struct testcases *test, int
> count)
>
> /* fork a process to run test case */
> ret = fork_test(t);
> + tests_cnt++;
> + if (ret == 3) {
> + ksft_test_result_skip(t->msg);
> + continue;
> + }
> if (ret != 0)
> ret = (t->expected == ret);
> else
> ret = !(t->expected);
>
> - tests_cnt++;
> ksft_test_result(ret, t->msg);
> }
> }
>
> > + return 1;
> > + }
> > +
> > + if (test->later != 0 && test->lam != 0)
> > + if (set_lam(test->lam) != 0)
> > + ret = 1;
> > +
> > + if (ret == 0) {
> > + if (sigsetjmp(segv_env, 1) == 0) {
> > + signal(SIGSEGV, segv_handler);
> > + ret = handle_lam_test(ptr, test->lam);
> > + } else {
> > + ret = 2;
> > + }
> > + }
> > +
> > + munmap(ptr, PAGE_SIZE);
> > + return ret;
> > +}
> > +
> > +static int handle_syscall(struct testcases *test) {
> > + struct utsname unme, *pu;
> > + int ret = 0;
> > +
> > + if (test->later == 0 && test->lam != 0)
> > + if (set_lam(test->lam) != 0)
> > + return 1;
> > +
> > + if (sigsetjmp(segv_env, 1) == 0) {
> > + signal(SIGSEGV, segv_handler);
> > + pu = (struct utsname *)set_metadata((uint64_t)&unme,
> > test->lam);
> > + ret = uname(pu);
> > + if (ret < 0)
> > + ret = 1;
> > + } else {
> > + ret = 2;
> > + }
> > +
> > + if (test->later != 0 && test->lam != 0)
> > + if (set_lam(test->lam) != -1 && ret == 0)
> > + ret = 1;
> > +
> > + return ret;
> > +}
> > +
> > static int fork_test(struct testcases *test) {
> > int ret, child_ret;
> > @@ -268,7 +349,6 @@ static struct testcases malloc_cases[] = {
> > },
> > };
> >
> > -
> > static struct testcases bits_cases[] = {
> > {
> > .test_func = handle_max_bits,
> > @@ -276,11 +356,54 @@ static struct testcases bits_cases[] = {
> > },
> > };
> >
> > +static struct testcases syscall_cases[] = {
> > + {
> > + .later = 0,
> > + .lam = LAM_U57_BITS,
> > + .test_func = handle_syscall,
> > + .msg = "SYSCALL: LAM_U57. syscall with metadata\n",
> > + },
> > + {
> > + .later = 1,
> > + .expected = 1,
> > + .lam = LAM_U57_BITS,
> > + .test_func = handle_syscall,
> > + .msg = "SYSCALL:[Negative] Disable LAM. Dereferencing
> > pointer with metadata.\n",
> > + },
> > +};
> > +
> > +static struct testcases mmap_cases[] = {
> > + {
> > + .later = 1,
> > + .expected = 0,
> > + .lam = LAM_U57_BITS,
> > + .addr = HIGH_ADDR,
> > + .test_func = handle_mmap,
> > + .msg = "MMAP: First mmap high address, then set
> > LAM_U57.\n",
> > + },
> > + {
> > + .later = 0,
> > + .expected = 0,
> > + .lam = LAM_U57_BITS,
> > + .addr = HIGH_ADDR,
> > + .test_func = handle_mmap,
> > + .msg = "MMAP: First LAM_U57, then High address.\n",
> > + },
> > + {
> > + .later = 0,
> > + .expected = 0,
> > + .lam = LAM_U57_BITS,
> > + .addr = LOW_ADDR,
> > + .test_func = handle_mmap,
> > + .msg = "MMAP: First LAM_U57, then Low address.\n",
> > + },
> > +};
> > +
> > static void cmd_help(void)
> > {
> > printf("usage: lam [-h] [-t test list]\n");
> > printf("\t-t test list: run tests specified in the test list,
> > default:0x%x\n", TEST_MASK);
> > - printf("\t\t0x1:malloc; 0x2:max_bits;\n");
> > + printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap;
> > 0x8:syscall.\n");
> > printf("\t-h: help\n");
> > }
> >
> > @@ -320,6 +443,12 @@ int main(int argc, char **argv)
> > if (tests & FUNC_BITS)
> > run_test(bits_cases, ARRAY_SIZE(bits_cases));
> >
> > + if (tests & FUNC_MMAP)
> > + run_test(mmap_cases, ARRAY_SIZE(mmap_cases));
> > +
> > + if (tests & FUNC_SYSCALL)
> > + run_test(syscall_cases, ARRAY_SIZE(syscall_cases));
> > +
> > ksft_set_plan(tests_cnt);
> >
> > return ksft_exit_pass();
On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > Hi Kirill,
> >
> > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > 64-bit linear addresses, allowing software to use of the untranslated
> > > address bits for metadata.
> >
> > We discussed this internally, but didn't bubble up here.
> >
> > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > within the IOMMU. This permits user to share VA directly with the device,
> > and the device can participate even in fixing page-faults and such.
> >
> > IOMMU enforces canonical addressing, since we are hijacking the top order
> > bits for meta-data, it will fail sanity check and we would return a failure
> > back to device on any page-faults from device.
> >
> > It also complicates how device TLB and ATS work, and needs some major
> > improvements to detect device capability to accept tagged pointers, adjust
> > the devtlb to act accordingly.
> >
> >
> > Both are orthogonal features, but there is an intersection of both
> > that are fundamentally incompatible.
> >
> > Its even more important, since an application might be using SVA under the
> > cover provided by some library that's used without their knowledge.
> >
> > The path would be:
> >
> > 1. Ensure both LAM and SVM are incompatible by design, without major
> > changes.
> > - If LAM is enabled already and later SVM enabling is requested by
> > user, that should fail. and Vice versa.
> > - Provide an API to user to ask for opt-out. Now they know they
> > must sanitize the pointers before sending to device, or the
> > working set is already isolated and needs no work.
>
> The patch below implements something like this. It is PoC, build-tested only.
>
> To be honest, I hate it. It is clearly a layering violation. It feels
> dirty. But I don't see any better way as we tie orthogonal features
> together.
>
> Also I have no idea how to make forced PASID allocation if LAM enabled.
> What the API has to look like?
>
> Any comments?
Looking through it, it seems to be sane enough.. I feel dirty too :-) but
don't see a better way.
I'm Ccing JasonG since we are reworking the IOMMU interfaces right now, and
Jacob who is in the middle of some refactoring.
>
> > 2. I suppose for any syscalls that take tagged pointers you would maybe
> > relax checks for how many bits to ignore for canonicallity. This is
> > required so user don't need to do the same for everything sanitization
> > before every syscall.
>
> I'm not quite follow this. For syscalls that allow tagged pointers, we do
> untagged_addr() now. Not sure what else needed.
>
> > If you have it fail, the library might choose a less optimal path if one is
> > available.
> >
> > Cheers,
> > Ashok
>
> diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
> index a31e27b95b19..e5c04ced36c9 100644
> --- a/arch/x86/include/uapi/asm/prctl.h
> +++ b/arch/x86/include/uapi/asm/prctl.h
> @@ -23,5 +23,6 @@
> #define ARCH_GET_UNTAG_MASK 0x4001
> #define ARCH_ENABLE_TAGGED_ADDR 0x4002
> #define ARCH_GET_MAX_TAG_BITS 0x4003
> +#define ARCH_ENABLE_TAGGED_ADDR_FORCED 0x4004
>
> #endif /* _ASM_X86_PRCTL_H */
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 337f80a0862f..7d89a2fd1a55 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -774,7 +774,8 @@ static bool lam_u48_allowed(void)
> #define LAM_U48_BITS 15
> #define LAM_U57_BITS 6
>
> -static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> +static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits,
> + bool forced)
> {
> int ret = 0;
>
> @@ -793,6 +794,11 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> goto out;
> }
>
> + if (pasid_valid(mm->pasid) && !forced) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> if (!nr_bits) {
> ret = -EINVAL;
> goto out;
> @@ -910,7 +916,9 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
> return put_user(task->mm->context.untag_mask,
> (unsigned long __user *)arg2);
> case ARCH_ENABLE_TAGGED_ADDR:
> - return prctl_enable_tagged_addr(task->mm, arg2);
> + return prctl_enable_tagged_addr(task->mm, arg2, false);
> + case ARCH_ENABLE_TAGGED_ADDR_FORCED:
> + return prctl_enable_tagged_addr(task->mm, arg2, true);
> case ARCH_GET_MAX_TAG_BITS: {
> int nr_bits;
>
> diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
> index 106506143896..a6ec17de1937 100644
> --- a/drivers/iommu/iommu-sva-lib.c
> +++ b/drivers/iommu/iommu-sva-lib.c
> @@ -4,6 +4,7 @@
> */
> #include <linux/mutex.h>
> #include <linux/sched/mm.h>
> +#include <asm/mmu_context.h>
>
> #include "iommu-sva-lib.h"
>
> @@ -32,6 +33,15 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> return -EINVAL;
>
> mutex_lock(&iommu_sva_lock);
> +
> + /* Serialize against LAM enabling */
> + mutex_lock(&mm->context.lock);
> +
> + if (mm_lam_cr3_mask(mm)) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> /* Is a PASID already associated with this mm? */
> if (pasid_valid(mm->pasid)) {
> if (mm->pasid < min || mm->pasid >= max)
> @@ -45,6 +55,7 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> else
> mm_pasid_set(mm, pasid);
> out:
> + mutex_unlock(&mm->context.lock);
> mutex_unlock(&iommu_sva_lock);
> return ret;
> }
> --
> Kiryl Shutsemau / Kirill A. Shutemov
Hi Ashok,
On Fri, 9 Sep 2022 16:08:02 +0000, Ashok Raj <[email protected]>
wrote:
> > diff --git a/arch/x86/include/uapi/asm/prctl.h
> > b/arch/x86/include/uapi/asm/prctl.h index a31e27b95b19..e5c04ced36c9
> > 100644 --- a/arch/x86/include/uapi/asm/prctl.h
> > +++ b/arch/x86/include/uapi/asm/prctl.h
> > @@ -23,5 +23,6 @@
> > #define ARCH_GET_UNTAG_MASK 0x4001
> > #define ARCH_ENABLE_TAGGED_ADDR 0x4002
> > #define ARCH_GET_MAX_TAG_BITS 0x4003
> > +#define ARCH_ENABLE_TAGGED_ADDR_FORCED 0x4004
> >
> > #endif /* _ASM_X86_PRCTL_H */
> > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > index 337f80a0862f..7d89a2fd1a55 100644
> > --- a/arch/x86/kernel/process_64.c
> > +++ b/arch/x86/kernel/process_64.c
> > @@ -774,7 +774,8 @@ static bool lam_u48_allowed(void)
> > #define LAM_U48_BITS 15
> > #define LAM_U57_BITS 6
> >
> > -static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned
> > long nr_bits) +static int prctl_enable_tagged_addr(struct mm_struct
> > *mm, unsigned long nr_bits,
> > + bool forced)
> > {
> > int ret = 0;
> >
> > @@ -793,6 +794,11 @@ static int prctl_enable_tagged_addr(struct
> > mm_struct *mm, unsigned long nr_bits) goto out;
> > }
> >
> > + if (pasid_valid(mm->pasid) && !forced) {
I don't think this works since we have lazy pasid free. for example,
after all the devices did sva_unbind, mm->pasid we'll remain valid until
mmdrop(). LAM should be supported in this case.
Perhaps, we could introduce another prctl flag for SVA, PR_GET_SVA?
Both iommu driver and LAM can set/query the flag. LAM applications may not
be the only ones want to know if share virtual addressing is on.
> > + ret = -EBUSY;
> > + goto out;
> > + }
> > +
Thanks,
Jacob
On 9/12/22 13:39, Jacob Pan wrote:
>>> + if (pasid_valid(mm->pasid) && !forced) {
> I don't think this works since we have lazy pasid free. for example,
> after all the devices did sva_unbind, mm->pasid we'll remain valid until
> mmdrop(). LAM should be supported in this case.
Nah, it works fine.
It just means that the rules are "you can't do LAM if your process
*EVER* got a PASID" instead of "you can't do LAM if you are actively
using your PASID".
We knew that PASID use would be a one-way trip for a process when we
moved to the simplified implementation. This is just more fallout from
that. It's fine.
> Perhaps, we could introduce another prctl flag for SVA, PR_GET_SVA?
> Both iommu driver and LAM can set/query the flag. LAM applications may not
> be the only ones want to know if share virtual addressing is on.
I don't think it's a good idea to add yet more UABI around this issue.
Won't the IOMMU folks eventually get their hardware in line with LAM?
Isn't this situation temporary?
On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > Hi Kirill,
> >
> > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > 64-bit linear addresses, allowing software to use of the untranslated
> > > address bits for metadata.
> >
> > We discussed this internally, but didn't bubble up here.
> >
> > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > within the IOMMU. This permits user to share VA directly with the device,
> > and the device can participate even in fixing page-faults and such.
> >
> > IOMMU enforces canonical addressing, since we are hijacking the top order
> > bits for meta-data, it will fail sanity check and we would return a failure
> > back to device on any page-faults from device.
> >
> > It also complicates how device TLB and ATS work, and needs some major
> > improvements to detect device capability to accept tagged pointers, adjust
> > the devtlb to act accordingly.
> >
> >
> > Both are orthogonal features, but there is an intersection of both
> > that are fundamentally incompatible.
> >
> > Its even more important, since an application might be using SVA under the
> > cover provided by some library that's used without their knowledge.
> >
> > The path would be:
> >
> > 1. Ensure both LAM and SVM are incompatible by design, without major
> > changes.
> > - If LAM is enabled already and later SVM enabling is requested by
> > user, that should fail. and Vice versa.
> > - Provide an API to user to ask for opt-out. Now they know they
> > must sanitize the pointers before sending to device, or the
> > working set is already isolated and needs no work.
>
> The patch below implements something like this. It is PoC, build-tested only.
>
> To be honest, I hate it. It is clearly a layering violation. It feels
> dirty. But I don't see any better way as we tie orthogonal features
> together.
>
> Also I have no idea how to make forced PASID allocation if LAM enabled.
> What the API has to look like?
Jacob, Ashok, any comment on this part?
I expect in many cases LAM will be enabled very early (like before malloc
is functinal) in process start and it makes PASID allocation always fail.
Any way out?
--
Kiryl Shutsemau / Kirill A. Shutemov
Hi Dave,
On Mon, 12 Sep 2022 14:41:56 -0700, Dave Hansen <[email protected]>
wrote:
> On 9/12/22 13:39, Jacob Pan wrote:
> >>> + if (pasid_valid(mm->pasid) && !forced) {
> > I don't think this works since we have lazy pasid free. for example,
> > after all the devices did sva_unbind, mm->pasid we'll remain valid
> > until mmdrop(). LAM should be supported in this case.
>
> Nah, it works fine.
> It just means that the rules are "you can't do LAM if your process
> *EVER* got a PASID" instead of "you can't do LAM if you are actively
> using your PASID".
Sure it works if you change the rules, but this case need to documented.
>
> We knew that PASID use would be a one-way trip for a process when we
> moved to the simplified implementation. This is just more fallout from
> that. It's fine.
>
Is LAM also a one-way trip?
> > Perhaps, we could introduce another prctl flag for SVA, PR_GET_SVA?
> > Both iommu driver and LAM can set/query the flag. LAM applications may
> > not be the only ones want to know if share virtual addressing is on.
>
> I don't think it's a good idea to add yet more UABI around this issue.
> Won't the IOMMU folks eventually get their hardware in line with LAM?
> Isn't this situation temporary?
Thanks,
Jacob
Hi Kirill,
On Tue, 13 Sep 2022 01:49:30 +0300, "Kirill A. Shutemov"
<[email protected]> wrote:
> On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > > Hi Kirill,
> > >
> > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > > Linear Address Masking[1] (LAM) modifies the checking that is
> > > > applied to 64-bit linear addresses, allowing software to use of the
> > > > untranslated address bits for metadata.
> > >
> > > We discussed this internally, but didn't bubble up here.
> > >
> > > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > > within the IOMMU. This permits user to share VA directly with the
> > > device, and the device can participate even in fixing page-faults and
> > > such.
> > >
> > > IOMMU enforces canonical addressing, since we are hijacking the top
> > > order bits for meta-data, it will fail sanity check and we would
> > > return a failure back to device on any page-faults from device.
> > >
> > > It also complicates how device TLB and ATS work, and needs some major
> > > improvements to detect device capability to accept tagged pointers,
> > > adjust the devtlb to act accordingly.
> > >
> > >
> > > Both are orthogonal features, but there is an intersection of both
> > > that are fundamentally incompatible.
> > >
> > > Its even more important, since an application might be using SVA
> > > under the cover provided by some library that's used without their
> > > knowledge.
> > >
> > > The path would be:
> > >
> > > 1. Ensure both LAM and SVM are incompatible by design, without major
> > > changes.
> > > - If LAM is enabled already and later SVM enabling is
> > > requested by user, that should fail. and Vice versa.
> > > - Provide an API to user to ask for opt-out. Now they know
> > > they must sanitize the pointers before sending to device, or the
> > > working set is already isolated and needs no work.
> >
> > The patch below implements something like this. It is PoC, build-tested
> > only.
> >
> > To be honest, I hate it. It is clearly a layering violation. It feels
> > dirty. But I don't see any better way as we tie orthogonal features
> > together.
> >
> > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > What the API has to look like?
>
> Jacob, Ashok, any comment on this part?
>
> I expect in many cases LAM will be enabled very early (like before malloc
> is functinal) in process start and it makes PASID allocation always fail.
>
Is there a generic flag LAM can set on the mm?
We can't check x86 feature in IOMMU SVA API. i.e.
@@ -32,6 +33,15 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
return -EINVAL;
mutex_lock(&iommu_sva_lock);
+
+ /* Serialize against LAM enabling */
+ mutex_lock(&mm->context.lock);
+
+ if (mm_lam_cr3_mask(mm)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
> Any way out?
>
Thanks,
Jacob
On Mon, Sep 12, 2022 at 03:55:02PM -0700, Jacob Pan wrote:
> Hi Dave,
>
> On Mon, 12 Sep 2022 14:41:56 -0700, Dave Hansen <[email protected]>
> wrote:
>
> > On 9/12/22 13:39, Jacob Pan wrote:
> > >>> + if (pasid_valid(mm->pasid) && !forced) {
> > > I don't think this works since we have lazy pasid free. for example,
> > > after all the devices did sva_unbind, mm->pasid we'll remain valid
> > > until mmdrop(). LAM should be supported in this case.
> >
> > Nah, it works fine.
> > It just means that the rules are "you can't do LAM if your process
> > *EVER* got a PASID" instead of "you can't do LAM if you are actively
> > using your PASID".
> Sure it works if you change the rules, but this case need to documented.
>
> >
> > We knew that PASID use would be a one-way trip for a process when we
> > moved to the simplified implementation. This is just more fallout from
> > that. It's fine.
> >
> Is LAM also a one-way trip?
Yes.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 12, 2022 at 05:08:09PM -0700, Jacob Pan wrote:
> Hi Kirill,
>
> On Tue, 13 Sep 2022 01:49:30 +0300, "Kirill A. Shutemov"
> <[email protected]> wrote:
>
> > On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > > > Hi Kirill,
> > > >
> > > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > > > Linear Address Masking[1] (LAM) modifies the checking that is
> > > > > applied to 64-bit linear addresses, allowing software to use of the
> > > > > untranslated address bits for metadata.
> > > >
> > > > We discussed this internally, but didn't bubble up here.
> > > >
> > > > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > > > within the IOMMU. This permits user to share VA directly with the
> > > > device, and the device can participate even in fixing page-faults and
> > > > such.
> > > >
> > > > IOMMU enforces canonical addressing, since we are hijacking the top
> > > > order bits for meta-data, it will fail sanity check and we would
> > > > return a failure back to device on any page-faults from device.
> > > >
> > > > It also complicates how device TLB and ATS work, and needs some major
> > > > improvements to detect device capability to accept tagged pointers,
> > > > adjust the devtlb to act accordingly.
> > > >
> > > >
> > > > Both are orthogonal features, but there is an intersection of both
> > > > that are fundamentally incompatible.
> > > >
> > > > Its even more important, since an application might be using SVA
> > > > under the cover provided by some library that's used without their
> > > > knowledge.
> > > >
> > > > The path would be:
> > > >
> > > > 1. Ensure both LAM and SVM are incompatible by design, without major
> > > > changes.
> > > > - If LAM is enabled already and later SVM enabling is
> > > > requested by user, that should fail. and Vice versa.
> > > > - Provide an API to user to ask for opt-out. Now they know
> > > > they must sanitize the pointers before sending to device, or the
> > > > working set is already isolated and needs no work.
> > >
> > > The patch below implements something like this. It is PoC, build-tested
> > > only.
> > >
> > > To be honest, I hate it. It is clearly a layering violation. It feels
> > > dirty. But I don't see any better way as we tie orthogonal features
> > > together.
> > >
> > > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > > What the API has to look like?
> >
> > Jacob, Ashok, any comment on this part?
> >
> > I expect in many cases LAM will be enabled very early (like before malloc
> > is functinal) in process start and it makes PASID allocation always fail.
> >
> Is there a generic flag LAM can set on the mm?
Hm. Not really.
I thought we can use untagged_addr(mm, -1UL) != -1UL as such check, but
-1UL is kernel address and untagged_addr() would not untag such address
for LAM.
I guess we can make add a helper for this.
But tagged address implementation is rather different across different
platforms and semantic can be hard to define. Like if the tagged addresses
support per-thread or per-process. Or maybe it is global.
Maybe just add arch hook there? arch_can_alloc_pasid(mm) or something.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Sep 12, 2022 at 02:41:56PM -0700, Dave Hansen wrote:
> On 9/12/22 13:39, Jacob Pan wrote:
> >>> + if (pasid_valid(mm->pasid) && !forced) {
> > I don't think this works since we have lazy pasid free. for example,
> > after all the devices did sva_unbind, mm->pasid we'll remain valid until
> > mmdrop(). LAM should be supported in this case.
>
> Nah, it works fine.
>
> It just means that the rules are "you can't do LAM if your process
> *EVER* got a PASID" instead of "you can't do LAM if you are actively
> using your PASID".
>
> We knew that PASID use would be a one-way trip for a process when we
> moved to the simplified implementation. This is just more fallout from
> that. It's fine.
Agree.
>
> > Perhaps, we could introduce another prctl flag for SVA, PR_GET_SVA?
> > Both iommu driver and LAM can set/query the flag. LAM applications may not
> > be the only ones want to know if share virtual addressing is on.
>
> I don't think it's a good idea to add yet more UABI around this issue.
> Won't the IOMMU folks eventually get their hardware in line with LAM?
> Isn't this situation temporary?
This is more than just the IOMMU change, since this involves device end,
ability to report tagging feature, communicating the width to ignore
etc. I suspect PCIe changes are required for the device end which would
be a long pole.
I suspect this would be moderately permanent :-) memory tagging is more
of a niche use case, and hurting general IO devices has lots of design
touch points that makes it difficult to close in short order.
On Tue, Sep 13, 2022 at 01:49:30AM +0300, Kirill A. Shutemov wrote:
> On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > > Hi Kirill,
> > >
> > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > > 64-bit linear addresses, allowing software to use of the untranslated
> > > > address bits for metadata.
> > >
> > > We discussed this internally, but didn't bubble up here.
> > >
> > > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > > within the IOMMU. This permits user to share VA directly with the device,
> > > and the device can participate even in fixing page-faults and such.
> > >
> > > IOMMU enforces canonical addressing, since we are hijacking the top order
> > > bits for meta-data, it will fail sanity check and we would return a failure
> > > back to device on any page-faults from device.
> > >
> > > It also complicates how device TLB and ATS work, and needs some major
> > > improvements to detect device capability to accept tagged pointers, adjust
> > > the devtlb to act accordingly.
> > >
> > >
> > > Both are orthogonal features, but there is an intersection of both
> > > that are fundamentally incompatible.
> > >
> > > Its even more important, since an application might be using SVA under the
> > > cover provided by some library that's used without their knowledge.
> > >
> > > The path would be:
> > >
> > > 1. Ensure both LAM and SVM are incompatible by design, without major
> > > changes.
> > > - If LAM is enabled already and later SVM enabling is requested by
> > > user, that should fail. and Vice versa.
> > > - Provide an API to user to ask for opt-out. Now they know they
> > > must sanitize the pointers before sending to device, or the
> > > working set is already isolated and needs no work.
> >
> > The patch below implements something like this. It is PoC, build-tested only.
> >
> > To be honest, I hate it. It is clearly a layering violation. It feels
> > dirty. But I don't see any better way as we tie orthogonal features
> > together.
> >
> > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > What the API has to look like?
>
> Jacob, Ashok, any comment on this part?
>
> I expect in many cases LAM will be enabled very early (like before malloc
> is functinal) in process start and it makes PASID allocation always fail.
>
> Any way out?
We need closure on this to proceed. Any clue?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Sep 14, 2022 at 05:45:18PM +0300, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 01:49:30AM +0300, Kirill A. Shutemov wrote:
> > On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > > > Hi Kirill,
> > > >
> > > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > > > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > > > 64-bit linear addresses, allowing software to use of the untranslated
> > > > > address bits for metadata.
> > > >
> > > > We discussed this internally, but didn't bubble up here.
> > > >
> > > > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > > > within the IOMMU. This permits user to share VA directly with the device,
> > > > and the device can participate even in fixing page-faults and such.
> > > >
> > > > IOMMU enforces canonical addressing, since we are hijacking the top order
> > > > bits for meta-data, it will fail sanity check and we would return a failure
> > > > back to device on any page-faults from device.
> > > >
> > > > It also complicates how device TLB and ATS work, and needs some major
> > > > improvements to detect device capability to accept tagged pointers, adjust
> > > > the devtlb to act accordingly.
> > > >
> > > >
> > > > Both are orthogonal features, but there is an intersection of both
> > > > that are fundamentally incompatible.
> > > >
> > > > Its even more important, since an application might be using SVA under the
> > > > cover provided by some library that's used without their knowledge.
> > > >
> > > > The path would be:
> > > >
> > > > 1. Ensure both LAM and SVM are incompatible by design, without major
> > > > changes.
> > > > - If LAM is enabled already and later SVM enabling is requested by
> > > > user, that should fail. and Vice versa.
> > > > - Provide an API to user to ask for opt-out. Now they know they
> > > > must sanitize the pointers before sending to device, or the
> > > > working set is already isolated and needs no work.
> > >
> > > The patch below implements something like this. It is PoC, build-tested only.
> > >
> > > To be honest, I hate it. It is clearly a layering violation. It feels
> > > dirty. But I don't see any better way as we tie orthogonal features
> > > together.
> > >
> > > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > > What the API has to look like?
> >
> > Jacob, Ashok, any comment on this part?
> >
> > I expect in many cases LAM will be enabled very early (like before malloc
> > is functinal) in process start and it makes PASID allocation always fail.
> >
> > Any way out?
>
> We need closure on this to proceed. Any clue?
Failing PASID allocation seems like the right thing to do here. If the
application is explicitly allocating PASID's it can opt-out using the
similar mechanism you have for LAM enabling. So user takes
responsibility for sanitizing pointers.
If some library is using an accelerator without application knowledge,
that would use the failure as a mechanism to use an alternate path if
one exists.
I don't know if both LAM and SVM need a separate forced opt-in (or i
don't have an opinion rather). Is this what you were asking?
+ Joerg, JasonG in case they have an opinion.
Cheers,
Ashok
On Wed, Sep 14, 2022 at 06:18:18PM +0300, Kirill A. Shutemov wrote:
> > > > >
> > > > > The patch below implements something like this. It is PoC, build-tested only.
> > > > >
> > > > > To be honest, I hate it. It is clearly a layering violation. It feels
> > > > > dirty. But I don't see any better way as we tie orthogonal features
> > > > > together.
> > > > >
> > > > > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > > > > What the API has to look like?
> > > >
> > > > Jacob, Ashok, any comment on this part?
> > > >
> > > > I expect in many cases LAM will be enabled very early (like before malloc
> > > > is functinal) in process start and it makes PASID allocation always fail.
> > > >
> > > > Any way out?
> > >
> > > We need closure on this to proceed. Any clue?
> >
> > Failing PASID allocation seems like the right thing to do here. If the
> > application is explicitly allocating PASID's it can opt-out using the
> > similar mechanism you have for LAM enabling. So user takes
> > responsibility for sanitizing pointers.
> >
> > If some library is using an accelerator without application knowledge,
> > that would use the failure as a mechanism to use an alternate path if
> > one exists.
> >
> > I don't know if both LAM and SVM need a separate forced opt-in (or i
> > don't have an opinion rather). Is this what you were asking?
> >
> > + Joerg, JasonG in case they have an opinion.
>
> My point is that the patch provides a way to override LAM vs. PASID mutual
> exclusion, but only if PASID allocated first. If we enabled LAM before
> PASID is allcoated there's no way to forcefully allocate PASID, bypassing
> LAM check. I think there should be one, no?
Yes, we should have one for force enabling SVM too if the application
asks for forgiveness.
On Wed, Sep 14, 2022 at 08:11:19AM -0700, Ashok Raj wrote:
> On Wed, Sep 14, 2022 at 05:45:18PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Sep 13, 2022 at 01:49:30AM +0300, Kirill A. Shutemov wrote:
> > > On Sun, Sep 04, 2022 at 03:39:52AM +0300, Kirill A. Shutemov wrote:
> > > > On Thu, Sep 01, 2022 at 05:45:08PM +0000, Ashok Raj wrote:
> > > > > Hi Kirill,
> > > > >
> > > > > On Tue, Aug 30, 2022 at 04:00:53AM +0300, Kirill A. Shutemov wrote:
> > > > > > Linear Address Masking[1] (LAM) modifies the checking that is applied to
> > > > > > 64-bit linear addresses, allowing software to use of the untranslated
> > > > > > address bits for metadata.
> > > > >
> > > > > We discussed this internally, but didn't bubble up here.
> > > > >
> > > > > Given that we are working on enabling Shared Virtual Addressing (SVA)
> > > > > within the IOMMU. This permits user to share VA directly with the device,
> > > > > and the device can participate even in fixing page-faults and such.
> > > > >
> > > > > IOMMU enforces canonical addressing, since we are hijacking the top order
> > > > > bits for meta-data, it will fail sanity check and we would return a failure
> > > > > back to device on any page-faults from device.
> > > > >
> > > > > It also complicates how device TLB and ATS work, and needs some major
> > > > > improvements to detect device capability to accept tagged pointers, adjust
> > > > > the devtlb to act accordingly.
> > > > >
> > > > >
> > > > > Both are orthogonal features, but there is an intersection of both
> > > > > that are fundamentally incompatible.
> > > > >
> > > > > Its even more important, since an application might be using SVA under the
> > > > > cover provided by some library that's used without their knowledge.
> > > > >
> > > > > The path would be:
> > > > >
> > > > > 1. Ensure both LAM and SVM are incompatible by design, without major
> > > > > changes.
> > > > > - If LAM is enabled already and later SVM enabling is requested by
> > > > > user, that should fail. and Vice versa.
> > > > > - Provide an API to user to ask for opt-out. Now they know they
> > > > > must sanitize the pointers before sending to device, or the
> > > > > working set is already isolated and needs no work.
> > > >
> > > > The patch below implements something like this. It is PoC, build-tested only.
> > > >
> > > > To be honest, I hate it. It is clearly a layering violation. It feels
> > > > dirty. But I don't see any better way as we tie orthogonal features
> > > > together.
> > > >
> > > > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > > > What the API has to look like?
> > >
> > > Jacob, Ashok, any comment on this part?
> > >
> > > I expect in many cases LAM will be enabled very early (like before malloc
> > > is functinal) in process start and it makes PASID allocation always fail.
> > >
> > > Any way out?
> >
> > We need closure on this to proceed. Any clue?
>
> Failing PASID allocation seems like the right thing to do here. If the
> application is explicitly allocating PASID's it can opt-out using the
> similar mechanism you have for LAM enabling. So user takes
> responsibility for sanitizing pointers.
>
> If some library is using an accelerator without application knowledge,
> that would use the failure as a mechanism to use an alternate path if
> one exists.
>
> I don't know if both LAM and SVM need a separate forced opt-in (or i
> don't have an opinion rather). Is this what you were asking?
>
> + Joerg, JasonG in case they have an opinion.
My point is that the patch provides a way to override LAM vs. PASID mutual
exclusion, but only if PASID allocated first. If we enabled LAM before
PASID is allcoated there's no way to forcefully allocate PASID, bypassing
LAM check. I think there should be one, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Sep 14, 2022 at 08:31:56AM -0700, Ashok Raj wrote:
> On Wed, Sep 14, 2022 at 06:18:18PM +0300, Kirill A. Shutemov wrote:
> > > > > >
> > > > > > The patch below implements something like this. It is PoC, build-tested only.
> > > > > >
> > > > > > To be honest, I hate it. It is clearly a layering violation. It feels
> > > > > > dirty. But I don't see any better way as we tie orthogonal features
> > > > > > together.
> > > > > >
> > > > > > Also I have no idea how to make forced PASID allocation if LAM enabled.
> > > > > > What the API has to look like?
> > > > >
> > > > > Jacob, Ashok, any comment on this part?
> > > > >
> > > > > I expect in many cases LAM will be enabled very early (like before malloc
> > > > > is functinal) in process start and it makes PASID allocation always fail.
> > > > >
> > > > > Any way out?
> > > >
> > > > We need closure on this to proceed. Any clue?
> > >
> > > Failing PASID allocation seems like the right thing to do here. If the
> > > application is explicitly allocating PASID's it can opt-out using the
> > > similar mechanism you have for LAM enabling. So user takes
> > > responsibility for sanitizing pointers.
> > >
> > > If some library is using an accelerator without application knowledge,
> > > that would use the failure as a mechanism to use an alternate path if
> > > one exists.
> > >
> > > I don't know if both LAM and SVM need a separate forced opt-in (or i
> > > don't have an opinion rather). Is this what you were asking?
> > >
> > > + Joerg, JasonG in case they have an opinion.
> >
> > My point is that the patch provides a way to override LAM vs. PASID mutual
> > exclusion, but only if PASID allocated first. If we enabled LAM before
> > PASID is allcoated there's no way to forcefully allocate PASID, bypassing
> > LAM check. I think there should be one, no?
>
> Yes, we should have one for force enabling SVM too if the application
> asks for forgiveness.
What is the right API here?
--
Kiryl Shutsemau / Kirill A. Shutemov
Hi Kirill,
On Wed, 14 Sep 2022 18:45:32 +0300, "Kirill A. Shutemov"
<[email protected]> wrote:
> On Wed, Sep 14, 2022 at 08:31:56AM -0700, Ashok Raj wrote:
> > On Wed, Sep 14, 2022 at 06:18:18PM +0300, Kirill A. Shutemov wrote:
> > > > > > >
> > > > > > > The patch below implements something like this. It is PoC,
> > > > > > > build-tested only.
> > > > > > >
> > > > > > > To be honest, I hate it. It is clearly a layering violation.
> > > > > > > It feels dirty. But I don't see any better way as we tie
> > > > > > > orthogonal features together.
> > > > > > >
> > > > > > > Also I have no idea how to make forced PASID allocation if
> > > > > > > LAM enabled. What the API has to look like?
> > > > > >
> > > > > > Jacob, Ashok, any comment on this part?
> > > > > >
> > > > > > I expect in many cases LAM will be enabled very early (like
> > > > > > before malloc is functinal) in process start and it makes PASID
> > > > > > allocation always fail.
> > > > > >
> > > > > > Any way out?
> > > > >
> > > > > We need closure on this to proceed. Any clue?
> > > >
> > > > Failing PASID allocation seems like the right thing to do here. If
> > > > the application is explicitly allocating PASID's it can opt-out
> > > > using the similar mechanism you have for LAM enabling. So user takes
> > > > responsibility for sanitizing pointers.
> > > >
> > > > If some library is using an accelerator without application
> > > > knowledge, that would use the failure as a mechanism to use an
> > > > alternate path if one exists.
> > > >
> > > > I don't know if both LAM and SVM need a separate forced opt-in (or i
> > > > don't have an opinion rather). Is this what you were asking?
> > > >
> > > > + Joerg, JasonG in case they have an opinion.
> > >
> > > My point is that the patch provides a way to override LAM vs. PASID
> > > mutual exclusion, but only if PASID allocated first. If we enabled
> > > LAM before PASID is allcoated there's no way to forcefully allocate
> > > PASID, bypassing LAM check. I think there should be one, no?
> >
> > Yes, we should have one for force enabling SVM too if the application
> > asks for forgiveness.
>
> What is the right API here?
>
It seems very difficult to implement a UAPI for the applications to
override at a runtime. Currently, SVM bind is under the control of
individual drivers. It could be at the time of open or some ioctl.
Perhaps, this can be a platform policy via some commandline option. e.g.
intel_iommu=sva_lam_coexist.
Thanks,
Jacob
On Wed, Sep 14, 2022 at 04:51:16PM -0700, Jacob Pan wrote:
> Hi Kirill,
>
> On Wed, 14 Sep 2022 18:45:32 +0300, "Kirill A. Shutemov"
> <[email protected]> wrote:
>
> > On Wed, Sep 14, 2022 at 08:31:56AM -0700, Ashok Raj wrote:
> > > On Wed, Sep 14, 2022 at 06:18:18PM +0300, Kirill A. Shutemov wrote:
> > > > > > > >
> > > > > > > > The patch below implements something like this. It is PoC,
> > > > > > > > build-tested only.
> > > > > > > >
> > > > > > > > To be honest, I hate it. It is clearly a layering violation.
> > > > > > > > It feels dirty. But I don't see any better way as we tie
> > > > > > > > orthogonal features together.
> > > > > > > >
> > > > > > > > Also I have no idea how to make forced PASID allocation if
> > > > > > > > LAM enabled. What the API has to look like?
> > > > > > >
> > > > > > > Jacob, Ashok, any comment on this part?
> > > > > > >
> > > > > > > I expect in many cases LAM will be enabled very early (like
> > > > > > > before malloc is functinal) in process start and it makes PASID
> > > > > > > allocation always fail.
> > > > > > >
> > > > > > > Any way out?
> > > > > >
> > > > > > We need closure on this to proceed. Any clue?
> > > > >
> > > > > Failing PASID allocation seems like the right thing to do here. If
> > > > > the application is explicitly allocating PASID's it can opt-out
> > > > > using the similar mechanism you have for LAM enabling. So user takes
> > > > > responsibility for sanitizing pointers.
> > > > >
> > > > > If some library is using an accelerator without application
> > > > > knowledge, that would use the failure as a mechanism to use an
> > > > > alternate path if one exists.
> > > > >
> > > > > I don't know if both LAM and SVM need a separate forced opt-in (or i
> > > > > don't have an opinion rather). Is this what you were asking?
> > > > >
> > > > > + Joerg, JasonG in case they have an opinion.
> > > >
> > > > My point is that the patch provides a way to override LAM vs. PASID
> > > > mutual exclusion, but only if PASID allocated first. If we enabled
> > > > LAM before PASID is allcoated there's no way to forcefully allocate
> > > > PASID, bypassing LAM check. I think there should be one, no?
> > >
> > > Yes, we should have one for force enabling SVM too if the application
> > > asks for forgiveness.
> >
> > What is the right API here?
> >
> It seems very difficult to implement a UAPI for the applications to
> override at a runtime. Currently, SVM bind is under the control of
> individual drivers. It could be at the time of open or some ioctl.
>
> Perhaps, this can be a platform policy via some commandline option. e.g.
> intel_iommu=sva_lam_coexist.
I think it has to be per-process, not a system-wide handle.
Maybe a separate arch_prctl() to allow to enable LAM/SVM coexisting?
It would cover both sides of the API, relaxing check for both.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 15, 2022 at 12:01:35PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 14, 2022 at 04:51:16PM -0700, Jacob Pan wrote:
> > Hi Kirill,
> >
> > On Wed, 14 Sep 2022 18:45:32 +0300, "Kirill A. Shutemov"
> > <[email protected]> wrote:
> >
> > > On Wed, Sep 14, 2022 at 08:31:56AM -0700, Ashok Raj wrote:
> > > > On Wed, Sep 14, 2022 at 06:18:18PM +0300, Kirill A. Shutemov wrote:
> > > > > > > > >
> > > > > > > > > The patch below implements something like this. It is PoC,
> > > > > > > > > build-tested only.
> > > > > > > > >
> > > > > > > > > To be honest, I hate it. It is clearly a layering violation.
> > > > > > > > > It feels dirty. But I don't see any better way as we tie
> > > > > > > > > orthogonal features together.
> > > > > > > > >
> > > > > > > > > Also I have no idea how to make forced PASID allocation if
> > > > > > > > > LAM enabled. What the API has to look like?
> > > > > > > >
> > > > > > > > Jacob, Ashok, any comment on this part?
> > > > > > > >
> > > > > > > > I expect in many cases LAM will be enabled very early (like
> > > > > > > > before malloc is functinal) in process start and it makes PASID
> > > > > > > > allocation always fail.
> > > > > > > >
> > > > > > > > Any way out?
> > > > > > >
> > > > > > > We need closure on this to proceed. Any clue?
> > > > > >
> > > > > > Failing PASID allocation seems like the right thing to do here. If
> > > > > > the application is explicitly allocating PASID's it can opt-out
> > > > > > using the similar mechanism you have for LAM enabling. So user takes
> > > > > > responsibility for sanitizing pointers.
> > > > > >
> > > > > > If some library is using an accelerator without application
> > > > > > knowledge, that would use the failure as a mechanism to use an
> > > > > > alternate path if one exists.
> > > > > >
> > > > > > I don't know if both LAM and SVM need a separate forced opt-in (or i
> > > > > > don't have an opinion rather). Is this what you were asking?
> > > > > >
> > > > > > + Joerg, JasonG in case they have an opinion.
> > > > >
> > > > > My point is that the patch provides a way to override LAM vs. PASID
> > > > > mutual exclusion, but only if PASID allocated first. If we enabled
> > > > > LAM before PASID is allcoated there's no way to forcefully allocate
> > > > > PASID, bypassing LAM check. I think there should be one, no?
> > > >
> > > > Yes, we should have one for force enabling SVM too if the application
> > > > asks for forgiveness.
> > >
> > > What is the right API here?
> > >
> > It seems very difficult to implement a UAPI for the applications to
> > override at a runtime. Currently, SVM bind is under the control of
> > individual drivers. It could be at the time of open or some ioctl.
> >
> > Perhaps, this can be a platform policy via some commandline option. e.g.
> > intel_iommu=sva_lam_coexist.
>
> I think it has to be per-process, not a system-wide handle.
>
> Maybe a separate arch_prctl() to allow to enable LAM/SVM coexisting?
> It would cover both sides of the API, relaxing check for both.
Maybe something like the patch below. Build tested only.
I really struggle with naming here. Any suggestions on what XXX has to be
replaced with? I don't think it has to be limited to LAM as some other
tagging implementation may come later.
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 2fdb390040b5..0a38b52b7b5e 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -12,6 +12,8 @@
#define MM_CONTEXT_UPROBE_IA32 BIT(0)
/* vsyscall page is accessible on this MM */
#define MM_CONTEXT_HAS_VSYSCALL BIT(1)
+/* Allow LAM and SVM coexisting */
+#define MM_CONTEXT_XXX BIT(2)
/*
* x86 has arch-specific MMU state beyond what lives in mm_struct.
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 3736f41948e9..d4a0994e5bc7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -113,6 +113,8 @@ static inline void mm_reset_untag_mask(struct mm_struct *mm)
mm->context.untag_mask = -1UL;
}
+#define arch_can_alloc_pasid(mm) \
+ (!mm_lam_cr3_mask(mm) || (mm->context.flags & MM_CONTEXT_XXX))
#else
static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index a31e27b95b19..3b77d51c7e6c 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -23,5 +23,6 @@
#define ARCH_GET_UNTAG_MASK 0x4001
#define ARCH_ENABLE_TAGGED_ADDR 0x4002
#define ARCH_GET_MAX_TAG_BITS 0x4003
+#define ARCH_XXX 0x4004
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9aa85e74e59e..111843c9dd40 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -793,6 +793,11 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
goto out;
}
+ if (pasid_valid(mm->pasid) && !(mm->context.flags & MM_CONTEXT_XXX)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
if (!nr_bits) {
ret = -EINVAL;
goto out;
@@ -911,6 +916,12 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
(unsigned long __user *)arg2);
case ARCH_ENABLE_TAGGED_ADDR:
return prctl_enable_tagged_addr(task->mm, arg2);
+ case ARCH_XXX:
+ if (mmap_write_lock_killable(task->mm))
+ return -EINTR;
+ task->mm->context.flags |= MM_CONTEXT_XXX;
+ mmap_write_unlock(task->mm);
+ return 0;
case ARCH_GET_MAX_TAG_BITS: {
int nr_bits;
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index 106506143896..ed76cdfa3e6b 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -2,6 +2,8 @@
/*
* Helpers for IOMMU drivers implementing SVA
*/
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
#include <linux/mutex.h>
#include <linux/sched/mm.h>
@@ -31,7 +33,17 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
min == 0 || max < min)
return -EINVAL;
+ /* Serialize against address tagging enabling */
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ if (!arch_can_alloc_pasid(mm)) {
+ mmap_write_unlock(mm);
+ return -EBUSY;
+ }
+
mutex_lock(&iommu_sva_lock);
+
/* Is a PASID already associated with this mm? */
if (pasid_valid(mm->pasid)) {
if (mm->pasid < min || mm->pasid >= max)
@@ -46,6 +58,7 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
mm_pasid_set(mm, pasid);
out:
mutex_unlock(&iommu_sva_lock);
+ mmap_write_unlock(mm);
return ret;
}
EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);
diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index b9b970f7ab45..1649b080d844 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -28,4 +28,8 @@ static inline void leave_mm(int cpu) { }
# define task_cpu_possible(cpu, p) cpumask_test_cpu((cpu), task_cpu_possible_mask(p))
#endif
+#ifndef arch_can_alloc_pasid
+#define arch_can_alloc_pasid(mm) true
+#endif
+
#endif
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 15, 2022 at 08:28:58PM +0300, Kirill A. Shutemov wrote:
> @@ -31,7 +33,17 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> min == 0 || max < min)
> return -EINVAL;
>
> + /* Serialize against address tagging enabling */
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> +
> + if (!arch_can_alloc_pasid(mm)) {
> + mmap_write_unlock(mm);
> + return -EBUSY;
> + }
This has nothing to do with "allocating pasid"
Rather should be: "is the CPU page table compatible with the IOMMU HW
page table walker"
For this I would rather have a function that queries the format of the
page table under the mm_struct and we have enum values like
INTEL_NORMAL and INTEL_LAM as possible values.
The iommu driver will block incompatible page table formats, and when
it starts up it should assert something that blocks changing the
format.
Jason
On Tue, Sep 20, 2022 at 10:14:54AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:28:58PM +0300, Kirill A. Shutemov wrote:
>
> > @@ -31,7 +33,17 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
> > min == 0 || max < min)
> > return -EINVAL;
> >
> > + /* Serialize against address tagging enabling */
> > + if (mmap_write_lock_killable(mm))
> > + return -EINTR;
> > +
> > + if (!arch_can_alloc_pasid(mm)) {
> > + mmap_write_unlock(mm);
> > + return -EBUSY;
> > + }
>
> This has nothing to do with "allocating pasid"
>
> Rather should be: "is the CPU page table compatible with the IOMMU HW
> page table walker"
Thanks Jason, this certainly looks like a more logical way to look at it
rather than the functional association of allocating pasids.
>
> For this I would rather have a function that queries the format of the
> page table under the mm_struct and we have enum values like
> INTEL_NORMAL and INTEL_LAM as possible values.
>
> The iommu driver will block incompatible page table formats, and when
> it starts up it should assert something that blocks changing the
> format.
>
> Jason
>
On 9/20/22 06:14, Jason Gunthorpe wrote:
> For this I would rather have a function that queries the format of the
> page table under the mm_struct and we have enum values like
> INTEL_NORMAL and INTEL_LAM as possible values.
>
> The iommu driver will block incompatible page table formats, and when
> it starts up it should assert something that blocks changing the
> format.
That doesn't sound too bad. Except, please don't call it a "page table
format". The format of the page tables does not change with LAM. It's
entirely how the CPU interprets addresses that changes.
I guess someone could make an argument that, with LAM, there is a "hole"
that can't be filled in and _that_ constitutes a format change, but
that'd be a stretch.
The thing that matters when LAM is on is that some CPU might be stashing
addresses somewhere that only have meaning when you interpret them with
LAM rules. That's really a property of the mm, not of the page tables.
Oh, and please don't call things "INTEL_WHATEVER". It looks silly and
confuses the heck out of people when a second CPU vendor needs to use
the same code.
On Tue, Sep 20, 2022 at 09:06:32AM -0700, Dave Hansen wrote:
> On 9/20/22 06:14, Jason Gunthorpe wrote:
> > For this I would rather have a function that queries the format of the
> > page table under the mm_struct and we have enum values like
> > INTEL_NORMAL and INTEL_LAM as possible values.
> >
> > The iommu driver will block incompatible page table formats, and when
> > it starts up it should assert something that blocks changing the
> > format.
>
> That doesn't sound too bad. Except, please don't call it a "page table
> format". The format of the page tables does not change with LAM. It's
> entirely how the CPU interprets addresses that changes.
Sure it does. The rules for how the page table is walked change. The
actual bits stored in memory might not be different, but that doesn't
mean the format didn't change. If it didn't change we wouldn't have an
incompatibility with the IOMMU HW walker.
> Oh, and please don't call things "INTEL_WHATEVER". It looks silly and
> confuses the heck out of people when a second CPU vendor needs to use
> the same code.
I don't mean something where a second CPU vendor would re-use the
code, I mean literally to specify the exact format of the page tables,
where each CPU and configuration can get its own code. eg "Intel
x86-64 format 5 level, no lam"
At the end of the day the iommu drivers need to know this information
so they can program the HW to walk the same tables.
Today it is all implicit that if you have an iommu driver then it can
make assumptions about the layout, but if we add an API we may as well
be a bit more explicit at the same time.
Jason
On Tue, Sep 20, 2022 at 11:41:04AM -0700, Jacob Pan wrote:
> Hi Jason,
>
> On Tue, 20 Sep 2022 13:27:27 -0300, Jason Gunthorpe <[email protected]> wrote:
>
> > On Tue, Sep 20, 2022 at 09:06:32AM -0700, Dave Hansen wrote:
> > > On 9/20/22 06:14, Jason Gunthorpe wrote:
> > > > For this I would rather have a function that queries the format of the
> > > > page table under the mm_struct and we have enum values like
> > > > INTEL_NORMAL and INTEL_LAM as possible values.
> > > >
> > > > The iommu driver will block incompatible page table formats, and when
> > > > it starts up it should assert something that blocks changing the
> > > > format.
> > >
> > > That doesn't sound too bad. Except, please don't call it a "page table
> > > format". The format of the page tables does not change with LAM. It's
> > > entirely how the CPU interprets addresses that changes.
> >
> > Sure it does. The rules for how the page table is walked change. The
> > actual bits stored in memory might not be different, but that doesn't
> > mean the format didn't change. If it didn't change we wouldn't have an
> > incompatibility with the IOMMU HW walker.
>
> There are many CPU-IOMMU compatibility checks before we do for SVA,e.g. we
> check paging mode in sva_bind. We are delegating these checks in
> arch/platform code. So why can't we let arch code decide how to convey
> mm-IOMMU SVA compatibility? let it be a flag ( as in this patch) or some
> callback.
In general I'm not so keen on arch unique code for general ideas like
this (ARM probably has the same issue), but sure it could work.
> Perhaps a more descriptive name
> s/arch_can_alloc_pasid(mm)/arch_can_support_sva(mm)/ is all we disagreeing
> :)
Except that still isn't what it is doing. "sva" can mean lots of
things. You need to assert that the page table format is one of the
formats that the iommu understands and configure the iommu to match
it. It is a very simple question about what ruleset and memory layout
govern the page table memory used by the CPU.
And I think every CPU should be able to define a couple of their
configurations in some enum, most of the PTE handling code is all
hardwired, so I don't think we really support that many combinations
anyhow?
Jason
Hi Jason,
On Tue, 20 Sep 2022 13:27:27 -0300, Jason Gunthorpe <[email protected]> wrote:
> On Tue, Sep 20, 2022 at 09:06:32AM -0700, Dave Hansen wrote:
> > On 9/20/22 06:14, Jason Gunthorpe wrote:
> > > For this I would rather have a function that queries the format of the
> > > page table under the mm_struct and we have enum values like
> > > INTEL_NORMAL and INTEL_LAM as possible values.
> > >
> > > The iommu driver will block incompatible page table formats, and when
> > > it starts up it should assert something that blocks changing the
> > > format.
> >
> > That doesn't sound too bad. Except, please don't call it a "page table
> > format". The format of the page tables does not change with LAM. It's
> > entirely how the CPU interprets addresses that changes.
>
> Sure it does. The rules for how the page table is walked change. The
> actual bits stored in memory might not be different, but that doesn't
> mean the format didn't change. If it didn't change we wouldn't have an
> incompatibility with the IOMMU HW walker.
There are many CPU-IOMMU compatibility checks before we do for SVA,e.g. we
check paging mode in sva_bind. We are delegating these checks in
arch/platform code. So why can't we let arch code decide how to convey
mm-IOMMU SVA compatibility? let it be a flag ( as in this patch) or some
callback.
Perhaps a more descriptive name
s/arch_can_alloc_pasid(mm)/arch_can_support_sva(mm)/ is all we disagreeing
:)
Thanks,
Jacob
Hi Jason,
On Tue, 20 Sep 2022 15:50:33 -0300, Jason Gunthorpe <[email protected]> wrote:
> On Tue, Sep 20, 2022 at 11:41:04AM -0700, Jacob Pan wrote:
> > Hi Jason,
> >
> > On Tue, 20 Sep 2022 13:27:27 -0300, Jason Gunthorpe <[email protected]>
> > wrote:
> > > On Tue, Sep 20, 2022 at 09:06:32AM -0700, Dave Hansen wrote:
> > > > On 9/20/22 06:14, Jason Gunthorpe wrote:
> > > > > For this I would rather have a function that queries the format
> > > > > of the page table under the mm_struct and we have enum values like
> > > > > INTEL_NORMAL and INTEL_LAM as possible values.
> > > > >
> > > > > The iommu driver will block incompatible page table formats, and
> > > > > when it starts up it should assert something that blocks changing
> > > > > the format.
> > > >
> > > > That doesn't sound too bad. Except, please don't call it a "page
> > > > table format". The format of the page tables does not change with
> > > > LAM. It's entirely how the CPU interprets addresses that changes.
> > > >
> > >
> > > Sure it does. The rules for how the page table is walked change. The
> > > actual bits stored in memory might not be different, but that doesn't
> > > mean the format didn't change. If it didn't change we wouldn't have an
> > > incompatibility with the IOMMU HW walker.
> >
> > There are many CPU-IOMMU compatibility checks before we do for SVA,e.g.
> > we check paging mode in sva_bind. We are delegating these checks in
> > arch/platform code. So why can't we let arch code decide how to convey
> > mm-IOMMU SVA compatibility? let it be a flag ( as in this patch) or some
> > callback.
>
> In general I'm not so keen on arch unique code for general ideas like
> this (ARM probably has the same issue), but sure it could work.
>
Creating an abstraction seems to belong to a separate endeavor when we
have more than one user. For now, are you ok with the current approach?
> > Perhaps a more descriptive name
> > s/arch_can_alloc_pasid(mm)/arch_can_support_sva(mm)/ is all we
> > disagreeing :)
>
> Except that still isn't what it is doing. "sva" can mean lots of
> things.
True, but sharing page table is the root cause of the incompatibility.
IMHO, SVA means sharing page table at the highest level.
> You need to assert that the page table format is one of the
> formats that the iommu understands and configure the iommu to match
> it. It is a very simple question about what ruleset and memory layout
> govern the page table memory used by the CPU.
the problem is more relevant to things like canonical address
requirement than page table format.
> And I think every CPU should be able to define a couple of their
> configurations in some enum, most of the PTE handling code is all
> hardwired, so I don't think we really support that many combinations
> anyhow?
sounds like a nice but separate effort, I don't know enough here.
Thanks,
Jacob
On Tue, Sep 20, 2022 at 01:44:30PM -0700, Jacob Pan wrote:
> > In general I'm not so keen on arch unique code for general ideas like
> > this (ARM probably has the same issue), but sure it could work.
> >
> Creating an abstraction seems to belong to a separate endeavor when we
> have more than one user. For now, are you ok with the current approach?
Sure, just don't give it a silly name like pasid or sva
> > You need to assert that the page table format is one of the
> > formats that the iommu understands and configure the iommu to match
> > it. It is a very simple question about what ruleset and memory layout
> > govern the page table memory used by the CPU.
> the problem is more relevant to things like canonical address
> requirement than page table format.
The translation of an VA to a PA is entirely within the realm of the
page table format, IMHO. If the rules change then the format clearly
differs, even if your arch documents talk about "canonical address" or
something to define the different parsing behaviors.
Jason
> From: Kirill A. Shutemov <[email protected]>
> Sent: Friday, September 16, 2022 1:29 AM
>
> I really struggle with naming here. Any suggestions on what XXX has to be
> replaced with? I don't think it has to be limited to LAM as some other
> tagging implementation may come later.
What about ARCH_ENABLE_TAGGED_ADDR_CPU to mark that the application
tags address only on CPU and pays attention to untag when the address is
programmed to a device?
w/ ARCH_ENABLE_TAGGED_ADDR_CPU then LAM and SVA can co-exist.
The original ARCH_ENABLE_TAGGED_ADDR means that tagged address is
used on both CPU and device. Enabling sva on a device behind an iommu
which doesn't support LAM is then rejected if LAM has been enabled. and
vice versa.
Thanks
Kevin
On 9/15/22 10:28, Kirill A. Shutemov wrote:> + /* Serialize against
address tagging enabling *
> + if (mmap_write_lock_killable(mm))
> + return -EINTR;
> +
> + if (!arch_can_alloc_pasid(mm)) {
> + mmap_write_unlock(mm);
> + return -EBUSY;
> + }
Shouldn't this actually be some kind of *device* check?
The question here is whether the gunk in the mm's address space is
compatible with the device.
* Can the device walk the page tables in use under the mm?
* Does the device interpret addresses the same way as the CPUs
using the mm?
The page table format is, right now, wholly determined at boot at the
latest. But, it probably wouldn't hurt to pretend like it _might_
change at runtime.
The address interpretation part is, of course, what LAM changes. It's
also arguable that it includes features like protection keys. I can
totally see a case where folks might want to be careful and disallow
device access to an mm where pkeys are in use.
On Wed, Sep 21, 2022 at 09:57:47AM -0700, Dave Hansen wrote:
> On 9/15/22 10:28, Kirill A. Shutemov wrote:> + /* Serialize against
> address tagging enabling *
> > + if (mmap_write_lock_killable(mm))
> > + return -EINTR;
> > +
> > + if (!arch_can_alloc_pasid(mm)) {
> > + mmap_write_unlock(mm);
> > + return -EBUSY;
> > + }
>
> Shouldn't this actually be some kind of *device* check?
The device will enable svm only when its capable of it, and performs all
the normal capability checks like PASID, ATS etc before enabling it.
This is the final step before the mm is hooked up with the IOMMU.
>
> The question here is whether the gunk in the mm's address space is
> compatible with the device.
>
> * Can the device walk the page tables in use under the mm?
> * Does the device interpret addresses the same way as the CPUs
> using the mm?
>
> The page table format is, right now, wholly determined at boot at the
> latest. But, it probably wouldn't hurt to pretend like it _might_
> change at runtime.
>
> The address interpretation part is, of course, what LAM changes. It's
> also arguable that it includes features like protection keys. I can
> totally see a case where folks might want to be careful and disallow
> device access to an mm where pkeys are in use.
On 9/21/22 10:08, Ashok Raj wrote:
> On Wed, Sep 21, 2022 at 09:57:47AM -0700, Dave Hansen wrote:
>> On 9/15/22 10:28, Kirill A. Shutemov wrote:> + /* Serialize against
>> address tagging enabling *
>>> + if (mmap_write_lock_killable(mm))
>>> + return -EINTR;
>>> +
>>> + if (!arch_can_alloc_pasid(mm)) {
>>> + mmap_write_unlock(mm);
>>> + return -EBUSY;
>>> + }
>> Shouldn't this actually be some kind of *device* check?
> The device will enable svm only when its capable of it, and performs all
> the normal capability checks like PASID, ATS etc before enabling it.
> This is the final step before the mm is hooked up with the IOMMU.
What does that mean, though?
Are you saying that any device compatibility with an mm is solely
determined by the IOMMU in play, so the IOMMU code should host the mm
compatibility checks?
On Wed, Sep 21, 2022 at 10:11:46AM -0700, Dave Hansen wrote:
> On 9/21/22 10:08, Ashok Raj wrote:
> > On Wed, Sep 21, 2022 at 09:57:47AM -0700, Dave Hansen wrote:
> >> On 9/15/22 10:28, Kirill A. Shutemov wrote:> + /* Serialize against
> >> address tagging enabling *
> >>> + if (mmap_write_lock_killable(mm))
> >>> + return -EINTR;
> >>> +
> >>> + if (!arch_can_alloc_pasid(mm)) {
> >>> + mmap_write_unlock(mm);
> >>> + return -EBUSY;
> >>> + }
> >> Shouldn't this actually be some kind of *device* check?
> > The device will enable svm only when its capable of it, and performs all
> > the normal capability checks like PASID, ATS etc before enabling it.
> > This is the final step before the mm is hooked up with the IOMMU.
>
> What does that mean, though?
>
> Are you saying that any device compatibility with an mm is solely
> determined by the IOMMU in play, so the IOMMU code should host the mm
> compatibility checks?
>
To check if a device supports SVM like capabilities it needs to support
the following PCIe capabilities.
- PASID
- Page Request Interface (PRI) for dynamic page-faulting
- ATS - For quick VA->PA lookups.
The device purely works only with memory addresses and caches them in
its device TLB after a lookup via ATS.
When device does ATS, it sends a translation request, and IOMMU will
walk the page-tables to give the PA back. It can use it until it gets an
invalidation.
So the device doesn't need to know page-table formats. but if you use
tagged pointers its something you want to check device support for it. I
don't think there is any plans right now to support something like the
following.
- Check device ability to work with tagged pointers.
- OS should configure the width to ignore etc
- Device TLB's properly handle the tagged portion without creating
aliasing etc.
In order for LAM and SVM to play nicely you need
#1 IOMMU support for tagged pointers
#2 Device ability to handle tagged pointers.
#2 above is an additional check to perform in addition to PASID,PRI,ATS
checks we do today.
Cheers,
Ashok
On Wed, Sep 21, 2022 at 10:11:46AM -0700, Dave Hansen wrote:
> Are you saying that any device compatibility with an mm is solely
> determined by the IOMMU in play, so the IOMMU code should host the mm
> compatibility checks?
Yes, exactly. Only the HW entity that walks the page tables needs to
understand their parsing rules and in this case that is only the IOMMU
block.
Jason
On Wed, Sep 21, 2022 at 03:11:34PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 10:11:46AM -0700, Dave Hansen wrote:
>
> > Are you saying that any device compatibility with an mm is solely
> > determined by the IOMMU in play, so the IOMMU code should host the mm
> > compatibility checks?
>
> Yes, exactly. Only the HW entity that walks the page tables needs to
> understand their parsing rules and in this case that is only the IOMMU
> block.
But device has to know what bits of the virtual address are significant to
handle device TLB lookup/store correctly, no?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Sep 23, 2022 at 03:42:39AM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 21, 2022 at 03:11:34PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 10:11:46AM -0700, Dave Hansen wrote:
> >
> > > Are you saying that any device compatibility with an mm is solely
> > > determined by the IOMMU in play, so the IOMMU code should host the mm
> > > compatibility checks?
> >
> > Yes, exactly. Only the HW entity that walks the page tables needs to
> > understand their parsing rules and in this case that is only the IOMMU
> > block.
>
> But device has to know what bits of the virtual address are significant to
> handle device TLB lookup/store correctly, no?
For a device that also cares about tagging, yes. But in the current
world we don't have such devices. IOMMU only knows about the shared cr3
we placed in the PASID table entries to walk page-tables. I hope the
page-tables don't factor the meta-data bits correct? So I would assume
an untagged pointer should just be fine for the IOMMU to walk. IOMMU
currently wants canonical addresses for VA.
On Thu, Sep 22, 2022 at 10:27:45PM -0700, Ashok Raj wrote:
> On Fri, Sep 23, 2022 at 03:42:39AM +0300, Kirill A. Shutemov wrote:
> > On Wed, Sep 21, 2022 at 03:11:34PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 10:11:46AM -0700, Dave Hansen wrote:
> > >
> > > > Are you saying that any device compatibility with an mm is solely
> > > > determined by the IOMMU in play, so the IOMMU code should host the mm
> > > > compatibility checks?
> > >
> > > Yes, exactly. Only the HW entity that walks the page tables needs to
> > > understand their parsing rules and in this case that is only the IOMMU
> > > block.
> >
> > But device has to know what bits of the virtual address are significant to
> > handle device TLB lookup/store correctly, no?
>
> For a device that also cares about tagging, yes. But in the current
> world we don't have such devices. IOMMU only knows about the shared cr3
> we placed in the PASID table entries to walk page-tables. I hope the
> page-tables don't factor the meta-data bits correct?
Correct. Page tables contain only physical addresses, so they have no
exposure to tags in the virtual addresses.
> So I would assume an untagged pointer should just be fine for the IOMMU
> to walk. IOMMU currently wants canonical addresses for VA.
Right. But it means that LAM compatibility can be block on two layers:
IOMMU and device. IOMMU is not the only HW entity that has to be aware of
tagged pointers.
So where to put compatibility check that can cover both device and IOMMU,
considering that in the future they can get aware about tagged pointers?
--
Kiryl Shutsemau / Kirill A. Shutemov
On Fri, Sep 23, 2022 at 12:38:26PM +0300, Kirill A. Shutemov wrote:
> > So I would assume an untagged pointer should just be fine for the IOMMU
> > to walk. IOMMU currently wants canonical addresses for VA.
>
> Right. But it means that LAM compatibility can be block on two layers:
> IOMMU and device. IOMMU is not the only HW entity that has to be aware of
> tagged pointers.
Why does a device need to care about this? What do you imagine a
device doing with it?
The userspace should program the device with the tagged address, the
device should present the tagged address on the bus, the IOMMU should
translate the tagged address the same as the CPU by ignoring the upper
bits.
Jason
On 9/23/22 04:46, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 12:38:26PM +0300, Kirill A. Shutemov wrote:
>>> So I would assume an untagged pointer should just be fine for the IOMMU
>>> to walk. IOMMU currently wants canonical addresses for VA.
>> Right. But it means that LAM compatibility can be block on two layers:
>> IOMMU and device. IOMMU is not the only HW entity that has to be aware of
>> tagged pointers.
> Why does a device need to care about this? What do you imagine a
> device doing with it?
>
> The userspace should program the device with the tagged address, the
> device should present the tagged address on the bus, the IOMMU should
> translate the tagged address the same as the CPU by ignoring the upper
> bits.
Is this how *every* access works? Every single device access to the
address space goes through the IOMMU?
I thought devices also cached address translation responses from the
IOMMU and stashed them in their own device-local TLB. If the device is
unaware of the tags, then how does device TLB invalidation work? Would
all device TLB flushes be full flushes of the devices TLB? If something
tried to use single-address invalidation, it would need to invalidate
every possible tag alias because the device wouldn't know that the tags
*are* tags instead of actual virtual addresses.
On Fri, Sep 23, 2022 at 07:18:42AM -0700, Dave Hansen wrote:
> On 9/23/22 04:46, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 12:38:26PM +0300, Kirill A. Shutemov wrote:
> >>> So I would assume an untagged pointer should just be fine for the IOMMU
> >>> to walk. IOMMU currently wants canonical addresses for VA.
> >> Right. But it means that LAM compatibility can be block on two layers:
> >> IOMMU and device. IOMMU is not the only HW entity that has to be aware of
> >> tagged pointers.
> > Why does a device need to care about this? What do you imagine a
> > device doing with it?
> >
> > The userspace should program the device with the tagged address, the
> > device should present the tagged address on the bus, the IOMMU should
> > translate the tagged address the same as the CPU by ignoring the upper
> > bits.
>
> Is this how *every* access works? Every single device access to the
> address space goes through the IOMMU?
>
> I thought devices also cached address translation responses from the
> IOMMU and stashed them in their own device-local TLB.
Ah, you are worried about invalidation.
There is an optional PCI feature called ATS that is this caching, and
it is mandatory if the IOMMU will use the CPU page table.
In ATS the invalidation is triggered by the iommu driver in a device
agnostic way.
The PCI spec has no provision to invalidate with a mask, only linear
chunks of address space can be invalidated.
Presumably when someone wants to teach the iommu to process LAM they
will also have to teach the the PCI spec how to do LAM too.
I would not like to see a world where drivers have to deal with this.
ATS is completely transparent to the driver, it should stay that
way. Devices should handle LAM through some PCI SIG agnostic approach
and that will be contained entirely in the iommu driver.
Jason
On Fri, Sep 23, 2022 at 11:42:39AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 07:18:42AM -0700, Dave Hansen wrote:
> > On 9/23/22 04:46, Jason Gunthorpe wrote:
> > > On Fri, Sep 23, 2022 at 12:38:26PM +0300, Kirill A. Shutemov wrote:
> > >>> So I would assume an untagged pointer should just be fine for the IOMMU
> > >>> to walk. IOMMU currently wants canonical addresses for VA.
> > >> Right. But it means that LAM compatibility can be block on two layers:
> > >> IOMMU and device. IOMMU is not the only HW entity that has to be aware of
> > >> tagged pointers.
> > > Why does a device need to care about this? What do you imagine a
> > > device doing with it?
> > >
> > > The userspace should program the device with the tagged address, the
> > > device should present the tagged address on the bus, the IOMMU should
> > > translate the tagged address the same as the CPU by ignoring the upper
> > > bits.
> >
> > Is this how *every* access works? Every single device access to the
> > address space goes through the IOMMU?
> >
> > I thought devices also cached address translation responses from the
> > IOMMU and stashed them in their own device-local TLB.
>
> Ah, you are worried about invalidation.
>
> There is an optional PCI feature called ATS that is this caching, and
> it is mandatory if the IOMMU will use the CPU page table.
>
> In ATS the invalidation is triggered by the iommu driver in a device
> agnostic way.
>
> The PCI spec has no provision to invalidate with a mask, only linear
> chunks of address space can be invalidated.
This part is currently being worked on in the PCI SIG. Once we have
something like this we can teach which portion of the VA to mask.
Ofcourse this will take a while before PCI sig standardizes it and we
will start seeing devices that support them.
BTW, this is a problem for all vendors that support SVM and
LAM/MTT. I see ARM in the SMMU 3.1 doc states it doesn't support MTT
at the moment, just like the Intel IOMMU.
I hope the API we develop must work across all vendors.
On Fri, Sep 23, 2022 at 07:18:42AM -0700, Dave Hansen wrote:
> On 9/23/22 04:46, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 12:38:26PM +0300, Kirill A. Shutemov wrote:
> >>> So I would assume an untagged pointer should just be fine for the IOMMU
> >>> to walk. IOMMU currently wants canonical addresses for VA.
> >> Right. But it means that LAM compatibility can be block on two layers:
> >> IOMMU and device. IOMMU is not the only HW entity that has to be aware of
> >> tagged pointers.
> > Why does a device need to care about this? What do you imagine a
> > device doing with it?
> >
> > The userspace should program the device with the tagged address, the
> > device should present the tagged address on the bus, the IOMMU should
> > translate the tagged address the same as the CPU by ignoring the upper
> > bits.
>
> Is this how *every* access works? Every single device access to the
> address space goes through the IOMMU?
>
> I thought devices also cached address translation responses from the
> IOMMU and stashed them in their own device-local TLB. If the device is
> unaware of the tags, then how does device TLB invalidation work? Would
This is coming a full circle now :-)
Since the device doesn't understand tagging, SVM and tagging aren't
compatible. If you need SVM, you can only send sanitized pointers to the
device period. In fact our page-request even looks for canonical checks
before doing the page-faulting.
> all device TLB flushes be full flushes of the devices TLB? If something
> tried to use single-address invalidation, it would need to invalidate
> every possible tag alias because the device wouldn't know that the tags
> *are* tags instead of actual virtual addresses.
Once tagging is extended into the PCI SIG, and devices know to work with
them, so will the IOMMU, then they can all play in the same field. Until
then they are isolated, or let SVM only work with untagged VA's.
On 9/23/22 08:28, Ashok Raj wrote:
>>
>> I thought devices also cached address translation responses from the
>> IOMMU and stashed them in their own device-local TLB. If the device is
>> unaware of the tags, then how does device TLB invalidation work? Would
> This is coming a full circle now :-)
>
> Since the device doesn't understand tagging, SVM and tagging aren't
> compatible. If you need SVM, you can only send sanitized pointers to the
> device period. In fact our page-request even looks for canonical checks
> before doing the page-faulting.
>
>> all device TLB flushes be full flushes of the devices TLB? If something
>> tried to use single-address invalidation, it would need to invalidate
>> every possible tag alias because the device wouldn't know that the tags
>> *are* tags instead of actual virtual addresses.
> Once tagging is extended into the PCI SIG, and devices know to work with
> them, so will the IOMMU, then they can all play in the same field. Until
> then they are isolated, or let SVM only work with untagged VA's.
But, the point that Kirill and I were getting at is still that devices
*have* a role to play here. The idea that this can be hidden at the
IOMMU layer is pure fantasy. Right?
On Fri, Sep 23, 2022 at 08:31:13AM -0700, Dave Hansen wrote:
> On 9/23/22 08:28, Ashok Raj wrote:
> >>
> >> I thought devices also cached address translation responses from the
> >> IOMMU and stashed them in their own device-local TLB. If the device is
> >> unaware of the tags, then how does device TLB invalidation work? Would
> > This is coming a full circle now :-)
> >
> > Since the device doesn't understand tagging, SVM and tagging aren't
> > compatible. If you need SVM, you can only send sanitized pointers to the
> > device period. In fact our page-request even looks for canonical checks
> > before doing the page-faulting.
> >
> >> all device TLB flushes be full flushes of the devices TLB? If something
> >> tried to use single-address invalidation, it would need to invalidate
> >> every possible tag alias because the device wouldn't know that the tags
> >> *are* tags instead of actual virtual addresses.
> > Once tagging is extended into the PCI SIG, and devices know to work with
> > them, so will the IOMMU, then they can all play in the same field. Until
> > then they are isolated, or let SVM only work with untagged VA's.
>
> But, the point that Kirill and I were getting at is still that devices
> *have* a role to play here. The idea that this can be hidden at the
> IOMMU layer is pure fantasy. Right?
If you *can't* send tagged memory to the device, what is the
role the device need to play?
For now you can only send proper VA's that are canonical.
On 9/23/22 08:44, Ashok Raj wrote:
>> But, the point that Kirill and I were getting at is still that devices
>> *have* a role to play here. The idea that this can be hidden at the
>> IOMMU layer is pure fantasy. Right?
> If you *can't* send tagged memory to the device, what is the
> role the device need to play?
>
> For now you can only send proper VA's that are canonical.
Today, yes, you have to keep tagged addresses away from devices. They
must be sequestered in a place that only the CPU can find them.
The observation that Kirill and I had is that there are thing that are
done solely on the device today -- like accessing a translated address
twice -- without IOMMU involvement. We were trying to figure out how
that would work in the future once tagged addresses are exposed to
devices and they implement all the new PCI magic.
After our private chat, I think the answer is that devices *have* a role
to play. Device-side logic must know how to untag memory before asking
for translation or even *deciding* to ask for address translation. But,
hopefully, the communicating that untagging information to the device
will be done in a device-agnostic, standardized way, just how PASIDs or
ATS are handled today.
On Fri, Sep 23, 2022 at 09:23:40AM -0700, Dave Hansen wrote:
> After our private chat, I think the answer is that devices *have* a role
> to play. Device-side logic must know how to untag memory before asking
> for translation or even *deciding* to ask for address translation. But,
> hopefully, the communicating that untagging information to the device
> will be done in a device-agnostic, standardized way, just how PASIDs or
> ATS are handled today.
Right, that would be my hope, we will see what PCI SIG standardizes.
Jason