2022-10-25 01:30:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 00/16] Linear Address Masking enabling

Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.

The capability can be used for efficient address sanitizers (ASAN)
implementation and for optimizations in JITs and virtual machines.

The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
this time.

Please review and consider applying.

Results for the self-tests:

ok 1 MALLOC: LAM_U57. Dereferencing pointer with metadata
# Get segmentation fault(11).ok 2 MALLOC:[Negative] Disable LAM. Dereferencing pointer with metadata.
ok 3 BITS: Check default tag bits
ok 4 # SKIP MMAP: First mmap high address, then set LAM_U57.
ok 5 # SKIP MMAP: First LAM_U57, then High address.
ok 6 MMAP: First LAM_U57, then Low address.
ok 7 SYSCALL: LAM_U57. syscall with metadata
ok 8 SYSCALL:[Negative] Disable LAM. Dereferencing pointer with metadata.
ok 9 URING: LAM_U57. Dereferencing pointer with metadata
ok 10 URING:[Negative] Disable LAM. Dereferencing pointer with metadata.
ok 11 FORK: LAM_U57, child process should get LAM mode same as parent
ok 12 EXECVE: LAM_U57, child process should get disabled LAM mode
open: Device or resource busy
ok 13 PASID: [Negative] Execute LAM, PASID, SVA in sequence
ok 14 PASID: Execute LAM, SVA, PASID in sequence
ok 15 PASID: [Negative] Execute PASID, LAM, SVA in sequence
ok 16 PASID: Execute PASID, SVA, LAM in sequence
ok 17 PASID: Execute SVA, LAM, PASID in sequence
ok 18 PASID: Execute SVA, PASID, LAM in sequence
1..18

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam

v11:
- Move untag_mask to /proc/$PID/status;
- s/SVM/SVA/g;
- static inline arch_pgtable_dma_compat() instead of macros;
- Replace pasid_valid() with mm_valid_pasid();
- Acks from Ashok and Jacob (forgot to apply from v9);
v10:
- Rebased to v6.1-rc1;
- Add selftest for SVM vs LAM;
v9:
- Fix race between LAM enabling and check that KVM memslot address doesn't
have any tags;
- Reduce untagged_addr() overhead until the first LAM user;
- Clarify SVM vs. LAM semantics;
- Use mmap_lock to serialize LAM enabling;
v8:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit messages;
- Selftests updates;
- Acked/Reviewed/Tested-bys from Alexander and Peter;
v7:
- Drop redundant smb_mb() in prctl_enable_tagged_addr();
- Cleanup code around build_cr3();
- Fix commit message;
- Fix indentation;
v6:
- Rebased onto v6.0-rc1
- LAM_U48 excluded from the patchet. Still available in the git tree;
- add ARCH_GET_MAX_TAG_BITS;
- Fix build without CONFIG_DEBUG_VM;
- Update comments;
- Reviewed/Tested-by from Alexander;
v5:
- Do not use switch_mm() in enable_lam_func()
- Use mb()/READ_ONCE() pair on LAM enabling;
- Add self-test by Weihong Zhang;
- Add comments;
v4:
- Fix untagged_addr() for LAM_U48;
- Remove no-threads restriction on LAM enabling;
- Fix mm_struct access from /proc/$PID/arch_status
- Fix LAM handling in initialize_tlbstate_and_flush()
- Pack tlb_state better;
- Comments and commit messages;
v3:
- Rebased onto v5.19-rc1
- Per-process enabling;
- API overhaul (again);
- Avoid branches and costly computations in the fast path;
- LAM_U48 is in optional patch.
v2:
- Rebased onto v5.18-rc1
- New arch_prctl(2)-based API
- Expose status of LAM (or other thread features) in
/proc/$PID/arch_status

[1] ISE, Chapter 10. https://cdrdv2.intel.com/v1/dl/getContent/671368
Kirill A. Shutemov (11):
x86/mm: Fix CR3_ADDR_MASK
x86: CPUID and CR3/CR4 flags for Linear Address Masking
mm: Pass down mm_struct to untagged_addr()
x86/mm: Handle LAM on context switch
x86/uaccess: Provide untagged_addr() and remove tags before address
check
KVM: Serialize tagged address check against tagging enabling
x86/mm: Provide arch_prctl() interface for LAM
x86/mm: Reduce untagged_addr() overhead until the first LAM user
mm: Expose untagging mask in /proc/$PID/status
iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
x86/mm, iommu/sva: Make LAM and SVA mutually exclusive

Weihong Zhang (5):
selftests/x86/lam: Add malloc and tag-bits test cases for
linear-address masking
selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address
masking
selftests/x86/lam: Add io_uring test cases for linear-address masking
selftests/x86/lam: Add inherit test cases for linear-address masking
selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for
linear-address masking

arch/arm64/include/asm/memory.h | 4 +-
arch/arm64/include/asm/mmu_context.h | 6 +
arch/arm64/include/asm/signal.h | 2 +-
arch/arm64/include/asm/uaccess.h | 2 +-
arch/arm64/kernel/hw_breakpoint.c | 2 +-
arch/arm64/kernel/traps.c | 4 +-
arch/arm64/mm/fault.c | 10 +-
arch/sparc/include/asm/mmu_context_64.h | 6 +
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/include/asm/uaccess_64.h | 2 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/mmu.h | 12 +-
arch/x86/include/asm/mmu_context.h | 47 +
arch/x86/include/asm/processor-flags.h | 4 +-
arch/x86/include/asm/tlbflush.h | 34 +
arch/x86/include/asm/uaccess.h | 46 +-
arch/x86/include/uapi/asm/prctl.h | 5 +
arch/x86/include/uapi/asm/processor-flags.h | 6 +
arch/x86/kernel/process.c | 3 +
arch/x86/kernel/process_64.c | 81 +-
arch/x86/kernel/traps.c | 6 +-
arch/x86/mm/tlb.c | 48 +-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/radeon/radeon_gem.c | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/iommu/iommu-sva-lib.c | 16 +-
drivers/media/common/videobuf2/frame_vector.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
.../staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
drivers/tee/tee_shm.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/proc/array.c | 6 +
fs/proc/task_mmu.c | 2 +-
include/linux/ioasid.h | 9 -
include/linux/mm.h | 11 -
include/linux/mmu_context.h | 14 +
include/linux/sched/mm.h | 8 +-
include/linux/uaccess.h | 15 +
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
mm/gup.c | 6 +-
mm/madvise.c | 2 +-
mm/mempolicy.c | 6 +-
mm/migrate.c | 2 +-
mm/mincore.c | 2 +-
mm/mlock.c | 4 +-
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/msync.c | 2 +-
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 1149 +++++++++++++++++
virt/kvm/kvm_main.c | 14 +-
54 files changed, 1539 insertions(+), 92 deletions(-)
create mode 100644 tools/testing/selftests/x86/lam.c

--
2.38.0


2022-10-25 01:31:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 03/16] mm: Pass down mm_struct to untagged_addr()

Intel Linear Address Masking (LAM) brings per-mm untagging rules. Pass
down mm_struct to the untagging helper. It will help to apply untagging
policy correctly.

In most cases, current->mm is the one to use, but there are some
exceptions, such as get_user_page_remote().

Move dummy implementation of untagged_addr() from <linux/mm.h> to
<linux/uaccess.h>. <asm/uaccess.h> can override the implementation.
Moving the dummy header outside <linux/mm.h> helps to avoid header hell
if you need to defer mm_struct within the helper.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/arm64/include/asm/memory.h | 4 ++--
arch/arm64/include/asm/signal.h | 2 +-
arch/arm64/include/asm/uaccess.h | 2 +-
arch/arm64/kernel/hw_breakpoint.c | 2 +-
arch/arm64/kernel/traps.c | 4 ++--
arch/arm64/mm/fault.c | 10 +++++-----
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/include/asm/uaccess_64.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
drivers/gpu/drm/radeon/radeon_gem.c | 2 +-
drivers/infiniband/hw/mlx4/mr.c | 2 +-
drivers/media/common/videobuf2/frame_vector.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
drivers/staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
drivers/tee/tee_shm.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/mm.h | 11 -----------
include/linux/uaccess.h | 15 +++++++++++++++
lib/strncpy_from_user.c | 2 +-
lib/strnlen_user.c | 2 +-
mm/gup.c | 6 +++---
mm/madvise.c | 2 +-
mm/mempolicy.c | 6 +++---
mm/migrate.c | 2 +-
mm/mincore.c | 2 +-
mm/mlock.c | 4 ++--
mm/mmap.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/msync.c | 2 +-
virt/kvm/kvm_main.c | 2 +-
33 files changed, 58 insertions(+), 52 deletions(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 9dd08cd339c3..5b24ef93c6b9 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -227,8 +227,8 @@ static inline unsigned long kaslr_offset(void)
#define __untagged_addr(addr) \
((__force __typeof__(addr))sign_extend64((__force u64)(addr), 55))

-#define untagged_addr(addr) ({ \
- u64 __addr = (__force u64)(addr); \
+#define untagged_addr(mm, addr) ({ \
+ u64 __addr = (__force u64)(addr); \
__addr &= __untagged_addr(__addr); \
(__force __typeof__(addr))__addr; \
})
diff --git a/arch/arm64/include/asm/signal.h b/arch/arm64/include/asm/signal.h
index ef449f5f4ba8..0899c355c398 100644
--- a/arch/arm64/include/asm/signal.h
+++ b/arch/arm64/include/asm/signal.h
@@ -18,7 +18,7 @@ static inline void __user *arch_untagged_si_addr(void __user *addr,
if (sig == SIGTRAP && si_code == TRAP_BRKPT)
return addr;

- return untagged_addr(addr);
+ return untagged_addr(current->mm, addr);
}
#define arch_untagged_si_addr arch_untagged_si_addr

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 5c7b2f9d5913..122d894a4136 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -44,7 +44,7 @@ static inline int access_ok(const void __user *addr, unsigned long size)
*/
if (IS_ENABLED(CONFIG_ARM64_TAGGED_ADDR_ABI) &&
(current->flags & PF_KTHREAD || test_thread_flag(TIF_TAGGED_ADDR)))
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);

return likely(__access_ok(addr, size));
}
diff --git a/arch/arm64/kernel/hw_breakpoint.c b/arch/arm64/kernel/hw_breakpoint.c
index b29a311bb055..d637cee7b771 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -715,7 +715,7 @@ static u64 get_distance_from_watchpoint(unsigned long addr, u64 val,
u64 wp_low, wp_high;
u32 lens, lene;

- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);

lens = __ffs(ctrl->len);
lene = __fls(ctrl->len);
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 23d281ed7621..f40f3885b674 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -477,7 +477,7 @@ void arm64_notify_segfault(unsigned long addr)
int code;

mmap_read_lock(current->mm);
- if (find_vma(current->mm, untagged_addr(addr)) == NULL)
+ if (find_vma(current->mm, untagged_addr(current->mm, addr)) == NULL)
code = SEGV_MAPERR;
else
code = SEGV_ACCERR;
@@ -551,7 +551,7 @@ static void user_cache_maint_handler(unsigned long esr, struct pt_regs *regs)
int ret = 0;

tagged_address = pt_regs_read_reg(regs, rt);
- address = untagged_addr(tagged_address);
+ address = untagged_addr(current->mm, tagged_address);

switch (crm) {
case ESR_ELx_SYS64_ISS_CRM_DC_CVAU: /* DC CVAU, gets promoted */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 5b391490e045..b8799e9c7e1b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -454,7 +454,7 @@ static void set_thread_esr(unsigned long address, unsigned long esr)
static void do_bad_area(unsigned long far, unsigned long esr,
struct pt_regs *regs)
{
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);

/*
* If we are in kernel mode at this point, we have no context to
@@ -524,7 +524,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
vm_fault_t fault;
unsigned long vm_flags;
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(mm, far);

if (kprobe_page_fault(regs, esr))
return 0;
@@ -679,7 +679,7 @@ static int __kprobes do_translation_fault(unsigned long far,
unsigned long esr,
struct pt_regs *regs)
{
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);

if (is_ttbr0_addr(addr))
return do_page_fault(far, esr, regs);
@@ -726,7 +726,7 @@ static int do_sea(unsigned long far, unsigned long esr, struct pt_regs *regs)
* UNKNOWN for synchronous external aborts. Mask them out now
* so that userspace doesn't see them.
*/
- siaddr = untagged_addr(far);
+ siaddr = untagged_addr(current->mm, far);
}
arm64_notify_die(inf->name, regs, inf->sig, inf->code, siaddr, esr);

@@ -816,7 +816,7 @@ static const struct fault_info fault_info[] = {
void do_mem_abort(unsigned long far, unsigned long esr, struct pt_regs *regs)
{
const struct fault_info *inf = esr_to_fault_info(esr);
- unsigned long addr = untagged_addr(far);
+ unsigned long addr = untagged_addr(current->mm, far);

if (!inf->fn(far, esr, regs))
return;
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index a779418ceba9..aa996ffe5c8c 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1052,7 +1052,7 @@ static inline unsigned long __untagged_addr(unsigned long start)

return start;
}
-#define untagged_addr(addr) \
+#define untagged_addr(mm, addr) \
((__typeof__(addr))(__untagged_addr((unsigned long)(addr))))

static inline bool pte_access_permitted(pte_t pte, bool write)
diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h
index 94266a5c5b04..b825a5dd0210 100644
--- a/arch/sparc/include/asm/uaccess_64.h
+++ b/arch/sparc/include/asm/uaccess_64.h
@@ -8,8 +8,10 @@

#include <linux/compiler.h>
#include <linux/string.h>
+#include <linux/mm_types.h>
#include <asm/asi.h>
#include <asm/spitfire.h>
+#include <asm/pgtable.h>

#include <asm/processor.h>
#include <asm-generic/access_ok.h>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 978d3970b5cc..173f0b5ccba1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1659,7 +1659,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
if (flags & KFD_IOC_ALLOC_MEM_FLAGS_USERPTR) {
if (!offset || !*offset)
return -EINVAL;
- user_addr = untagged_addr(*offset);
+ user_addr = untagged_addr(current->mm, *offset);
} else if (flags & (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
bo_type = ttm_bo_type_sg;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 8ef31d687ef3..691dfb3f2c0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -382,7 +382,7 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;

- args->addr = untagged_addr(args->addr);
+ args->addr = untagged_addr(current->mm, args->addr);

if (offset_in_page(args->addr | args->size))
return -EINVAL;
diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index 261fcbae88d7..cba2f4b19838 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -371,7 +371,7 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;

- args->addr = untagged_addr(args->addr);
+ args->addr = untagged_addr(current->mm, args->addr);

if (offset_in_page(args->addr | args->size))
return -EINVAL;
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index a40bf58bcdd3..383ac9e40dfa 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -379,7 +379,7 @@ static struct ib_umem *mlx4_get_umem_mr(struct ib_device *device, u64 start,
* again
*/
if (!ib_access_writable(access_flags)) {
- unsigned long untagged_start = untagged_addr(start);
+ unsigned long untagged_start = untagged_addr(current->mm, start);
struct vm_area_struct *vma;

mmap_read_lock(current->mm);
diff --git a/drivers/media/common/videobuf2/frame_vector.c b/drivers/media/common/videobuf2/frame_vector.c
index 542dde9d2609..7e62f7a2555d 100644
--- a/drivers/media/common/videobuf2/frame_vector.c
+++ b/drivers/media/common/videobuf2/frame_vector.c
@@ -47,7 +47,7 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
if (WARN_ON_ONCE(nr_frames > vec->nr_allocated))
nr_frames = vec->nr_allocated;

- start = untagged_addr(start);
+ start = untagged_addr(mm, start);

ret = pin_user_pages_fast(start, nr_frames,
FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
index 52312ce2ba05..a1444f8afa05 100644
--- a/drivers/media/v4l2-core/videobuf-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
@@ -157,8 +157,8 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem)
static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
struct videobuf_buffer *vb)
{
- unsigned long untagged_baddr = untagged_addr(vb->baddr);
struct mm_struct *mm = current->mm;
+ unsigned long untagged_baddr = untagged_addr(mm, vb->baddr);
struct vm_area_struct *vma;
unsigned long prev_pfn, this_pfn;
unsigned long pages_done, user_address;
diff --git a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
index f50494123f03..a43c65950554 100644
--- a/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
+++ b/drivers/staging/media/atomisp/pci/hmm/hmm_bo.c
@@ -794,7 +794,7 @@ static int alloc_user_pages(struct hmm_buffer_object *bo,
* and map to user space
*/

- userptr = untagged_addr(userptr);
+ userptr = untagged_addr(current->mm, userptr);

if (vma->vm_flags & (VM_IO | VM_PFNMAP)) {
page_nr = pin_user_pages((unsigned long)userptr, bo->pgnr,
diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index 27295bda3e0b..5c85445f3a65 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -262,7 +262,7 @@ register_shm_helper(struct tee_context *ctx, unsigned long addr,
shm->flags = flags;
shm->ctx = ctx;
shm->id = id;
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
start = rounddown(addr, PAGE_SIZE);
shm->offset = addr - start;
shm->size = length;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 23c24fe98c00..74b6aecea8b0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -573,7 +573,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
goto done;
}

- vaddr = untagged_addr(vaddr);
+ vaddr = untagged_addr(mm, vaddr);

retry:
vma = vma_lookup(mm, vaddr);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8b4f3073f8f5..665e36885f21 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1685,7 +1685,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
/* watch out for wraparound */
start_vaddr = end_vaddr;
if (svpfn <= (ULONG_MAX >> PAGE_SHIFT))
- start_vaddr = untagged_addr(svpfn << PAGE_SHIFT);
+ start_vaddr = untagged_addr(mm, svpfn << PAGE_SHIFT);

/* Ensure the address is inside the task */
if (start_vaddr > mm->task_size)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..bfac5a166cb8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -95,17 +95,6 @@ extern int mmap_rnd_compat_bits __read_mostly;
#include <asm/page.h>
#include <asm/processor.h>

-/*
- * Architectures that support memory tagging (assigning tags to memory regions,
- * embedding these tags into addresses that point to these memory regions, and
- * checking that the memory and the pointer tags match on memory accesses)
- * redefine this macro to strip tags from pointers.
- * It's defined as noop for architectures that don't support memory tagging.
- */
-#ifndef untagged_addr
-#define untagged_addr(addr) (addr)
-#endif
-
#ifndef __pa_symbol
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index afb18f198843..46680189d761 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -10,6 +10,21 @@

#include <asm/uaccess.h>

+/*
+ * Architectures that support memory tagging (assigning tags to memory regions,
+ * embedding these tags into addresses that point to these memory regions, and
+ * checking that the memory and the pointer tags match on memory accesses)
+ * redefine this macro to strip tags from pointers.
+ *
+ * Passing down mm_struct allows to define untagging rules on per-process
+ * basis.
+ *
+ * It's defined as noop for architectures that don't support memory tagging.
+ */
+#ifndef untagged_addr
+#define untagged_addr(mm, addr) (addr)
+#endif
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 6432b8c3e431..6e1e2aa0c994 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -121,7 +121,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
return 0;

max_addr = TASK_SIZE_MAX;
- src_addr = (unsigned long)untagged_addr(src);
+ src_addr = (unsigned long)untagged_addr(current->mm, src);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index feeb935a2299..abc096a68f05 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -97,7 +97,7 @@ long strnlen_user(const char __user *str, long count)
return 0;

max_addr = TASK_SIZE_MAX;
- src_addr = (unsigned long)untagged_addr(str);
+ src_addr = (unsigned long)untagged_addr(current->mm, str);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
diff --git a/mm/gup.c b/mm/gup.c
index fe195d47de74..f585e4a185ca 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1168,7 +1168,7 @@ static long __get_user_pages(struct mm_struct *mm,
if (!nr_pages)
return 0;

- start = untagged_addr(start);
+ start = untagged_addr(mm, start);

VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));

@@ -1342,7 +1342,7 @@ int fixup_user_fault(struct mm_struct *mm,
struct vm_area_struct *vma;
vm_fault_t ret;

- address = untagged_addr(address);
+ address = untagged_addr(mm, address);

if (unlocked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
@@ -3027,7 +3027,7 @@ static int internal_get_user_pages_fast(unsigned long start,
if (!(gup_flags & FOLL_FAST_ONLY))
might_lock_read(&current->mm->mmap_lock);

- start = untagged_addr(start) & PAGE_MASK;
+ start = untagged_addr(current->mm, start) & PAGE_MASK;
len = nr_pages << PAGE_SHIFT;
if (check_add_overflow(start, len, &end))
return 0;
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..1319a18da8bc 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1382,7 +1382,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
size_t len;
struct blk_plug plug;

- start = untagged_addr(start);
+ start = untagged_addr(mm, start);

if (!madvise_behavior_valid(behavior))
return -EINVAL;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index a937eaec5b68..4fdeef477fbd 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1467,7 +1467,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
int lmode = mode;
int err;

- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);
err = sanitize_mpol_flags(&lmode, &mode_flags);
if (err)
return err;
@@ -1491,7 +1491,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
int err = -ENOENT;
VMA_ITERATOR(vmi, mm, start);

- start = untagged_addr(start);
+ start = untagged_addr(mm, start);
if (start & ~PAGE_MASK)
return -EINVAL;
/*
@@ -1692,7 +1692,7 @@ static int kernel_get_mempolicy(int __user *policy,
if (nmask != NULL && maxnode < nr_node_ids)
return -EINVAL;

- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);

err = do_get_mempolicy(&pval, &nodes, addr, flags);

diff --git a/mm/migrate.c b/mm/migrate.c
index 1379e1912772..8e7823bef31d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1795,7 +1795,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
goto out_flush;
if (get_user(node, nodes + i))
goto out_flush;
- addr = (unsigned long)untagged_addr(p);
+ addr = (unsigned long)untagged_addr(mm, p);

err = -ENODEV;
if (node < 0 || node >= MAX_NUMNODES)
diff --git a/mm/mincore.c b/mm/mincore.c
index fa200c14185f..72c55bd9d184 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -236,7 +236,7 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
unsigned long pages;
unsigned char *tmp;

- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);

/* Check the start address: needs to be page-aligned.. */
if (start & ~PAGE_MASK)
diff --git a/mm/mlock.c b/mm/mlock.c
index 7032f6dd0ce1..d969703c08ff 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -570,7 +570,7 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla
unsigned long lock_limit;
int error = -ENOMEM;

- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);

if (!can_do_mlock())
return -EPERM;
@@ -633,7 +633,7 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
{
int ret;

- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);

len = PAGE_ALIGN(len + (offset_in_page(start)));
start &= PAGE_MASK;
diff --git a/mm/mmap.c b/mm/mmap.c
index bf2122af94e7..bb8037840160 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2796,7 +2796,7 @@ EXPORT_SYMBOL(vm_munmap);

SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
{
- addr = untagged_addr(addr);
+ addr = untagged_addr(current->mm, addr);
return __vm_munmap(addr, len, true);
}

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..dee44e3a0527 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -680,7 +680,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
struct mmu_gather tlb;
MA_STATE(mas, &current->mm->mm_mt, 0, 0);

- start = untagged_addr(start);
+ start = untagged_addr(current->mm, start);

prot &= ~(PROT_GROWSDOWN|PROT_GROWSUP);
if (grows == (PROT_GROWSDOWN|PROT_GROWSUP)) /* can't be both */
diff --git a/mm/mremap.c b/mm/mremap.c
index e465ffe279bb..81c857281a52 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -909,7 +909,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
*
* See Documentation/arm64/tagged-address-abi.rst for more information.
*/
- addr = untagged_addr(addr);
+ addr = untagged_addr(mm, addr);

if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
return ret;
diff --git a/mm/msync.c b/mm/msync.c
index ac4c9bfea2e7..f941e9bb610f 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -37,7 +37,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
int unmapped_error = 0;
int error = -EINVAL;

- start = untagged_addr(start);
+ start = untagged_addr(mm, start);

if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e30f1b4ecfa5..8c86b06b35da 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1945,7 +1945,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
/* We can read the guest memory with __xxx_user() later on. */
if ((mem->userspace_addr & (PAGE_SIZE - 1)) ||
- (mem->userspace_addr != untagged_addr(mem->userspace_addr)) ||
+ (mem->userspace_addr != untagged_addr(kvm->mm, mem->userspace_addr)) ||
!access_ok((void __user *)(unsigned long)mem->userspace_addr,
mem->memory_size))
return -EINVAL;
--
2.38.0

2022-10-25 01:32:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 15/16] selftests/x86/lam: Add inherit test cases for linear-address masking

From: Weihong Zhang <[email protected]>

LAM is enabled per-thread and gets inherited on fork(2)/clone(2). exec()
reverts LAM status to the default disabled state.

There are two test scenarios:

- Fork test cases:

These cases were used to test the inheritance of LAM for per-thread,
Child process generated by fork() should inherit LAM feature from
parent process, Child process can get the LAM mode same as parent
process.

- Execve test cases:

Processes generated by execve() are different from processes
generated by fork(), these processes revert LAM status to disabled
status.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 125 +++++++++++++++++++++++++++++-
1 file changed, 121 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 8ea1fcef4c9f..cfc9073c0262 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -37,8 +37,9 @@
#define FUNC_MMAP 0x4
#define FUNC_SYSCALL 0x8
#define FUNC_URING 0x10
+#define FUNC_INHERITE 0x20

-#define TEST_MASK 0x1f
+#define TEST_MASK 0x3f

#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -174,6 +175,28 @@ static unsigned long get_default_tag_bits(void)
return lam;
}

+/*
+ * Set tagged address and read back untag mask.
+ * check if the untag mask is expected.
+ */
+static int get_lam(void)
+{
+ uint64_t ptr = 0;
+ int ret = -1;
+ /* Get untagged mask */
+ if (syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr) == -1)
+ return -1;
+
+ /* Check mask returned is expected */
+ if (ptr == ~(LAM_U57_MASK))
+ ret = LAM_U57_BITS;
+ else if (ptr == -1ULL)
+ ret = LAM_NONE;
+
+
+ return ret;
+}
+
/* According to LAM mode, set metadata in high bits */
static uint64_t set_metadata(uint64_t src, unsigned long lam)
{
@@ -581,7 +604,7 @@ int do_uring(unsigned long lam)

switch (lam) {
case LAM_U57_BITS: /* Clear bits 62:57 */
- addr = (addr & ~(0x3fULL << 57));
+ addr = (addr & ~(LAM_U57_MASK));
break;
}
free((void *)addr);
@@ -632,6 +655,72 @@ static int fork_test(struct testcases *test)
return ret;
}

+static int handle_execve(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ char path[PATH_MAX];
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ /* Get current binary's path and the binary was run by execve */
+ if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ exit(-1);
+
+ /* run binary to get LAM mode and return to parent process */
+ if (execlp(path, path, "-t 0x0", NULL) < 0) {
+ perror("error on exec");
+ exit(-1);
+ }
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ if (ret != LAM_NONE)
+ return 1;
+ }
+
+ return 0;
+}
+
+static int handle_inheritance(struct testcases *test)
+{
+ int ret, child_ret;
+ int lam = test->lam;
+ pid_t pid;
+
+ /* Set LAM mode in parent process */
+ if (set_lam(lam) != 0)
+ return 1;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ return 1;
+ } else if (pid == 0) {
+ /* Set LAM mode in parent process */
+ int child_lam = get_lam();
+
+ exit(child_lam);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+
+ if (lam != ret)
+ return 1;
+ }
+
+ return 0;
+}
+
static void run_test(struct testcases *test, int count)
{
int i, ret = 0;
@@ -740,11 +829,26 @@ static struct testcases mmap_cases[] = {
},
};

+static struct testcases inheritance_cases[] = {
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_inheritance,
+ .msg = "FORK: LAM_U57, child process should get LAM mode same as parent\n",
+ },
+ {
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_execve,
+ .msg = "EXECVE: LAM_U57, child process should get disabled LAM mode\n",
+ },
+};
+
static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring.\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall; 0x10:io_uring; 0x20:inherit;\n");
printf("\t-h: help\n");
}

@@ -764,7 +868,7 @@ int main(int argc, char **argv)
switch (c) {
case 't':
tests = strtoul(optarg, NULL, 16);
- if (!(tests & TEST_MASK)) {
+ if (tests && !(tests & TEST_MASK)) {
ksft_print_msg("Invalid argument!\n");
return -1;
}
@@ -778,6 +882,16 @@ int main(int argc, char **argv)
}
}

+ /*
+ * When tests is 0, it is not a real test case;
+ * the option used by test case(execve) to check the lam mode in
+ * process generated by execve, the process read back lam mode and
+ * check with lam mode in parent process.
+ */
+ if (!tests)
+ return (get_lam());
+
+ /* Run test cases */
if (tests & FUNC_MALLOC)
run_test(malloc_cases, ARRAY_SIZE(malloc_cases));

@@ -793,6 +907,9 @@ int main(int argc, char **argv)
if (tests & FUNC_URING)
run_test(uring_cases, ARRAY_SIZE(uring_cases));

+ if (tests & FUNC_INHERITE)
+ run_test(inheritance_cases, ARRAY_SIZE(inheritance_cases));
+
ksft_set_plan(tests_cnt);

return ksft_exit_pass();
--
2.38.0

2022-10-25 02:06:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 16/16] selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for linear-address masking

From: Weihong Zhang <[email protected]>

By default do not allow to enable both LAM and use SVA in the same
process.
The new ARCH_FORCE_TAGGED_SVA arch_prctl() overrides the limitation.

Add new test cases for the new arch_prctl:
Before using ARCH_FORCE_TAGGED_SVA, should not allow to enable LAM/SVA
coexisting. the test cases should be negative.

The test depands on idxd driver and iommu. before test, need add
"intel_iommu=on,sm_on" in kernel command line and insmod idxd driver.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 237 +++++++++++++++++++++++++++++-
1 file changed, 235 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index cfc9073c0262..52a876a7ccb8 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -30,6 +30,7 @@
#define ARCH_GET_UNTAG_MASK 0x4001
#define ARCH_ENABLE_TAGGED_ADDR 0x4002
#define ARCH_GET_MAX_TAG_BITS 0x4003
+#define ARCH_FORCE_TAGGED_SVA 0x4004

/* Specified test function bits */
#define FUNC_MALLOC 0x1
@@ -38,8 +39,9 @@
#define FUNC_SYSCALL 0x8
#define FUNC_URING 0x10
#define FUNC_INHERITE 0x20
+#define FUNC_PASID 0x40

-#define TEST_MASK 0x3f
+#define TEST_MASK 0x7f

#define LOW_ADDR (0x1UL << 30)
#define HIGH_ADDR (0x3UL << 48)
@@ -55,11 +57,19 @@
#define URING_QUEUE_SZ 1
#define URING_BLOCK_SZ 2048

+/* Pasid test define */
+#define LAM_CMD_BIT 0x1
+#define PAS_CMD_BIT 0x2
+#define SVA_CMD_BIT 0x4
+
+#define PAS_CMD(cmd1, cmd2, cmd3) (((cmd3) << 8) | ((cmd2) << 4) | ((cmd1) << 0))
+
struct testcases {
unsigned int later;
int expected; /* 2: SIGSEGV Error; 1: other errors */
unsigned long lam;
uint64_t addr;
+ uint64_t cmd;
int (*test_func)(struct testcases *test);
const char *msg;
};
@@ -556,7 +566,7 @@ int do_uring(unsigned long lam)
struct file_io *fi;
struct stat st;
int ret = 1;
- char path[PATH_MAX];
+ char path[PATH_MAX] = {0};

/* get current process path */
if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
@@ -852,6 +862,226 @@ static void cmd_help(void)
printf("\t-h: help\n");
}

+/* Check for file existence */
+uint8_t file_Exists(const char *fileName)
+{
+ struct stat buffer;
+
+ uint8_t ret = (stat(fileName, &buffer) == 0);
+
+ return ret;
+}
+
+/* Sysfs idxd files */
+const char *dsa_configs[] = {
+ "echo 1 > /sys/bus/dsa/devices/dsa0/wq0.1/group_id",
+ "echo shared > /sys/bus/dsa/devices/dsa0/wq0.1/mode",
+ "echo 10 > /sys/bus/dsa/devices/dsa0/wq0.1/priority",
+ "echo 16 > /sys/bus/dsa/devices/dsa0/wq0.1/size",
+ "echo 15 > /sys/bus/dsa/devices/dsa0/wq0.1/threshold",
+ "echo user > /sys/bus/dsa/devices/dsa0/wq0.1/type",
+ "echo MyApp1 > /sys/bus/dsa/devices/dsa0/wq0.1/name",
+ "echo 1 > /sys/bus/dsa/devices/dsa0/engine0.1/group_id",
+ "echo dsa0 > /sys/bus/dsa/drivers/idxd/bind",
+ /* bind files and devices, generated a device file in /dev */
+ "echo wq0.1 > /sys/bus/dsa/drivers/user/bind",
+};
+
+/* DSA device file */
+const char *dsaDeviceFile = "/dev/dsa/wq0.1";
+/* file for io*/
+const char *dsaPasidEnable = "/sys/bus/dsa/devices/dsa0/pasid_enabled";
+
+/*
+ * DSA depends on kernel cmdline "intel_iommu=on,sm_on"
+ * return pasid_enabled (0: disable 1:enable)
+ */
+int Check_DSA_Kernel_Setting(void)
+{
+ char command[256] = "";
+ char buf[256] = "";
+ char *ptr;
+ int rv = -1;
+
+ snprintf(command, sizeof(command) - 1, "cat %s", dsaPasidEnable);
+
+ FILE *cmd = popen(command, "r");
+
+ if (cmd) {
+ while (fgets(buf, sizeof(buf) - 1, cmd) != NULL);
+
+ pclose(cmd);
+ rv = strtol(buf, &ptr, 16);
+ }
+
+ return rv;
+}
+
+/*
+ * Config DSA's sysfs files as shared DSA's WQ.
+ * Generated a device file /dev/dsa/wq0.1
+ * Return: 0 OK; 1 Failed; 3 Skip(SVA disabled).
+ */
+int Dsa_Init_Sysfs(void)
+{
+ uint len = ARRAY_SIZE(dsa_configs);
+ const char **p = dsa_configs;
+
+ if (file_Exists(dsaDeviceFile) == 1)
+ return 0;
+
+ /* check the idxd driver */
+ if (file_Exists(dsaPasidEnable) != 1) {
+ printf("Please make sure idxd driver was loaded\n");
+ return 3;
+ }
+
+ /* Check SVA feature */
+ if (Check_DSA_Kernel_Setting() != 1) {
+ printf("Please enable SVA.(Add intel_iommu=on,sm_on in kernel cmdline)\n");
+ return 3;
+ }
+
+ /* Check the idxd device file on /dev/dsa/ */
+ for (int i = 0; i < len; i++) {
+ if (system(p[i]))
+ return 1;
+ }
+
+ /* After config, /dev/dsa/wq0.1 should be generated */
+ return (file_Exists(dsaDeviceFile) != 1);
+}
+
+/*
+ * Open DSA device file, triger API: iommu_sva_alloc_pasid
+ */
+void *allocate_dsa_pasid(void)
+{
+ int fd;
+ void *wq;
+
+ fd = open(dsaDeviceFile, O_RDWR);
+ if (fd < 0) {
+ perror("open");
+ return MAP_FAILED;
+ }
+
+ wq = mmap(NULL, 0x1000, PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, 0);
+ if (wq == MAP_FAILED)
+ perror("mmap");
+
+ return wq;
+}
+
+int set_force_svm(void)
+{
+ int ret = 0;
+
+ ret = syscall(SYS_arch_prctl, ARCH_FORCE_TAGGED_SVA);
+
+ return ret;
+}
+
+int handle_pasid(struct testcases *test)
+{
+ uint tmp = test->cmd;
+ uint runed = 0x0;
+ int ret = 0;
+ void *wq = NULL;
+
+ ret = Dsa_Init_Sysfs();
+ if (ret != 0)
+ return ret;
+
+ for (int i = 0; i < 3; i++) {
+ int err = 0;
+
+ if (tmp & 0x1) {
+ /* run set lam mode*/
+ if ((runed & 0x1) == 0) {
+ err = set_lam(LAM_U57_BITS);
+ runed = runed | 0x1;
+ } else
+ err = 1;
+ } else if (tmp & 0x4) {
+ /* run force svm */
+ if ((runed & 0x4) == 0) {
+ err = set_force_svm();
+ runed = runed | 0x4;
+ } else
+ err = 1;
+ } else if (tmp & 0x2) {
+ /* run allocate pasid */
+ if ((runed & 0x2) == 0) {
+ runed = runed | 0x2;
+ wq = allocate_dsa_pasid();
+ if (wq == MAP_FAILED)
+ err = 1;
+ } else
+ err = 1;
+ }
+
+ ret = ret + err;
+ if (ret > 0)
+ break;
+
+ tmp = tmp >> 4;
+ }
+
+ if (wq != MAP_FAILED && wq != NULL)
+ if (munmap(wq, 0x1000))
+ printf("munmap failed %d\n", errno);
+
+ if (runed != 0x7)
+ ret = 1;
+
+ return (ret != 0);
+}
+
+/*
+ * Pasid test depends on idxd and SVA, kernel should enable iommu and sm.
+ * command line(intel_iommu=on,sm_on)
+ */
+static struct testcases pasid_cases[] = {
+ {
+ .expected = 1,
+ .cmd = PAS_CMD(LAM_CMD_BIT, PAS_CMD_BIT, SVA_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: [Negative] Execute LAM, PASID, SVA in sequence\n",
+ },
+ {
+ .expected = 0,
+ .cmd = PAS_CMD(LAM_CMD_BIT, SVA_CMD_BIT, PAS_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: Execute LAM, SVA, PASID in sequence\n",
+ },
+ {
+ .expected = 1,
+ .cmd = PAS_CMD(PAS_CMD_BIT, LAM_CMD_BIT, SVA_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: [Negative] Execute PASID, LAM, SVA in sequence\n",
+ },
+ {
+ .expected = 0,
+ .cmd = PAS_CMD(PAS_CMD_BIT, SVA_CMD_BIT, LAM_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: Execute PASID, SVA, LAM in sequence\n",
+ },
+ {
+ .expected = 0,
+ .cmd = PAS_CMD(SVA_CMD_BIT, LAM_CMD_BIT, PAS_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: Execute SVA, LAM, PASID in sequence\n",
+ },
+ {
+ .expected = 0,
+ .cmd = PAS_CMD(SVA_CMD_BIT, PAS_CMD_BIT, LAM_CMD_BIT),
+ .test_func = handle_pasid,
+ .msg = "PASID: Execute SVA, PASID, LAM in sequence\n",
+ },
+};
+
int main(int argc, char **argv)
{
int c = 0;
@@ -910,6 +1140,9 @@ int main(int argc, char **argv)
if (tests & FUNC_INHERITE)
run_test(inheritance_cases, ARRAY_SIZE(inheritance_cases));

+ if (tests & FUNC_PASID)
+ run_test(pasid_cases, ARRAY_SIZE(pasid_cases));
+
ksft_set_plan(tests_cnt);

return ksft_exit_pass();
--
2.38.0

2022-10-25 02:09:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 13/16] selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address masking

From: Weihong Zhang <[email protected]>

Add mmap and SYSCALL test cases.

SYSCALL test cases:

- LAM supports set metadata in high bits 62:57 (LAM_U57) of a user pointer, pass
the pointer to SYSCALL, SYSCALL can dereference the pointer and return correct
result.

- Disable LAM, pass a pointer with metadata in high bits to SYSCALL,
SYSCALL returns -1 (EFAULT).

MMAP test cases:

- Enable LAM_U57, MMAP with low address (below bits 47), set metadata
in high bits of the address, dereference the address should be
allowed.

- Enable LAM_U57, MMAP with high address (above bits 47), set metadata
in high bits of the address, dereference the address should be
allowed.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/lam.c | 144 +++++++++++++++++++++++++++++-
1 file changed, 140 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index 900a3a0fb709..cdc6e40e00e0 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -7,6 +7,7 @@
#include <signal.h>
#include <setjmp.h>
#include <sys/mman.h>
+#include <sys/utsname.h>
#include <sys/wait.h>
#include <inttypes.h>

@@ -29,11 +30,18 @@
/* Specified test function bits */
#define FUNC_MALLOC 0x1
#define FUNC_BITS 0x2
+#define FUNC_MMAP 0x4
+#define FUNC_SYSCALL 0x8

-#define TEST_MASK 0x3
+#define TEST_MASK 0xf
+
+#define LOW_ADDR (0x1UL << 30)
+#define HIGH_ADDR (0x3UL << 48)

#define MALLOC_LEN 32

+#define PAGE_SIZE (4 << 10)
+
struct testcases {
unsigned int later;
int expected; /* 2: SIGSEGV Error; 1: other errors */
@@ -49,6 +57,7 @@ jmp_buf segv_env;
static void segv_handler(int sig)
{
ksft_print_msg("Get segmentation fault(%d).", sig);
+
siglongjmp(segv_env, 1);
}

@@ -61,6 +70,16 @@ static inline int cpu_has_lam(void)
return (cpuinfo[0] & (1 << 26));
}

+/* Check 5-level page table feature in CPUID.(EAX=07H, ECX=00H):ECX.[bit 16] */
+static inline int cpu_has_la57(void)
+{
+ unsigned int cpuinfo[4];
+
+ __cpuid_count(0x7, 0, cpuinfo[0], cpuinfo[1], cpuinfo[2], cpuinfo[3]);
+
+ return (cpuinfo[2] & (1 << 16));
+}
+
/*
* Set tagged address and read back untag mask.
* check if the untagged mask is expected.
@@ -213,6 +232,68 @@ static int handle_malloc(struct testcases *test)
return ret;
}

+static int handle_mmap(struct testcases *test)
+{
+ void *ptr;
+ unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ ptr = mmap((void *)test->addr, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ flags, -1, 0);
+ if (ptr == MAP_FAILED) {
+ if (test->addr == HIGH_ADDR)
+ if (!cpu_has_la57())
+ return 3; /* unsupport LA57 */
+ return 1;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ ret = 1;
+
+ if (ret == 0) {
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = handle_lam_test(ptr, test->lam);
+ } else {
+ ret = 2;
+ }
+ }
+
+ munmap(ptr, PAGE_SIZE);
+ return ret;
+}
+
+static int handle_syscall(struct testcases *test)
+{
+ struct utsname unme, *pu;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) != 0)
+ return 1;
+
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ pu = (struct utsname *)set_metadata((uint64_t)&unme, test->lam);
+ ret = uname(pu);
+ if (ret < 0)
+ ret = 1;
+ } else {
+ ret = 2;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) != -1 && ret == 0)
+ ret = 1;
+
+ return ret;
+}
+
static int fork_test(struct testcases *test)
{
int ret, child_ret;
@@ -241,13 +322,20 @@ static void run_test(struct testcases *test, int count)
struct testcases *t = test + i;

/* fork a process to run test case */
+ tests_cnt++;
ret = fork_test(t);
+
+ /* return 3 is not support LA57, the case should be skipped */
+ if (ret == 3) {
+ ksft_test_result_skip(t->msg);
+ continue;
+ }
+
if (ret != 0)
ret = (t->expected == ret);
else
ret = !(t->expected);

- tests_cnt++;
ksft_test_result(ret, t->msg);
}
}
@@ -268,7 +356,6 @@ static struct testcases malloc_cases[] = {
},
};

-
static struct testcases bits_cases[] = {
{
.test_func = handle_max_bits,
@@ -276,11 +363,54 @@ static struct testcases bits_cases[] = {
},
};

+static struct testcases syscall_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_syscall,
+ .msg = "SYSCALL: LAM_U57. syscall with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 1,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_syscall,
+ .msg = "SYSCALL:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
+static struct testcases mmap_cases[] = {
+ {
+ .later = 1,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = HIGH_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First mmap high address, then set LAM_U57.\n",
+ },
+ {
+ .later = 0,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = HIGH_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First LAM_U57, then High address.\n",
+ },
+ {
+ .later = 0,
+ .expected = 0,
+ .lam = LAM_U57_BITS,
+ .addr = LOW_ADDR,
+ .test_func = handle_mmap,
+ .msg = "MMAP: First LAM_U57, then Low address.\n",
+ },
+};
+
static void cmd_help(void)
{
printf("usage: lam [-h] [-t test list]\n");
printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
- printf("\t\t0x1:malloc; 0x2:max_bits;\n");
+ printf("\t\t0x1:malloc; 0x2:max_bits; 0x4:mmap; 0x8:syscall.\n");
printf("\t-h: help\n");
}

@@ -320,6 +450,12 @@ int main(int argc, char **argv)
if (tests & FUNC_BITS)
run_test(bits_cases, ARRAY_SIZE(bits_cases));

+ if (tests & FUNC_MMAP)
+ run_test(mmap_cases, ARRAY_SIZE(mmap_cases));
+
+ if (tests & FUNC_SYSCALL)
+ run_test(syscall_cases, ARRAY_SIZE(syscall_cases));
+
ksft_set_plan(tests_cnt);

return ksft_exit_pass();
--
2.38.0

2022-10-25 02:14:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 07/16] x86/mm: Provide arch_prctl() interface for LAM

Add a couple of arch_prctl() handles:

- ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number
of tag bits. It is rounded up to the nearest LAM mode that can
provide it. For now only LAM_U57 is supported, with 6 tag bits.

- ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag
bits located in the address.

- ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request.
Zero if LAM is not supported.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/uapi/asm/prctl.h | 4 ++
arch/x86/kernel/process_64.c | 65 ++++++++++++++++++++++++++++++-
2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..a31e27b95b19 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,8 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6b3418bff326..a98536101447 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -743,6 +743,60 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
}
#endif

+static void enable_lam_func(void *mm)
+{
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned long lam_mask;
+ unsigned long cr3;
+
+ if (loaded_mm != mm)
+ return;
+
+ lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
+
+ /* Update CR3 to get LAM active on the CPU */
+ cr3 = __read_cr3();
+ cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
+ cr3 |= lam_mask;
+ write_cr3(cr3);
+ set_tlbstate_cr3_lam_mask(lam_mask);
+}
+
+#define LAM_U57_BITS 6
+
+static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
+{
+ int ret = 0;
+
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return -ENODEV;
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ /* Already enabled? */
+ if (mm->context.lam_cr3_mask) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (!nr_bits) {
+ ret = -EINVAL;
+ goto out;
+ } else if (nr_bits <= LAM_U57_BITS) {
+ mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
+ mm->context.untag_mask = ~GENMASK(62, 57);
+ } else {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+out:
+ mmap_write_unlock(mm);
+ return ret;
+}
+
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
@@ -830,7 +884,16 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(&vdso_image_64, arg2);
#endif
-
+ case ARCH_GET_UNTAG_MASK:
+ return put_user(task->mm->context.untag_mask,
+ (unsigned long __user *)arg2);
+ case ARCH_ENABLE_TAGGED_ADDR:
+ return prctl_enable_tagged_addr(task->mm, arg2);
+ case ARCH_GET_MAX_TAG_BITS:
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return put_user(0, (unsigned long __user *)arg2);
+ else
+ return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
default:
ret = -EINVAL;
break;
--
2.38.0

2022-10-25 02:15:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 08/16] x86/mm: Reduce untagged_addr() overhead until the first LAM user

Use static key to reduce untagged_addr() overhead.

The key only gets enabled when the first process enables LAM.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/uaccess.h | 8 ++++++--
arch/x86/kernel/process_64.c | 4 ++++
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index c6062c07ccd2..820234f1f750 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -23,6 +23,8 @@ static inline bool pagefault_disabled(void);
#endif

#ifdef CONFIG_X86_64
+DECLARE_STATIC_KEY_FALSE(tagged_addr_key);
+
/*
* Mask out tag bits from the address.
*
@@ -31,8 +33,10 @@ static inline bool pagefault_disabled(void);
*/
#define untagged_addr(mm, addr) ({ \
u64 __addr = (__force u64)(addr); \
- s64 sign = (s64)__addr >> 63; \
- __addr &= (mm)->context.untag_mask | sign; \
+ if (static_branch_likely(&tagged_addr_key)) { \
+ s64 sign = (s64)__addr >> 63; \
+ __addr &= (mm)->context.untag_mask | sign; \
+ } \
(__force __typeof__(addr))__addr; \
})

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index a98536101447..9952e9f517ec 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -743,6 +743,9 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
}
#endif

+DEFINE_STATIC_KEY_FALSE(tagged_addr_key);
+EXPORT_SYMBOL_GPL(tagged_addr_key);
+
static void enable_lam_func(void *mm)
{
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -792,6 +795,7 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
}

on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+ static_branch_enable(&tagged_addr_key);
out:
mmap_write_unlock(mm);
return ret;
--
2.38.0

2022-10-25 02:24:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 01/16] x86/mm: Fix CR3_ADDR_MASK

The mask must not include bits above physical address mask. These bits
are reserved and can be used for other things. Bits 61 and 62 are used
for Linear Address Masking.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/processor-flags.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..a7f3d9100adb 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,7 +35,7 @@
*/
#ifdef CONFIG_X86_64
/* Mask off the address space ID and SME encryption bits. */
-#define CR3_ADDR_MASK __sme_clr(0x7FFFFFFFFFFFF000ull)
+#define CR3_ADDR_MASK __sme_clr(PHYSICAL_PAGE_MASK)
#define CR3_PCID_MASK 0xFFFull
#define CR3_NOFLUSH BIT_ULL(63)

--
2.38.0

2022-10-25 02:27:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 12/16] selftests/x86/lam: Add malloc and tag-bits test cases for linear-address masking

From: Weihong Zhang <[email protected]>

LAM is supported only in 64-bit mode and applies only addresses used for data
accesses. In 64-bit mode, linear address have 64 bits. LAM is applied to 64-bit
linear address and allow software to use high bits for metadata.
LAM supports configurations that differ regarding which pointer bits are masked
and can be used for metadata.

LAM includes following mode:

- LAM_U57, pointer bits in positions 62:57 are masked (LAM width 6),
allows bits 62:57 of a user pointer to be used as metadata.

There are some arch_prctls:
ARCH_ENABLE_TAGGED_ADDR: enable LAM mode, mask high bits of a user pointer.
ARCH_GET_UNTAG_MASK: get current untagged mask.
ARCH_GET_MAX_TAG_BITS: the maximum tag bits user can request. zero if LAM
is not supported.

The LAM mode is for pre-process, a process has only one chance to set LAM mode.
But there is no API to disable LAM mode. So all of test cases are run under
child process.

Functions of this test:

MALLOC

- LAM_U57 masks bits 57:62 of a user pointer. Process on user space
can dereference such pointers.

- Disable LAM, dereference a pointer with metadata above 48 bit or 57 bit
lead to trigger SIGSEGV.

TAG_BITS

- Max tag bits of LAM_U57 is 6.

Signed-off-by: Weihong Zhang <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
tools/testing/selftests/x86/Makefile | 2 +-
tools/testing/selftests/x86/lam.c | 326 +++++++++++++++++++++++++++
2 files changed, 327 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/x86/lam.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index 0388c4d60af0..c1a16a9d4f2f 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -18,7 +18,7 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
- corrupt_xstate_header amx
+ corrupt_xstate_header amx lam
# Some selftests require 32bit support enabled also on 64bit systems
TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall

diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
new file mode 100644
index 000000000000..900a3a0fb709
--- /dev/null
+++ b/tools/testing/selftests/x86/lam.c
@@ -0,0 +1,326 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <signal.h>
+#include <setjmp.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <inttypes.h>
+
+#include "../kselftest.h"
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+/* LAM modes, these definitions were copied from kernel code */
+#define LAM_NONE 0
+#define LAM_U57_BITS 6
+
+#define LAM_U57_MASK (0x3fULL << 57)
+/* arch prctl for LAM */
+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
+/* Specified test function bits */
+#define FUNC_MALLOC 0x1
+#define FUNC_BITS 0x2
+
+#define TEST_MASK 0x3
+
+#define MALLOC_LEN 32
+
+struct testcases {
+ unsigned int later;
+ int expected; /* 2: SIGSEGV Error; 1: other errors */
+ unsigned long lam;
+ uint64_t addr;
+ int (*test_func)(struct testcases *test);
+ const char *msg;
+};
+
+int tests_cnt;
+jmp_buf segv_env;
+
+static void segv_handler(int sig)
+{
+ ksft_print_msg("Get segmentation fault(%d).", sig);
+ siglongjmp(segv_env, 1);
+}
+
+static inline int cpu_has_lam(void)
+{
+ unsigned int cpuinfo[4];
+
+ __cpuid_count(0x7, 1, cpuinfo[0], cpuinfo[1], cpuinfo[2], cpuinfo[3]);
+
+ return (cpuinfo[0] & (1 << 26));
+}
+
+/*
+ * Set tagged address and read back untag mask.
+ * check if the untagged mask is expected.
+ *
+ * @return:
+ * 0: Set LAM mode successfully
+ * others: failed to set LAM
+ */
+static int set_lam(unsigned long lam)
+{
+ int ret = 0;
+ uint64_t ptr = 0;
+
+ if (lam != LAM_U57_BITS && lam != LAM_NONE)
+ return -1;
+
+ /* Skip check return */
+ syscall(SYS_arch_prctl, ARCH_ENABLE_TAGGED_ADDR, lam);
+
+ /* Get untagged mask */
+ syscall(SYS_arch_prctl, ARCH_GET_UNTAG_MASK, &ptr);
+
+ /* Check mask returned is expected */
+ if (lam == LAM_U57_BITS)
+ ret = (ptr != ~(LAM_U57_MASK));
+ else if (lam == LAM_NONE)
+ ret = (ptr != -1ULL);
+
+ return ret;
+}
+
+static unsigned long get_default_tag_bits(void)
+{
+ pid_t pid;
+ int lam = LAM_NONE;
+ int ret = 0;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ } else if (pid == 0) {
+ /* Set LAM mode in child process */
+ if (set_lam(LAM_U57_BITS) == 0)
+ lam = LAM_U57_BITS;
+ else
+ lam = LAM_NONE;
+ exit(lam);
+ } else {
+ wait(&ret);
+ lam = WEXITSTATUS(ret);
+ }
+
+ return lam;
+}
+
+/* According to LAM mode, set metadata in high bits */
+static uint64_t set_metadata(uint64_t src, unsigned long lam)
+{
+ uint64_t metadata;
+
+ srand(time(NULL));
+ /* Get a random value as metadata */
+ metadata = rand();
+
+ switch (lam) {
+ case LAM_U57_BITS: /* Set metadata in bits 62:57 */
+ metadata = (src & ~(LAM_U57_MASK)) | ((metadata & 0x3f) << 57);
+ break;
+ default:
+ metadata = src;
+ break;
+ }
+
+ return metadata;
+}
+
+/*
+ * Set metadata in user pointer, compare new pointer with original pointer.
+ * both pointers should point to the same address.
+ *
+ * @return:
+ * 0: value on the pointer with metadate and value on original are same
+ * 1: not same.
+ */
+static int handle_lam_test(void *src, unsigned int lam)
+{
+ char *ptr;
+
+ strcpy((char *)src, "USER POINTER");
+
+ ptr = (char *)set_metadata((uint64_t)src, lam);
+ if (src == ptr)
+ return 0;
+
+ /* Copy a string into the pointer with metadata */
+ strcpy((char *)ptr, "METADATA POINTER");
+
+ return (!!strcmp((char *)src, (char *)ptr));
+}
+
+
+int handle_max_bits(struct testcases *test)
+{
+ unsigned long exp_bits = get_default_tag_bits();
+ unsigned long bits = 0;
+
+ if (exp_bits != LAM_NONE)
+ exp_bits = LAM_U57_BITS;
+
+ /* Get LAM max tag bits */
+ if (syscall(SYS_arch_prctl, ARCH_GET_MAX_TAG_BITS, &bits) == -1)
+ return 1;
+
+ return (exp_bits != bits);
+}
+
+/*
+ * Test lam feature through dereference pointer get from malloc.
+ * @return 0: Pass test. 1: Get failure during test 2: Get SIGSEGV
+ */
+static int handle_malloc(struct testcases *test)
+{
+ char *ptr = NULL;
+ int ret = 0;
+
+ if (test->later == 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1)
+ return 1;
+
+ ptr = (char *)malloc(MALLOC_LEN);
+ if (ptr == NULL) {
+ perror("malloc() failure\n");
+ return 1;
+ }
+
+ /* Set signal handler */
+ if (sigsetjmp(segv_env, 1) == 0) {
+ signal(SIGSEGV, segv_handler);
+ ret = handle_lam_test(ptr, test->lam);
+ } else {
+ ret = 2;
+ }
+
+ if (test->later != 0 && test->lam != 0)
+ if (set_lam(test->lam) == -1 && ret == 0)
+ ret = 1;
+
+ free(ptr);
+
+ return ret;
+}
+
+static int fork_test(struct testcases *test)
+{
+ int ret, child_ret;
+ pid_t pid;
+
+ pid = fork();
+ if (pid < 0) {
+ perror("Fork failed.");
+ ret = 1;
+ } else if (pid == 0) {
+ ret = test->test_func(test);
+ exit(ret);
+ } else {
+ wait(&child_ret);
+ ret = WEXITSTATUS(child_ret);
+ }
+
+ return ret;
+}
+
+static void run_test(struct testcases *test, int count)
+{
+ int i, ret = 0;
+
+ for (i = 0; i < count; i++) {
+ struct testcases *t = test + i;
+
+ /* fork a process to run test case */
+ ret = fork_test(t);
+ if (ret != 0)
+ ret = (t->expected == ret);
+ else
+ ret = !(t->expected);
+
+ tests_cnt++;
+ ksft_test_result(ret, t->msg);
+ }
+}
+
+static struct testcases malloc_cases[] = {
+ {
+ .later = 0,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC: LAM_U57. Dereferencing pointer with metadata\n",
+ },
+ {
+ .later = 1,
+ .expected = 2,
+ .lam = LAM_U57_BITS,
+ .test_func = handle_malloc,
+ .msg = "MALLOC:[Negative] Disable LAM. Dereferencing pointer with metadata.\n",
+ },
+};
+
+
+static struct testcases bits_cases[] = {
+ {
+ .test_func = handle_max_bits,
+ .msg = "BITS: Check default tag bits\n",
+ },
+};
+
+static void cmd_help(void)
+{
+ printf("usage: lam [-h] [-t test list]\n");
+ printf("\t-t test list: run tests specified in the test list, default:0x%x\n", TEST_MASK);
+ printf("\t\t0x1:malloc; 0x2:max_bits;\n");
+ printf("\t-h: help\n");
+}
+
+int main(int argc, char **argv)
+{
+ int c = 0;
+ unsigned int tests = TEST_MASK;
+
+ tests_cnt = 0;
+
+ if (!cpu_has_lam()) {
+ ksft_print_msg("Unsupported LAM feature!\n");
+ return -1;
+ }
+
+ while ((c = getopt(argc, argv, "ht:")) != -1) {
+ switch (c) {
+ case 't':
+ tests = strtoul(optarg, NULL, 16);
+ if (!(tests & TEST_MASK)) {
+ ksft_print_msg("Invalid argument!\n");
+ return -1;
+ }
+ break;
+ case 'h':
+ cmd_help();
+ return 0;
+ default:
+ ksft_print_msg("Invalid argument\n");
+ return -1;
+ }
+ }
+
+ if (tests & FUNC_MALLOC)
+ run_test(malloc_cases, ARRAY_SIZE(malloc_cases));
+
+ if (tests & FUNC_BITS)
+ run_test(bits_cases, ARRAY_SIZE(bits_cases));
+
+ ksft_set_plan(tests_cnt);
+
+ return ksft_exit_pass();
+}
--
2.38.0

2022-10-25 02:32:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 05/16] x86/uaccess: Provide untagged_addr() and remove tags before address check

untagged_addr() is a helper used by the core-mm to strip tag bits and
get the address to the canonical shape. In only handles userspace
addresses. The untagging mask is stored in mmu_context and will be set
on enabling LAM for the process.

The tags must not be included into check whether it's okay to access the
userspace address.

Strip tags in access_ok().

get_user() and put_user() don't use access_ok(), but check access
against TASK_SIZE directly in assembly. Strip tags, before calling into
the assembly helper.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/mmu.h | 3 +++
arch/x86/include/asm/mmu_context.h | 11 ++++++++
arch/x86/include/asm/uaccess.h | 42 +++++++++++++++++++++++++++---
arch/x86/kernel/process.c | 3 +++
4 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 002889ca8978..2fdb390040b5 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -43,6 +43,9 @@ typedef struct {

/* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
unsigned long lam_cr3_mask;
+
+ /* Significant bits of the virtual address. Excludes tag bits. */
+ u64 untag_mask;
#endif

struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 69c943b2ae90..5bd3d46685dc 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -100,6 +100,12 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+ mm->context.untag_mask = oldmm->context.untag_mask;
+}
+
+static inline void mm_reset_untag_mask(struct mm_struct *mm)
+{
+ mm->context.untag_mask = -1UL;
}

#else
@@ -112,6 +118,10 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
{
}
+
+static inline void mm_reset_untag_mask(struct mm_struct *mm)
+{
+}
#endif

#define enter_lazy_tlb enter_lazy_tlb
@@ -138,6 +148,7 @@ static inline int init_new_context(struct task_struct *tsk,
mm->context.execute_only_pkey = -1;
}
#endif
+ mm_reset_untag_mask(mm);
init_new_context_ldt(mm);
return 0;
}
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 8bc614cfe21b..c6062c07ccd2 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -7,6 +7,7 @@
#include <linux/compiler.h>
#include <linux/instrumented.h>
#include <linux/kasan-checks.h>
+#include <linux/mm_types.h>
#include <linux/string.h>
#include <asm/asm.h>
#include <asm/page.h>
@@ -21,6 +22,30 @@ static inline bool pagefault_disabled(void);
# define WARN_ON_IN_IRQ()
#endif

+#ifdef CONFIG_X86_64
+/*
+ * Mask out tag bits from the address.
+ *
+ * Magic with the 'sign' allows to untag userspace pointer without any branches
+ * while leaving kernel addresses intact.
+ */
+#define untagged_addr(mm, addr) ({ \
+ u64 __addr = (__force u64)(addr); \
+ s64 sign = (s64)__addr >> 63; \
+ __addr &= (mm)->context.untag_mask | sign; \
+ (__force __typeof__(addr))__addr; \
+})
+
+#define untagged_ptr(mm, ptr) ({ \
+ u64 __ptrval = (__force u64)(ptr); \
+ __ptrval = untagged_addr(mm, __ptrval); \
+ (__force __typeof__(*(ptr)) *)__ptrval; \
+})
+#else
+#define untagged_addr(mm, addr) (addr)
+#define untagged_ptr(mm, ptr) (ptr)
+#endif
+
/**
* access_ok - Checks if a user space pointer is valid
* @addr: User space pointer to start of block to check
@@ -41,7 +66,7 @@ static inline bool pagefault_disabled(void);
#define access_ok(addr, size) \
({ \
WARN_ON_IN_IRQ(); \
- likely(__access_ok(addr, size)); \
+ likely(__access_ok(untagged_addr(current->mm, addr), size)); \
})

#include <asm-generic/access_ok.h>
@@ -127,7 +152,13 @@ extern int __get_user_bad(void);
* Return: zero on success, or -EFAULT on error.
* On error, the variable @x is set to zero.
*/
-#define get_user(x,ptr) ({ might_fault(); do_get_user_call(get_user,x,ptr); })
+#define get_user(x,ptr) \
+({ \
+ __typeof__(*(ptr)) __user *__ptr_clean; \
+ __ptr_clean = untagged_ptr(current->mm, ptr); \
+ might_fault(); \
+ do_get_user_call(get_user,x,__ptr_clean); \
+})

/**
* __get_user - Get a simple variable from user space, with less checking.
@@ -227,7 +258,12 @@ extern void __put_user_nocheck_8(void);
*
* Return: zero on success, or -EFAULT on error.
*/
-#define put_user(x, ptr) ({ might_fault(); do_put_user_call(put_user,x,ptr); })
+#define put_user(x, ptr) ({ \
+ __typeof__(*(ptr)) __user *__ptr_clean; \
+ __ptr_clean = untagged_ptr(current->mm, ptr); \
+ might_fault(); \
+ do_put_user_call(put_user,x,__ptr_clean); \
+})

/**
* __put_user - Write a simple value into user space, with less checking.
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c21b7347a26d..d1e83ba21130 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,6 +47,7 @@
#include <asm/frame.h>
#include <asm/unwind.h>
#include <asm/tdx.h>
+#include <asm/mmu_context.h>

#include "process.h"

@@ -367,6 +368,8 @@ void arch_setup_new_exec(void)
task_clear_spec_ssb_noexec(current);
speculation_ctrl_update(read_thread_flags());
}
+
+ mm_reset_untag_mask(current->mm);
}

#ifdef CONFIG_X86_IOPL_IOPERM
--
2.38.0

2022-10-25 02:33:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 04/16] x86/mm: Handle LAM on context switch

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/mmu.h | 3 ++
arch/x86/include/asm/mmu_context.h | 24 +++++++++++++++
arch/x86/include/asm/tlbflush.h | 34 +++++++++++++++++++++
arch/x86/mm/tlb.c | 48 ++++++++++++++++++++----------
4 files changed, 93 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..002889ca8978 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -40,6 +40,9 @@ typedef struct {

#ifdef CONFIG_X86_64
unsigned short flags;
+
+ /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
+ unsigned long lam_cr3_mask;
#endif

struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index b8d40ddeab00..69c943b2ae90 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -91,6 +91,29 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
}
#endif

+#ifdef CONFIG_X86_64
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return mm->context.lam_cr3_mask;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+}
+
+#else
+
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+#endif
+
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);

@@ -168,6 +191,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
{
arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
+ dup_lam(oldmm, mm);
return ldt_dup_context(oldmm, mm);
}

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..662598dea937 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -101,6 +101,16 @@ struct tlb_state {
*/
bool invalidate_other;

+#ifdef CONFIG_X86_64
+ /*
+ * Active LAM mode.
+ *
+ * X86_CR3_LAM_U57/U48 shifted right by X86_CR3_LAM_U57_BIT or 0 if LAM
+ * disabled.
+ */
+ u8 lam;
+#endif
+
/*
* Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
* the corresponding user PCID needs a flush next time we
@@ -357,6 +367,30 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#define huge_pmd_needs_flush huge_pmd_needs_flush

+#ifdef CONFIG_X86_64
+static inline unsigned long tlbstate_lam_cr3_mask(void)
+{
+ unsigned long lam = this_cpu_read(cpu_tlbstate.lam);
+
+ return lam << X86_CR3_LAM_U57_BIT;
+}
+
+static inline void set_tlbstate_cr3_lam_mask(unsigned long mask)
+{
+ this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
+}
+
+#else
+
+static inline unsigned long tlbstate_lam_cr3_mask(void)
+{
+ return 0;
+}
+
+static inline void set_tlbstate_cr3_lam_mask(u64 mask)
+{
+}
+#endif
#endif /* !MODULE */

static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c1e31e9a85d7..d6c9c15d2ad2 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -154,26 +154,30 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
{
+ unsigned long cr3 = __sme_pa(pgd) | lam;
+
if (static_cpu_has(X86_FEATURE_PCID)) {
- return __sme_pa(pgd) | kern_pcid(asid);
+ VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+ cr3 |= kern_pcid(asid);
} else {
VM_WARN_ON_ONCE(asid != 0);
- return __sme_pa(pgd);
}
+
+ return cr3;
}

-static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
+ unsigned long lam)
{
- VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
/*
* Use boot_cpu_has() instead of this_cpu_has() as this function
* might be called during early boot. This should work even after
* boot because all CPU's the have same capabilities:
*/
VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
- return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
+ return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
}

/*
@@ -274,15 +278,16 @@ static inline void invalidate_user_asid(u16 asid)
(unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
}

-static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, unsigned long lam,
+ bool need_flush)
{
unsigned long new_mm_cr3;

if (need_flush) {
invalidate_user_asid(new_asid);
- new_mm_cr3 = build_cr3(pgdir, new_asid);
+ new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
} else {
- new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+ new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
}

/*
@@ -491,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ unsigned long prev_lam = tlbstate_lam_cr3_mask();
+ unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
@@ -520,7 +527,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* isn't free.
*/
#ifdef CONFIG_DEBUG_VM
- if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
+ if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
/*
* If we were to BUG here, we'd be very likely to kill
* the system so hard that we don't see the call trace.
@@ -554,6 +561,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);
+ VM_WARN_ON(prev_lam != new_lam);

/*
* Even in lazy TLB mode, the CPU should stay set in the
@@ -622,15 +630,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
barrier();
}

+ set_tlbstate_cr3_lam_mask(new_lam);
if (need_flush) {
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, true);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
} else {
/* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, false);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
@@ -691,6 +700,10 @@ void initialize_tlbstate_and_flush(void)
/* Assert that CR3 already references the right mm. */
WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));

+ /* LAM expected to be disabled in CR3 and init_mm */
+ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
+ WARN_ON(mm_lam_cr3_mask(&init_mm));
+
/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
* doesn't work like other CR4 bits because it can only be set from
@@ -699,8 +712,8 @@ void initialize_tlbstate_and_flush(void)
WARN_ON(boot_cpu_has(X86_FEATURE_PCID) &&
!(cr4_read_shadow() & X86_CR4_PCIDE));

- /* Force ASID 0 and force a TLB flush. */
- write_cr3(build_cr3(mm->pgd, 0));
+ /* Disable LAM, force ASID 0 and force a TLB flush. */
+ write_cr3(build_cr3(mm->pgd, 0, 0));

/* Reinitialize tlbstate. */
this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
@@ -708,6 +721,7 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
+ set_tlbstate_cr3_lam_mask(0);

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
@@ -1071,8 +1085,10 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
*/
unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3 =
+ build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+ tlbstate_lam_cr3_mask());

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
--
2.38.0

2022-10-25 02:34:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 10/16] iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()

Kernel has few users of pasid_valid() and all but one checks if the
process has PASID allocated. The helper takes ioasid_t as the input.

Replace the helper with mm_valid_pasid() that takes mm_struct as the
argument. The only call that checks PASID that is not tied to mm_struct
is open-codded now.

This is preparatory patch. It helps avoid ifdeffery: no need to
dereference mm->pasid in generic code to check if the process has PASID.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/traps.c | 6 +++---
drivers/iommu/iommu-sva-lib.c | 4 ++--
include/linux/ioasid.h | 9 ---------
include/linux/sched/mm.h | 8 +++++++-
4 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 178015a820f0..3bdc37ae7d1c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -664,15 +664,15 @@ static bool try_fixup_enqcmd_gp(void)
if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
return false;

- pasid = current->mm->pasid;
-
/*
* If the mm has not been allocated a
* PASID, the #GP can not be fixed up.
*/
- if (!pasid_valid(pasid))
+ if (!mm_valid_pasid(current->mm))
return false;

+ pasid = current->mm->pasid;
+
/*
* Did this thread already have its PASID activated?
* If so, the #GP must be from something else.
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index 106506143896..27be6b81e0b5 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -33,14 +33,14 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)

mutex_lock(&iommu_sva_lock);
/* Is a PASID already associated with this mm? */
- if (pasid_valid(mm->pasid)) {
+ if (mm_valid_pasid(mm)) {
if (mm->pasid < min || mm->pasid >= max)
ret = -EOVERFLOW;
goto out;
}

pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm);
- if (!pasid_valid(pasid))
+ if (pasid == INVALID_IOASID)
ret = -ENOMEM;
else
mm_pasid_set(mm, pasid);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index af1c9d62e642..836ae09e92c2 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -40,10 +40,6 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_set_data(ioasid_t ioasid, void *data);
-static inline bool pasid_valid(ioasid_t ioasid)
-{
- return ioasid != INVALID_IOASID;
-}

#else /* !CONFIG_IOASID */
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
@@ -74,10 +70,5 @@ static inline int ioasid_set_data(ioasid_t ioasid, void *data)
return -ENOTSUPP;
}

-static inline bool pasid_valid(ioasid_t ioasid)
-{
- return false;
-}
-
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2a243616f222..b69fe7e8c0ac 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -457,6 +457,11 @@ static inline void mm_pasid_init(struct mm_struct *mm)
mm->pasid = INVALID_IOASID;
}

+static inline bool mm_valid_pasid(struct mm_struct *mm)
+{
+ return mm->pasid != INVALID_IOASID;
+}
+
/* Associate a PASID with an mm_struct: */
static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid)
{
@@ -465,13 +470,14 @@ static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid)

static inline void mm_pasid_drop(struct mm_struct *mm)
{
- if (pasid_valid(mm->pasid)) {
+ if (mm_valid_pasid(mm)) {
ioasid_free(mm->pasid);
mm->pasid = INVALID_IOASID;
}
}
#else
static inline void mm_pasid_init(struct mm_struct *mm) {}
+static inline bool mm_valid_pasid(struct mm_struct *mm) { return false; }
static inline void mm_pasid_set(struct mm_struct *mm, u32 pasid) {}
static inline void mm_pasid_drop(struct mm_struct *mm) {}
#endif
--
2.38.0

2022-11-07 11:46:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 00/16] Linear Address Masking enabling

On Tue, Oct 25, 2022 at 03:17:06AM +0300, Kirill A. Shutemov wrote:
> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> 64-bit linear addresses, allowing software to use of the untranslated
> address bits for metadata.
>
> The capability can be used for efficient address sanitizers (ASAN)
> implementation and for optimizations in JITs and virtual machines.
>
> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> this time.
>
> Please review and consider applying.

Ping?

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-07 15:15:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv11 00/16] Linear Address Masking enabling

On 10/24/22 17:17, Kirill A. Shutemov wrote:
> Linear Address Masking[1] (LAM) modifies the checking that is applied to
> 64-bit linear addresses, allowing software to use of the untranslated
> address bits for metadata.
>
> The capability can be used for efficient address sanitizers (ASAN)
> implementation and for optimizations in JITs and virtual machines.
>
> The patchset brings support for LAM for userspace addresses. Only LAM_U57 at
> this time.
>
> Please review and consider applying.

I'm pretty happy with this series other than my two comments (switch_mm
race and explaining why untagged_addr needs to preserve kernel addresses).

>
> Results for the self-tests:
>
> ok 1 MALLOC: LAM_U57. Dereferencing pointer with metadata
> # Get segmentation fault(11).ok 2 MALLOC:[Negative] Disable LAM. Dereferencing pointer with metadata.
> ok 3 BITS: Check default tag bits
> ok 4 # SKIP MMAP: First mmap high address, then set LAM_U57.
> ok 5 # SKIP MMAP: First LAM_U57, then High address.
> ok 6 MMAP: First LAM_U57, then Low address.
> ok 7 SYSCALL: LAM_U57. syscall with metadata
> ok 8 SYSCALL:[Negative] Disable LAM. Dereferencing pointer with metadata.
> ok 9 URING: LAM_U57. Dereferencing pointer with metadata
> ok 10 URING:[Negative] Disable LAM. Dereferencing pointer with metadata.
> ok 11 FORK: LAM_U57, child process should get LAM mode same as parent
> ok 12 EXECVE: LAM_U57, child process should get disabled LAM mode
> open: Device or resource busy
> ok 13 PASID: [Negative] Execute LAM, PASID, SVA in sequence
> ok 14 PASID: Execute LAM, SVA, PASID in sequence
> ok 15 PASID: [Negative] Execute PASID, LAM, SVA in sequence
> ok 16 PASID: Execute PASID, SVA, LAM in sequence
> ok 17 PASID: Execute SVA, LAM, PASID in sequence
> ok 18 PASID: Execute SVA, PASID, LAM in sequence
> 1..18
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git lam
>
> v11:
> - Move untag_mask to /proc/$PID/status;
> - s/SVM/SVA/g;
> - static inline arch_pgtable_dma_compat() instead of macros;
> - Replace pasid_valid() with mm_valid_pasid();
> - Acks from Ashok and Jacob (forgot to apply from v9);
> v10:
> - Rebased to v6.1-rc1;
> - Add selftest for SVM vs LAM;
> v9:
> - Fix race between LAM enabling and check that KVM memslot address doesn't
> have any tags;
> - Reduce untagged_addr() overhead until the first LAM user;
> - Clarify SVM vs. LAM semantics;
> - Use mmap_lock to serialize LAM enabling;
> v8:
> - Drop redundant smb_mb() in prctl_enable_tagged_addr();
> - Cleanup code around build_cr3();
> - Fix commit messages;
> - Selftests updates;
> - Acked/Reviewed/Tested-bys from Alexander and Peter;
> v7:
> - Drop redundant smb_mb() in prctl_enable_tagged_addr();
> - Cleanup code around build_cr3();
> - Fix commit message;
> - Fix indentation;
> v6:
> - Rebased onto v6.0-rc1
> - LAM_U48 excluded from the patchet. Still available in the git tree;
> - add ARCH_GET_MAX_TAG_BITS;
> - Fix build without CONFIG_DEBUG_VM;
> - Update comments;
> - Reviewed/Tested-by from Alexander;
> v5:
> - Do not use switch_mm() in enable_lam_func()
> - Use mb()/READ_ONCE() pair on LAM enabling;
> - Add self-test by Weihong Zhang;
> - Add comments;
> v4:
> - Fix untagged_addr() for LAM_U48;
> - Remove no-threads restriction on LAM enabling;
> - Fix mm_struct access from /proc/$PID/arch_status
> - Fix LAM handling in initialize_tlbstate_and_flush()
> - Pack tlb_state better;
> - Comments and commit messages;
> v3:
> - Rebased onto v5.19-rc1
> - Per-process enabling;
> - API overhaul (again);
> - Avoid branches and costly computations in the fast path;
> - LAM_U48 is in optional patch.
> v2:
> - Rebased onto v5.18-rc1
> - New arch_prctl(2)-based API
> - Expose status of LAM (or other thread features) in
> /proc/$PID/arch_status
>
> [1] ISE, Chapter 10. https://cdrdv2.intel.com/v1/dl/getContent/671368
> Kirill A. Shutemov (11):
> x86/mm: Fix CR3_ADDR_MASK
> x86: CPUID and CR3/CR4 flags for Linear Address Masking
> mm: Pass down mm_struct to untagged_addr()
> x86/mm: Handle LAM on context switch
> x86/uaccess: Provide untagged_addr() and remove tags before address
> check
> KVM: Serialize tagged address check against tagging enabling
> x86/mm: Provide arch_prctl() interface for LAM
> x86/mm: Reduce untagged_addr() overhead until the first LAM user
> mm: Expose untagging mask in /proc/$PID/status
> iommu/sva: Replace pasid_valid() helper with mm_valid_pasid()
> x86/mm, iommu/sva: Make LAM and SVA mutually exclusive
>
> Weihong Zhang (5):
> selftests/x86/lam: Add malloc and tag-bits test cases for
> linear-address masking
> selftests/x86/lam: Add mmap and SYSCALL test cases for linear-address
> masking
> selftests/x86/lam: Add io_uring test cases for linear-address masking
> selftests/x86/lam: Add inherit test cases for linear-address masking
> selftests/x86/lam: Add ARCH_FORCE_TAGGED_SVA test cases for
> linear-address masking
>
> arch/arm64/include/asm/memory.h | 4 +-
> arch/arm64/include/asm/mmu_context.h | 6 +
> arch/arm64/include/asm/signal.h | 2 +-
> arch/arm64/include/asm/uaccess.h | 2 +-
> arch/arm64/kernel/hw_breakpoint.c | 2 +-
> arch/arm64/kernel/traps.c | 4 +-
> arch/arm64/mm/fault.c | 10 +-
> arch/sparc/include/asm/mmu_context_64.h | 6 +
> arch/sparc/include/asm/pgtable_64.h | 2 +-
> arch/sparc/include/asm/uaccess_64.h | 2 +
> arch/x86/include/asm/cpufeatures.h | 1 +
> arch/x86/include/asm/mmu.h | 12 +-
> arch/x86/include/asm/mmu_context.h | 47 +
> arch/x86/include/asm/processor-flags.h | 4 +-
> arch/x86/include/asm/tlbflush.h | 34 +
> arch/x86/include/asm/uaccess.h | 46 +-
> arch/x86/include/uapi/asm/prctl.h | 5 +
> arch/x86/include/uapi/asm/processor-flags.h | 6 +
> arch/x86/kernel/process.c | 3 +
> arch/x86/kernel/process_64.c | 81 +-
> arch/x86/kernel/traps.c | 6 +-
> arch/x86/mm/tlb.c | 48 +-
> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +-
> drivers/gpu/drm/radeon/radeon_gem.c | 2 +-
> drivers/infiniband/hw/mlx4/mr.c | 2 +-
> drivers/iommu/iommu-sva-lib.c | 16 +-
> drivers/media/common/videobuf2/frame_vector.c | 2 +-
> drivers/media/v4l2-core/videobuf-dma-contig.c | 2 +-
> .../staging/media/atomisp/pci/hmm/hmm_bo.c | 2 +-
> drivers/tee/tee_shm.c | 2 +-
> drivers/vfio/vfio_iommu_type1.c | 2 +-
> fs/proc/array.c | 6 +
> fs/proc/task_mmu.c | 2 +-
> include/linux/ioasid.h | 9 -
> include/linux/mm.h | 11 -
> include/linux/mmu_context.h | 14 +
> include/linux/sched/mm.h | 8 +-
> include/linux/uaccess.h | 15 +
> lib/strncpy_from_user.c | 2 +-
> lib/strnlen_user.c | 2 +-
> mm/gup.c | 6 +-
> mm/madvise.c | 2 +-
> mm/mempolicy.c | 6 +-
> mm/migrate.c | 2 +-
> mm/mincore.c | 2 +-
> mm/mlock.c | 4 +-
> mm/mmap.c | 2 +-
> mm/mprotect.c | 2 +-
> mm/mremap.c | 2 +-
> mm/msync.c | 2 +-
> tools/testing/selftests/x86/Makefile | 2 +-
> tools/testing/selftests/x86/lam.c | 1149 +++++++++++++++++
> virt/kvm/kvm_main.c | 14 +-
> 54 files changed, 1539 insertions(+), 92 deletions(-)
> create mode 100644 tools/testing/selftests/x86/lam.c
>


2022-11-07 15:30:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv11 05/16] x86/uaccess: Provide untagged_addr() and remove tags before address check

On 10/24/22 17:17, Kirill A. Shutemov wrote:
> untagged_addr() is a helper used by the core-mm to strip tag bits and
> get the address to the canonical shape. In only handles userspace
> addresses. The untagging mask is stored in mmu_context and will be set
> on enabling LAM for the process.
>
> The tags must not be included into check whether it's okay to access the
> userspace address.
>
> Strip tags in access_ok().
>
> get_user() and put_user() don't use access_ok(), but check access
> against TASK_SIZE directly in assembly. Strip tags, before calling into
> the assembly helper.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Alexander Potapenko <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/mmu.h | 3 +++
> arch/x86/include/asm/mmu_context.h | 11 ++++++++
> arch/x86/include/asm/uaccess.h | 42 +++++++++++++++++++++++++++---
> arch/x86/kernel/process.c | 3 +++
> 4 files changed, 56 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 002889ca8978..2fdb390040b5 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -43,6 +43,9 @@ typedef struct {
>
> /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
> unsigned long lam_cr3_mask;
> +
> + /* Significant bits of the virtual address. Excludes tag bits. */
> + u64 untag_mask;
> #endif
>
> struct mutex lock;
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 69c943b2ae90..5bd3d46685dc 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -100,6 +100,12 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
> {
> mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
> + mm->context.untag_mask = oldmm->context.untag_mask;
> +}
> +
> +static inline void mm_reset_untag_mask(struct mm_struct *mm)
> +{
> + mm->context.untag_mask = -1UL;
> }
>
> #else
> @@ -112,6 +118,10 @@ static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
> {
> }
> +
> +static inline void mm_reset_untag_mask(struct mm_struct *mm)
> +{
> +}
> #endif
>
> #define enter_lazy_tlb enter_lazy_tlb
> @@ -138,6 +148,7 @@ static inline int init_new_context(struct task_struct *tsk,
> mm->context.execute_only_pkey = -1;
> }
> #endif
> + mm_reset_untag_mask(mm);
> init_new_context_ldt(mm);
> return 0;
> }
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index 8bc614cfe21b..c6062c07ccd2 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -7,6 +7,7 @@
> #include <linux/compiler.h>
> #include <linux/instrumented.h>
> #include <linux/kasan-checks.h>
> +#include <linux/mm_types.h>
> #include <linux/string.h>
> #include <asm/asm.h>
> #include <asm/page.h>
> @@ -21,6 +22,30 @@ static inline bool pagefault_disabled(void);
> # define WARN_ON_IN_IRQ()
> #endif
>
> +#ifdef CONFIG_X86_64
> +/*
> + * Mask out tag bits from the address.
> + *
> + * Magic with the 'sign' allows to untag userspace pointer without any branches
> + * while leaving kernel addresses intact.
> + */
> +#define untagged_addr(mm, addr) ({ \
> + u64 __addr = (__force u64)(addr); \
> + s64 sign = (s64)__addr >> 63; \
> + __addr &= (mm)->context.untag_mask | sign; \
> + (__force __typeof__(addr))__addr; \
> +})
> +

I think this implementation is correct, but I'm wondering if there are
any callers of untagged_addr that actually need to preserve kernel
addresses. Are there? (There certainly *were* back when we had set_fs().)

I'm also mildly uneasy about a potential edge case. Naively, one would
expect:

untagged_addr(current->mm, addr) + size ==
untagged_addr(current->mm, addr + size)

at least for an address that is valid enough to be potentially
dereferenced. This isn't true any more for size that overflows into the
tag bit range.

I *think* we're okay though -- __access_ok requires that addr <= limit -
size, so any range that overflows into tag bits will be rejected even if
the entire range consists of valid (tagged) user addresses.

So:

Acked-by: Andy Lutomirski <[email protected]>


2022-11-07 16:03:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv11 04/16] x86/mm: Handle LAM on context switch

On 10/24/22 17:17, Kirill A. Shutemov wrote:
> Linear Address Masking mode for userspace pointers encoded in CR3 bits.
> The mode is selected per-process and stored in mm_context_t.
>
> switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
> accordingly.
>
> The active LAM mode gets recorded in the tlb_state.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Tested-by: Alexander Potapenko <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/include/asm/mmu.h | 3 ++
> arch/x86/include/asm/mmu_context.h | 24 +++++++++++++++
> arch/x86/include/asm/tlbflush.h | 34 +++++++++++++++++++++
> arch/x86/mm/tlb.c | 48 ++++++++++++++++++++----------
> 4 files changed, 93 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
> index 5d7494631ea9..002889ca8978 100644
> --- a/arch/x86/include/asm/mmu.h
> +++ b/arch/x86/include/asm/mmu.h
> @@ -40,6 +40,9 @@ typedef struct {
>
> #ifdef CONFIG_X86_64
> unsigned short flags;
> +
> + /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
> + unsigned long lam_cr3_mask;
> #endif
>
> struct mutex lock;
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index b8d40ddeab00..69c943b2ae90 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -91,6 +91,29 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
> }
> #endif
>
> +#ifdef CONFIG_X86_64
> +static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> +{
> + return mm->context.lam_cr3_mask;
> +}
> +
> +static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
> +{
> + mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
> +}
> +
> +#else
> +
> +static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> +{
> + return 0;
> +}
> +
> +static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
> +{
> +}
> +#endif
> +
> #define enter_lazy_tlb enter_lazy_tlb
> extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
>
> @@ -168,6 +191,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
> {
> arch_dup_pkeys(oldmm, mm);
> paravirt_arch_dup_mmap(oldmm, mm);
> + dup_lam(oldmm, mm);
> return ldt_dup_context(oldmm, mm);
> }
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index cda3118f3b27..662598dea937 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -101,6 +101,16 @@ struct tlb_state {
> */
> bool invalidate_other;
>
> +#ifdef CONFIG_X86_64
> + /*
> + * Active LAM mode.
> + *
> + * X86_CR3_LAM_U57/U48 shifted right by X86_CR3_LAM_U57_BIT or 0 if LAM
> + * disabled.
> + */
> + u8 lam;
> +#endif
> +
> /*
> * Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
> * the corresponding user PCID needs a flush next time we
> @@ -357,6 +367,30 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
> }
> #define huge_pmd_needs_flush huge_pmd_needs_flush
>
> +#ifdef CONFIG_X86_64
> +static inline unsigned long tlbstate_lam_cr3_mask(void)
> +{
> + unsigned long lam = this_cpu_read(cpu_tlbstate.lam);
> +
> + return lam << X86_CR3_LAM_U57_BIT;
> +}
> +
> +static inline void set_tlbstate_cr3_lam_mask(unsigned long mask)
> +{
> + this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
> +}
> +
> +#else
> +
> +static inline unsigned long tlbstate_lam_cr3_mask(void)
> +{
> + return 0;
> +}
> +
> +static inline void set_tlbstate_cr3_lam_mask(u64 mask)
> +{
> +}
> +#endif
> #endif /* !MODULE */
>
> static inline void __native_tlb_flush_global(unsigned long cr4)
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index c1e31e9a85d7..d6c9c15d2ad2 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -154,26 +154,30 @@ static inline u16 user_pcid(u16 asid)
> return ret;
> }
>
> -static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
> +static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
> {
> + unsigned long cr3 = __sme_pa(pgd) | lam;
> +
> if (static_cpu_has(X86_FEATURE_PCID)) {
> - return __sme_pa(pgd) | kern_pcid(asid);
> + VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
> + cr3 |= kern_pcid(asid);
> } else {
> VM_WARN_ON_ONCE(asid != 0);
> - return __sme_pa(pgd);
> }
> +
> + return cr3;
> }
>
> -static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
> +static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
> + unsigned long lam)
> {
> - VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
> /*
> * Use boot_cpu_has() instead of this_cpu_has() as this function
> * might be called during early boot. This should work even after
> * boot because all CPU's the have same capabilities:
> */
> VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
> - return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
> + return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
> }
>
> /*
> @@ -274,15 +278,16 @@ static inline void invalidate_user_asid(u16 asid)
> (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
> }
>
> -static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
> +static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, unsigned long lam,
> + bool need_flush)
> {
> unsigned long new_mm_cr3;
>
> if (need_flush) {
> invalidate_user_asid(new_asid);
> - new_mm_cr3 = build_cr3(pgdir, new_asid);
> + new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
> } else {
> - new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
> + new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
> }
>
> /*
> @@ -491,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> {
> struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> + unsigned long prev_lam = tlbstate_lam_cr3_mask();
> + unsigned long new_lam = mm_lam_cr3_mask(next);
> bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
> unsigned cpu = smp_processor_id();
> u64 next_tlb_gen;
> @@ -520,7 +527,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> * isn't free.
> */
> #ifdef CONFIG_DEBUG_VM
> - if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
> + if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
> /*
> * If we were to BUG here, we'd be very likely to kill
> * the system so hard that we don't see the call trace.
> @@ -554,6 +561,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> if (real_prev == next) {
> VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> next->context.ctx_id);
> + VM_WARN_ON(prev_lam != new_lam);

What prevents this warning from firing if a remote cpu does
prctl_enable_tagged_addr() and this cpu hits this code path before
getting the LAM-enabling IPI? Conceptually this would be like if we
asserted that LDTR matched the mm_context's ldt setting in this code path.

I think (haven't really verified) that you can fix this by removing the
warning and adding a comment explaining that CR3 can be out of sync due
to a race against changes to LAM settings. I don't think there's any
way to eliminate the race -- there is no lock you can take while
changing lam that prevents a remote CPU from switching mm or scheduling.

--Andy

2022-11-07 17:52:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 05/16] x86/uaccess: Provide untagged_addr() and remove tags before address check

On Mon, Nov 07, 2022 at 06:50:51AM -0800, Andy Lutomirski wrote:
> > @@ -21,6 +22,30 @@ static inline bool pagefault_disabled(void);
> > # define WARN_ON_IN_IRQ()
> > #endif
> > +#ifdef CONFIG_X86_64
> > +/*
> > + * Mask out tag bits from the address.
> > + *
> > + * Magic with the 'sign' allows to untag userspace pointer without any branches
> > + * while leaving kernel addresses intact.
> > + */
> > +#define untagged_addr(mm, addr) ({ \
> > + u64 __addr = (__force u64)(addr); \
> > + s64 sign = (s64)__addr >> 63; \
> > + __addr &= (mm)->context.untag_mask | sign; \
> > + (__force __typeof__(addr))__addr; \
> > +})
> > +
>
> I think this implementation is correct, but I'm wondering if there are any
> callers of untagged_addr that actually need to preserve kernel addresses.
> Are there? (There certainly *were* back when we had set_fs().)

I don't think there's any.

CONFIG_KASAN_SW_TAGS uses untagged_addr() on kernel addresses, but it is
only enabled on arm64. On x86, it will use CR4.LAM_SUP and the enabling
would require a new helper for untagging kernel addresses.

That said, I would rather stay on the safe side.

> I'm also mildly uneasy about a potential edge case. Naively, one would
> expect:
>
> untagged_addr(current->mm, addr) + size ==
> untagged_addr(current->mm, addr + size)
>
> at least for an address that is valid enough to be potentially dereferenced.
> This isn't true any more for size that overflows into the tag bit range.

That's definitely a new edge case.

From quick grep, the only CONFIG_KASAN_SW_TAGS code obviously does
arithmetics on address before untagging.

> I *think* we're okay though -- __access_ok requires that addr <= limit -
> size, so any range that overflows into tag bits will be rejected even if the
> entire range consists of valid (tagged) user addresses.

True.

> So:
>
> Acked-by: Andy Lutomirski <[email protected]>

Thanks!

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-07 18:03:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 04/16] x86/mm: Handle LAM on context switch

On Mon, Nov 07, 2022 at 06:58:59AM -0800, Andy Lutomirski wrote:
> > @@ -554,6 +561,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > if (real_prev == next) {
> > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> > next->context.ctx_id);
> > + VM_WARN_ON(prev_lam != new_lam);
>
> What prevents this warning from firing if a remote cpu does
> prctl_enable_tagged_addr() and this cpu hits this code path before getting
> the LAM-enabling IPI? Conceptually this would be like if we asserted that
> LDTR matched the mm_context's ldt setting in this code path.
>
> I think (haven't really verified) that you can fix this by removing the
> warning and adding a comment explaining that CR3 can be out of sync due to a
> race against changes to LAM settings. I don't think there's any way to
> eliminate the race -- there is no lock you can take while changing lam that
> prevents a remote CPU from switching mm or scheduling.

Something like this?

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d6c9c15d2ad2..c6cac1a1bc64 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -561,7 +561,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
if (real_prev == next) {
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);
- VM_WARN_ON(prev_lam != new_lam);
+
+ /*
+ * 'prev_lam' does not necessary match 'new_lam' here. In case
+ * of race with LAM enabling, the updated 'lam_cr3_mask' can be
+ * been before LAM-enabling IPI kicks in.
+ *
+ * The race is harmless: it is okay to update CR3 with new LAM
+ * mode. The IPI will rewrite CR3 shortly.
+ */

/*
* Even in lazy TLB mode, the CPU should stay set in the
--
Kiryl Shutsemau / Kirill A. Shutemov

2022-11-07 18:09:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv11 04/16] x86/mm: Handle LAM on context switch

On 11/7/22 09:14, Kirill A. Shutemov wrote:
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -561,7 +561,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> if (real_prev == next) {
> VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> next->context.ctx_id);
> - VM_WARN_ON(prev_lam != new_lam);
> +
> + /*
> + * 'prev_lam' does not necessary match 'new_lam' here. In case
> + * of race with LAM enabling, the updated 'lam_cr3_mask' can be
> + * been before LAM-enabling IPI kicks in.
> + *
> + * The race is harmless: it is okay to update CR3 with new LAM
> + * mode. The IPI will rewrite CR3 shortly.
> + */

So, let's do something like this in switch_mm_irqs_off():

/* Not actually switching mm's */
VM_WARN_ON(this_cpu_read(cpu_tlbstate....

/*
* If this races with another thread that enables
* lam, 'new_lam' might not match 'prev_lam'.
*/

Then, in enable_lam_func(), something like this:

/*
* Update CR3 to get LAM active on the CPU
*
* This might not actually need to update CR3 if a context
* switch happened between updating 'lam_cr3_mask' and
* running this IPI handler. Update it unconditionally for
* simplicity.
*/
cr3 = __read_cr3();
cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
cr3 |= lam_mask;
write_cr3(cr3);
set_tlbstate_cr3_lam_mask(lam_mask);


I'd much rather get folks thinking about IPI races in the IPI handler
rather than thinking about the IPI handler in the context switch path.

It's kinda silly to be describing the occasional superfluous
enable_lam_func() activity from switch_mm_irqs_off().

2022-11-07 22:14:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11.1 04/16] x86/mm: Handle LAM on context switch

Linear Address Masking mode for userspace pointers encoded in CR3 bits.
The mode is selected per-process and stored in mm_context_t.

switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
accordingly.

The active LAM mode gets recorded in the tlb_state.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/mmu.h | 3 ++
arch/x86/include/asm/mmu_context.h | 24 ++++++++++++++
arch/x86/include/asm/tlbflush.h | 34 +++++++++++++++++++
arch/x86/mm/tlb.c | 53 +++++++++++++++++++++---------
4 files changed, 98 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..002889ca8978 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -40,6 +40,9 @@ typedef struct {

#ifdef CONFIG_X86_64
unsigned short flags;
+
+ /* Active LAM mode: X86_CR3_LAM_U48 or X86_CR3_LAM_U57 or 0 (disabled) */
+ unsigned long lam_cr3_mask;
#endif

struct mutex lock;
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index b8d40ddeab00..69c943b2ae90 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -91,6 +91,29 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
}
#endif

+#ifdef CONFIG_X86_64
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return mm->context.lam_cr3_mask;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+ mm->context.lam_cr3_mask = oldmm->context.lam_cr3_mask;
+}
+
+#else
+
+static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
+{
+}
+#endif
+
#define enter_lazy_tlb enter_lazy_tlb
extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);

@@ -168,6 +191,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
{
arch_dup_pkeys(oldmm, mm);
paravirt_arch_dup_mmap(oldmm, mm);
+ dup_lam(oldmm, mm);
return ldt_dup_context(oldmm, mm);
}

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..662598dea937 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -101,6 +101,16 @@ struct tlb_state {
*/
bool invalidate_other;

+#ifdef CONFIG_X86_64
+ /*
+ * Active LAM mode.
+ *
+ * X86_CR3_LAM_U57/U48 shifted right by X86_CR3_LAM_U57_BIT or 0 if LAM
+ * disabled.
+ */
+ u8 lam;
+#endif
+
/*
* Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
* the corresponding user PCID needs a flush next time we
@@ -357,6 +367,30 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
}
#define huge_pmd_needs_flush huge_pmd_needs_flush

+#ifdef CONFIG_X86_64
+static inline unsigned long tlbstate_lam_cr3_mask(void)
+{
+ unsigned long lam = this_cpu_read(cpu_tlbstate.lam);
+
+ return lam << X86_CR3_LAM_U57_BIT;
+}
+
+static inline void set_tlbstate_cr3_lam_mask(unsigned long mask)
+{
+ this_cpu_write(cpu_tlbstate.lam, mask >> X86_CR3_LAM_U57_BIT);
+}
+
+#else
+
+static inline unsigned long tlbstate_lam_cr3_mask(void)
+{
+ return 0;
+}
+
+static inline void set_tlbstate_cr3_lam_mask(u64 mask)
+{
+}
+#endif
#endif /* !MODULE */

static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index c1e31e9a85d7..4380776b3c61 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -154,26 +154,30 @@ static inline u16 user_pcid(u16 asid)
return ret;
}

-static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid, unsigned long lam)
{
+ unsigned long cr3 = __sme_pa(pgd) | lam;
+
if (static_cpu_has(X86_FEATURE_PCID)) {
- return __sme_pa(pgd) | kern_pcid(asid);
+ VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+ cr3 |= kern_pcid(asid);
} else {
VM_WARN_ON_ONCE(asid != 0);
- return __sme_pa(pgd);
}
+
+ return cr3;
}

-static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
+ unsigned long lam)
{
- VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
/*
* Use boot_cpu_has() instead of this_cpu_has() as this function
* might be called during early boot. This should work even after
* boot because all CPU's the have same capabilities:
*/
VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
- return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
+ return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
}

/*
@@ -274,15 +278,16 @@ static inline void invalidate_user_asid(u16 asid)
(unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
}

-static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, unsigned long lam,
+ bool need_flush)
{
unsigned long new_mm_cr3;

if (need_flush) {
invalidate_user_asid(new_asid);
- new_mm_cr3 = build_cr3(pgdir, new_asid);
+ new_mm_cr3 = build_cr3(pgdir, new_asid, lam);
} else {
- new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+ new_mm_cr3 = build_cr3_noflush(pgdir, new_asid, lam);
}

/*
@@ -491,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ unsigned long prev_lam = tlbstate_lam_cr3_mask();
+ unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
@@ -520,7 +527,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* isn't free.
*/
#ifdef CONFIG_DEBUG_VM
- if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
+ if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
/*
* If we were to BUG here, we'd be very likely to kill
* the system so hard that we don't see the call trace.
@@ -552,9 +559,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* instruction.
*/
if (real_prev == next) {
+ /* Not actually switching mm's */
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

+ /*
+ * If this races with another thread that enables lam, 'new_lam'
+ * might not match 'prev_lam'.
+ */
+
/*
* Even in lazy TLB mode, the CPU should stay set in the
* mm_cpumask. The TLB shootdown code can figure out from
@@ -622,15 +635,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
barrier();
}

+ set_tlbstate_cr3_lam_mask(new_lam);
if (need_flush) {
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
- load_new_mm_cr3(next->pgd, new_asid, true);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, true);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
} else {
/* The new ASID is already up to date. */
- load_new_mm_cr3(next->pgd, new_asid, false);
+ load_new_mm_cr3(next->pgd, new_asid, new_lam, false);

trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
}
@@ -691,6 +705,10 @@ void initialize_tlbstate_and_flush(void)
/* Assert that CR3 already references the right mm. */
WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));

+ /* LAM expected to be disabled in CR3 and init_mm */
+ WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
+ WARN_ON(mm_lam_cr3_mask(&init_mm));
+
/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
* doesn't work like other CR4 bits because it can only be set from
@@ -699,8 +717,8 @@ void initialize_tlbstate_and_flush(void)
WARN_ON(boot_cpu_has(X86_FEATURE_PCID) &&
!(cr4_read_shadow() & X86_CR4_PCIDE));

- /* Force ASID 0 and force a TLB flush. */
- write_cr3(build_cr3(mm->pgd, 0));
+ /* Disable LAM, force ASID 0 and force a TLB flush. */
+ write_cr3(build_cr3(mm->pgd, 0, 0));

/* Reinitialize tlbstate. */
this_cpu_write(cpu_tlbstate.last_user_mm_spec, LAST_USER_MM_INIT);
@@ -708,6 +726,7 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
+ set_tlbstate_cr3_lam_mask(0);

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
@@ -1071,8 +1090,10 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
*/
unsigned long __get_current_cr3_fast(void)
{
- unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
- this_cpu_read(cpu_tlbstate.loaded_mm_asid));
+ unsigned long cr3 =
+ build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
+ this_cpu_read(cpu_tlbstate.loaded_mm_asid),
+ tlbstate_lam_cr3_mask());

/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || preemptible());
--
2.38.0


2022-11-07 22:16:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11.1 07/16] x86/mm: Provide arch_prctl() interface for LAM

Add a couple of arch_prctl() handles:

- ARCH_ENABLE_TAGGED_ADDR enabled LAM. The argument is required number
of tag bits. It is rounded up to the nearest LAM mode that can
provide it. For now only LAM_U57 is supported, with 6 tag bits.

- ARCH_GET_UNTAG_MASK returns untag mask. It can indicates where tag
bits located in the address.

- ARCH_GET_MAX_TAG_BITS returns the maximum tag bits user can request.
Zero if LAM is not supported.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Tested-by: Alexander Potapenko <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/uapi/asm/prctl.h | 4 ++
arch/x86/kernel/process_64.c | 71 ++++++++++++++++++++++++++++++-
2 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 500b96e71f18..a31e27b95b19 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -20,4 +20,8 @@
#define ARCH_MAP_VDSO_32 0x2002
#define ARCH_MAP_VDSO_64 0x2003

+#define ARCH_GET_UNTAG_MASK 0x4001
+#define ARCH_ENABLE_TAGGED_ADDR 0x4002
+#define ARCH_GET_MAX_TAG_BITS 0x4003
+
#endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6b3418bff326..b8f2558a3aeb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -743,6 +743,66 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
}
#endif

+static void enable_lam_func(void *mm)
+{
+ struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+ unsigned long lam_mask;
+ unsigned long cr3;
+
+ if (loaded_mm != mm)
+ return;
+
+ lam_mask = READ_ONCE(loaded_mm->context.lam_cr3_mask);
+
+ /*
+ * Update CR3 to get LAM active on the CPU.
+ *
+ * This might not actually need to update CR3 if a context switch
+ * happened between updating 'lam_cr3_mask' and running this IPI
+ * handler. Update it unconditionally for simplicity.
+ */
+ cr3 = __read_cr3();
+ cr3 &= ~(X86_CR3_LAM_U48 | X86_CR3_LAM_U57);
+ cr3 |= lam_mask;
+ write_cr3(cr3);
+ set_tlbstate_cr3_lam_mask(lam_mask);
+}
+
+#define LAM_U57_BITS 6
+
+static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
+{
+ int ret = 0;
+
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return -ENODEV;
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ /* Already enabled? */
+ if (mm->context.lam_cr3_mask) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (!nr_bits) {
+ ret = -EINVAL;
+ goto out;
+ } else if (nr_bits <= LAM_U57_BITS) {
+ mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
+ mm->context.untag_mask = ~GENMASK(62, 57);
+ } else {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true);
+out:
+ mmap_write_unlock(mm);
+ return ret;
+}
+
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
@@ -830,7 +890,16 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
case ARCH_MAP_VDSO_64:
return prctl_map_vdso(&vdso_image_64, arg2);
#endif
-
+ case ARCH_GET_UNTAG_MASK:
+ return put_user(task->mm->context.untag_mask,
+ (unsigned long __user *)arg2);
+ case ARCH_ENABLE_TAGGED_ADDR:
+ return prctl_enable_tagged_addr(task->mm, arg2);
+ case ARCH_GET_MAX_TAG_BITS:
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return put_user(0, (unsigned long __user *)arg2);
+ else
+ return put_user(LAM_U57_BITS, (unsigned long __user *)arg2);
default:
ret = -EINVAL;
break;
--
2.38.0


2022-11-09 04:20:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv11.1 04/16] x86/mm: Handle LAM on context switch

On 11/7/22 13:35, Kirill A. Shutemov wrote:
> Linear Address Masking mode for userspace pointers encoded in CR3 bits.
> The mode is selected per-process and stored in mm_context_t.
>
> switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
> accordingly.
>
> The active LAM mode gets recorded in the tlb_state.
>

> +static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> +{
> + return mm->context.lam_cr3_mask;

READ_ONCE -- otherwise this has a data race and might generate sanitizer
complaints.

> +}

> @@ -491,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> {
> struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> + unsigned long prev_lam = tlbstate_lam_cr3_mask();
> + unsigned long new_lam = mm_lam_cr3_mask(next);

So I'm reading this again after drinking a cup of coffee. new_lam is
next's LAM mask according to mm_struct (and thus can change
asynchronously due to a remote CPU). prev_lam is based on tlbstate and
can't change asynchronously, at least not with IRQs off.


> bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
> unsigned cpu = smp_processor_id();
> u64 next_tlb_gen;
> @@ -520,7 +527,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> * isn't free.
> */
> #ifdef CONFIG_DEBUG_VM
> - if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
> + if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {

So is the only purpose of tlbstate_lam_cr3_mask() to enable this warning
to work?

> /*
> * If we were to BUG here, we'd be very likely to kill
> * the system so hard that we don't see the call trace.
> @@ -552,9 +559,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> * instruction.
> */
> if (real_prev == next) {
> + /* Not actually switching mm's */
> VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> next->context.ctx_id);
>
> + /*
> + * If this races with another thread that enables lam, 'new_lam'
> + * might not match 'prev_lam'.
> + */
> +

Indeed.

> /*
> * Even in lazy TLB mode, the CPU should stay set in the
> * mm_cpumask. The TLB shootdown code can figure out from
> @@ -622,15 +635,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> barrier();
> }

> @@ -691,6 +705,10 @@ void initialize_tlbstate_and_flush(void)
> /* Assert that CR3 already references the right mm. */
> WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));
>
> + /* LAM expected to be disabled in CR3 and init_mm */
> + WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
> + WARN_ON(mm_lam_cr3_mask(&init_mm));
> +

I think the callers all have init_mm selected, but the rest of this
function is not really written with this assumption. (But it does force
ASID 0, which is at least a bizarre thing to do for non-init-mm.)

What's the purpose of this warning? I'm okay with keeping it, but maybe
also add a warning that fires if mm != &init_mm.


2022-11-09 10:55:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11.1 04/16] x86/mm: Handle LAM on context switch

On Tue, Nov 08, 2022 at 07:54:35PM -0800, Andy Lutomirski wrote:
> On 11/7/22 13:35, Kirill A. Shutemov wrote:
> > Linear Address Masking mode for userspace pointers encoded in CR3 bits.
> > The mode is selected per-process and stored in mm_context_t.
> >
> > switch_mm_irqs_off() now respects selected LAM mode and constructs CR3
> > accordingly.
> >
> > The active LAM mode gets recorded in the tlb_state.
> >
>
> > +static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
> > +{
> > + return mm->context.lam_cr3_mask;
>
> READ_ONCE -- otherwise this has a data race and might generate sanitizer
> complaints.

Yep, thanks for pointing it out.

> > +}
>
> > @@ -491,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > {
> > struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> > u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> > + unsigned long prev_lam = tlbstate_lam_cr3_mask();
> > + unsigned long new_lam = mm_lam_cr3_mask(next);
>
> So I'm reading this again after drinking a cup of coffee. new_lam is next's
> LAM mask according to mm_struct (and thus can change asynchronously due to a
> remote CPU). prev_lam is based on tlbstate and can't change asynchronously,
> at least not with IRQs off.
>
>
> > bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
> > unsigned cpu = smp_processor_id();
> > u64 next_tlb_gen;
> > @@ -520,7 +527,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > * isn't free.
> > */
> > #ifdef CONFIG_DEBUG_VM
> > - if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
> > + if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
>
> So is the only purpose of tlbstate_lam_cr3_mask() to enable this warning to
> work?

Right. And disabling CONFIG_DEBUG_VM leads to warning. See the fixup
below.

> > /*
> > * If we were to BUG here, we'd be very likely to kill
> > * the system so hard that we don't see the call trace.
> > @@ -552,9 +559,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > * instruction.
> > */
> > if (real_prev == next) {
> > + /* Not actually switching mm's */
> > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> > next->context.ctx_id);
> > + /*
> > + * If this races with another thread that enables lam, 'new_lam'
> > + * might not match 'prev_lam'.
> > + */
> > +
>
> Indeed.
>
> > /*
> > * Even in lazy TLB mode, the CPU should stay set in the
> > * mm_cpumask. The TLB shootdown code can figure out from
> > @@ -622,15 +635,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
> > barrier();
> > }
>
> > @@ -691,6 +705,10 @@ void initialize_tlbstate_and_flush(void)
> > /* Assert that CR3 already references the right mm. */
> > WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));
> > + /* LAM expected to be disabled in CR3 and init_mm */
> > + WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
> > + WARN_ON(mm_lam_cr3_mask(&init_mm));
> > +
>
> I think the callers all have init_mm selected, but the rest of this function
> is not really written with this assumption. (But it does force ASID 0,
> which is at least a bizarre thing to do for non-init-mm.)

Hm. It uses tlb_gen of init_mm, so I assumed &init_mm == mm, but yeah it
is not strictly correct.

> What's the purpose of this warning? I'm okay with keeping it, but maybe
> also add a warning that fires if mm != &init_mm.

Just to make sure we are in sane state. I can drop init_mm reference if it
helps.

The fixup based on your feedback:

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 1ab7ecf61659..6f5b58a5f951 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -94,7 +94,7 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
#ifdef CONFIG_X86_64
static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
{
- return mm->context.lam_cr3_mask;
+ return READ_ONCE(mm->context.lam_cr3_mask);
}

static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4380776b3c61..ab66a48f38ce 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -496,7 +496,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
{
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
- unsigned long prev_lam = tlbstate_lam_cr3_mask();
unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
@@ -527,7 +526,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
* isn't free.
*/
#ifdef CONFIG_DEBUG_VM
- if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid, prev_lam))) {
+ if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid,
+ tlbstate_lam_cr3_mask()))) {
/*
* If we were to BUG here, we'd be very likely to kill
* the system so hard that we don't see the call trace.
@@ -565,7 +565,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,

/*
* If this races with another thread that enables lam, 'new_lam'
- * might not match 'prev_lam'.
+ * might not match tlbstate_lam_cr3_mask().
*/

/*
@@ -705,9 +705,9 @@ void initialize_tlbstate_and_flush(void)
/* Assert that CR3 already references the right mm. */
WARN_ON((cr3 & CR3_ADDR_MASK) != __pa(mm->pgd));

- /* LAM expected to be disabled in CR3 and init_mm */
+ /* LAM expected to be disabled */
WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
- WARN_ON(mm_lam_cr3_mask(&init_mm));
+ WARN_ON(mm_lam_cr3_mask(mm));

/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
--
Kiryl Shutsemau / Kirill A. Shutemov