2019-05-06 16:32:28

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

=== Overview

arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.

Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:

1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")

This patchset extends tagged pointer support to syscall arguments.

As per the proposed ABI change [3], tagged pointers are only allowed to be
passed to syscalls when they point to memory ranges obtained by anonymous
mmap() or sbrk() (see the patchset [3] for more details).

For non-memory syscalls this is done by untaging user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel and stays tagged when the kernel
dereferences the pointer when perfoming user memory accesses.

Memory syscalls (mmap, mprotect, etc.) don't do user memory accesses but
rather deal with memory ranges, and untagged pointers are better suited to
describe memory ranges internally. Thus for memory syscalls we untag
pointers completely when they enter the kernel.

=== Other approaches

One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.

An alternative approach to untagging pointers in memory syscalls prologues
is to inspead allow tagged pointers to be passed to find_vma() (and other
vma related functions) and untag them there. Unfortunately, a lot of
find_vma() callers then compare or subtract the returned vma start and end
fields against the pointer that was being searched. Thus this approach
would still require changing all find_vma() callers.

=== Testing

The following testing approaches has been taken to find potential issues
with user pointer untagging:

1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.

2. Static testing with grep to find parts of the kernel that call
find_vma() (and other similar functions) or directly compare against
vm_start/vm_end fields of vma.

3. Static testing with grep to find parts of the kernel that compare
user pointers with TASK_SIZE or other similar consts and macros.

4. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.

Based on the results of the testing the requried patches have been added
to the patchset.

=== Notes

This patchset is meant to be merged together with "arm64 relaxed ABI" [3].

This patchset is a prerequisite for ARM's memory tagging hardware feature
support [4].

This patchset has been merged into the Pixel 2 & 3 kernel trees and is
now being used to enable testing of Pixel phones with HWASan.

Thanks!

[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292

[3] https://lkml.org/lkml/2019/3/18/819

[4] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

Changes in v15:
- Removed unnecessary untagging from radeon_ttm_tt_set_userptr().
- Removed unnecessary untagging from amdgpu_ttm_tt_set_userptr().
- Moved untagging to validate_range() in userfaultfd code.
- Moved untagging to ib_uverbs_(re)reg_mr() from mlx4_get_umem_mr().
- Rebased onto 5.1.

Changes in v14:
- Moved untagging for most memory syscalls to an arm64 specific
implementation, instead of doing that in the common code.
- Dropped "net, arm64: untag user pointers in tcp_zerocopy_receive", since
the provided user pointers don't come from an anonymous map and thus are
not covered by this ABI relaxation.
- Dropped "kernel, arm64: untag user pointers in prctl_set_mm*".
- Moved untagging from __check_mem_type() to tee_shm_register().
- Updated untagging for the amdgpu and radeon drivers to cover the MMU
notifier, as suggested by Felix.
- Since this ABI relaxation doesn't actually allow tagged instruction
pointers, dropped the following patches:
- Dropped "tracing, arm64: untag user pointers in seq_print_user_ip".
- Dropped "uprobes, arm64: untag user pointers in find_active_uprobe".
- Dropped "bpf, arm64: untag user pointers in stack_map_get_build_id_offset".
- Rebased onto 5.1-rc7 (37624b58).

Changes in v13:
- Simplified untagging in tcp_zerocopy_receive().
- Looked at find_vma() callers in drivers/, which allowed to identify a
few other places where untagging is needed.
- Added patch "mm, arm64: untag user pointers in get_vaddr_frames".
- Added patch "drm/amdgpu, arm64: untag user pointers in
amdgpu_ttm_tt_get_user_pages".
- Added patch "drm/radeon, arm64: untag user pointers in
radeon_ttm_tt_pin_userptr".
- Added patch "IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr".
- Added patch "media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get".
- Added patch "tee/optee, arm64: untag user pointers in check_mem_type".
- Added patch "vfio/type1, arm64: untag user pointers".

Changes in v12:
- Changed untagging in tcp_zerocopy_receive() to also untag zc->address.
- Fixed untagging in prctl_set_mm* to only untag pointers for vma lookups
and validity checks, but leave them as is for actual user space accesses.
- Updated the link to the v2 of the "arm64 relaxed ABI" patchset [3].
- Dropped the documentation patch, as the "arm64 relaxed ABI" patchset [3]
handles that.

Changes in v11:
- Added "uprobes, arm64: untag user pointers in find_active_uprobe" patch.
- Added "bpf, arm64: untag user pointers in stack_map_get_build_id_offset"
patch.
- Fixed "tracing, arm64: untag user pointers in seq_print_user_ip" to
correctly perform subtration with a tagged addr.
- Moved untagged_addr() from SYSCALL_DEFINE3(mprotect) and
SYSCALL_DEFINE4(pkey_mprotect) to do_mprotect_pkey().
- Moved untagged_addr() definition for other arches from
include/linux/memory.h to include/linux/mm.h.
- Changed untagging in strn*_user() to perform userspace accesses through
tagged pointers.
- Updated the documentation to mention that passing tagged pointers to
memory syscalls is allowed.
- Updated the test to use malloc'ed memory instead of stack memory.

Changes in v10:
- Added "mm, arm64: untag user pointers passed to memory syscalls" back.
- New patch "fs, arm64: untag user pointers in fs/userfaultfd.c".
- New patch "net, arm64: untag user pointers in tcp_zerocopy_receive".
- New patch "kernel, arm64: untag user pointers in prctl_set_mm*".
- New patch "tracing, arm64: untag user pointers in seq_print_user_ip".

Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.

Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
support tagged user pointers.

Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.

Changes in v6:
- Added annotations for user pointer casts found by sparse.
- Rebased onto 050cdc6c (4.19-rc1+).

Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).

Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).

Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.

Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.

Changes in v1:
- Rebased onto 4.17-rc1.

Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).

Signed-off-by: Andrey Konovalov <[email protected]>

Andrey Konovalov (17):
uaccess: add untagged_addr definition for other arches
arm64: untag user pointers in access_ok and __uaccess_mask_ptr
lib, arm64: untag user pointers in strn*_user
mm: add ksys_ wrappers to memory syscalls
arms64: untag user pointers passed to memory syscalls
mm: untag user pointers in do_pages_move
mm, arm64: untag user pointers in mm/gup.c
mm, arm64: untag user pointers in get_vaddr_frames
fs, arm64: untag user pointers in copy_mount_options
fs, arm64: untag user pointers in fs/userfaultfd.c
drm/amdgpu, arm64: untag user pointers
drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl
IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()
media/v4l2-core, arm64: untag user pointers in
videobuf_dma_contig_user_get
tee, arm64: untag user pointers in tee_shm_register
vfio/type1, arm64: untag user pointers in vaddr_get_pfn
selftests, arm64: add a selftest for passing tagged pointers to kernel

arch/arm64/include/asm/uaccess.h | 10 +-
arch/arm64/kernel/sys.c | 128 ++++++++++++++++-
.../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 +
drivers/gpu/drm/radeon/radeon_gem.c | 2 +
drivers/infiniband/core/uverbs_cmd.c | 4 +
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +-
drivers/tee/tee_shm.c | 1 +
drivers/vfio/vfio_iommu_type1.c | 2 +
fs/namespace.c | 2 +-
fs/userfaultfd.c | 22 +--
include/linux/mm.h | 4 +
include/linux/syscalls.h | 22 +++
ipc/shm.c | 7 +-
lib/strncpy_from_user.c | 3 +-
lib/strnlen_user.c | 3 +-
mm/frame_vector.c | 2 +
mm/gup.c | 4 +
mm/madvise.c | 129 +++++++++---------
mm/mempolicy.c | 21 ++-
mm/migrate.c | 1 +
mm/mincore.c | 57 ++++----
mm/mlock.c | 20 ++-
mm/mmap.c | 30 +++-
mm/mprotect.c | 6 +-
mm/mremap.c | 27 ++--
mm/msync.c | 35 +++--
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 ++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++
tools/testing/selftests/arm64/tags_test.c | 21 +++
31 files changed, 436 insertions(+), 164 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c

--
2.21.0.1020.gf2820cf01a-goog


2019-05-06 16:32:36

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 01/17] uaccess: add untagged_addr definition for other arches

To allow arm64 syscalls to accept tagged pointers from userspace, we must
untag them when they are passed to the kernel. Since untagging is done in
generic parts of the kernel, the untagged_addr macro needs to be defined
for all architectures.

Define it as a noop for architectures other than arm64.

Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b10c21630f5..44041df804a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
#include <asm/pgtable.h>
#include <asm/processor.h>

+#ifndef untagged_addr
+#define untagged_addr(addr) (addr)
+#endif
+
#ifndef __pa_symbol
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:32:56

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 07/17] mm, arm64: untag user pointers in mm/gup.c

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

mm/gup.c provides a kernel interface that accepts user addresses and
manipulates user pages directly (for example get_user_pages, that is used
by the futex syscall). Since a user can provided tagged addresses, we need
to handle this case.

Add untagging to gup.c functions that use user addresses for vma lookups.

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/gup.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 91819b8ad9cc..2f477a0a7180 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -696,6 +696,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (!nr_pages)
return 0;

+ start = untagged_addr(start);
+
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

/*
@@ -858,6 +860,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;
vm_fault_t ret, major = 0;

+ address = untagged_addr(address);
+
if (unlocked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;

--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:32:57

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 06/17] mm: untag user pointers in do_pages_move

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

do_pages_move() is used in the implementation of the move_pages syscall.

Untag user pointers in this function.

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/migrate.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 663a5449367a..c014a07135f0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1617,6 +1617,7 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
if (get_user(node, nodes + i))
goto out_flush;
addr = (unsigned long)p;
+ addr = untagged_addr(addr);

err = -ENODEV;
if (node < 0 || node >= MAX_NUMNODES)
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:07

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 08/17] mm, arm64: untag user pointers in get_vaddr_frames

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

get_vaddr_frames uses provided user pointers for vma lookups, which can
only by done with untagged pointers. Instead of locating and changing
all callers of this function, perform untagging in it.

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/frame_vector.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index c64dca6e27c2..c431ca81dad5 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -46,6 +46,8 @@ int get_vaddr_frames(unsigned long start, unsigned int nr_frames,
if (WARN_ON_ONCE(nr_frames > vec->nr_allocated))
nr_frames = vec->nr_allocated;

+ start = untagged_addr(start);
+
down_read(&mm->mmap_sem);
locked = 1;
vma = find_vma_intersection(mm, start, start + 1);
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:15

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 09/17] fs, arm64: untag user pointers in copy_mount_options

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

In copy_mount_options a user address is being subtracted from TASK_SIZE.
If the address is lower than TASK_SIZE, the size is calculated to not
allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
However if the address is tagged, then the size will be calculated
incorrectly.

Untag the address before subtracting.

Signed-off-by: Andrey Konovalov <[email protected]>
---
fs/namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c9cab307fa77..c27e5713bf04 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2825,7 +2825,7 @@ void *copy_mount_options(const void __user * data)
* the remainder of the page.
*/
/* copy_from_user cannot cross TASK_SIZE ! */
- size = TASK_SIZE - (unsigned long)data;
+ size = TASK_SIZE - (unsigned long)untagged_addr(data);
if (size > PAGE_SIZE)
size = PAGE_SIZE;

--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:25

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 13/17] IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through
e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.

Untag user pointers in these functions.

Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/infiniband/core/uverbs_cmd.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 062a86c04123..36e7b52577d0 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -708,6 +708,8 @@ static int ib_uverbs_reg_mr(struct uverbs_attr_bundle *attrs)
if (ret)
return ret;

+ cmd.start = untagged_addr(cmd.start);
+
if ((cmd.start & ~PAGE_MASK) != (cmd.hca_va & ~PAGE_MASK))
return -EINVAL;

@@ -790,6 +792,8 @@ static int ib_uverbs_rereg_mr(struct uverbs_attr_bundle *attrs)
if (ret)
return ret;

+ cmd.start = untagged_addr(cmd.start);
+
if (cmd.flags & ~IB_MR_REREG_SUPPORTED || !cmd.flags)
return -EINVAL;

--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:31

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 15/17] tee, arm64: untag user pointers in tee_shm_register

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

tee_shm_register()->optee_shm_unregister()->check_mem_type() uses provided
user pointers for vma lookups (via __check_mem_type()), which can only by
done with untagged pointers.

Untag user pointers in this function.

Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/tee/tee_shm.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c
index 0b9ab1d0dd45..8e7b52ab6c63 100644
--- a/drivers/tee/tee_shm.c
+++ b/drivers/tee/tee_shm.c
@@ -263,6 +263,7 @@ struct tee_shm *tee_shm_register(struct tee_context *ctx, unsigned long addr,
shm->teedev = teedev;
shm->ctx = ctx;
shm->id = -1;
+ addr = untagged_addr(addr);
start = rounddown(addr, PAGE_SIZE);
shm->offset = addr - start;
shm->size = length;
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:41

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

This patch adds a simple test, that calls the uname syscall with a
tagged user pointer as an argument. Without the kernel accepting tagged
user pointers the test fails with EFAULT.

Signed-off-by: Andrey Konovalov <[email protected]>
---
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 ++++++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 +++++++++++
tools/testing/selftests/arm64/tags_test.c | 21 +++++++++++++++++++
4 files changed, 45 insertions(+)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c

diff --git a/tools/testing/selftests/arm64/.gitignore b/tools/testing/selftests/arm64/.gitignore
new file mode 100644
index 000000000000..e8fae8d61ed6
--- /dev/null
+++ b/tools/testing/selftests/arm64/.gitignore
@@ -0,0 +1 @@
+tags_test
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
new file mode 100644
index 000000000000..a61b2e743e99
--- /dev/null
+++ b/tools/testing/selftests/arm64/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# ARCH can be overridden by the user for cross compiling
+ARCH ?= $(shell uname -m 2>/dev/null || echo not)
+
+ifneq (,$(filter $(ARCH),aarch64 arm64))
+TEST_GEN_PROGS := tags_test
+TEST_PROGS := run_tags_test.sh
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/arm64/run_tags_test.sh b/tools/testing/selftests/arm64/run_tags_test.sh
new file mode 100755
index 000000000000..745f11379930
--- /dev/null
+++ b/tools/testing/selftests/arm64/run_tags_test.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+echo "--------------------"
+echo "running tags test"
+echo "--------------------"
+./tags_test
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+else
+ echo "[PASS]"
+fi
diff --git a/tools/testing/selftests/arm64/tags_test.c b/tools/testing/selftests/arm64/tags_test.c
new file mode 100644
index 000000000000..2bd1830a7ebe
--- /dev/null
+++ b/tools/testing/selftests/arm64/tags_test.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <sys/utsname.h>
+
+#define SHIFT_TAG(tag) ((uint64_t)(tag) << 56)
+#define SET_TAG(ptr, tag) (((uint64_t)(ptr) & ~SHIFT_TAG(0xff)) | \
+ SHIFT_TAG(tag))
+
+int main(void)
+{
+ struct utsname *ptr = (struct utsname *)malloc(sizeof(*ptr));
+ void *tagged_ptr = (void *)SET_TAG(ptr, 0x42);
+ int err = uname(tagged_ptr);
+
+ free(ptr);
+ return err;
+}
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:33:59

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 14/17] media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

videobuf_dma_contig_user_get() uses provided user pointers for vma
lookups, which can only by done with untagged pointers.

Untag the pointers in this function.

Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
index e1bf50df4c70..8a1ddd146b17 100644
--- a/drivers/media/v4l2-core/videobuf-dma-contig.c
+++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
@@ -160,6 +160,7 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem)
static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
struct videobuf_buffer *vb)
{
+ unsigned long untagged_baddr = untagged_addr(vb->baddr);
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
unsigned long prev_pfn, this_pfn;
@@ -167,22 +168,22 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
unsigned int offset;
int ret;

- offset = vb->baddr & ~PAGE_MASK;
+ offset = untagged_baddr & ~PAGE_MASK;
mem->size = PAGE_ALIGN(vb->size + offset);
ret = -EINVAL;

down_read(&mm->mmap_sem);

- vma = find_vma(mm, vb->baddr);
+ vma = find_vma(mm, untagged_baddr);
if (!vma)
goto out_up;

- if ((vb->baddr + mem->size) > vma->vm_end)
+ if ((untagged_baddr + mem->size) > vma->vm_end)
goto out_up;

pages_done = 0;
prev_pfn = 0; /* kill warning */
- user_address = vb->baddr;
+ user_address = untagged_baddr;

while (pages_done < (mem->size >> PAGE_SHIFT)) {
ret = follow_pfn(vma, user_address, &this_pfn);
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:34:11

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 11/17] drm/amdgpu, arm64: untag user pointers

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

In amdgpu_gem_userptr_ioctl() and amdgpu_amdkfd_gpuvm.c/init_user_pages()
an MMU notifier is set up with a (tagged) userspace pointer. The untagged
address should be used so that MMU notifiers for the untagged address get
correctly matched up with the right BO. This patch untag user pointers in
amdgpu_gem_userptr_ioctl() for the GEM case and in amdgpu_amdkfd_gpuvm_
alloc_memory_of_gpu() for the KFD case. This also makes sure that an
untagged pointer is passed to amdgpu_ttm_tt_get_user_pages(), which uses
it for vma lookups.

Suggested-by: Kuehling, Felix <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 1921dec3df7a..20cac44ed449 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1121,7 +1121,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
alloc_flags = 0;
if (!offset || !*offset)
return -EINVAL;
- user_addr = *offset;
+ user_addr = untagged_addr(*offset);
} else if (flags & ALLOC_MEM_FLAGS_DOORBELL) {
domain = AMDGPU_GEM_DOMAIN_GTT;
alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index d21dd2f369da..985cb82b2aa6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -286,6 +286,8 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;

+ args->addr = untagged_addr(args->addr);
+
if (offset_in_page(args->addr | args->size))
return -EINVAL;

--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:34:14

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 12/17] drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

In radeon_gem_userptr_ioctl() an MMU notifier is set up with a (tagged)
userspace pointer. The untagged address should be used so that MMU
notifiers for the untagged address get correctly matched up with the right
BO. This funcation also calls radeon_ttm_tt_pin_userptr(), which uses
provided user pointers for vma lookups, which can only by done with
untagged pointers.

This patch untags user pointers in radeon_gem_userptr_ioctl().

Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/gpu/drm/radeon/radeon_gem.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index 44617dec8183..90eb78fb5eb2 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -291,6 +291,8 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data,
uint32_t handle;
int r;

+ args->addr = untagged_addr(args->addr);
+
if (offset_in_page(args->addr | args->size))
return -EINVAL;

--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:34:42

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 10/17] fs, arm64: untag user pointers in fs/userfaultfd.c

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

userfaultfd code use provided user pointers for vma lookups, which can
only by done with untagged pointers.

Untag user pointers in validate_range().

Signed-off-by: Andrey Konovalov <[email protected]>
---
fs/userfaultfd.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f5de1e726356..aa47ed0969dd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1261,21 +1261,23 @@ static __always_inline void wake_userfault(struct userfaultfd_ctx *ctx,
}

static __always_inline int validate_range(struct mm_struct *mm,
- __u64 start, __u64 len)
+ __u64 *start, __u64 len)
{
__u64 task_size = mm->task_size;

- if (start & ~PAGE_MASK)
+ *start = untagged_addr(*start);
+
+ if (*start & ~PAGE_MASK)
return -EINVAL;
if (len & ~PAGE_MASK)
return -EINVAL;
if (!len)
return -EINVAL;
- if (start < mmap_min_addr)
+ if (*start < mmap_min_addr)
return -EINVAL;
- if (start >= task_size)
+ if (*start >= task_size)
return -EINVAL;
- if (len > task_size - start)
+ if (len > task_size - *start)
return -EINVAL;
return 0;
}
@@ -1325,7 +1327,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
goto out;
}

- ret = validate_range(mm, uffdio_register.range.start,
+ ret = validate_range(mm, &uffdio_register.range.start,
uffdio_register.range.len);
if (ret)
goto out;
@@ -1514,7 +1516,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
goto out;

- ret = validate_range(mm, uffdio_unregister.start,
+ ret = validate_range(mm, &uffdio_unregister.start,
uffdio_unregister.len);
if (ret)
goto out;
@@ -1665,7 +1667,7 @@ static int userfaultfd_wake(struct userfaultfd_ctx *ctx,
if (copy_from_user(&uffdio_wake, buf, sizeof(uffdio_wake)))
goto out;

- ret = validate_range(ctx->mm, uffdio_wake.start, uffdio_wake.len);
+ ret = validate_range(ctx->mm, &uffdio_wake.start, uffdio_wake.len);
if (ret)
goto out;

@@ -1705,7 +1707,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
sizeof(uffdio_copy)-sizeof(__s64)))
goto out;

- ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+ ret = validate_range(ctx->mm, &uffdio_copy.dst, uffdio_copy.len);
if (ret)
goto out;
/*
@@ -1761,7 +1763,7 @@ static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
sizeof(uffdio_zeropage)-sizeof(__s64)))
goto out;

- ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+ ret = validate_range(ctx->mm, &uffdio_zeropage.range.start,
uffdio_zeropage.range.len);
if (ret)
goto out;
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:34:45

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 16/17] vfio/type1, arm64: untag user pointers in vaddr_get_pfn

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

vaddr_get_pfn() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.

Untag user pointers in this function.

Signed-off-by: Andrey Konovalov <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d0f731c9920a..5daa966d799e 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -382,6 +382,8 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,

down_read(&mm->mmap_sem);

+ vaddr = untagged_addr(vaddr);
+
vma = find_vma_intersection(mm, vaddr, vaddr + 1);

if (vma && vma->vm_flags & VM_PFNMAP) {
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:35:10

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

This patch allows tagged pointers to be passed to the following memory
syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
remap_file_pages, shmat and shmdt.

This is done by untagging pointers passed to these syscalls in the
prologues of their handlers.

Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/kernel/sys.c | 128 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 127 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
index b44065fb1616..933bb9f3d6ec 100644
--- a/arch/arm64/kernel/sys.c
+++ b/arch/arm64/kernel/sys.c
@@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
{
if (offset_in_page(off) != 0)
return -EINVAL;
-
+ addr = untagged_addr(addr);
return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}

+SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len,
+ unsigned long, prot, unsigned long, flags,
+ unsigned long, fd, unsigned long, pgoff)
+{
+ addr = untagged_addr(addr);
+ return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff);
+}
+
+SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len,
+ unsigned long, new_len, unsigned long, flags,
+ unsigned long, new_addr)
+{
+ addr = untagged_addr(addr);
+ new_addr = untagged_addr(new_addr);
+ return ksys_mremap(addr, old_len, new_len, flags, new_addr);
+}
+
+SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len)
+{
+ addr = untagged_addr(addr);
+ return ksys_munmap(addr, len);
+}
+
SYSCALL_DEFINE1(arm64_personality, unsigned int, personality)
{
if (personality(personality) == PER_LINUX32 &&
@@ -47,10 +70,113 @@ SYSCALL_DEFINE1(arm64_personality, unsigned int, personality)
return ksys_personality(personality);
}

+SYSCALL_DEFINE1(arm64_brk, unsigned long, brk)
+{
+ brk = untagged_addr(brk);
+ return ksys_brk(brk);
+}
+
+SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
+ unsigned long __user *, nmask, unsigned long, maxnode,
+ unsigned long, addr, unsigned long, flags)
+{
+ addr = untagged_addr(addr);
+ return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
+}
+
+SYSCALL_DEFINE3(arm64_madvise, unsigned long, start,
+ size_t, len_in, int, behavior)
+{
+ start = untagged_addr(start);
+ return ksys_madvise(start, len_in, behavior);
+}
+
+SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
+ unsigned long, mode, const unsigned long __user *, nmask,
+ unsigned long, maxnode, unsigned int, flags)
+{
+ start = untagged_addr(start);
+ return ksys_mbind(start, len, mode, nmask, maxnode, flags);
+}
+
+SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len)
+{
+ start = untagged_addr(start);
+ return ksys_mlock(start, len, VM_LOCKED);
+}
+
+SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len)
+{
+ start = untagged_addr(start);
+ return ksys_mlock(start, len, VM_LOCKED);
+}
+
+SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len)
+{
+ start = untagged_addr(start);
+ return ksys_munlock(start, len);
+}
+
+SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
+ unsigned long, prot)
+{
+ start = untagged_addr(start);
+ return ksys_mprotect_pkey(start, len, prot, -1);
+}
+
+SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags)
+{
+ start = untagged_addr(start);
+ return ksys_msync(start, len, flags);
+}
+
+SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
+ unsigned char __user *, vec)
+{
+ start = untagged_addr(start);
+ return ksys_mincore(start, len, vec);
+}
+
+SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
+ unsigned long, size, unsigned long, prot,
+ unsigned long, pgoff, unsigned long, flags)
+{
+ start = untagged_addr(start);
+ return ksys_remap_file_pages(start, size, prot, pgoff, flags);
+}
+
+SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg)
+{
+ shmaddr = untagged_addr(shmaddr);
+ return ksys_shmat(shmid, shmaddr, shmflg);
+}
+
+SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr)
+{
+ shmaddr = untagged_addr(shmaddr);
+ return ksys_shmdt(shmaddr);
+}
+
/*
* Wrappers to pass the pt_regs argument.
*/
#define sys_personality sys_arm64_personality
+#define sys_mmap_pgoff sys_arm64_mmap_pgoff
+#define sys_mremap sys_arm64_mremap
+#define sys_munmap sys_arm64_munmap
+#define sys_brk sys_arm64_brk
+#define sys_get_mempolicy sys_arm64_get_mempolicy
+#define sys_madvise sys_arm64_madvise
+#define sys_mbind sys_arm64_mbind
+#define sys_mlock sys_arm64_mlock
+#define sys_mlock2 sys_arm64_mlock2
+#define sys_munlock sys_arm64_munlock
+#define sys_mprotect sys_arm64_mprotect
+#define sys_msync sys_arm64_msync
+#define sys_mincore sys_arm64_mincore
+#define sys_remap_file_pages sys_arm64_remap_file_pages
+#define sys_shmat sys_arm64_shmat
+#define sys_shmdt sys_arm64_shmdt

asmlinkage long sys_ni_syscall(const struct pt_regs *);
#define __arm64_sys_ni_syscall sys_ni_syscall
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:35:10

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 04/17] mm: add ksys_ wrappers to memory syscalls

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

This patch adds ksys_ wrappers to the following memory syscalls:

brk, get_mempolicy (renamed kernel_get_mempolicy -> ksys_get_mempolicy),
madvise, mbind (renamed kernel_mbind -> ksys_mbind), mincore,
mlock (renamed do_mlock -> ksys_mlock), mlock2, mmap_pgoff,
mprotect (renamed do_mprotect_pkey -> ksys_mprotect_pkey), mremap, msync,
munlock, munmap, remap_file_pages, shmat, shmdt.

The next patch in this series will add a custom implementation for these
syscalls that makes them accept tagged pointers on arm64.

Signed-off-by: Andrey Konovalov <[email protected]>
---
include/linux/syscalls.h | 22 +++++++
ipc/shm.c | 7 ++-
mm/madvise.c | 129 ++++++++++++++++++++-------------------
mm/mempolicy.c | 21 +++----
mm/mincore.c | 57 +++++++++--------
mm/mlock.c | 20 ++++--
mm/mmap.c | 30 ++++++---
mm/mprotect.c | 6 +-
mm/mremap.c | 27 +++++---
mm/msync.c | 35 ++++++-----
10 files changed, 213 insertions(+), 141 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e446806a561f..70008f5ed84f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1260,6 +1260,28 @@ int ksys_ipc(unsigned int call, int first, unsigned long second,
unsigned long third, void __user * ptr, long fifth);
int compat_ksys_ipc(u32 call, int first, int second,
u32 third, u32 ptr, u32 fifth);
+unsigned long ksys_mremap(unsigned long addr, unsigned long old_len,
+ unsigned long new_len, unsigned long flags,
+ unsigned long new_addr);
+int ksys_munmap(unsigned long addr, size_t len);
+unsigned long ksys_brk(unsigned long brk);
+int ksys_get_mempolicy(int __user *policy, unsigned long __user *nmask,
+ unsigned long maxnode, unsigned long addr, unsigned long flags);
+int ksys_madvise(unsigned long start, size_t len_in, int behavior);
+long ksys_mbind(unsigned long start, unsigned long len,
+ unsigned long mode, const unsigned long __user *nmask,
+ unsigned long maxnode, unsigned int flags);
+__must_check int ksys_mlock(unsigned long start, size_t len, vm_flags_t flags);
+__must_check int ksys_mlock2(unsigned long start, size_t len, vm_flags_t flags);
+int ksys_munlock(unsigned long start, size_t len);
+int ksys_mprotect_pkey(unsigned long start, size_t len,
+ unsigned long prot, int pkey);
+int ksys_msync(unsigned long start, size_t len, int flags);
+long ksys_mincore(unsigned long start, size_t len, unsigned char __user *vec);
+unsigned long ksys_remap_file_pages(unsigned long start, unsigned long size,
+ unsigned long prot, unsigned long pgoff, unsigned long flags);
+long ksys_shmat(int shmid, char __user *shmaddr, int shmflg);
+long ksys_shmdt(char __user *shmaddr);

/*
* The following kernel syscall equivalents are just wrappers to fs-internal
diff --git a/ipc/shm.c b/ipc/shm.c
index ce1ca9f7c6e9..557b43968c0e 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1588,7 +1588,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg,
return err;
}

-SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
+long ksys_shmat(int shmid, char __user *shmaddr, int shmflg)
{
unsigned long ret;
long err;
@@ -1600,6 +1600,11 @@ SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
return (long)ret;
}

+SYSCALL_DEFINE3(shmat, int, shmid, char __user *, shmaddr, int, shmflg)
+{
+ return ksys_shmat(shmid, shmaddr, shmflg);
+}
+
#ifdef CONFIG_COMPAT

#ifndef COMPAT_SHMLBA
diff --git a/mm/madvise.c b/mm/madvise.c
index 21a7881a2db4..c27f5f14e2ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -738,68 +738,7 @@ madvise_behavior_valid(int behavior)
}
}

-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area. The idea is to help the kernel
- * use appropriate read-ahead and caching techniques. The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- * MADV_NORMAL - the default behavior is to read clusters. This
- * results in some read-ahead and read-behind.
- * MADV_RANDOM - the system should read the minimum amount of data
- * on any access, since it is unlikely that the appli-
- * cation will need more than what it asks for.
- * MADV_SEQUENTIAL - pages in the given range will probably be accessed
- * once, so they can be aggressively read ahead, and
- * can be freed soon after they are accessed.
- * MADV_WILLNEED - the application is notifying the system to read
- * some pages ahead.
- * MADV_DONTNEED - the application is finished with the given range,
- * so the kernel can free resources associated with it.
- * MADV_FREE - the application marks pages in the given range as lazy free,
- * where actual purges are postponed until memory pressure happens.
- * MADV_REMOVE - the application wants to free up the given range of
- * pages and associated backing store.
- * MADV_DONTFORK - omit this area from child's address space when forking:
- * typically, to avoid COWing pages pinned by get_user_pages().
- * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- * MADV_WIPEONFORK - present the child process with zero-filled memory in this
- * range after a fork.
- * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- * MADV_HWPOISON - trigger memory error handler as if the given memory range
- * were corrupted by unrecoverable hardware memory failure.
- * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- * this area with pages of identical content from other such areas.
- * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- * MADV_HUGEPAGE - the application wants to back the given range by transparent
- * huge pages in the future. Existing pages might be coalesced and
- * new pages might be allocated as THP.
- * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- * transparent huge pages so the existing pages will not be
- * coalesced into THP and new pages will not be allocated as THP.
- * MADV_DONTDUMP - the application wants to prevent pages in the given range
- * from being included in its core dump.
- * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- * zero - success
- * -EINVAL - start + len < 0, start is not page-aligned,
- * "behavior" is not a valid value, or application
- * is attempting to release locked or shared pages,
- * or the specified address range includes file, Huge TLB,
- * MAP_SHARED or VMPFNMAP range.
- * -ENOMEM - addresses in the specified range are not currently
- * mapped, or are outside the AS of the process.
- * -EIO - an I/O error occurred while paging in data.
- * -EBADF - map exists, but area maps something that isn't a file.
- * -EAGAIN - a kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+int ksys_madvise(unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -894,3 +833,69 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)

return error;
}
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques. The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ * MADV_NORMAL - the default behavior is to read clusters. This
+ * results in some read-ahead and read-behind.
+ * MADV_RANDOM - the system should read the minimum amount of data
+ * on any access, since it is unlikely that the appli-
+ * cation will need more than what it asks for.
+ * MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ * once, so they can be aggressively read ahead, and
+ * can be freed soon after they are accessed.
+ * MADV_WILLNEED - the application is notifying the system to read
+ * some pages ahead.
+ * MADV_DONTNEED - the application is finished with the given range,
+ * so the kernel can free resources associated with it.
+ * MADV_FREE - the application marks pages in the given range as lazy free,
+ * where actual purges are postponed until memory pressure happens.
+ * MADV_REMOVE - the application wants to free up the given range of
+ * pages and associated backing store.
+ * MADV_DONTFORK - omit this area from child's address space when forking:
+ * typically, to avoid COWing pages pinned by get_user_pages().
+ * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ * MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ * range after a fork.
+ * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ * MADV_HWPOISON - trigger memory error handler as if the given memory range
+ * were corrupted by unrecoverable hardware memory failure.
+ * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ * this area with pages of identical content from other such areas.
+ * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ * MADV_HUGEPAGE - the application wants to back the given range by transparent
+ * huge pages in the future. Existing pages might be coalesced and
+ * new pages might be allocated as THP.
+ * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ * transparent huge pages so the existing pages will not be
+ * coalesced into THP and new pages will not be allocated as THP.
+ * MADV_DONTDUMP - the application wants to prevent pages in the given range
+ * from being included in its core dump.
+ * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ * zero - success
+ * -EINVAL - start + len < 0, start is not page-aligned,
+ * "behavior" is not a valid value, or application
+ * is attempting to release locked or shared pages,
+ * or the specified address range includes file, Huge TLB,
+ * MAP_SHARED or VMPFNMAP range.
+ * -ENOMEM - addresses in the specified range are not currently
+ * mapped, or are outside the AS of the process.
+ * -EIO - an I/O error occurred while paging in data.
+ * -EBADF - map exists, but area maps something that isn't a file.
+ * -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+ return ksys_madvise(start, len_in, behavior);
+}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2219e747df49..c2f82a045ceb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1352,9 +1352,9 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
}

-static long kernel_mbind(unsigned long start, unsigned long len,
- unsigned long mode, const unsigned long __user *nmask,
- unsigned long maxnode, unsigned int flags)
+long ksys_mbind(unsigned long start, unsigned long len,
+ unsigned long mode, const unsigned long __user *nmask,
+ unsigned long maxnode, unsigned int flags)
{
nodemask_t nodes;
int err;
@@ -1377,7 +1377,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
{
- return kernel_mbind(start, len, mode, nmask, maxnode, flags);
+ return ksys_mbind(start, len, mode, nmask, maxnode, flags);
}

/* Set the process memory policy */
@@ -1507,11 +1507,8 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,


/* Retrieve NUMA policy */
-static int kernel_get_mempolicy(int __user *policy,
- unsigned long __user *nmask,
- unsigned long maxnode,
- unsigned long addr,
- unsigned long flags)
+int ksys_get_mempolicy(int __user *policy, unsigned long __user *nmask,
+ unsigned long maxnode, unsigned long addr, unsigned long flags)
{
int err;
int uninitialized_var(pval);
@@ -1538,7 +1535,7 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
unsigned long __user *, nmask, unsigned long, maxnode,
unsigned long, addr, unsigned long, flags)
{
- return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags);
+ return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
}

#ifdef CONFIG_COMPAT
@@ -1559,7 +1556,7 @@ COMPAT_SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
if (nmask)
nm = compat_alloc_user_space(alloc_size);

- err = kernel_get_mempolicy(policy, nm, nr_bits+1, addr, flags);
+ err = ksys_get_mempolicy(policy, nm, nr_bits+1, addr, flags);

if (!err && nmask) {
unsigned long copy_size;
@@ -1613,7 +1610,7 @@ COMPAT_SYSCALL_DEFINE6(mbind, compat_ulong_t, start, compat_ulong_t, len,
return -EFAULT;
}

- return kernel_mbind(start, len, mode, nm, nr_bits+1, flags);
+ return ksys_mbind(start, len, mode, nm, nr_bits+1, flags);
}

COMPAT_SYSCALL_DEFINE4(migrate_pages, compat_pid_t, pid,
diff --git a/mm/mincore.c b/mm/mincore.c
index 218099b5ed31..a609bd8128da 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -197,32 +197,7 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
return (end - addr) >> PAGE_SHIFT;
}

-/*
- * The mincore(2) system call.
- *
- * mincore() returns the memory residency status of the pages in the
- * current process's address space specified by [addr, addr + len).
- * The status is returned in a vector of bytes. The least significant
- * bit of each byte is 1 if the referenced page is in memory, otherwise
- * it is zero.
- *
- * Because the status of a page can change after mincore() checks it
- * but before it returns to the application, the returned vector may
- * contain stale information. Only locked pages are guaranteed to
- * remain in memory.
- *
- * return values:
- * zero - success
- * -EFAULT - vec points to an illegal address
- * -EINVAL - addr is not a multiple of PAGE_SIZE
- * -ENOMEM - Addresses in the range [addr, addr + len] are
- * invalid for the address space of this process, or
- * specify one or more pages which are not currently
- * mapped
- * -EAGAIN - A kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
- unsigned char __user *, vec)
+long ksys_mincore(unsigned long start, size_t len, unsigned char __user *vec)
{
long retval;
unsigned long pages;
@@ -271,3 +246,33 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
free_page((unsigned long) tmp);
return retval;
}
+
+/*
+ * The mincore(2) system call.
+ *
+ * mincore() returns the memory residency status of the pages in the
+ * current process's address space specified by [addr, addr + len).
+ * The status is returned in a vector of bytes. The least significant
+ * bit of each byte is 1 if the referenced page is in memory, otherwise
+ * it is zero.
+ *
+ * Because the status of a page can change after mincore() checks it
+ * but before it returns to the application, the returned vector may
+ * contain stale information. Only locked pages are guaranteed to
+ * remain in memory.
+ *
+ * return values:
+ * zero - success
+ * -EFAULT - vec points to an illegal address
+ * -EINVAL - addr is not a multiple of PAGE_SIZE
+ * -ENOMEM - Addresses in the range [addr, addr + len] are
+ * invalid for the address space of this process, or
+ * specify one or more pages which are not currently
+ * mapped
+ * -EAGAIN - A kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, len,
+ unsigned char __user *, vec)
+{
+ return ksys_mincore(start, len, vec);
+}
diff --git a/mm/mlock.c b/mm/mlock.c
index 080f3b36415b..09e449447539 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -668,7 +668,7 @@ static int count_mm_mlocked_page_nr(struct mm_struct *mm,
return count >> PAGE_SHIFT;
}

-static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
+__must_check int ksys_mlock(unsigned long start, size_t len, vm_flags_t flags)
{
unsigned long locked;
unsigned long lock_limit;
@@ -715,10 +715,10 @@ static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t fla

SYSCALL_DEFINE2(mlock, unsigned long, start, size_t, len)
{
- return do_mlock(start, len, VM_LOCKED);
+ return ksys_mlock(start, len, VM_LOCKED);
}

-SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
+__must_check int ksys_mlock2(unsigned long start, size_t len, vm_flags_t flags)
{
vm_flags_t vm_flags = VM_LOCKED;

@@ -728,10 +728,15 @@ SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
if (flags & MLOCK_ONFAULT)
vm_flags |= VM_LOCKONFAULT;

- return do_mlock(start, len, vm_flags);
+ return ksys_mlock(start, len, vm_flags);
}

-SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+SYSCALL_DEFINE3(mlock2, unsigned long, start, size_t, len, int, flags)
+{
+ return ksys_mlock2(start, len, flags);
+}
+
+int ksys_munlock(unsigned long start, size_t len)
{
int ret;

@@ -746,6 +751,11 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
return ret;
}

+SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
+{
+ return ksys_munlock(start, len);
+}
+
/*
* Take the MCL_* flags passed into mlockall (or 0 if called from munlockall)
* and translate into the appropriate modifications to mm->def_flags and/or the
diff --git a/mm/mmap.c b/mm/mmap.c
index bd7b9f293b39..09bfaf36b961 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -189,7 +189,8 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)

static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
struct list_head *uf);
-SYSCALL_DEFINE1(brk, unsigned long, brk)
+
+unsigned long ksys_brk(unsigned long brk)
{
unsigned long retval;
unsigned long newbrk, oldbrk, origbrk;
@@ -288,6 +289,11 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
return retval;
}

+SYSCALL_DEFINE1(brk, unsigned long, brk)
+{
+ return ksys_brk(brk);
+}
+
static long vma_compute_subtree_gap(struct vm_area_struct *vma)
{
unsigned long max, prev_end, subtree_gap;
@@ -2870,18 +2876,19 @@ int vm_munmap(unsigned long start, size_t len)
}
EXPORT_SYMBOL(vm_munmap);

-SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
+int ksys_munmap(unsigned long addr, size_t len)
{
profile_munmap(addr);
return __vm_munmap(addr, len, true);
}

+SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
+{
+ return ksys_munmap(addr, len);
+}

-/*
- * Emulation of deprecated remap_file_pages() syscall.
- */
-SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
- unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
+unsigned long ksys_remap_file_pages(unsigned long start, unsigned long size,
+ unsigned long prot, unsigned long pgoff, unsigned long flags)
{

struct mm_struct *mm = current->mm;
@@ -2976,6 +2983,15 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
}

+/*
+ * Emulation of deprecated remap_file_pages() syscall.
+ */
+SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
+ unsigned long, prot, unsigned long, pgoff, unsigned long, flags)
+{
+ return ksys_remap_file_pages(start, size, prot, pgoff, flags);
+}
+
/*
* this is really a simplified "do_mmap". it only handles
* anonymous maps. eventually we may be able to do some
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 028c724dcb1a..07344bdd7a04 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -454,7 +454,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
/*
* pkey==-1 when doing a legacy mprotect()
*/
-static int do_mprotect_pkey(unsigned long start, size_t len,
+int ksys_mprotect_pkey(unsigned long start, size_t len,
unsigned long prot, int pkey)
{
unsigned long nstart, end, tmp, reqprot;
@@ -578,7 +578,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
unsigned long, prot)
{
- return do_mprotect_pkey(start, len, prot, -1);
+ return ksys_mprotect_pkey(start, len, prot, -1);
}

#ifdef CONFIG_ARCH_HAS_PKEYS
@@ -586,7 +586,7 @@ SYSCALL_DEFINE3(mprotect, unsigned long, start, size_t, len,
SYSCALL_DEFINE4(pkey_mprotect, unsigned long, start, size_t, len,
unsigned long, prot, int, pkey)
{
- return do_mprotect_pkey(start, len, prot, pkey);
+ return ksys_mprotect_pkey(start, len, prot, pkey);
}

SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val)
diff --git a/mm/mremap.c b/mm/mremap.c
index e3edef6b7a12..fec1f9911388 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -584,16 +584,9 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
return 1;
}

-/*
- * Expand (or shrink) an existing mapping, potentially moving it at the
- * same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
- *
- * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
- * This option implies MREMAP_MAYMOVE.
- */
-SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
- unsigned long, new_len, unsigned long, flags,
- unsigned long, new_addr)
+unsigned long ksys_mremap(unsigned long addr, unsigned long old_len,
+ unsigned long new_len, unsigned long flags,
+ unsigned long new_addr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
@@ -726,3 +719,17 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
userfaultfd_unmap_complete(mm, &uf_unmap);
return ret;
}
+
+/*
+ * Expand (or shrink) an existing mapping, potentially moving it at the
+ * same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
+ *
+ * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
+ * This option implies MREMAP_MAYMOVE.
+ */
+SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
+ unsigned long, new_len, unsigned long, flags,
+ unsigned long, new_addr)
+{
+ return ksys_mremap(addr, old_len, new_len, flags, new_addr);
+}
diff --git a/mm/msync.c b/mm/msync.c
index ef30a429623a..b5a013549626 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -15,21 +15,7 @@
#include <linux/syscalls.h>
#include <linux/sched.h>

-/*
- * MS_SYNC syncs the entire file - including mappings.
- *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
- *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
- * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
- * applications.
- */
-SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
+int ksys_msync(unsigned long start, size_t len, int flags)
{
unsigned long end;
struct mm_struct *mm = current->mm;
@@ -106,3 +92,22 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
out:
return error ? : unmapped_error;
}
+
+/*
+ * MS_SYNC syncs the entire file - including mappings.
+ *
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
+ * write out the dirty pages and wait on the writeout and check the result.
+ * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
+ * async writeout immediately.
+ * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
+ * applications.
+ */
+SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
+{
+ return ksys_msync(start, len, flags);
+}
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:35:19

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 03/17] lib, arm64: untag user pointers in strn*_user

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

strncpy_from_user and strnlen_user accept user addresses as arguments, and
do not go through the same path as copy_from_user and others, so here we
need to handle the case of tagged user addresses separately.

Untag user pointers passed to these functions.

Note, that this patch only temporarily untags the pointers to perform
validity checks, but then uses them as is to perform user memory accesses.

Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/strncpy_from_user.c | 3 ++-
lib/strnlen_user.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index 58eacd41526c..6209bb9507c7 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -6,6 +6,7 @@
#include <linux/uaccess.h>
#include <linux/kernel.h>
#include <linux/errno.h>
+#include <linux/mm.h>

#include <asm/byteorder.h>
#include <asm/word-at-a-time.h>
@@ -107,7 +108,7 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
return 0;

max_addr = user_addr_max();
- src_addr = (unsigned long)src;
+ src_addr = (unsigned long)untagged_addr(src);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 1c1a1b0e38a5..8ca3d2ac32ec 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -2,6 +2,7 @@
#include <linux/kernel.h>
#include <linux/export.h>
#include <linux/uaccess.h>
+#include <linux/mm.h>

#include <asm/word-at-a-time.h>

@@ -109,7 +110,7 @@ long strnlen_user(const char __user *str, long count)
return 0;

max_addr = user_addr_max();
- src_addr = (unsigned long)str;
+ src_addr = (unsigned long)untagged_addr(str);
if (likely(src_addr < max_addr)) {
unsigned long max = max_addr - src_addr;
long retval;
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 16:35:20

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v15 02/17] arm64: untag user pointers in access_ok and __uaccess_mask_ptr

This patch is a part of a series that extends arm64 kernel ABI to allow to
pass tagged user pointers (with the top byte set to something else other
than 0x00) as syscall arguments.

copy_from_user (and a few other similar functions) are used to copy data
from user memory into the kernel memory or vice versa. Since a user can
provided a tagged pointer to one of the syscalls that use copy_from_user,
we need to correctly handle such pointers.

Do this by untagging user pointers in access_ok and in __uaccess_mask_ptr,
before performing access validity checks.

Note, that this patch only temporarily untags the pointers to perform the
checks, but then passes them as is into the kernel internals.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index e5d5f31c6d36..9164ecb5feca 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -94,7 +94,7 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si
return ret;
}

-#define access_ok(addr, size) __range_ok(addr, size)
+#define access_ok(addr, size) __range_ok(untagged_addr(addr), size)
#define user_addr_max get_fs

#define _ASM_EXTABLE(from, to) \
@@ -226,7 +226,8 @@ static inline void uaccess_enable_not_uao(void)

/*
* Sanitise a uaccess pointer such that it becomes NULL if above the
- * current addr_limit.
+ * current addr_limit. In case the pointer is tagged (has the top byte set),
+ * untag the pointer before checking.
*/
#define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr)
static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
@@ -234,10 +235,11 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
void __user *safe_ptr;

asm volatile(
- " bics xzr, %1, %2\n"
+ " bics xzr, %3, %2\n"
" csel %0, %1, xzr, eq\n"
: "=&r" (safe_ptr)
- : "r" (ptr), "r" (current_thread_info()->addr_limit)
+ : "r" (ptr), "r" (current_thread_info()->addr_limit),
+ "r" (untagged_addr(ptr))
: "cc");

csdb();
--
2.21.0.1020.gf2820cf01a-goog

2019-05-06 19:51:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v15 13/17] IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()

On Mon, May 06, 2019 at 06:30:59PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through
> e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
>
> Untag user pointers in these functions.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---
> drivers/infiniband/core/uverbs_cmd.c | 4 ++++
> 1 file changed, 4 insertions(+)

I think this is OK.. We should really get it tested though.. Leon?

Jason

2019-05-07 06:34:58

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH v15 13/17] IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()

On Mon, May 06, 2019 at 04:50:20PM -0300, Jason Gunthorpe wrote:
> On Mon, May 06, 2019 at 06:30:59PM +0200, Andrey Konovalov wrote:
> > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > pass tagged user pointers (with the top byte set to something else other
> > than 0x00) as syscall arguments.
> >
> > ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through
> > e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
> >
> > Untag user pointers in these functions.
> >
> > Signed-off-by: Andrey Konovalov <[email protected]>
> > ---
> > drivers/infiniband/core/uverbs_cmd.c | 4 ++++
> > 1 file changed, 4 insertions(+)
>
> I think this is OK.. We should really get it tested though.. Leon?

It can be done after v5.2-rc1.

Thanks

>
> Jason

2019-05-07 16:45:03

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v15 11/17] drm/amdgpu, arm64: untag user pointers

On 2019-05-06 12:30 p.m., Andrey Konovalov wrote:
> [CAUTION: External Email]
>
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> In amdgpu_gem_userptr_ioctl() and amdgpu_amdkfd_gpuvm.c/init_user_pages()
> an MMU notifier is set up with a (tagged) userspace pointer. The untagged
> address should be used so that MMU notifiers for the untagged address get
> correctly matched up with the right BO. This patch untag user pointers in
> amdgpu_gem_userptr_ioctl() for the GEM case and in amdgpu_amdkfd_gpuvm_
> alloc_memory_of_gpu() for the KFD case. This also makes sure that an
> untagged pointer is passed to amdgpu_ttm_tt_get_user_pages(), which uses
> it for vma lookups.
>
> Suggested-by: Kuehling, Felix <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>

Acked-by: Felix Kuehling <[email protected]>


> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
> 2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 1921dec3df7a..20cac44ed449 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1121,7 +1121,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
> alloc_flags = 0;
> if (!offset || !*offset)
> return -EINVAL;
> - user_addr = *offset;
> + user_addr = untagged_addr(*offset);
> } else if (flags & ALLOC_MEM_FLAGS_DOORBELL) {
> domain = AMDGPU_GEM_DOMAIN_GTT;
> alloc_domain = AMDGPU_GEM_DOMAIN_CPU;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index d21dd2f369da..985cb82b2aa6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -286,6 +286,8 @@ int amdgpu_gem_userptr_ioctl(struct drm_device *dev, void *data,
> uint32_t handle;
> int r;
>
> + args->addr = untagged_addr(args->addr);
> +
> if (offset_in_page(args->addr | args->size))
> return -EINVAL;
>
> --
> 2.21.0.1020.gf2820cf01a-goog
>

2019-05-07 16:46:44

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH v15 12/17] drm/radeon, arm64: untag user pointers in radeon_gem_userptr_ioctl

On 2019-05-06 12:30 p.m., Andrey Konovalov wrote:
> [CAUTION: External Email]
>
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> In radeon_gem_userptr_ioctl() an MMU notifier is set up with a (tagged)
> userspace pointer. The untagged address should be used so that MMU
> notifiers for the untagged address get correctly matched up with the right
> BO. This funcation also calls radeon_ttm_tt_pin_userptr(), which uses
> provided user pointers for vma lookups, which can only by done with
> untagged pointers.
>
> This patch untags user pointers in radeon_gem_userptr_ioctl().
>
> Signed-off-by: Andrey Konovalov <[email protected]>
Acked-by: Felix Kuehling <[email protected]>


> ---
> drivers/gpu/drm/radeon/radeon_gem.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
> index 44617dec8183..90eb78fb5eb2 100644
> --- a/drivers/gpu/drm/radeon/radeon_gem.c
> +++ b/drivers/gpu/drm/radeon/radeon_gem.c
> @@ -291,6 +291,8 @@ int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data,
> uint32_t handle;
> int r;
>
> + args->addr = untagged_addr(args->addr);
> +
> if (offset_in_page(args->addr | args->size))
> return -EINVAL;
>
> --
> 2.21.0.1020.gf2820cf01a-goog
>

2019-05-17 14:51:07

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

Hi Andrey,

On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.

The more I look at this problem, the less convinced I am that we can
solve it in a way that results in a stable ABI covering ioctls(). While
for the Android kernel codebase it could be simpler as you don't upgrade
the kernel version every 2.5 months, for the mainline kernel this
doesn't scale. Any run-time checks are relatively limited in terms of
drivers covered. Better static checking would be nice as a long term
solution but we didn't get anywhere with the discussion last year.

IMO (RFC for now), I see two ways forward:

1. Make this a user space problem and do not allow tagged pointers into
the syscall ABI. A libc wrapper would have to convert structures,
parameters before passing them into the kernel. Note that we can
still support the hardware MTE in the kernel by enabling tagged
memory ranges, saving/restoring tags etc. but not allowing tagged
addresses at the syscall boundary.

2. Similar shim to the above libc wrapper but inside the kernel
(arch/arm64 only; most pointer arguments could be covered with an
__SC_CAST similar to the s390 one). There are two differences from
what we've discussed in the past:

a) this is an opt-in by the user which would have to explicitly call
prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
to pass tagged pointers to the kernel. This would probably be the
responsibility of the C lib to make sure it doesn't tag heap
allocations. If the user did not opt-in, the syscalls are routed
through the normal path (no untagging address shim).

b) ioctl() and other blacklisted syscalls (prctl) will not accept
tagged pointers (to be documented in Vicenzo's ABI patches).

It doesn't solve the problems we are trying to address but 2.a saves us
from blindly relaxing the ABI without knowing how to easily assess new
code being merged (over 500K lines between kernel versions). Existing
applications (who don't opt-in) won't inadvertently start using the new
ABI which could risk becoming de-facto ABI that we need to support on
the long run.

Option 1 wouldn't solve the ioctl() problem either and while it makes
things simpler for the kernel, I am aware that it's slightly more
complicated in user space (but I really don't mind if you prefer option
1 ;)).

The tagged pointers (whether hwasan or MTE) should ideally be a
transparent feature for the application writer but I don't think we can
solve it entirely and make it seamless for the multitude of ioctls().
I'd say you only opt in to such feature if you know what you are doing
and the user code takes care of specific cases like ioctl(), hence the
prctl() proposal even for the hwasan.

Comments welcomed.

--
Catalin

2019-05-20 23:55:19

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <[email protected]> wrote:
>
> Hi Andrey,
>
> On Mon, May 06, 2019 at 06:30:46PM +0200, Andrey Konovalov wrote:
> > One of the alternative approaches to untagging that was considered is to
> > completely strip the pointer tag as the pointer enters the kernel with
> > some kind of a syscall wrapper, but that won't work with the countless
> > number of different ioctl calls. With this approach we would need a custom
> > wrapper for each ioctl variation, which doesn't seem practical.
>
> The more I look at this problem, the less convinced I am that we can
> solve it in a way that results in a stable ABI covering ioctls(). While
> for the Android kernel codebase it could be simpler as you don't upgrade
> the kernel version every 2.5 months, for the mainline kernel this
> doesn't scale. Any run-time checks are relatively limited in terms of
> drivers covered. Better static checking would be nice as a long term
> solution but we didn't get anywhere with the discussion last year.
>
> IMO (RFC for now), I see two ways forward:
>
> 1. Make this a user space problem and do not allow tagged pointers into
> the syscall ABI. A libc wrapper would have to convert structures,
> parameters before passing them into the kernel. Note that we can
> still support the hardware MTE in the kernel by enabling tagged
> memory ranges, saving/restoring tags etc. but not allowing tagged
> addresses at the syscall boundary.
>
> 2. Similar shim to the above libc wrapper but inside the kernel
> (arch/arm64 only; most pointer arguments could be covered with an
> __SC_CAST similar to the s390 one). There are two differences from
> what we've discussed in the past:
>
> a) this is an opt-in by the user which would have to explicitly call
> prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> to pass tagged pointers to the kernel. This would probably be the
> responsibility of the C lib to make sure it doesn't tag heap
> allocations. If the user did not opt-in, the syscalls are routed
> through the normal path (no untagging address shim).
>
> b) ioctl() and other blacklisted syscalls (prctl) will not accept
> tagged pointers (to be documented in Vicenzo's ABI patches).
>
> It doesn't solve the problems we are trying to address but 2.a saves us
> from blindly relaxing the ABI without knowing how to easily assess new
> code being merged (over 500K lines between kernel versions). Existing
> applications (who don't opt-in) won't inadvertently start using the new
> ABI which could risk becoming de-facto ABI that we need to support on
> the long run.
>
> Option 1 wouldn't solve the ioctl() problem either and while it makes
> things simpler for the kernel, I am aware that it's slightly more
> complicated in user space (but I really don't mind if you prefer option
> 1 ;)).
>
> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.
>
> Comments welcomed.

Any userspace shim approach is problematic for Android because of the
apps that use raw system calls. AFAIK, all apps written in Go are in
that camp - I'm not sure how common they are, but getting them all
recompiled is probably not realistic.

The way I see it, a patch that breaks handling of tagged pointers is
not that different from, say, a patch that adds a wild pointer
dereference. Both are bugs; the difference is that (a) the former
breaks a relatively uncommon target and (b) it's arguably an easier
mistake to make. If MTE adoption goes well, (a) will not be the case
for long.

This is a bit of a chicken-and-egg problem. In a world where memory
allocators on one or several popular platforms generate pointers with
non-zero tags, any such breakage will be caught in testing.
Unfortunately to reach that state we need the kernel to start
accepting tagged pointers first, and then hold on for a couple of
years until userspace catches up.

Perhaps we can start by whitelisting ioctls by driver?

2019-05-21 18:32:49

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <[email protected]> wrote:
> > IMO (RFC for now), I see two ways forward:
> >
> > 1. Make this a user space problem and do not allow tagged pointers into
> > the syscall ABI. A libc wrapper would have to convert structures,
> > parameters before passing them into the kernel. Note that we can
> > still support the hardware MTE in the kernel by enabling tagged
> > memory ranges, saving/restoring tags etc. but not allowing tagged
> > addresses at the syscall boundary.
> >
> > 2. Similar shim to the above libc wrapper but inside the kernel
> > (arch/arm64 only; most pointer arguments could be covered with an
> > __SC_CAST similar to the s390 one). There are two differences from
> > what we've discussed in the past:
> >
> > a) this is an opt-in by the user which would have to explicitly call
> > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > to pass tagged pointers to the kernel. This would probably be the
> > responsibility of the C lib to make sure it doesn't tag heap
> > allocations. If the user did not opt-in, the syscalls are routed
> > through the normal path (no untagging address shim).
> >
> > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > tagged pointers (to be documented in Vicenzo's ABI patches).
[...]
> Any userspace shim approach is problematic for Android because of the
> apps that use raw system calls. AFAIK, all apps written in Go are in
> that camp - I'm not sure how common they are, but getting them all
> recompiled is probably not realistic.

That's a fair point (I wasn't expecting it would get much traction
anyway ;)). OTOH, it allows upstreaming of the MTE patches while we
continue the discussions around TBI.

> The way I see it, a patch that breaks handling of tagged pointers is
> not that different from, say, a patch that adds a wild pointer
> dereference. Both are bugs; the difference is that (a) the former
> breaks a relatively uncommon target and (b) it's arguably an easier
> mistake to make. If MTE adoption goes well, (a) will not be the case
> for long.

It's also the fact such patch would go unnoticed for a long time until
someone exercises that code path. And when they do, the user would be
pretty much in the dark trying to figure what what went wrong, why a
SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
all the places where it matters in the current kernel codebase (ignoring
future patches).

I think we should revisit the static checking discussions we had last
year. Run-time checking (even with compiler instrumentation and
syzkaller fuzzing) would only cover the code paths specific to a Linux
or Android installation.

> This is a bit of a chicken-and-egg problem. In a world where memory
> allocators on one or several popular platforms generate pointers with
> non-zero tags, any such breakage will be caught in testing.
> Unfortunately to reach that state we need the kernel to start
> accepting tagged pointers first, and then hold on for a couple of
> years until userspace catches up.

Would the kernel also catch up with providing a stable ABI? Because we
have two moving targets.

On one hand, you have Android or some Linux distro that stick to a
stable kernel version for some time, so they have better chance of
clearing most of the problems. On the other hand, we have mainline
kernel that gets over 500K lines every release. As maintainer, I can't
rely on my testing alone as this is on a limited number of platforms. So
my concern is that every kernel release has a significant chance of
breaking the ABI, unless we have a better way of identifying potential
issues.

> Perhaps we can start by whitelisting ioctls by driver?

This was also raised by Ruben in private but without a (static) tool to
to check, manually going through all the drivers doesn't scale. It's
very likely that most drivers don't care, just a get_user/put_user is
already handled by these patches. Searching for find_vma() was
identifying one such use-case but is this sufficient? Are there other
cases we need to explicitly untag a pointer?


The other point I'd like feedback on is 2.a above. I see _some_ value
into having the user opt-in to this relaxed ABI rather than blinding
exposing it to all applications. Dave suggested (in private) a new
personality (e.g. PER_LINUX_TBI) inherited by children. It would be the
responsibility of the C library to check the current personality bits
and only tag pointers on allocation *if* the kernel allowed it. The
kernel could provide the AT_FLAGS bit as in Vincenzo's patches if the
personality was set but can't set it retrospectively if the user called
sys_personality. By default, /sbin/init would not have this personality
and libc would not tag pointers, so we can guarantee that your distro
boots normally with a new kernel version. We could have an envp that
gets caught by /sbin/init so you can pass it on the kernel command line
(or a dynamic loader at run-time). But the default should be the current
ABI behaviour.

We can enforce the current behaviour by having access_ok() check the
personality or a TIF flag but we may relax this enforcement at some
point in the future as we learn more about the implications of TBI.

Thanks.

--
Catalin

2019-05-21 18:50:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:

> The tagged pointers (whether hwasan or MTE) should ideally be a
> transparent feature for the application writer but I don't think we can
> solve it entirely and make it seamless for the multitude of ioctls().
> I'd say you only opt in to such feature if you know what you are doing
> and the user code takes care of specific cases like ioctl(), hence the
> prctl() proposal even for the hwasan.

I'm not sure such a dire view is warrented..

The ioctl situation is not so bad, other than a few special cases,
most drivers just take a 'void __user *' and pass it as an argument to
some function that accepts a 'void __user *'. sparse et al verify
this.

As long as the core functions do the right thing the drivers will be
OK.

The only place things get dicy is if someone casts to unsigned long
(ie for vma work) but I think that reflects that our driver facing
APIs for VMAs are compatible with static analysis (ie I have no
earthly idea why get_user_pages() accepts an unsigned long), not that
this is too hard.

Jason

2019-05-22 00:07:04

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <[email protected]> wrote:
> > > IMO (RFC for now), I see two ways forward:
> > > [...]
> > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > (arch/arm64 only; most pointer arguments could be covered with an
> > > __SC_CAST similar to the s390 one). There are two differences from
> > > what we've discussed in the past:
> > >
> > > a) this is an opt-in by the user which would have to explicitly call
> > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > to pass tagged pointers to the kernel. This would probably be the
> > > responsibility of the C lib to make sure it doesn't tag heap
> > > allocations. If the user did not opt-in, the syscalls are routed
> > > through the normal path (no untagging address shim).
> > >
> > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > tagged pointers (to be documented in Vicenzo's ABI patches).
> >
> > The way I see it, a patch that breaks handling of tagged pointers is
> > not that different from, say, a patch that adds a wild pointer
> > dereference. Both are bugs; the difference is that (a) the former
> > breaks a relatively uncommon target and (b) it's arguably an easier
> > mistake to make. If MTE adoption goes well, (a) will not be the case
> > for long.
>
> It's also the fact such patch would go unnoticed for a long time until
> someone exercises that code path. And when they do, the user would be
> pretty much in the dark trying to figure what what went wrong, why a
> SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> all the places where it matters in the current kernel codebase (ignoring
> future patches).

So, looking forward a bit, this isn't going to be an ARM-specific issue
for long. In fact, I think we shouldn't have arm-specific syscall wrappers
in this series: I think untagged_addr() should likely be added at the
top-level and have it be a no-op for other architectures. So given this
becoming a kernel-wide multi-architecture issue (under the assumption
that x86, RISC-V, and others will gain similar TBI or MTE things),
we should solve it in a way that we can re-use.

We need something that is going to work everywhere. And it needs to be
supported by the kernel for the simple reason that the kernel needs to
do MTE checks during copy_from_user(): having that information stripped
means we lose any userspace-assigned MTE protections if they get handled
by the kernel, which is a total non-starter, IMO.

As an aside: I think Sparc ADI support in Linux actually side-stepped
this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
for kernel code.") I think this was a mistake we should not repeat for
arm64 (we do seem to be at least in agreement about this, I think).

[1] https://lore.kernel.org/patchwork/patch/654481/

> > This is a bit of a chicken-and-egg problem. In a world where memory
> > allocators on one or several popular platforms generate pointers with
> > non-zero tags, any such breakage will be caught in testing.
> > Unfortunately to reach that state we need the kernel to start
> > accepting tagged pointers first, and then hold on for a couple of
> > years until userspace catches up.
>
> Would the kernel also catch up with providing a stable ABI? Because we
> have two moving targets.
>
> On one hand, you have Android or some Linux distro that stick to a
> stable kernel version for some time, so they have better chance of
> clearing most of the problems. On the other hand, we have mainline
> kernel that gets over 500K lines every release. As maintainer, I can't
> rely on my testing alone as this is on a limited number of platforms. So
> my concern is that every kernel release has a significant chance of
> breaking the ABI, unless we have a better way of identifying potential
> issues.

I just want to make sure I fully understand your concern about this
being an ABI break, and I work best with examples. The closest situation
I can see would be:

- some program has no idea about MTE
- malloc() starts returning MTE-tagged addresses
- program doesn't break from that change
- program uses some syscall that is missing untagged_addr() and fails
- kernel has now broken userspace that used to work

The trouble I see with this is that it is largely theoretical and
requires part of userspace to collude to start using a new CPU feature
that tickles a bug in the kernel. As I understand the golden rule,
this is a bug in the kernel (a missed ioctl() or such) to be fixed,
not a global breaking of some userspace behavior.

I feel like I'm missing something about this being seen as an ABI
break. The kernel already fails on userspace addresses that have high
bits set -- are there things that _depend_ on this failure to operate?

--
Kees Cook

2019-05-22 10:13:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

Hi Kees,

Thanks for joining the thread ;).

On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <[email protected]> wrote:
> > > > IMO (RFC for now), I see two ways forward:
> > > > [...]
> > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > > (arch/arm64 only; most pointer arguments could be covered with an
> > > > __SC_CAST similar to the s390 one). There are two differences from
> > > > what we've discussed in the past:
> > > >
> > > > a) this is an opt-in by the user which would have to explicitly call
> > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > > to pass tagged pointers to the kernel. This would probably be the
> > > > responsibility of the C lib to make sure it doesn't tag heap
> > > > allocations. If the user did not opt-in, the syscalls are routed
> > > > through the normal path (no untagging address shim).
> > > >
> > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > > tagged pointers (to be documented in Vicenzo's ABI patches).
> > >
> > > The way I see it, a patch that breaks handling of tagged pointers is
> > > not that different from, say, a patch that adds a wild pointer
> > > dereference. Both are bugs; the difference is that (a) the former
> > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > for long.
> >
> > It's also the fact such patch would go unnoticed for a long time until
> > someone exercises that code path. And when they do, the user would be
> > pretty much in the dark trying to figure what what went wrong, why a
> > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > all the places where it matters in the current kernel codebase (ignoring
> > future patches).
>
> So, looking forward a bit, this isn't going to be an ARM-specific issue
> for long.

I do hope so.

> In fact, I think we shouldn't have arm-specific syscall wrappers
> in this series: I think untagged_addr() should likely be added at the
> top-level and have it be a no-op for other architectures.

That's what the current patchset does, so we have this as a starting
point. Kostya raised another potential issue with the syscall wrappers:
with MTE the kernel will be forced to enable the match-all (wildcard)
pointers for user space accesses since copy_from_user() would only get a
0 tag. So it has wider implications than just uaccess routines not
checking the colour.

> So given this becoming a kernel-wide multi-architecture issue (under
> the assumption that x86, RISC-V, and others will gain similar TBI or
> MTE things), we should solve it in a way that we can re-use.

Can we do any better to aid the untagged_addr() placement (e.g. better
type annotations, better static analysis)? We have to distinguish
between user pointers that may be dereferenced by the kernel (I think
almost fully covered with this patchset) and user addresses represented
as ulong that may:

a) be converted to a user pointer and dereferenced; I think that's the
case for many overloaded ulong/u64 arguments

b) used for address space management, rbtree look-ups etc. where the tag
is no longer relevant and it even gets in the way

We tried last year to identify void __user * casts to unsigned long
using sparse on the assumption that pointers can be tagged while ulong
is about address space management and needs to lose such tag. I think we
could have pushed this further. For example, get_user_pages() takes an
unsigned long but it is perfectly capable of untagging the address
itself. Shall we change its first argument to void __user * (together
with all its callers)?

find_vma(), OTOH, could untag the address but it doesn't help since
vm_start/end don't have such information (that's more about the content
or type that the user decided) and the callers check against it.

Are there any other places where this matters? These patches tracked
down find_vma() as some heuristics but we may need better static
analysis to identify other cases.

> We need something that is going to work everywhere. And it needs to be
> supported by the kernel for the simple reason that the kernel needs to
> do MTE checks during copy_from_user(): having that information stripped
> means we lose any userspace-assigned MTE protections if they get handled
> by the kernel, which is a total non-starter, IMO.

Such feedback is welcomed ;).

> As an aside: I think Sparc ADI support in Linux actually side-stepped
> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> for kernel code.") I think this was a mistake we should not repeat for
> arm64 (we do seem to be at least in agreement about this, I think).
>
> [1] https://lore.kernel.org/patchwork/patch/654481/

I tried to drag the SPARC guys into this discussion but without much
success.

> > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > allocators on one or several popular platforms generate pointers with
> > > non-zero tags, any such breakage will be caught in testing.
> > > Unfortunately to reach that state we need the kernel to start
> > > accepting tagged pointers first, and then hold on for a couple of
> > > years until userspace catches up.
> >
> > Would the kernel also catch up with providing a stable ABI? Because we
> > have two moving targets.
> >
> > On one hand, you have Android or some Linux distro that stick to a
> > stable kernel version for some time, so they have better chance of
> > clearing most of the problems. On the other hand, we have mainline
> > kernel that gets over 500K lines every release. As maintainer, I can't
> > rely on my testing alone as this is on a limited number of platforms. So
> > my concern is that every kernel release has a significant chance of
> > breaking the ABI, unless we have a better way of identifying potential
> > issues.
>
> I just want to make sure I fully understand your concern about this
> being an ABI break, and I work best with examples. The closest situation
> I can see would be:
>
> - some program has no idea about MTE

Apart from some libraries like libc (and maybe those that handle
specific device ioctls), I think most programs should have no idea about
MTE. I wouldn't expect programmers to have to change their app just
because we have a new feature that colours heap allocations.

> - malloc() starts returning MTE-tagged addresses
> - program doesn't break from that change
> - program uses some syscall that is missing untagged_addr() and fails
> - kernel has now broken userspace that used to work

That's one aspect though probably more of a case of plugging in a new
device (graphics card, network etc.) and the ioctl to the new device
doesn't work.

The other is that, assuming we reach a point where the kernel entirely
supports this relaxed ABI, can we guarantee that it won't break in the
future. Let's say some subsequent kernel change (some refactoring)
misses out an untagged_addr(). This renders a previously TBI/MTE-capable
syscall unusable. Can we rely only on testing?

> The trouble I see with this is that it is largely theoretical and
> requires part of userspace to collude to start using a new CPU feature
> that tickles a bug in the kernel. As I understand the golden rule,
> this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> not a global breaking of some userspace behavior.

Yes, we should follow the rule that it's a kernel bug but it doesn't
help the user that a newly installed kernel causes user space to no
longer reach a prompt. Hence the proposal of an opt-in via personality
(for MTE we would need an explicit opt-in by the user anyway since the
top byte is no longer ignored but checked against the allocation tag).

> I feel like I'm missing something about this being seen as an ABI
> break. The kernel already fails on userspace addresses that have high
> bits set -- are there things that _depend_ on this failure to operate?

It's about providing a relaxed ABI which allows non-zero top byte and
breaking it later inadvertently without having something better in place
to analyse the kernel changes.

Thanks.

--
Catalin

2019-05-22 10:42:32

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 03/17] lib, arm64: untag user pointers in strn*_user

On Mon, May 06, 2019 at 06:30:49PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> strncpy_from_user and strnlen_user accept user addresses as arguments, and
> do not go through the same path as copy_from_user and others, so here we
> need to handle the case of tagged user addresses separately.
>
> Untag user pointers passed to these functions.
>
> Note, that this patch only temporarily untags the pointers to perform
> validity checks, but then uses them as is to perform user memory accesses.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Just to keep track of where I am with the reviews while the ABI
discussion continues:

Reviewed-by: Catalin Marinas <[email protected]>

2019-05-22 10:58:46

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 04/17] mm: add ksys_ wrappers to memory syscalls

On Mon, May 06, 2019 at 06:30:50PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> This patch adds ksys_ wrappers to the following memory syscalls:
>
> brk, get_mempolicy (renamed kernel_get_mempolicy -> ksys_get_mempolicy),
> madvise, mbind (renamed kernel_mbind -> ksys_mbind), mincore,
> mlock (renamed do_mlock -> ksys_mlock), mlock2, mmap_pgoff,
> mprotect (renamed do_mprotect_pkey -> ksys_mprotect_pkey), mremap, msync,
> munlock, munmap, remap_file_pages, shmat, shmdt.
>
> The next patch in this series will add a custom implementation for these
> syscalls that makes them accept tagged pointers on arm64.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2019-05-22 11:50:43

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> This patch allows tagged pointers to be passed to the following memory
> syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> remap_file_pages, shmat and shmdt.
>
> This is done by untagging pointers passed to these syscalls in the
> prologues of their handlers.

I'll go through them one by one to see if we can tighten the expected
ABI while having the MTE in mind.

> diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
> index b44065fb1616..933bb9f3d6ec 100644
> --- a/arch/arm64/kernel/sys.c
> +++ b/arch/arm64/kernel/sys.c
> @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
> {
> if (offset_in_page(off) != 0)
> return -EINVAL;
> -
> + addr = untagged_addr(addr);
> return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
> }

If user passes a tagged pointer to mmap() and the address is honoured
(or MAP_FIXED is given), what is the expected return pointer? Does it
need to be tagged with the value from the hint?

With MTE, we may want to use this as a request for the default colour of
the mapped pages (still under discussion).

> +SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len,
> + unsigned long, prot, unsigned long, flags,
> + unsigned long, fd, unsigned long, pgoff)
> +{
> + addr = untagged_addr(addr);
> + return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff);
> +}

We don't have __NR_mmap_pgoff on arm64.

> +SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len,
> + unsigned long, new_len, unsigned long, flags,
> + unsigned long, new_addr)
> +{
> + addr = untagged_addr(addr);
> + new_addr = untagged_addr(new_addr);
> + return ksys_mremap(addr, old_len, new_len, flags, new_addr);
> +}

Similar comment as for mmap(), do we want the tag from new_addr to be
preserved? In addition, should we check that the two tags are identical
or mremap() should become a way to repaint a memory region?

> +SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len)
> +{
> + addr = untagged_addr(addr);
> + return ksys_munmap(addr, len);
> +}

This looks fine.

> +SYSCALL_DEFINE1(arm64_brk, unsigned long, brk)
> +{
> + brk = untagged_addr(brk);
> + return ksys_brk(brk);
> +}

I wonder whether brk() should simply not accept tags, and should not
return them (similar to the prctl(PR_SET_MM) discussion). We could
document this in the ABI requirements.

> +SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
> + unsigned long __user *, nmask, unsigned long, maxnode,
> + unsigned long, addr, unsigned long, flags)
> +{
> + addr = untagged_addr(addr);
> + return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
> +}
> +
> +SYSCALL_DEFINE3(arm64_madvise, unsigned long, start,
> + size_t, len_in, int, behavior)
> +{
> + start = untagged_addr(start);
> + return ksys_madvise(start, len_in, behavior);
> +}
> +
> +SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
> + unsigned long, mode, const unsigned long __user *, nmask,
> + unsigned long, maxnode, unsigned int, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_mbind(start, len, mode, nmask, maxnode, flags);
> +}
> +
> +SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_mlock(start, len, VM_LOCKED);
> +}
> +
> +SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_mlock(start, len, VM_LOCKED);
> +}
> +
> +SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_munlock(start, len);
> +}
> +
> +SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
> + unsigned long, prot)
> +{
> + start = untagged_addr(start);
> + return ksys_mprotect_pkey(start, len, prot, -1);
> +}
> +
> +SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_msync(start, len, flags);
> +}
> +
> +SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
> + unsigned char __user *, vec)
> +{
> + start = untagged_addr(start);
> + return ksys_mincore(start, len, vec);
> +}

These look fine.

> +SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
> + unsigned long, size, unsigned long, prot,
> + unsigned long, pgoff, unsigned long, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_remap_file_pages(start, size, prot, pgoff, flags);
> +}

While this has been deprecated for some time, I presume user space still
invokes it?

> +SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg)
> +{
> + shmaddr = untagged_addr(shmaddr);
> + return ksys_shmat(shmid, shmaddr, shmflg);
> +}
> +
> +SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr)
> +{
> + shmaddr = untagged_addr(shmaddr);
> + return ksys_shmdt(shmaddr);
> +}

Do we actually want to allow shared tagged memory? Who's going to tag
it? If not, we can document it as not supported.

--
Catalin

2019-05-22 11:53:11

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 06/17] mm: untag user pointers in do_pages_move

On Mon, May 06, 2019 at 06:30:52PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> do_pages_move() is used in the implementation of the move_pages syscall.
>
> Untag user pointers in this function.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2019-05-22 11:58:22

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 07/17] mm, arm64: untag user pointers in mm/gup.c

On Mon, May 06, 2019 at 06:30:53PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> mm/gup.c provides a kernel interface that accepts user addresses and
> manipulates user pages directly (for example get_user_pages, that is used
> by the futex syscall). Since a user can provided tagged addresses, we need
> to handle this case.
>
> Add untagging to gup.c functions that use user addresses for vma lookups.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2019-05-22 12:10:59

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 09/17] fs, arm64: untag user pointers in copy_mount_options

On Mon, May 06, 2019 at 06:30:55PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> In copy_mount_options a user address is being subtracted from TASK_SIZE.
> If the address is lower than TASK_SIZE, the size is calculated to not
> allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
> However if the address is tagged, then the size will be calculated
> incorrectly.
>
> Untag the address before subtracting.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Reviewed-by: Catalin Marinas <[email protected]>

2019-05-22 13:51:43

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
>
> > The tagged pointers (whether hwasan or MTE) should ideally be a
> > transparent feature for the application writer but I don't think we can
> > solve it entirely and make it seamless for the multitude of ioctls().
> > I'd say you only opt in to such feature if you know what you are doing
> > and the user code takes care of specific cases like ioctl(), hence the
> > prctl() proposal even for the hwasan.
>
> I'm not sure such a dire view is warrented..
>
> The ioctl situation is not so bad, other than a few special cases,
> most drivers just take a 'void __user *' and pass it as an argument to
> some function that accepts a 'void __user *'. sparse et al verify
> this.
>
> As long as the core functions do the right thing the drivers will be
> OK.
>
> The only place things get dicy is if someone casts to unsigned long
> (ie for vma work) but I think that reflects that our driver facing
> APIs for VMAs are compatible with static analysis (ie I have no
> earthly idea why get_user_pages() accepts an unsigned long), not that
> this is too hard.

If multiple people will care about this, perhaps we should try to
annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
structures.

For example, we could have a couple of mutually exclusive modifiers

T __object *
T __vaddr * (or U __vaddr)

In the first case the pointer points to an object (in the C sense)
that the call may dereference but not use for any other purpose.

In the latter case the pointer (or other type) is a virtual address
that the call does not dereference but my do other things with.

Also

U __really(T)

to tell static analysers the real type of pointers smuggled through
UAPI disguised as other types (*cough* KVM, etc.)

We could gradually make sparse more strict about the presence of
annotations and allowed conversions, add get/put_user() variants
that demand explicit annotation, etc.

find_vma() wouldn't work with a __object pointer, for example. A
get_user_pages_for_dereference() might be needed for __object pointers
(embodying a promise from the caller that only the object will be
dereferenced within the mapped pages).

Thoughts?

This kind of thing would need widespread buy-in in order to be viable.

Cheers
---Dave

2019-05-22 14:17:55

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel

On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> This patch adds a simple test, that calls the uname syscall with a
> tagged user pointer as an argument. Without the kernel accepting tagged
> user pointers the test fails with EFAULT.

That's probably sufficient for a simple example. Something we could add
to Documentation maybe is a small library that can be LD_PRELOAD'ed so
that you can run a lot more tests like LTP.

We could add this to selftests but I think it's too glibc specific.

--------------------8<------------------------------------
#include <stdlib.h>

#define TAG_SHIFT (56)
#define TAG_MASK (0xffUL << TAG_SHIFT)

void *__libc_malloc(size_t size);
void __libc_free(void *ptr);
void *__libc_realloc(void *ptr, size_t size);
void *__libc_calloc(size_t nmemb, size_t size);

static void *tag_ptr(void *ptr)
{
unsigned long tag = rand() & 0xff;
if (!ptr)
return ptr;
return (void *)((unsigned long)ptr | (tag << TAG_SHIFT));
}

static void *untag_ptr(void *ptr)
{
return (void *)((unsigned long)ptr & ~TAG_MASK);
}

void *malloc(size_t size)
{
return tag_ptr(__libc_malloc(size));
}

void free(void *ptr)
{
__libc_free(untag_ptr(ptr));
}

void *realloc(void *ptr, size_t size)
{
return tag_ptr(__libc_realloc(untag_ptr(ptr), size));
}

void *calloc(size_t nmemb, size_t size)
{
return tag_ptr(__libc_calloc(nmemb, size));
}

2019-05-22 15:56:26

by enh

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <[email protected]> wrote:
>
> Hi Kees,
>
> Thanks for joining the thread ;).
>
> On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > On Tue, May 21, 2019 at 07:29:33PM +0100, Catalin Marinas wrote:
> > > On Mon, May 20, 2019 at 04:53:07PM -0700, Evgenii Stepanov wrote:
> > > > On Fri, May 17, 2019 at 7:49 AM Catalin Marinas <[email protected]> wrote:
> > > > > IMO (RFC for now), I see two ways forward:
> > > > > [...]
> > > > > 2. Similar shim to the above libc wrapper but inside the kernel
> > > > > (arch/arm64 only; most pointer arguments could be covered with an
> > > > > __SC_CAST similar to the s390 one). There are two differences from
> > > > > what we've discussed in the past:
> > > > >
> > > > > a) this is an opt-in by the user which would have to explicitly call
> > > > > prctl(). If it returns -ENOTSUPP etc., the user won't be allowed
> > > > > to pass tagged pointers to the kernel. This would probably be the
> > > > > responsibility of the C lib to make sure it doesn't tag heap
> > > > > allocations. If the user did not opt-in, the syscalls are routed
> > > > > through the normal path (no untagging address shim).
> > > > >
> > > > > b) ioctl() and other blacklisted syscalls (prctl) will not accept
> > > > > tagged pointers (to be documented in Vicenzo's ABI patches).
> > > >
> > > > The way I see it, a patch that breaks handling of tagged pointers is
> > > > not that different from, say, a patch that adds a wild pointer
> > > > dereference. Both are bugs; the difference is that (a) the former
> > > > breaks a relatively uncommon target and (b) it's arguably an easier
> > > > mistake to make. If MTE adoption goes well, (a) will not be the case
> > > > for long.
> > >
> > > It's also the fact such patch would go unnoticed for a long time until
> > > someone exercises that code path. And when they do, the user would be
> > > pretty much in the dark trying to figure what what went wrong, why a
> > > SIGSEGV or -EFAULT happened. What's worse, we can't even say we fixed
> > > all the places where it matters in the current kernel codebase (ignoring
> > > future patches).
> >
> > So, looking forward a bit, this isn't going to be an ARM-specific issue
> > for long.
>
> I do hope so.
>
> > In fact, I think we shouldn't have arm-specific syscall wrappers
> > in this series: I think untagged_addr() should likely be added at the
> > top-level and have it be a no-op for other architectures.
>
> That's what the current patchset does, so we have this as a starting
> point. Kostya raised another potential issue with the syscall wrappers:
> with MTE the kernel will be forced to enable the match-all (wildcard)
> pointers for user space accesses since copy_from_user() would only get a
> 0 tag. So it has wider implications than just uaccess routines not
> checking the colour.
>
> > So given this becoming a kernel-wide multi-architecture issue (under
> > the assumption that x86, RISC-V, and others will gain similar TBI or
> > MTE things), we should solve it in a way that we can re-use.
>
> Can we do any better to aid the untagged_addr() placement (e.g. better
> type annotations, better static analysis)? We have to distinguish
> between user pointers that may be dereferenced by the kernel (I think
> almost fully covered with this patchset) and user addresses represented
> as ulong that may:
>
> a) be converted to a user pointer and dereferenced; I think that's the
> case for many overloaded ulong/u64 arguments
>
> b) used for address space management, rbtree look-ups etc. where the tag
> is no longer relevant and it even gets in the way
>
> We tried last year to identify void __user * casts to unsigned long
> using sparse on the assumption that pointers can be tagged while ulong
> is about address space management and needs to lose such tag. I think we
> could have pushed this further. For example, get_user_pages() takes an
> unsigned long but it is perfectly capable of untagging the address
> itself. Shall we change its first argument to void __user * (together
> with all its callers)?
>
> find_vma(), OTOH, could untag the address but it doesn't help since
> vm_start/end don't have such information (that's more about the content
> or type that the user decided) and the callers check against it.
>
> Are there any other places where this matters? These patches tracked
> down find_vma() as some heuristics but we may need better static
> analysis to identify other cases.
>
> > We need something that is going to work everywhere. And it needs to be
> > supported by the kernel for the simple reason that the kernel needs to
> > do MTE checks during copy_from_user(): having that information stripped
> > means we lose any userspace-assigned MTE protections if they get handled
> > by the kernel, which is a total non-starter, IMO.
>
> Such feedback is welcomed ;).
>
> > As an aside: I think Sparc ADI support in Linux actually side-stepped
> > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> > for kernel code.") I think this was a mistake we should not repeat for
> > arm64 (we do seem to be at least in agreement about this, I think).
> >
> > [1] https://lore.kernel.org/patchwork/patch/654481/
>
> I tried to drag the SPARC guys into this discussion but without much
> success.
>
> > > > This is a bit of a chicken-and-egg problem. In a world where memory
> > > > allocators on one or several popular platforms generate pointers with
> > > > non-zero tags, any such breakage will be caught in testing.
> > > > Unfortunately to reach that state we need the kernel to start
> > > > accepting tagged pointers first, and then hold on for a couple of
> > > > years until userspace catches up.
> > >
> > > Would the kernel also catch up with providing a stable ABI? Because we
> > > have two moving targets.
> > >
> > > On one hand, you have Android or some Linux distro that stick to a
> > > stable kernel version for some time, so they have better chance of
> > > clearing most of the problems. On the other hand, we have mainline
> > > kernel that gets over 500K lines every release. As maintainer, I can't
> > > rely on my testing alone as this is on a limited number of platforms. So
> > > my concern is that every kernel release has a significant chance of
> > > breaking the ABI, unless we have a better way of identifying potential
> > > issues.
> >
> > I just want to make sure I fully understand your concern about this
> > being an ABI break, and I work best with examples. The closest situation
> > I can see would be:
> >
> > - some program has no idea about MTE
>
> Apart from some libraries like libc (and maybe those that handle
> specific device ioctls), I think most programs should have no idea about
> MTE. I wouldn't expect programmers to have to change their app just
> because we have a new feature that colours heap allocations.

obviously i'm biased as a libc maintainer, but...

i don't think it helps to move this to libc --- now you just have an
extra dependency where to have a guaranteed working system you need to
update your kernel and libc together. (or at least update your libc to
understand new ioctls etc _before_ you can update your kernel.)

> > - malloc() starts returning MTE-tagged addresses
> > - program doesn't break from that change
> > - program uses some syscall that is missing untagged_addr() and fails
> > - kernel has now broken userspace that used to work
>
> That's one aspect though probably more of a case of plugging in a new
> device (graphics card, network etc.) and the ioctl to the new device
> doesn't work.
>
> The other is that, assuming we reach a point where the kernel entirely
> supports this relaxed ABI, can we guarantee that it won't break in the
> future. Let's say some subsequent kernel change (some refactoring)
> misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> syscall unusable. Can we rely only on testing?
>
> > The trouble I see with this is that it is largely theoretical and
> > requires part of userspace to collude to start using a new CPU feature
> > that tickles a bug in the kernel. As I understand the golden rule,
> > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > not a global breaking of some userspace behavior.
>
> Yes, we should follow the rule that it's a kernel bug but it doesn't
> help the user that a newly installed kernel causes user space to no
> longer reach a prompt. Hence the proposal of an opt-in via personality
> (for MTE we would need an explicit opt-in by the user anyway since the
> top byte is no longer ignored but checked against the allocation tag).

but realistically would this actually get used in this way? or would
any given system either be MTE or non-MTE. in which case a kernel
configuration option would seem to make more sense. (because either
way, the hypothetical user basically needs to recompile the kernel to
get back on their feet. or all of userspace.)

i'm not sure i see this new way for a kernel update to break my system
and need to be fixed forward/rolled back as any different from any of
the existing ways in which this can happen :-) as an end-user i have
to rely on whoever's sending me software updates to test adequately
enough that they find the problems. as an end user, there isn't any
difference between "my phone rebooted when i tried to take a photo
because of a kernel/driver leak", say, and "my phone rebooted when i
tried to take a photo because of missing untagging of a pointer passed
via ioctl".

i suspect you and i have very different people in mind when we say "user" :-)

> > I feel like I'm missing something about this being seen as an ABI
> > break. The kernel already fails on userspace addresses that have high
> > bits set -- are there things that _depend_ on this failure to operate?
>
> It's about providing a relaxed ABI which allows non-zero top byte and
> breaking it later inadvertently without having something better in place
> to analyse the kernel changes.
>
> Thanks.
>
> --
> Catalin

2019-05-22 16:38:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <[email protected]> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.
>
> obviously i'm biased as a libc maintainer, but...
>
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)

That's not what I meant (or I misunderstood you). If we have a relaxed
ABI in the kernel and a libc that returns tagged pointers on malloc() I
wouldn't expect the programmer to do anything different in the
application code like explicit untagging. Basically the program would
continue to run unmodified irrespective of whether you use an old libc
without tagged pointers or a new one which tags heap allocations.

What I do expect is that the libc checks for the presence of the relaxed
ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
the relaxed ABI. As you said, you shouldn't expect that the C library
and kernel are upgraded together, so they should be able to work in any
new/old version combination.

> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
>
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)

The two hard requirements I have for supporting any new hardware feature
in Linux are (1) a single kernel image binary continues to run on old
hardware while making use of the new feature if available and (2) old
user space continues to run on new hardware while new user space can
take advantage of the new feature.

The distro user space usually has a hard requirement that it continues
to run on (certain) old hardware. We can't enforce this in the kernel
but we offer the option to user space developers of checking feature
availability through HWCAP bits.

The Android story may be different as you have more control about which
kernel configurations are deployed on specific SoCs. I'm looking more
from a Linux distro angle where you just get an off-the-shelf OS image
and install it on your hardware, either taking advantage of new features
or just not using them if the software was not updated. Or, if updated
software is installed on old hardware, it would just run.

For MTE, we just can't enable it by default since there are applications
who use the top byte of a pointer and expect it to be ignored rather
than failing with a mismatched tag. Just think of a hwasan compiled
binary where TBI is expected to work and you try to run it with MTE
turned on.

I would also expect the C library or dynamic loader to check for the
presence of a HWCAP_MTE bit before starting to tag memory allocations,
otherwise it would get SIGILL on the first MTE instruction it tries to
execute.

> i'm not sure i see this new way for a kernel update to break my system
> and need to be fixed forward/rolled back as any different from any of
> the existing ways in which this can happen :-) as an end-user i have
> to rely on whoever's sending me software updates to test adequately
> enough that they find the problems. as an end user, there isn't any
> difference between "my phone rebooted when i tried to take a photo
> because of a kernel/driver leak", say, and "my phone rebooted when i
> tried to take a photo because of missing untagging of a pointer passed
> via ioctl".
>
> i suspect you and i have very different people in mind when we say "user" :-)

Indeed, I think we have different users in mind. I didn't mean the end
user who doesn't really care which C library version it's running on
their phone but rather advanced users (not necessarily kernel
developers) that prefer to build their own kernels with every release.
We could extend this to kernel developers who don't have time to track
down why a new kernel triggers lots of SIGSEGVs during boot.

--
Catalin

2019-05-22 17:23:55

by enh

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 9:35 AM Catalin Marinas <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <[email protected]> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
> >
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> That's not what I meant (or I misunderstood you). If we have a relaxed
> ABI in the kernel and a libc that returns tagged pointers on malloc() I
> wouldn't expect the programmer to do anything different in the
> application code like explicit untagging. Basically the program would
> continue to run unmodified irrespective of whether you use an old libc
> without tagged pointers or a new one which tags heap allocations.
>
> What I do expect is that the libc checks for the presence of the relaxed
> ABI, currently proposed as an AT_FLAGS bit (for MTE we'd have a
> HWCAP_MTE), and only tag the malloc() pointers if the kernel supports
> the relaxed ABI. As you said, you shouldn't expect that the C library
> and kernel are upgraded together, so they should be able to work in any
> new/old version combination.

yes, that part makes sense. i do think we'd use the AT_FLAGS bit, for
exactly this.

i was questioning the argument about the ioctl issues, and saying that
from my perspective, untagging bugs are not really any different than
any other kind of kernel bug.

> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.
>
> The distro user space usually has a hard requirement that it continues
> to run on (certain) old hardware. We can't enforce this in the kernel
> but we offer the option to user space developers of checking feature
> availability through HWCAP bits.
>
> The Android story may be different as you have more control about which
> kernel configurations are deployed on specific SoCs. I'm looking more
> from a Linux distro angle where you just get an off-the-shelf OS image
> and install it on your hardware, either taking advantage of new features
> or just not using them if the software was not updated. Or, if updated
> software is installed on old hardware, it would just run.
>
> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.
>
> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.

(a bit off-topic, but i thought the MTE instructions were encoded in
the no-op space, to avoid this?)

> > i'm not sure i see this new way for a kernel update to break my system
> > and need to be fixed forward/rolled back as any different from any of
> > the existing ways in which this can happen :-) as an end-user i have
> > to rely on whoever's sending me software updates to test adequately
> > enough that they find the problems. as an end user, there isn't any
> > difference between "my phone rebooted when i tried to take a photo
> > because of a kernel/driver leak", say, and "my phone rebooted when i
> > tried to take a photo because of missing untagging of a pointer passed
> > via ioctl".
> >
> > i suspect you and i have very different people in mind when we say "user" :-)
>
> Indeed, I think we have different users in mind. I didn't mean the end
> user who doesn't really care which C library version it's running on
> their phone but rather advanced users (not necessarily kernel
> developers) that prefer to build their own kernels with every release.
> We could extend this to kernel developers who don't have time to track
> down why a new kernel triggers lots of SIGSEGVs during boot.

i still don't see how this isn't just a regular testing/CI issue, the
same as any other kind of kernel bug. it's already the case that i can
get a bad kernel...

> --
> Catalin

2019-05-22 20:05:51

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <[email protected]> wrote:
> > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > I just want to make sure I fully understand your concern about this
> > > being an ABI break, and I work best with examples. The closest situation
> > > I can see would be:
> > >
> > > - some program has no idea about MTE
> >
> > Apart from some libraries like libc (and maybe those that handle
> > specific device ioctls), I think most programs should have no idea about
> > MTE. I wouldn't expect programmers to have to change their app just
> > because we have a new feature that colours heap allocations.

Right -- things should Just Work from the application perspective.

> obviously i'm biased as a libc maintainer, but...
>
> i don't think it helps to move this to libc --- now you just have an
> extra dependency where to have a guaranteed working system you need to
> update your kernel and libc together. (or at least update your libc to
> understand new ioctls etc _before_ you can update your kernel.)

I think (hope?) we've all agreed that we shouldn't pass this off to
userspace. At the very least, it reduces the utility of MTE, and at worst
it complicates userspace when this is clearly a kernel/architecture issue.

>
> > > - malloc() starts returning MTE-tagged addresses
> > > - program doesn't break from that change
> > > - program uses some syscall that is missing untagged_addr() and fails
> > > - kernel has now broken userspace that used to work
> >
> > That's one aspect though probably more of a case of plugging in a new
> > device (graphics card, network etc.) and the ioctl to the new device
> > doesn't work.

I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
be glitches, and we can disable stuff either via CONFIG or (as is more
common now) via a kernel commandline with untagged_addr() containing a
static branch, etc. But I actually don't think we need to go this route
(see below...)

> > The other is that, assuming we reach a point where the kernel entirely
> > supports this relaxed ABI, can we guarantee that it won't break in the
> > future. Let's say some subsequent kernel change (some refactoring)
> > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > syscall unusable. Can we rely only on testing?
> >
> > > The trouble I see with this is that it is largely theoretical and
> > > requires part of userspace to collude to start using a new CPU feature
> > > that tickles a bug in the kernel. As I understand the golden rule,
> > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > not a global breaking of some userspace behavior.
> >
> > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > help the user that a newly installed kernel causes user space to no
> > longer reach a prompt. Hence the proposal of an opt-in via personality
> > (for MTE we would need an explicit opt-in by the user anyway since the
> > top byte is no longer ignored but checked against the allocation tag).
>
> but realistically would this actually get used in this way? or would
> any given system either be MTE or non-MTE. in which case a kernel
> configuration option would seem to make more sense. (because either
> way, the hypothetical user basically needs to recompile the kernel to
> get back on their feet. or all of userspace.)

Right: the point is to design things so that we do our best to not break
userspace that is using the new feature (which I think this series has
done well). But supporting MTE/TBI is just like supporting PAN: if someone
refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
to break under PAN. There will be the same long tail of these bugs like
any other, but my sense is that they are small and rare. But I agree:
they're going to be pretty weird bugs to track down. The final result,
however, will be excellent annotation in the kernel for where userspace
addresses get used and people make assumptions about them.

The sooner we get the series landed and gain QEMU support (or real
hardware), the faster we can hammer out these missed corner-cases.
What's the timeline for either of those things, BTW?

> > > I feel like I'm missing something about this being seen as an ABI
> > > break. The kernel already fails on userspace addresses that have high
> > > bits set -- are there things that _depend_ on this failure to operate?
> >
> > It's about providing a relaxed ABI which allows non-zero top byte and
> > breaking it later inadvertently without having something better in place
> > to analyse the kernel changes.

It sounds like the question is how to switch a process in or out of this
ABI (but I don't think that's the real issue: I think it's just a matter
of whether or not a process uses tags at all). Doing it at the prctl()
level doesn't make sense to me, except maybe to detect MTE support or
something. ("Should I tag allocations?") And that state is controlled
by the kernel: the kernel does it or it doesn't.

If a process wants to not tag, that's also up to the allocator where
it can decide not to ask the kernel, and just not tag. Nothing breaks in
userspace if a process is NOT tagging and untagged_addr() exists or is
missing. This, I think, is the core way this doesn't trip over the
golden rule: an old system image will run fine (because it's not
tagging). A *new* system may encounter bugs with tagging because it's a
new feature: this is The Way Of Things. But we don't break old userspace
because old userspace isn't using tags.

So the agreement appears to be between the kernel and the allocator.
Kernel says "I support this" or not. Telling the allocator to not tag if
something breaks sounds like an entirely userspace decision, yes?

--
Kees Cook

2019-05-22 20:17:38

by enh

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 12:21 PM Kees Cook <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 08:30:21AM -0700, enh wrote:
> > On Wed, May 22, 2019 at 3:11 AM Catalin Marinas <[email protected]> wrote:
> > > On Tue, May 21, 2019 at 05:04:39PM -0700, Kees Cook wrote:
> > > > I just want to make sure I fully understand your concern about this
> > > > being an ABI break, and I work best with examples. The closest situation
> > > > I can see would be:
> > > >
> > > > - some program has no idea about MTE
> > >
> > > Apart from some libraries like libc (and maybe those that handle
> > > specific device ioctls), I think most programs should have no idea about
> > > MTE. I wouldn't expect programmers to have to change their app just
> > > because we have a new feature that colours heap allocations.
>
> Right -- things should Just Work from the application perspective.
>
> > obviously i'm biased as a libc maintainer, but...
> >
> > i don't think it helps to move this to libc --- now you just have an
> > extra dependency where to have a guaranteed working system you need to
> > update your kernel and libc together. (or at least update your libc to
> > understand new ioctls etc _before_ you can update your kernel.)
>
> I think (hope?) we've all agreed that we shouldn't pass this off to
> userspace. At the very least, it reduces the utility of MTE, and at worst
> it complicates userspace when this is clearly a kernel/architecture issue.
>
> >
> > > > - malloc() starts returning MTE-tagged addresses
> > > > - program doesn't break from that change
> > > > - program uses some syscall that is missing untagged_addr() and fails
> > > > - kernel has now broken userspace that used to work
> > >
> > > That's one aspect though probably more of a case of plugging in a new
> > > device (graphics card, network etc.) and the ioctl to the new device
> > > doesn't work.
>
> I think MTE will likely be rather like NX/PXN and SMAP/PAN: there will
> be glitches, and we can disable stuff either via CONFIG or (as is more
> common now) via a kernel commandline with untagged_addr() containing a
> static branch, etc. But I actually don't think we need to go this route
> (see below...)
>
> > > The other is that, assuming we reach a point where the kernel entirely
> > > supports this relaxed ABI, can we guarantee that it won't break in the
> > > future. Let's say some subsequent kernel change (some refactoring)
> > > misses out an untagged_addr(). This renders a previously TBI/MTE-capable
> > > syscall unusable. Can we rely only on testing?
> > >
> > > > The trouble I see with this is that it is largely theoretical and
> > > > requires part of userspace to collude to start using a new CPU feature
> > > > that tickles a bug in the kernel. As I understand the golden rule,
> > > > this is a bug in the kernel (a missed ioctl() or such) to be fixed,
> > > > not a global breaking of some userspace behavior.
> > >
> > > Yes, we should follow the rule that it's a kernel bug but it doesn't
> > > help the user that a newly installed kernel causes user space to no
> > > longer reach a prompt. Hence the proposal of an opt-in via personality
> > > (for MTE we would need an explicit opt-in by the user anyway since the
> > > top byte is no longer ignored but checked against the allocation tag).
> >
> > but realistically would this actually get used in this way? or would
> > any given system either be MTE or non-MTE. in which case a kernel
> > configuration option would seem to make more sense. (because either
> > way, the hypothetical user basically needs to recompile the kernel to
> > get back on their feet. or all of userspace.)
>
> Right: the point is to design things so that we do our best to not break
> userspace that is using the new feature (which I think this series has
> done well). But supporting MTE/TBI is just like supporting PAN: if someone
> refactors a driver and swaps a copy_from_user() to a memcpy(), it's going
> to break under PAN. There will be the same long tail of these bugs like
> any other, but my sense is that they are small and rare. But I agree:
> they're going to be pretty weird bugs to track down. The final result,
> however, will be excellent annotation in the kernel for where userspace
> addresses get used and people make assumptions about them.
>
> The sooner we get the series landed and gain QEMU support (or real
> hardware), the faster we can hammer out these missed corner-cases.
> What's the timeline for either of those things, BTW?
>
> > > > I feel like I'm missing something about this being seen as an ABI
> > > > break. The kernel already fails on userspace addresses that have high
> > > > bits set -- are there things that _depend_ on this failure to operate?
> > >
> > > It's about providing a relaxed ABI which allows non-zero top byte and
> > > breaking it later inadvertently without having something better in place
> > > to analyse the kernel changes.
>
> It sounds like the question is how to switch a process in or out of this
> ABI (but I don't think that's the real issue: I think it's just a matter
> of whether or not a process uses tags at all). Doing it at the prctl()
> level doesn't make sense to me, except maybe to detect MTE support or
> something. ("Should I tag allocations?") And that state is controlled
> by the kernel: the kernel does it or it doesn't.
>
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.
>
> So the agreement appears to be between the kernel and the allocator.
> Kernel says "I support this" or not. Telling the allocator to not tag if
> something breaks sounds like an entirely userspace decision, yes?

sgtm, and the AT_FLAGS suggestion sounds fine for our needs in that regard.

> --
> Kees Cook

2019-05-22 20:50:24

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> The two hard requirements I have for supporting any new hardware feature
> in Linux are (1) a single kernel image binary continues to run on old
> hardware while making use of the new feature if available and (2) old
> user space continues to run on new hardware while new user space can
> take advantage of the new feature.

Agreed! And I think the series meets these requirements, yes?

> For MTE, we just can't enable it by default since there are applications
> who use the top byte of a pointer and expect it to be ignored rather
> than failing with a mismatched tag. Just think of a hwasan compiled
> binary where TBI is expected to work and you try to run it with MTE
> turned on.

Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
conflicting with MTE. And anything that starts using TBI suddenly can't
run in the future because it's being interpreted as MTE bits? (Is that
the ABI concern? I feel like we got into the weeds about ioctl()s and
one-off bugs...)

So there needs to be some way to let the kernel know which of three
things it should be doing:
1- leaving userspace addresses as-is (present)
2- wiping the top bits before using (this series)
3- wiping the top bits for most things, but retaining them for MTE as
needed (the future)

I expect MTE to be the "default" in the future. Once a system's libc has
grown support for it, everything will be trying to use MTE. TBI will be
the special case (but TBI is effectively a prerequisite).

AFAICT, the only difference I see between 2 and 3 will be the tag handling
in usercopy (all other places will continue to ignore the top bits). Is
that accurate?

Is "1" a per-process state we want to keep? (I assume not, but rather it
is available via no TBI/MTE CONFIG or a boot-time option, if at all?)

To choose between "2" and "3", it seems we need a per-process flag to
opt into TBI (and out of MTE). For userspace, how would a future binary
choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
since the choice may be "late" after ELF load (this implies the need
for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
bit seems sufficient. And without the marking, I'd expect the kernel to
enforce MTE when there are high bits.

> I would also expect the C library or dynamic loader to check for the
> presence of a HWCAP_MTE bit before starting to tag memory allocations,
> otherwise it would get SIGILL on the first MTE instruction it tries to
> execute.

I've got the same question as Elliot: aren't MTE instructions just NOP
to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
gets entirely ignored: checking is only needed to satisfy curiosity
or behavioral expectations.

To me, the conflict seems to be using TBI in the face of expecting MTE to
be the default state of the future. (But the internal changes needed
for TBI -- this series -- is a prereq for MTE.)

--
Kees Cook

2019-05-22 21:18:36

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Wed, May 22, 2019 at 4:49 AM Catalin Marinas <[email protected]> wrote:
>
> On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > pass tagged user pointers (with the top byte set to something else other
> > than 0x00) as syscall arguments.
> >
> > This patch allows tagged pointers to be passed to the following memory
> > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > remap_file_pages, shmat and shmdt.
> >
> > This is done by untagging pointers passed to these syscalls in the
> > prologues of their handlers.
>
> I'll go through them one by one to see if we can tighten the expected
> ABI while having the MTE in mind.
>
> > diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
> > index b44065fb1616..933bb9f3d6ec 100644
> > --- a/arch/arm64/kernel/sys.c
> > +++ b/arch/arm64/kernel/sys.c
> > @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
> > {
> > if (offset_in_page(off) != 0)
> > return -EINVAL;
> > -
> > + addr = untagged_addr(addr);
> > return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
> > }
>
> If user passes a tagged pointer to mmap() and the address is honoured
> (or MAP_FIXED is given), what is the expected return pointer? Does it
> need to be tagged with the value from the hint?

For HWASan the most convenient would be to use the tag from the hint.
But since in the TBI (not MTE) mode the kernel has no idea what
meaning userspace assigns to pointer tags, perhaps it should not try
to guess, and should return raw (zero-tagged) address instead.

> With MTE, we may want to use this as a request for the default colour of
> the mapped pages (still under discussion).

I like this - and in that case it would make sense to return the
pointer that can be immediately dereferenced without crashing the
process, i.e. with the matching tag.

> > +SYSCALL_DEFINE6(arm64_mmap_pgoff, unsigned long, addr, unsigned long, len,
> > + unsigned long, prot, unsigned long, flags,
> > + unsigned long, fd, unsigned long, pgoff)
> > +{
> > + addr = untagged_addr(addr);
> > + return ksys_mmap_pgoff(addr, len, prot, flags, fd, pgoff);
> > +}
>
> We don't have __NR_mmap_pgoff on arm64.
>
> > +SYSCALL_DEFINE5(arm64_mremap, unsigned long, addr, unsigned long, old_len,
> > + unsigned long, new_len, unsigned long, flags,
> > + unsigned long, new_addr)
> > +{
> > + addr = untagged_addr(addr);
> > + new_addr = untagged_addr(new_addr);
> > + return ksys_mremap(addr, old_len, new_len, flags, new_addr);
> > +}
>
> Similar comment as for mmap(), do we want the tag from new_addr to be
> preserved? In addition, should we check that the two tags are identical
> or mremap() should become a way to repaint a memory region?
>
> > +SYSCALL_DEFINE2(arm64_munmap, unsigned long, addr, size_t, len)
> > +{
> > + addr = untagged_addr(addr);
> > + return ksys_munmap(addr, len);
> > +}
>
> This looks fine.
>
> > +SYSCALL_DEFINE1(arm64_brk, unsigned long, brk)
> > +{
> > + brk = untagged_addr(brk);
> > + return ksys_brk(brk);
> > +}
>
> I wonder whether brk() should simply not accept tags, and should not
> return them (similar to the prctl(PR_SET_MM) discussion). We could
> document this in the ABI requirements.
>
> > +SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
> > + unsigned long __user *, nmask, unsigned long, maxnode,
> > + unsigned long, addr, unsigned long, flags)
> > +{
> > + addr = untagged_addr(addr);
> > + return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
> > +}
> > +
> > +SYSCALL_DEFINE3(arm64_madvise, unsigned long, start,
> > + size_t, len_in, int, behavior)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_madvise(start, len_in, behavior);
> > +}
> > +
> > +SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
> > + unsigned long, mode, const unsigned long __user *, nmask,
> > + unsigned long, maxnode, unsigned int, flags)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_mbind(start, len, mode, nmask, maxnode, flags);
> > +}
> > +
> > +SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_mlock(start, len, VM_LOCKED);
> > +}
> > +
> > +SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_mlock(start, len, VM_LOCKED);
> > +}
> > +
> > +SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_munlock(start, len);
> > +}
> > +
> > +SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
> > + unsigned long, prot)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_mprotect_pkey(start, len, prot, -1);
> > +}
> > +
> > +SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_msync(start, len, flags);
> > +}
> > +
> > +SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
> > + unsigned char __user *, vec)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_mincore(start, len, vec);
> > +}
>
> These look fine.
>
> > +SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
> > + unsigned long, size, unsigned long, prot,
> > + unsigned long, pgoff, unsigned long, flags)
> > +{
> > + start = untagged_addr(start);
> > + return ksys_remap_file_pages(start, size, prot, pgoff, flags);
> > +}
>
> While this has been deprecated for some time, I presume user space still
> invokes it?
>
> > +SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg)
> > +{
> > + shmaddr = untagged_addr(shmaddr);
> > + return ksys_shmat(shmid, shmaddr, shmflg);
> > +}
> > +
> > +SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr)
> > +{
> > + shmaddr = untagged_addr(shmaddr);
> > + return ksys_shmdt(shmaddr);
> > +}
>
> Do we actually want to allow shared tagged memory? Who's going to tag
> it? If not, we can document it as not supported.
>
> --
> Catalin

2019-05-22 23:05:23

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 1:47 PM Kees Cook <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
>
> Agreed! And I think the series meets these requirements, yes?
>
> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
>
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern? I feel like we got into the weeds about ioctl()s and
> one-off bugs...)
>
> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)
> 3- wiping the top bits for most things, but retaining them for MTE as
> needed (the future)
>
> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).
>
> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?
>
> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE). For userspace, how would a future binary
> choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> since the choice may be "late" after ELF load (this implies the need
> for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> bit seems sufficient. And without the marking, I'd expect the kernel to
> enforce MTE when there are high bits.
>
> > I would also expect the C library or dynamic loader to check for the
> > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > otherwise it would get SIGILL on the first MTE instruction it tries to
> > execute.
>
> I've got the same question as Elliot: aren't MTE instructions just NOP
> to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> gets entirely ignored: checking is only needed to satisfy curiosity
> or behavioral expectations.

MTE instructions are not NOP. Most of them have side effects (changing
register values, zeroing memory).
This only matters for stack tagging, though. Heap tagging is a runtime
decision in the allocator.

If an image needs to run on old hardware, it will have to do heap tagging only.

> To me, the conflict seems to be using TBI in the face of expecting MTE to
> be the default state of the future. (But the internal changes needed
> for TBI -- this series -- is a prereq for MTE.)
>
> --
> Kees Cook

2019-05-22 23:11:17

by enh

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 1:47 PM Kees Cook <[email protected]> wrote:
> >
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
> >
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern? I feel like we got into the weeds about ioctl()s and
> > one-off bugs...)
> >
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
> >
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
> >
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
> >
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
> >
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE). For userspace, how would a future binary
> > choose TBI over MTE? If it's a library issue, we can't use an ELF bit,
> > since the choice may be "late" after ELF load (this implies the need
> > for a prctl().) If it's binary-only ("built with HWKASan") then an ELF
> > bit seems sufficient. And without the marking, I'd expect the kernel to
> > enforce MTE when there are high bits.
> >
> > > I would also expect the C library or dynamic loader to check for the
> > > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > > otherwise it would get SIGILL on the first MTE instruction it tries to
> > > execute.
> >
> > I've got the same question as Elliot: aren't MTE instructions just NOP
> > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> > gets entirely ignored: checking is only needed to satisfy curiosity
> > or behavioral expectations.
>
> MTE instructions are not NOP. Most of them have side effects (changing
> register values, zeroing memory).

no, i meant "they're encoded in a space that was previously no-ops, so
running on MTE code on old hardware doesn't cause SIGILL".

> This only matters for stack tagging, though. Heap tagging is a runtime
> decision in the allocator.
>
> If an image needs to run on old hardware, it will have to do heap tagging only.
>
> > To me, the conflict seems to be using TBI in the face of expecting MTE to
> > be the default state of the future. (But the internal changes needed
> > for TBI -- this series -- is a prereq for MTE.)
> >
> > --
> > Kees Cook

2019-05-23 00:22:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> >
> > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > transparent feature for the application writer but I don't think we can
> > > solve it entirely and make it seamless for the multitude of ioctls().
> > > I'd say you only opt in to such feature if you know what you are doing
> > > and the user code takes care of specific cases like ioctl(), hence the
> > > prctl() proposal even for the hwasan.
> >
> > I'm not sure such a dire view is warrented..
> >
> > The ioctl situation is not so bad, other than a few special cases,
> > most drivers just take a 'void __user *' and pass it as an argument to
> > some function that accepts a 'void __user *'. sparse et al verify
> > this.
> >
> > As long as the core functions do the right thing the drivers will be
> > OK.
> >
> > The only place things get dicy is if someone casts to unsigned long
> > (ie for vma work) but I think that reflects that our driver facing
> > APIs for VMAs are compatible with static analysis (ie I have no
> > earthly idea why get_user_pages() accepts an unsigned long), not that
> > this is too hard.
>
> If multiple people will care about this, perhaps we should try to
> annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> structures.
>
> For example, we could have a couple of mutually exclusive modifiers
>
> T __object *
> T __vaddr * (or U __vaddr)
>
> In the first case the pointer points to an object (in the C sense)
> that the call may dereference but not use for any other purpose.

How would you use these two differently?

So far the kernel has worked that __user should tag any pointer that
is from userspace and then you can't do anything with it until you
transform it into a kernel something

> to tell static analysers the real type of pointers smuggled through
> UAPI disguised as other types (*cough* KVM, etc.)

Yes, that would help alot, we often have to pass pointers through a
u64 in the uAPI, and there is no static checker support to make sure
they are run through the u64_to_user_ptr() helper.

Jason

2019-05-23 07:37:16

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 04:09:31PM -0700, enh wrote:
> On Wed, May 22, 2019 at 4:03 PM Evgenii Stepanov <[email protected]> wrote:
> > On Wed, May 22, 2019 at 1:47 PM Kees Cook <[email protected]> wrote:
> > > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > > I would also expect the C library or dynamic loader to check for the
> > > > presence of a HWCAP_MTE bit before starting to tag memory allocations,
> > > > otherwise it would get SIGILL on the first MTE instruction it tries to
> > > > execute.
> > >
> > > I've got the same question as Elliot: aren't MTE instructions just NOP
> > > to older CPUs? I.e. if the CPU (or kernel) don't support it, it just
> > > gets entirely ignored: checking is only needed to satisfy curiosity
> > > or behavioral expectations.
> >
> > MTE instructions are not NOP. Most of them have side effects (changing
> > register values, zeroing memory).
>
> no, i meant "they're encoded in a space that was previously no-ops, so
> running on MTE code on old hardware doesn't cause SIGILL".

It does result in SIGILL, there wasn't enough encoding left in the NOP
space for old/current CPU implementations (in hindsight, we should have
reserved a bigger NOP space).

As Evgenii said, the libc needs to be careful when tagging the heap as
it would cause a SIGILL if the HWCAP_MTE is not set. The standard
application doesn't need to be recompiled as it would not issue MTE
colouring instructions, just standard LDR/STR.

Stack tagging is problematic if you want to colour each frame
individually, the function prologue would need the non-NOP MTE
instructions. The best we can do here is just having the (thread) stacks
of different colours.

--
Catalin

2019-05-23 09:06:45

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Wed, May 22, 2019 at 02:16:57PM -0700, Evgenii Stepanov wrote:
> On Wed, May 22, 2019 at 4:49 AM Catalin Marinas <[email protected]> wrote:
> > On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > > pass tagged user pointers (with the top byte set to something else other
> > > than 0x00) as syscall arguments.
> > >
> > > This patch allows tagged pointers to be passed to the following memory
> > > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> > > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > > remap_file_pages, shmat and shmdt.
> > >
> > > This is done by untagging pointers passed to these syscalls in the
> > > prologues of their handlers.
> >
> > I'll go through them one by one to see if we can tighten the expected
> > ABI while having the MTE in mind.
> >
> > > diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
> > > index b44065fb1616..933bb9f3d6ec 100644
> > > --- a/arch/arm64/kernel/sys.c
> > > +++ b/arch/arm64/kernel/sys.c
> > > @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
> > > {
> > > if (offset_in_page(off) != 0)
> > > return -EINVAL;
> > > -
> > > + addr = untagged_addr(addr);
> > > return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
> > > }
> >
> > If user passes a tagged pointer to mmap() and the address is honoured
> > (or MAP_FIXED is given), what is the expected return pointer? Does it
> > need to be tagged with the value from the hint?
>
> For HWASan the most convenient would be to use the tag from the hint.
> But since in the TBI (not MTE) mode the kernel has no idea what
> meaning userspace assigns to pointer tags, perhaps it should not try
> to guess, and should return raw (zero-tagged) address instead.

Then, just to relax the ABI for hwasan, shall we simply disallow tagged
pointers on mmap() arguments? We can leave them in for
mremap(old_address), madvise().

> > With MTE, we may want to use this as a request for the default colour of
> > the mapped pages (still under discussion).
>
> I like this - and in that case it would make sense to return the
> pointer that can be immediately dereferenced without crashing the
> process, i.e. with the matching tag.

This came up from the Android investigation work where large memory
allocations (using mmap) could be more efficiently pre-tagged by the
kernel on page fault. Not sure about the implementation details yet.

--
Catalin

2019-05-23 10:45:48

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > On Tue, May 21, 2019 at 03:48:56PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 17, 2019 at 03:49:31PM +0100, Catalin Marinas wrote:
> > >
> > > > The tagged pointers (whether hwasan or MTE) should ideally be a
> > > > transparent feature for the application writer but I don't think we can
> > > > solve it entirely and make it seamless for the multitude of ioctls().
> > > > I'd say you only opt in to such feature if you know what you are doing
> > > > and the user code takes care of specific cases like ioctl(), hence the
> > > > prctl() proposal even for the hwasan.
> > >
> > > I'm not sure such a dire view is warrented..
> > >
> > > The ioctl situation is not so bad, other than a few special cases,
> > > most drivers just take a 'void __user *' and pass it as an argument to
> > > some function that accepts a 'void __user *'. sparse et al verify
> > > this.
> > >
> > > As long as the core functions do the right thing the drivers will be
> > > OK.
> > >
> > > The only place things get dicy is if someone casts to unsigned long
> > > (ie for vma work) but I think that reflects that our driver facing
> > > APIs for VMAs are compatible with static analysis (ie I have no
> > > earthly idea why get_user_pages() accepts an unsigned long), not that
> > > this is too hard.
> >
> > If multiple people will care about this, perhaps we should try to
> > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > structures.
> >
> > For example, we could have a couple of mutually exclusive modifiers
> >
> > T __object *
> > T __vaddr * (or U __vaddr)
> >
> > In the first case the pointer points to an object (in the C sense)
> > that the call may dereference but not use for any other purpose.
>
> How would you use these two differently?
>
> So far the kernel has worked that __user should tag any pointer that
> is from userspace and then you can't do anything with it until you
> transform it into a kernel something

Ultimately it would be good to disallow casting __object pointers execpt
to compatible __object pointer types, and to make get_user etc. demand
__object.

__vaddr pointers / addresses would be freely castable, but not to
__object and so would not be dereferenceable even indirectly.

Or that's the general idea. Figuring out a sane set of rules that we
could actually check / enforce would require a bit of work.

(Whether the __vaddr base type is a pointer or an integer type is
probably moot, due to the restrictions we would place on the use of
these anyway.)

> > to tell static analysers the real type of pointers smuggled through
> > UAPI disguised as other types (*cough* KVM, etc.)
>
> Yes, that would help alot, we often have to pass pointers through a
> u64 in the uAPI, and there is no static checker support to make sure
> they are run through the u64_to_user_ptr() helper.

Agreed.

Cheers
---Dave

2019-05-23 14:46:42

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > The two hard requirements I have for supporting any new hardware feature
> > in Linux are (1) a single kernel image binary continues to run on old
> > hardware while making use of the new feature if available and (2) old
> > user space continues to run on new hardware while new user space can
> > take advantage of the new feature.
>
> Agreed! And I think the series meets these requirements, yes?

Yes. I mentioned this just to make sure people don't expect different
kernel builds for different hardware features.

There is also the obvious requirement which I didn't mention: new user
space continues to run on new/subsequent kernel versions. That's one of
the points of contention for this series (ignoring MTE) with the
maintainers having to guarantee this without much effort. IOW, do the
500K+ new lines in a subsequent kernel version break any user space out
there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
syskaller sufficient? Better static analysis would definitely help.

> > For MTE, we just can't enable it by default since there are applications
> > who use the top byte of a pointer and expect it to be ignored rather
> > than failing with a mismatched tag. Just think of a hwasan compiled
> > binary where TBI is expected to work and you try to run it with MTE
> > turned on.
>
> Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> conflicting with MTE. And anything that starts using TBI suddenly can't
> run in the future because it's being interpreted as MTE bits? (Is that
> the ABI concern?

That's another aspect to figure out when we add the MTE support. I don't
think we'd be able to do this without an explicit opt-in by the user.

Or, if we ever want MTE to be turned on by default (i.e. tag checking),
even if everything is tagged with 0, we have to disallow TBI for user
and this includes hwasan. There were a small number of programs using
the TBI (I think some JavaScript compilers tried this). But now we are
bringing in the hwasan support and this can be a large user base. Shall
we add an ELF note for such binaries that use TBI/hwasan?

This series is still required for MTE but we may decide not to relax the
ABI blindly, therefore the opt-in (prctl) or personality idea.

> I feel like we got into the weeds about ioctl()s and one-off bugs...)

This needs solving as well. Most driver developers won't know why
untagged_addr() is needed unless we have more rigorous types or type
annotations and a tool to check them (we should probably revive the old
sparse thread).

> So there needs to be some way to let the kernel know which of three
> things it should be doing:
> 1- leaving userspace addresses as-is (present)
> 2- wiping the top bits before using (this series)

(I'd say tolerating rather than wiping since get_user still uses the tag
in the current series)

The current series does not allow any choice between 1 and 2, the
default ABI basically becomes option 2.

> 3- wiping the top bits for most things, but retaining them for MTE as
> needed (the future)

2 and 3 are not entirely compatible as a tagged pointer may be checked
against the memory colour by the hardware. So you can't have hwasan
binary with MTE enabled.

> I expect MTE to be the "default" in the future. Once a system's libc has
> grown support for it, everything will be trying to use MTE. TBI will be
> the special case (but TBI is effectively a prerequisite).

The kernel handling of tagged pointers is indeed a prerequisite. The ABI
distinction between the above 2 and 3 needs to be solved.

> AFAICT, the only difference I see between 2 and 3 will be the tag handling
> in usercopy (all other places will continue to ignore the top bits). Is
> that accurate?

Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
binary, it will SEGFAULT (either in user space or in kernel uaccess).
How does the kernel choose between 2 and 3?

> Is "1" a per-process state we want to keep? (I assume not, but rather it
> is available via no TBI/MTE CONFIG or a boot-time option, if at all?)

Possibly, though not necessarily per process. For testing or if
something goes wrong during boot, a command line option with a static
label would do. The AT_FLAGS bit needs to be checked by user space. My
preference would be per-process.

> To choose between "2" and "3", it seems we need a per-process flag to
> opt into TBI (and out of MTE).

Or leave option 2 the default and get it to opt in to MTE.

> For userspace, how would a future binary choose TBI over MTE? If it's
> a library issue, we can't use an ELF bit, since the choice may be
> "late" after ELF load (this implies the need for a prctl().) If it's
> binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> And without the marking, I'd expect the kernel to enforce MTE when
> there are high bits.

The current plan is that a future binary issues a prctl(), after
checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
are not in the current NOP space). I'd expect this to be done by the
libc or dynamic loader under the assumption that the binaries it loads
do _not_ use the top pointer byte for anything else. With hwasan
compiled objects this gets more confusing (any ELF note to identify
them?).

(there is also the risk of existing applications using TBI already but
I'm not aware of any still using this feature other than hwasan)

--
Catalin

2019-05-23 15:10:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 12:21:27PM -0700, Kees Cook wrote:
> If a process wants to not tag, that's also up to the allocator where
> it can decide not to ask the kernel, and just not tag. Nothing breaks in
> userspace if a process is NOT tagging and untagged_addr() exists or is
> missing. This, I think, is the core way this doesn't trip over the
> golden rule: an old system image will run fine (because it's not
> tagging). A *new* system may encounter bugs with tagging because it's a
> new feature: this is The Way Of Things. But we don't break old userspace
> because old userspace isn't using tags.

With this series and hwasan binaries, at some point in the future they
will be considered "old userspace" and they do use pointer tags which
expect to be ignored by both the hardware and the kernel. MTE breaks
this assumption.

--
Catalin

2019-05-23 15:23:09

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Wed, May 22, 2019 at 09:58:22AM -0700, enh wrote:
> i was questioning the argument about the ioctl issues, and saying that
> from my perspective, untagging bugs are not really any different than
> any other kind of kernel bug.

Once this series gets in, they are indeed just kernel bugs. What I want
is an easier way to identify them, ideally before they trigger in the
field.

> i still don't see how this isn't just a regular testing/CI issue, the
> same as any other kind of kernel bug. it's already the case that i can
> get a bad kernel...

The testing would have a smaller code coverage in terms of drivers,
filesystems than something like a static checker (though one does not
exclude the other).

--
Catalin

2019-05-23 15:46:02

by enh

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> > On Wed, May 22, 2019 at 05:35:27PM +0100, Catalin Marinas wrote:
> > > The two hard requirements I have for supporting any new hardware feature
> > > in Linux are (1) a single kernel image binary continues to run on old
> > > hardware while making use of the new feature if available and (2) old
> > > user space continues to run on new hardware while new user space can
> > > take advantage of the new feature.
> >
> > Agreed! And I think the series meets these requirements, yes?
>
> Yes. I mentioned this just to make sure people don't expect different
> kernel builds for different hardware features.
>
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.
>
> > > For MTE, we just can't enable it by default since there are applications
> > > who use the top byte of a pointer and expect it to be ignored rather
> > > than failing with a mismatched tag. Just think of a hwasan compiled
> > > binary where TBI is expected to work and you try to run it with MTE
> > > turned on.
> >
> > Ah! Okay, here's the use-case I wasn't thinking of: the concern is TBI
> > conflicting with MTE. And anything that starts using TBI suddenly can't
> > run in the future because it's being interpreted as MTE bits? (Is that
> > the ABI concern?
>
> That's another aspect to figure out when we add the MTE support. I don't
> think we'd be able to do this without an explicit opt-in by the user.
>
> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?
>
> This series is still required for MTE but we may decide not to relax the
> ABI blindly, therefore the opt-in (prctl) or personality idea.
>
> > I feel like we got into the weeds about ioctl()s and one-off bugs...)
>
> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).
>
> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
>
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
>
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.
>
> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
>
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.
>
> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
>
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.
>
> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
>
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?
>
> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.
>
> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
>
> Or leave option 2 the default and get it to opt in to MTE.
>
> > For userspace, how would a future binary choose TBI over MTE? If it's
> > a library issue, we can't use an ELF bit, since the choice may be
> > "late" after ELF load (this implies the need for a prctl().) If it's
> > binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> > And without the marking, I'd expect the kernel to enforce MTE when
> > there are high bits.
>
> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else.

yeah, it sounds like to support hwasan and MTE, the dynamic linker
will need to not use either itself.

> With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).

no, at the moment code that wants to know checks for the presence of
__hwasan_init. (and bionic doesn't actually look at any ELF notes
right now.) but we can always add something if we need to.

> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)
>
> --
> Catalin

2019-05-23 16:39:56

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> There is also the obvious requirement which I didn't mention: new user
> space continues to run on new/subsequent kernel versions. That's one of
> the points of contention for this series (ignoring MTE) with the
> maintainers having to guarantee this without much effort. IOW, do the
> 500K+ new lines in a subsequent kernel version break any user space out
> there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> syskaller sufficient? Better static analysis would definitely help.

We can't have perfect coverage of people actively (or accidentally)
working to trick static analyzers (and the compiler) into "forgetting"
about a __user annotation. We can certainly improve analysis (I see
the other sub-thread on that), but I'd like that work not to block
this series.

What on this front would you be comfortable with? Given it's a new
feature isn't it sufficient to have a CONFIG (and/or boot option)?

> Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> even if everything is tagged with 0, we have to disallow TBI for user
> and this includes hwasan. There were a small number of programs using
> the TBI (I think some JavaScript compilers tried this). But now we are
> bringing in the hwasan support and this can be a large user base. Shall
> we add an ELF note for such binaries that use TBI/hwasan?

Just to be clear, you say "disallow TBI for user" -- you mean a
particular process, yes? i.e. there is no architectural limitation that
says once we're using MTE nothing can switch to TBI. i.e. a process is
either doing MTE or TBI (or nothing, but that's the same as TBI).

> This needs solving as well. Most driver developers won't know why
> untagged_addr() is needed unless we have more rigorous types or type
> annotations and a tool to check them (we should probably revive the old
> sparse thread).

This seems like a parallel concern: we can do that separately from this
series. Without landing it, is it much harder for people to test it,
look for bugs, help with types/annotations, etc.

> > So there needs to be some way to let the kernel know which of three
> > things it should be doing:
> > 1- leaving userspace addresses as-is (present)
> > 2- wiping the top bits before using (this series)
>
> (I'd say tolerating rather than wiping since get_user still uses the tag
> in the current series)
>
> The current series does not allow any choice between 1 and 2, the
> default ABI basically becomes option 2.

What about testing tools that intentionally insert high bits for syscalls
and are _expecting_ them to fail? It seems the TBI series will break them?
In that case, do we need to opt into TBI as well?

> > 3- wiping the top bits for most things, but retaining them for MTE as
> > needed (the future)
>
> 2 and 3 are not entirely compatible as a tagged pointer may be checked
> against the memory colour by the hardware. So you can't have hwasan
> binary with MTE enabled.

Right: a process must be either MTE or TBI, not both.

> > I expect MTE to be the "default" in the future. Once a system's libc has
> > grown support for it, everything will be trying to use MTE. TBI will be
> > the special case (but TBI is effectively a prerequisite).
>
> The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> distinction between the above 2 and 3 needs to be solved.

Does that need solving now or when the MTE series appears? As there is
no reason to distinguish between "normal" and "TBI", that doesn't seem
to need solving at this point?

> > AFAICT, the only difference I see between 2 and 3 will be the tag handling
> > in usercopy (all other places will continue to ignore the top bits). Is
> > that accurate?
>
> Yes, mostly (for the kernel). If MTE is enabled by default for a hwasan
> binary, it will SEGFAULT (either in user space or in kernel uaccess).
> How does the kernel choose between 2 and 3?

Right -- that was my question as well.

> > Is "1" a per-process state we want to keep? (I assume not, but rather it
> > is available via no TBI/MTE CONFIG or a boot-time option, if at all?)
>
> Possibly, though not necessarily per process. For testing or if
> something goes wrong during boot, a command line option with a static
> label would do. The AT_FLAGS bit needs to be checked by user space. My
> preference would be per-process.

I would agree.

> > To choose between "2" and "3", it seems we need a per-process flag to
> > opt into TBI (and out of MTE).
>
> Or leave option 2 the default and get it to opt in to MTE.

Given that MTE has to "start" at some point in the binary lifetime, I'm
fine with opting into MTE. I do expect, though, this will feel redundant
in a couple years as everything will immediately opt-in. But, okay, this
is therefore an issue for the MTE series.

> The current plan is that a future binary issues a prctl(), after
> checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> are not in the current NOP space). I'd expect this to be done by the
> libc or dynamic loader under the assumption that the binaries it loads
> do _not_ use the top pointer byte for anything else. With hwasan
> compiled objects this gets more confusing (any ELF note to identify
> them?).

Okay, sounds fine.

> (there is also the risk of existing applications using TBI already but
> I'm not aware of any still using this feature other than hwasan)

Correct.


Alright, the tl;dr appears to be:
- you want more assurances that we can find __user stripping in the
kernel more easily. (But this seems like a parallel problem.)
- we might need to opt in to TBI with a prctl()
- all other concerns are for the future MTE series (though it sounds
like HWCAP_MTE and a prctl() solve those issues too).

Is this accurate? What do you see as the blockers for this series at
this point?

--
Kees Cook

2019-05-23 16:58:57

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > If multiple people will care about this, perhaps we should try to
> > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > structures.
> > >
> > > For example, we could have a couple of mutually exclusive modifiers
> > >
> > > T __object *
> > > T __vaddr * (or U __vaddr)
> > >
> > > In the first case the pointer points to an object (in the C sense)
> > > that the call may dereference but not use for any other purpose.
> >
> > How would you use these two differently?
> >
> > So far the kernel has worked that __user should tag any pointer that
> > is from userspace and then you can't do anything with it until you
> > transform it into a kernel something
>
> Ultimately it would be good to disallow casting __object pointers execpt
> to compatible __object pointer types, and to make get_user etc. demand
> __object.
>
> __vaddr pointers / addresses would be freely castable, but not to
> __object and so would not be dereferenceable even indirectly.

I think it gets too complicated and there are ambiguous cases that we
may not be able to distinguish. For example copy_from_user() may be used
to copy a user data structure into the kernel, hence __object would
work, while the same function may be used to copy opaque data to a file,
so __vaddr may be a better option (unless I misunderstood your
proposal).

We currently have T __user * and I think it's a good starting point. The
prior attempt [1] was shut down because it was just hiding the cast
using __force. We'd need to work through those cases again and rather
start changing the function prototypes to avoid unnecessary casting in
the callers (e.g. get_user_pages(void __user *) or come up with a new
type) while changing the explicit casting to a macro where it needs to
be obvious that we are converting a user pointer, potentially typed
(tagged), to an untyped address range. We may need a user_ptr_to_ulong()
macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
of it).

It may actually not be far from what you suggested but I'd keep the
current T __user * to denote possible dereference.

[1] https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/

--
Catalin

2019-05-23 17:02:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 08:44:12AM -0700, enh wrote:
> On Thu, May 23, 2019 at 7:45 AM Catalin Marinas <[email protected]> wrote:
> > On Wed, May 22, 2019 at 01:47:36PM -0700, Kees Cook wrote:
> > > For userspace, how would a future binary choose TBI over MTE? If it's
> > > a library issue, we can't use an ELF bit, since the choice may be
> > > "late" after ELF load (this implies the need for a prctl().) If it's
> > > binary-only ("built with HWKASan") then an ELF bit seems sufficient.
> > > And without the marking, I'd expect the kernel to enforce MTE when
> > > there are high bits.
> >
> > The current plan is that a future binary issues a prctl(), after
> > checking the HWCAP_MTE bit (as I replied to Elliot, the MTE instructions
> > are not in the current NOP space). I'd expect this to be done by the
> > libc or dynamic loader under the assumption that the binaries it loads
> > do _not_ use the top pointer byte for anything else.
>
> yeah, it sounds like to support hwasan and MTE, the dynamic linker
> will need to not use either itself.
>
> > With hwasan compiled objects this gets more confusing (any ELF note
> > to identify them?).
>
> no, at the moment code that wants to know checks for the presence of
> __hwasan_init. (and bionic doesn't actually look at any ELF notes
> right now.) but we can always add something if we need to.

It's a userspace decision to make. In the kernel, we are proposing that
bionic calls a prctl() to enable MTE explicitly. It could first check
for the presence of __hwasan_init.

--
Catalin

2019-05-23 17:46:40

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 03:44:49PM +0100, Catalin Marinas wrote:
> > There is also the obvious requirement which I didn't mention: new user
> > space continues to run on new/subsequent kernel versions. That's one of
> > the points of contention for this series (ignoring MTE) with the
> > maintainers having to guarantee this without much effort. IOW, do the
> > 500K+ new lines in a subsequent kernel version break any user space out
> > there? I'm only talking about the relaxed TBI ABI. Are the usual LTP,
> > syskaller sufficient? Better static analysis would definitely help.
>
> We can't have perfect coverage of people actively (or accidentally)
> working to trick static analyzers (and the compiler) into "forgetting"
> about a __user annotation. We can certainly improve analysis (I see
> the other sub-thread on that), but I'd like that work not to block
> this series.

I don't want to block this series either as it's a prerequisite for MTE
but at the moment I'm not confident we tracked down all the places where
things can break and what the future effort will be to analyse new
kernel changes.

We've made lots of progress since March last year (I think when the
first version was posted) by both identifying new places in the kernel
that need untagging and limiting the ranges that user space can tag
(e.g. an mmap() on a device should not allow tagged pointers). Can we
say we are done or more investigation is needed?

> What on this front would you be comfortable with? Given it's a new
> feature isn't it sufficient to have a CONFIG (and/or boot option)?

I'd rather avoid re-building kernels. A boot option would do, unless we
see value in a per-process (inherited) personality or prctl. The
per-process option has the slight advantage that I can boot my machine
normally while being able to run LTP with a relaxed ABI (and a new libc
which tags pointers). I admit I don't have a very strong argument on a
per-process opt-in.

Another option would be a sysctl control together with a cmdline option.

> > Or, if we ever want MTE to be turned on by default (i.e. tag checking),
> > even if everything is tagged with 0, we have to disallow TBI for user
> > and this includes hwasan. There were a small number of programs using
> > the TBI (I think some JavaScript compilers tried this). But now we are
> > bringing in the hwasan support and this can be a large user base. Shall
> > we add an ELF note for such binaries that use TBI/hwasan?
>
> Just to be clear, you say "disallow TBI for user" -- you mean a
> particular process, yes? i.e. there is no architectural limitation that
> says once we're using MTE nothing can switch to TBI. i.e. a process is
> either doing MTE or TBI (or nothing, but that's the same as TBI).

I may have answered this below. The architecture does not allow TBI
(i.e. faults) if MTE is enabled for a process and address range.

> > > So there needs to be some way to let the kernel know which of three
> > > things it should be doing:
> > > 1- leaving userspace addresses as-is (present)
> > > 2- wiping the top bits before using (this series)
> >
> > (I'd say tolerating rather than wiping since get_user still uses the tag
> > in the current series)
> >
> > The current series does not allow any choice between 1 and 2, the
> > default ABI basically becomes option 2.
>
> What about testing tools that intentionally insert high bits for syscalls
> and are _expecting_ them to fail? It seems the TBI series will break them?
> In that case, do we need to opt into TBI as well?

If there are such tools, then we may need a per-process control. It's
basically an ABI change.

> > > 3- wiping the top bits for most things, but retaining them for MTE as
> > > needed (the future)
> >
> > 2 and 3 are not entirely compatible as a tagged pointer may be checked
> > against the memory colour by the hardware. So you can't have hwasan
> > binary with MTE enabled.
>
> Right: a process must be either MTE or TBI, not both.

Indeed.

> > > I expect MTE to be the "default" in the future. Once a system's libc has
> > > grown support for it, everything will be trying to use MTE. TBI will be
> > > the special case (but TBI is effectively a prerequisite).
> >
> > The kernel handling of tagged pointers is indeed a prerequisite. The ABI
> > distinction between the above 2 and 3 needs to be solved.
>
> Does that need solving now or when the MTE series appears? As there is
> no reason to distinguish between "normal" and "TBI", that doesn't seem
> to need solving at this point?

We don't need to solve 2 vs 3 as long as option 3 is an explicit opt-in
(prctl) by the user. Otherwise the kernel cannot guess whether the user
is using TBI or not (and if it does and still opts in to MTE, we can say
it's a user problem ;)).

> > > To choose between "2" and "3", it seems we need a per-process flag to
> > > opt into TBI (and out of MTE).
> >
> > Or leave option 2 the default and get it to opt in to MTE.
>
> Given that MTE has to "start" at some point in the binary lifetime, I'm
> fine with opting into MTE. I do expect, though, this will feel redundant
> in a couple years as everything will immediately opt-in. But, okay, this
> is therefore an issue for the MTE series.

Unless Google phases out hwasan soon ;), they may live in parallel for a
while, so a choice of TBI vs MTE.

> Alright, the tl;dr appears to be:
> - you want more assurances that we can find __user stripping in the
> kernel more easily. (But this seems like a parallel problem.)

Yes, and that we found all (most) cases now. The reason I don't see it
as a parallel problem is that, as maintainer, I promise an ABI to user
and I'd rather stick to it. I don't want, for example, ncurses to stop
working because of some ioctl() rejecting tagged pointers.

If it's just the occasional one-off bug I'm fine to deal with it. But
no-one convinced me yet that this is the case.

As for the generic driver code (filesystems or other subsystems),
without some clear direction for developers, together with static
checking/sparse, on how user pointers are cast to longs (one example),
it would become my responsibility to identify and fix them up with any
kernel release. This series is not providing such guidance, just adding
untagged_addr() in some places that we think matter.

> - we might need to opt in to TBI with a prctl()

Yes, although still up for discussion.

> - all other concerns are for the future MTE series (though it sounds
> like HWCAP_MTE and a prctl() solve those issues too).

Indeed.

> Is this accurate? What do you see as the blockers for this series at
> this point?

Well, see my comment on your first bullet point above ;).

And thanks for summarising it.

--
Catalin

2019-05-23 17:54:04

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On 5/21/19 6:04 PM, Kees Cook wrote:
> As an aside: I think Sparc ADI support in Linux actually side-stepped
> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> for kernel code.") I think this was a mistake we should not repeat for
> arm64 (we do seem to be at least in agreement about this, I think).
>
> [1] https://lore.kernel.org/patchwork/patch/654481/


That is a very early version of the sparc ADI patch. Support for tagged
addresses in syscalls was added in later versions and is in the patch
that is in the kernel.

That part "Kernel does not enable ADI for kernel code." is correct. It
is a possible enhancement for future.

--
Khalid

2019-05-23 20:13:08

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

Hi Khalid,

On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
> On 5/21/19 6:04 PM, Kees Cook wrote:
> > As an aside: I think Sparc ADI support in Linux actually side-stepped
> > this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> > be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> > for kernel code.") I think this was a mistake we should not repeat for
> > arm64 (we do seem to be at least in agreement about this, I think).
> >
> > [1] https://lore.kernel.org/patchwork/patch/654481/
>
> That is a very early version of the sparc ADI patch. Support for tagged
> addresses in syscalls was added in later versions and is in the patch
> that is in the kernel.

I tried to figure out but I'm not familiar with the sparc port. How did
you solve the tagged address going into various syscall implementations
in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
ends up deeper in the core code?

Thanks.

--
Catalin

2019-05-23 21:33:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > What on this front would you be comfortable with? Given it's a new
> > feature isn't it sufficient to have a CONFIG (and/or boot option)?
>
> I'd rather avoid re-building kernels. A boot option would do, unless we
> see value in a per-process (inherited) personality or prctl. The

I think I've convinced myself that the normal<->TBI ABI control should
be a boot parameter. More below...

> > What about testing tools that intentionally insert high bits for syscalls
> > and are _expecting_ them to fail? It seems the TBI series will break them?
> > In that case, do we need to opt into TBI as well?
>
> If there are such tools, then we may need a per-process control. It's
> basically an ABI change.

syzkaller already attempts to randomly inject non-canonical and
0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
able to write directly to kernel memory[1].

It seems that using TBI by default and not allowing a switch back to
"normal" ABI without a reboot actually means that userspace cannot inject
kernel pointers into syscalls any more, since they'll get universally
stripped now. Is my understanding correct, here? i.e. exploiting
CVE-2017-5123 would be impossible under TBI?

If so, then I think we should commit to the TBI ABI and have a boot
flag to disable it, but NOT have a process flag, as that would allow
attackers to bypass the masking. The only flag should be "TBI or MTE".

If so, can I get top byte masking for other architectures too? Like,
just to strip high bits off userspace addresses? ;)

(Oh, in looking I see this is implemented with sign-extension... why
not just a mask? So it'll either be valid userspace address or forced
into the non-canonical range?)

[1] https://salls.github.io/Linux-Kernel-CVE-2017-5123/

> > Alright, the tl;dr appears to be:
> > - you want more assurances that we can find __user stripping in the
> > kernel more easily. (But this seems like a parallel problem.)
>
> Yes, and that we found all (most) cases now. The reason I don't see it
> as a parallel problem is that, as maintainer, I promise an ABI to user
> and I'd rather stick to it. I don't want, for example, ncurses to stop
> working because of some ioctl() rejecting tagged pointers.

But this is what I don't understand: it would need to be ncurses _using
TBI_, that would stop working (having started to work before, but then
regress due to a newly added one-off bug). Regular ncurses will be fine
because it's not using TBI. So The Golden Rule isn't violated, and by
definition, it's a specific regression caused by some bug (since TBI
would have had to have worked _before_ in the situation to be considered
a regression now). Which describes the normal path for kernel
development... add feature, find corner cases where it doesn't work,
fix them, encounter new regressions, fix those, repeat forever.

> If it's just the occasional one-off bug I'm fine to deal with it. But
> no-one convinced me yet that this is the case.

You believe there still to be some systemic cases that haven't been
found yet? And even if so -- isn't it better to work on that
incrementally?

> As for the generic driver code (filesystems or other subsystems),
> without some clear direction for developers, together with static
> checking/sparse, on how user pointers are cast to longs (one example),
> it would become my responsibility to identify and fix them up with any
> kernel release. This series is not providing such guidance, just adding
> untagged_addr() in some places that we think matter.

What about adding a nice bit of .rst documentation that describes the
situation and shows how to use untagged_addr(). This is the kind of
kernel-wide change that "everyone" needs to know about, and shouldn't
be the arch maintainer's sole responsibility to fix.

> > - we might need to opt in to TBI with a prctl()
>
> Yes, although still up for discussion.

I think I've talked myself out of it. I say boot param only! :)


So what do you say to these next steps:

- change untagged_addr() to use a static branch that is controlled with
a boot parameter.
- add, say, Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(),
with examples based on all the cases seen so far in this series.
- continue work to improve static analysis.

Thanks for wading through this with me! :)

--
Kees Cook

2019-05-23 21:44:57

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On 5/23/19 2:11 PM, Catalin Marinas wrote:
> Hi Khalid,
>
> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>> for kernel code.") I think this was a mistake we should not repeat for
>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>
>> That is a very early version of the sparc ADI patch. Support for tagged
>> addresses in syscalls was added in later versions and is in the patch
>> that is in the kernel.
>
> I tried to figure out but I'm not familiar with the sparc port. How did
> you solve the tagged address going into various syscall implementations
> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> ends up deeper in the core code?

Tag is not removed from the user addresses. Kernel passes tagged
addresses to copy_from_user and copy_to_user. MMU checks the tag
embedded in the address when kernel accesses userspace addresses. This
maintains the ADI integrity even when userspace attempts to access any
userspace addresses through system calls.

On sparc, access_ok() is defined as:

#define access_ok(addr, size) __access_ok((unsigned long)(addr), size)
#define __access_ok(addr, size) (__user_ok((addr) & get_fs().seg, (size)))
#define __user_ok(addr, size) ({ (void)(size); (addr) < STACK_TOP; })

STACK_TOP for M7 processor (which is the first sparc processor to
support ADI) is 0xfff8000000000000UL. Tagged addresses pass the
access_ok() check fine. Any tag mismatches that happen during kernel
access to userspace addresses are handled by do_mcd_err().

--
Khalid

2019-05-23 21:51:25

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On 5/23/19 2:11 PM, Catalin Marinas wrote:
> Hi Khalid,
>
> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>> for kernel code.") I think this was a mistake we should not repeat for
>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>
>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>
>> That is a very early version of the sparc ADI patch. Support for tagged
>> addresses in syscalls was added in later versions and is in the patch
>> that is in the kernel.
>
> I tried to figure out but I'm not familiar with the sparc port. How did
> you solve the tagged address going into various syscall implementations
> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> ends up deeper in the core code?
>

Another spot I should point out in ADI patch - Tags are not stored in
VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
before userspace addresses are passed to IOMMU in the following snippet
from the patch:

diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 5335ba3c850e..357b6047653a 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
nr_pages
, int write,
pgd_t *pgdp;
int nr = 0;

+#ifdef CONFIG_SPARC64
+ if (adi_capable()) {
+ long addr = start;
+
+ /* If userspace has passed a versioned address, kernel
+ * will not find it in the VMAs since it does not store
+ * the version tags in the list of VMAs. Storing version
+ * tags in list of VMAs is impractical since they can be
+ * changed any time from userspace without dropping into
+ * kernel. Any address search in VMAs will be done with
+ * non-versioned addresses. Ensure the ADI version bits
+ * are dropped here by sign extending the last bit before
+ * ADI bits. IOMMU does not implement version tags.
+ */
+ addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
+ start = addr;
+ }
+#endif
start &= PAGE_MASK;
addr = start;
len = (unsigned long) nr_pages << PAGE_SHIFT;


--
Khalid


2019-05-24 04:25:09

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Thu, May 23, 2019 at 2:04 AM Catalin Marinas <[email protected]> wrote:
>
> On Wed, May 22, 2019 at 02:16:57PM -0700, Evgenii Stepanov wrote:
> > On Wed, May 22, 2019 at 4:49 AM Catalin Marinas <[email protected]> wrote:
> > > On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > > > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > > > pass tagged user pointers (with the top byte set to something else other
> > > > than 0x00) as syscall arguments.
> > > >
> > > > This patch allows tagged pointers to be passed to the following memory
> > > > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> > > > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > > > remap_file_pages, shmat and shmdt.
> > > >
> > > > This is done by untagging pointers passed to these syscalls in the
> > > > prologues of their handlers.
> > >
> > > I'll go through them one by one to see if we can tighten the expected
> > > ABI while having the MTE in mind.
> > >
> > > > diff --git a/arch/arm64/kernel/sys.c b/arch/arm64/kernel/sys.c
> > > > index b44065fb1616..933bb9f3d6ec 100644
> > > > --- a/arch/arm64/kernel/sys.c
> > > > +++ b/arch/arm64/kernel/sys.c
> > > > @@ -35,10 +35,33 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
> > > > {
> > > > if (offset_in_page(off) != 0)
> > > > return -EINVAL;
> > > > -
> > > > + addr = untagged_addr(addr);
> > > > return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
> > > > }
> > >
> > > If user passes a tagged pointer to mmap() and the address is honoured
> > > (or MAP_FIXED is given), what is the expected return pointer? Does it
> > > need to be tagged with the value from the hint?
> >
> > For HWASan the most convenient would be to use the tag from the hint.
> > But since in the TBI (not MTE) mode the kernel has no idea what
> > meaning userspace assigns to pointer tags, perhaps it should not try
> > to guess, and should return raw (zero-tagged) address instead.
>
> Then, just to relax the ABI for hwasan, shall we simply disallow tagged
> pointers on mmap() arguments? We can leave them in for
> mremap(old_address), madvise().

I think this would be fine. We should allow tagged in pointers in
mprotect though.

> > > With MTE, we may want to use this as a request for the default colour of
> > > the mapped pages (still under discussion).
> >
> > I like this - and in that case it would make sense to return the
> > pointer that can be immediately dereferenced without crashing the
> > process, i.e. with the matching tag.
>
> This came up from the Android investigation work where large memory
> allocations (using mmap) could be more efficiently pre-tagged by the
> kernel on page fault. Not sure about the implementation details yet.
>
> --
> Catalin

2019-05-24 10:13:41

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
> On 5/23/19 2:11 PM, Catalin Marinas wrote:
> > On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
> >> On 5/21/19 6:04 PM, Kees Cook wrote:
> >>> As an aside: I think Sparc ADI support in Linux actually side-stepped
> >>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
> >>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
> >>> for kernel code.") I think this was a mistake we should not repeat for
> >>> arm64 (we do seem to be at least in agreement about this, I think).
> >>>
> >>> [1] https://lore.kernel.org/patchwork/patch/654481/
> >>
> >> That is a very early version of the sparc ADI patch. Support for tagged
> >> addresses in syscalls was added in later versions and is in the patch
> >> that is in the kernel.
> >
> > I tried to figure out but I'm not familiar with the sparc port. How did
> > you solve the tagged address going into various syscall implementations
> > in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
> > ends up deeper in the core code?
>
> Another spot I should point out in ADI patch - Tags are not stored in
> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
> before userspace addresses are passed to IOMMU in the following snippet
> from the patch:
>
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 5335ba3c850e..357b6047653a 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
> nr_pages
> , int write,
> pgd_t *pgdp;
> int nr = 0;
>
> +#ifdef CONFIG_SPARC64
> + if (adi_capable()) {
> + long addr = start;
> +
> + /* If userspace has passed a versioned address, kernel
> + * will not find it in the VMAs since it does not store
> + * the version tags in the list of VMAs. Storing version
> + * tags in list of VMAs is impractical since they can be
> + * changed any time from userspace without dropping into
> + * kernel. Any address search in VMAs will be done with
> + * non-versioned addresses. Ensure the ADI version bits
> + * are dropped here by sign extending the last bit before
> + * ADI bits. IOMMU does not implement version tags.
> + */
> + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
> + start = addr;
> + }
> +#endif
> start &= PAGE_MASK;
> addr = start;
> len = (unsigned long) nr_pages << PAGE_SHIFT;

Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so
you fix this case here. If we add the generic untagged_addr() macro in
the generic code, I think sparc can start making use of it rather than
open-coding the shifts.

There are a few other other places where tags can leak and the core code
would get confused (for example, madvise()). I presume your user space
doesn't exercise them. On arm64 we plan to just allow the C library to
tag any new memory allocation, so those core code paths would need to be
covered.

And similarly, devices, IOMMU, any DMA would ignore tags.

--
Catalin

2019-05-24 11:22:00

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> On Thu, May 23, 2019 at 06:43:46PM +0100, Catalin Marinas wrote:
> > On Thu, May 23, 2019 at 09:38:19AM -0700, Kees Cook wrote:
> > > What about testing tools that intentionally insert high bits for syscalls
> > > and are _expecting_ them to fail? It seems the TBI series will break them?
> > > In that case, do we need to opt into TBI as well?
> >
> > If there are such tools, then we may need a per-process control. It's
> > basically an ABI change.
>
> syzkaller already attempts to randomly inject non-canonical and
> 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> able to write directly to kernel memory[1].
>
> It seems that using TBI by default and not allowing a switch back to
> "normal" ABI without a reboot actually means that userspace cannot inject
> kernel pointers into syscalls any more, since they'll get universally
> stripped now. Is my understanding correct, here? i.e. exploiting
> CVE-2017-5123 would be impossible under TBI?

Unless the kernel is also using TBI (khwasan), in which case masking out
the top byte wouldn't help. Anyway, as per this discussion, we want the
tagged pointer to remain intact all the way to put_user(), so nothing
gets masked out. I don't think this would have helped with the waitid()
bug.

> If so, then I think we should commit to the TBI ABI and have a boot
> flag to disable it, but NOT have a process flag, as that would allow
> attackers to bypass the masking. The only flag should be "TBI or MTE".
>
> If so, can I get top byte masking for other architectures too? Like,
> just to strip high bits off userspace addresses? ;)

But you didn't like my option 2 shim proposal which strips the tag on
kernel entry because it lowers the value of MTE ;).

> (Oh, in looking I see this is implemented with sign-extension... why
> not just a mask? So it'll either be valid userspace address or forced
> into the non-canonical range?)

The TTBR0/1 selection on memory accesses is done based on bit 63 if TBI
is disabled and bit 55 when enabled. Sign-extension allows us to use the
same macro for both user and kernel tagged pointers. With MTE tag 0
would be match-all for TTBR0 and 0xff for TTBR1 (so that we don't modify
the virtual address space of the kernel; I need to check the latest spec
to be sure). Note that the VA space for both user and kernel is limited
to 52-bit architecturally so, on access, bits 55..52 must be the same, 0
or 1, otherwise you get a fault.

Since the syzkaller tests would also need to set bits 55-52 (actually 48
for kernel addresses, we haven't merged the 52-bit kernel VA patches
yet) to hit a valid kernel address, I don't think ignoring the top byte
makes much difference to the expected failure scenario.

> > > Alright, the tl;dr appears to be:
> > > - you want more assurances that we can find __user stripping in the
> > > kernel more easily. (But this seems like a parallel problem.)
> >
> > Yes, and that we found all (most) cases now. The reason I don't see it
> > as a parallel problem is that, as maintainer, I promise an ABI to user
> > and I'd rather stick to it. I don't want, for example, ncurses to stop
> > working because of some ioctl() rejecting tagged pointers.
>
> But this is what I don't understand: it would need to be ncurses _using
> TBI_, that would stop working (having started to work before, but then
> regress due to a newly added one-off bug). Regular ncurses will be fine
> because it's not using TBI. So The Golden Rule isn't violated,

Once we introduced TBI and the libc starts tagging heap allocations,
this becomes the new "regular" user space behaviour (i.e. using TBI). So
a new bug would break the golden rule. It could also be an old bug that
went unnoticed (i.e. you changed the graphics card and its driver gets
confused by tagged pointers coming from user-space).

> and by definition, it's a specific regression caused by some bug
> (since TBI would have had to have worked _before_ in the situation to
> be considered a regression now). Which describes the normal path for
> kernel development... add feature, find corner cases where it doesn't
> work, fix them, encounter new regressions, fix those, repeat forever.
>
> > If it's just the occasional one-off bug I'm fine to deal with it. But
> > no-one convinced me yet that this is the case.
>
> You believe there still to be some systemic cases that haven't been
> found yet? And even if so -- isn't it better to work on that
> incrementally?

I want some way to systematically identify potential issues (sparse?).
Since problems are most likely in drivers, I don't have all devices to
check and not all users have the knowledge to track down why something
failed.

I think we can do this incrementally as long the TBI ABI is not the
default. Even better if we made it per process.

> > As for the generic driver code (filesystems or other subsystems),
> > without some clear direction for developers, together with static
> > checking/sparse, on how user pointers are cast to longs (one example),
> > it would become my responsibility to identify and fix them up with any
> > kernel release. This series is not providing such guidance, just adding
> > untagged_addr() in some places that we think matter.
>
> What about adding a nice bit of .rst documentation that describes the
> situation and shows how to use untagged_addr(). This is the kind of
> kernel-wide change that "everyone" needs to know about, and shouldn't
> be the arch maintainer's sole responsibility to fix.

This works (if people read it) but we also need to be more prescriptive
in how casting is done and how we differentiate between a pointer for
dereference (T __user *) and address space management (usually unsigned
long). On top of that, we'd get sparse to check for such conversions and
maybe even checkpatch for some low-hanging fruit.

> > > - we might need to opt in to TBI with a prctl()
> >
> > Yes, although still up for discussion.
>
> I think I've talked myself out of it. I say boot param only! :)

I hope I talked you in again ;). I don't see TBI as improving kernel
security.

> So what do you say to these next steps:
>
> - change untagged_addr() to use a static branch that is controlled with
> a boot parameter.

access_ok() as well.

> - add, say, Documentation/core-api/user-addresses.rst to describe
> proper care and handling of user space pointers with untagged_addr(),
> with examples based on all the cases seen so far in this series.

We have u64_to_user_ptr(). What about the reverse? And maybe changing
get_user_pages() to take void __user *.

> - continue work to improve static analysis.

Andrew Murray in the ARM kernel team started revisiting the old sparse
threads, let's see how it goes.

--
Catalin

2019-05-24 13:17:08

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [PATCH v15 14/17] media/v4l2-core, arm64: untag user pointers in videobuf_dma_contig_user_get

Em Mon, 6 May 2019 18:31:00 +0200
Andrey Konovalov <[email protected]> escreveu:

> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> videobuf_dma_contig_user_get() uses provided user pointers for vma
> lookups, which can only by done with untagged pointers.
>
> Untag the pointers in this function.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Acked-by: Mauro Carvalho Chehab <[email protected]>

> ---
> drivers/media/v4l2-core/videobuf-dma-contig.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c b/drivers/media/v4l2-core/videobuf-dma-contig.c
> index e1bf50df4c70..8a1ddd146b17 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-contig.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
> @@ -160,6 +160,7 @@ static void videobuf_dma_contig_user_put(struct videobuf_dma_contig_memory *mem)
> static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
> struct videobuf_buffer *vb)
> {
> + unsigned long untagged_baddr = untagged_addr(vb->baddr);
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma;
> unsigned long prev_pfn, this_pfn;
> @@ -167,22 +168,22 @@ static int videobuf_dma_contig_user_get(struct videobuf_dma_contig_memory *mem,
> unsigned int offset;
> int ret;
>
> - offset = vb->baddr & ~PAGE_MASK;
> + offset = untagged_baddr & ~PAGE_MASK;
> mem->size = PAGE_ALIGN(vb->size + offset);
> ret = -EINVAL;
>
> down_read(&mm->mmap_sem);
>
> - vma = find_vma(mm, vb->baddr);
> + vma = find_vma(mm, untagged_baddr);
> if (!vma)
> goto out_up;
>
> - if ((vb->baddr + mem->size) > vma->vm_end)
> + if ((untagged_baddr + mem->size) > vma->vm_end)
> goto out_up;
>
> pages_done = 0;
> prev_pfn = 0; /* kill warning */
> - user_address = vb->baddr;
> + user_address = untagged_baddr;
>
> while (pages_done < (mem->size >> PAGE_SHIFT)) {
> ret = follow_pfn(vma, user_address, &this_pfn);



Thanks,
Mauro

2019-05-24 14:25:33

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 05:57:09PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 11:42:57AM +0100, Dave P Martin wrote:
> > On Wed, May 22, 2019 at 09:20:52PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 22, 2019 at 02:49:28PM +0100, Dave Martin wrote:
> > > > If multiple people will care about this, perhaps we should try to
> > > > annotate types more explicitly in SYSCALL_DEFINEx() and ABI data
> > > > structures.
> > > >
> > > > For example, we could have a couple of mutually exclusive modifiers
> > > >
> > > > T __object *
> > > > T __vaddr * (or U __vaddr)
> > > >
> > > > In the first case the pointer points to an object (in the C sense)
> > > > that the call may dereference but not use for any other purpose.
> > >
> > > How would you use these two differently?
> > >
> > > So far the kernel has worked that __user should tag any pointer that
> > > is from userspace and then you can't do anything with it until you
> > > transform it into a kernel something
> >
> > Ultimately it would be good to disallow casting __object pointers execpt
> > to compatible __object pointer types, and to make get_user etc. demand
> > __object.
> >
> > __vaddr pointers / addresses would be freely castable, but not to
> > __object and so would not be dereferenceable even indirectly.
>
> I think it gets too complicated and there are ambiguous cases that we
> may not be able to distinguish. For example copy_from_user() may be used
> to copy a user data structure into the kernel, hence __object would
> work, while the same function may be used to copy opaque data to a file,
> so __vaddr may be a better option (unless I misunderstood your
> proposal).

Can you illustrate? I'm not sure of your point here.

> We currently have T __user * and I think it's a good starting point. The
> prior attempt [1] was shut down because it was just hiding the cast
> using __force. We'd need to work through those cases again and rather
> start changing the function prototypes to avoid unnecessary casting in
> the callers (e.g. get_user_pages(void __user *) or come up with a new
> type) while changing the explicit casting to a macro where it needs to
> be obvious that we are converting a user pointer, potentially typed
> (tagged), to an untyped address range. We may need a user_ptr_to_ulong()
> macro or similar (it seems that we have a u64_to_user_ptr, wasn't aware
> of it).
>
> It may actually not be far from what you suggested but I'd keep the
> current T __user * to denote possible dereference.

This may not have been clear, but __object and __vaddr would be
orthogonal to __user. Since __object and __vaddr strictly constrain
what can be done with an lvalue, they could be cast on, but not be
cast off without __force.

Syscall arguments and pointer in ioctl structs etc. would typically
be annotated as __object __user * or __vaddr __user *. Plain old
__user * would work as before, but would be more permissive and give
static analysers less information to go on.

Conversion or use or __object or __vaddr pointers would require specific
APIs in the kernel, so that we can be clear about the semantics.

Doing things this way would allow migration to annotation of most or all
ABI pointers with __object or __vaddr over time, but we wouldn't have to
do it all in one go. Problem cases (which won't be the majority) could
continue to be plain __user.


This does not magically solve the challenges of MTE, but might provide
tools that are useful to help avoid bitrot and regressions over time.

I agree though that there might be a fair number of of cases that don't
conveniently fall under __object or __vaddr semantics. It's hard to
know without trying it.

_Most_ syscall arguments seem to be fairly obviously one or another
though, and this approach has some possibility of scaling to ioctls
and other odd interfaces.

Cheers
---Dave

2019-05-24 14:28:00

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On 5/24/19 4:11 AM, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 03:49:05PM -0600, Khalid Aziz wrote:
>> On 5/23/19 2:11 PM, Catalin Marinas wrote:
>>> On Thu, May 23, 2019 at 11:51:40AM -0600, Khalid Aziz wrote:
>>>> On 5/21/19 6:04 PM, Kees Cook wrote:
>>>>> As an aside: I think Sparc ADI support in Linux actually side-stepped
>>>>> this[1] (i.e. chose "solution 1"): "All addresses passed to kernel must
>>>>> be non-ADI tagged addresses." (And sadly, "Kernel does not enable ADI
>>>>> for kernel code.") I think this was a mistake we should not repeat for
>>>>> arm64 (we do seem to be at least in agreement about this, I think).
>>>>>
>>>>> [1] https://lore.kernel.org/patchwork/patch/654481/
>>>>
>>>> That is a very early version of the sparc ADI patch. Support for tagged
>>>> addresses in syscalls was added in later versions and is in the patch
>>>> that is in the kernel.
>>>
>>> I tried to figure out but I'm not familiar with the sparc port. How did
>>> you solve the tagged address going into various syscall implementations
>>> in the kernel (e.g. sys_write)? Is the tag removed on kernel entry or it
>>> ends up deeper in the core code?
>>
>> Another spot I should point out in ADI patch - Tags are not stored in
>> VMAs and IOMMU does not support ADI tags on M7. ADI tags are stripped
>> before userspace addresses are passed to IOMMU in the following snippet
>> from the patch:
>>
>> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
>> index 5335ba3c850e..357b6047653a 100644
>> --- a/arch/sparc/mm/gup.c
>> +++ b/arch/sparc/mm/gup.c
>> @@ -201,6 +202,24 @@ int __get_user_pages_fast(unsigned long start, int
>> nr_pages
>> , int write,
>> pgd_t *pgdp;
>> int nr = 0;
>>
>> +#ifdef CONFIG_SPARC64
>> + if (adi_capable()) {
>> + long addr = start;
>> +
>> + /* If userspace has passed a versioned address, kernel
>> + * will not find it in the VMAs since it does not store
>> + * the version tags in the list of VMAs. Storing version
>> + * tags in list of VMAs is impractical since they can be
>> + * changed any time from userspace without dropping into
>> + * kernel. Any address search in VMAs will be done with
>> + * non-versioned addresses. Ensure the ADI version bits
>> + * are dropped here by sign extending the last bit before
>> + * ADI bits. IOMMU does not implement version tags.
>> + */
>> + addr = (addr << (long)adi_nbits()) >> (long)adi_nbits();
>> + start = addr;
>> + }
>> +#endif
>> start &= PAGE_MASK;
>> addr = start;
>> len = (unsigned long) nr_pages << PAGE_SHIFT;
>
> Thanks Khalid. I missed that sparc does not enable HAVE_GENERIC_GUP, so
> you fix this case here. If we add the generic untagged_addr() macro in
> the generic code, I think sparc can start making use of it rather than
> open-coding the shifts.

Hi Catalin,

Yes, that will be good. Right now addresses are untagged in sparc code
in only two spots but that will expand as we expand use of tags.
Scalabale solution is definitely better.

>
> There are a few other other places where tags can leak and the core code
> would get confused (for example, madvise()). I presume your user space
> doesn't exercise them. On arm64 we plan to just allow the C library to
> tag any new memory allocation, so those core code paths would need to be
> covered.
>
> And similarly, devices, IOMMU, any DMA would ignore tags.
>

Right. You are doing lot more with tags than sparc code intended to do.
I had looked into implementing just malloc(), mmap() and possibly
shmat() in library that automatically tags pointers. Expanding tags to
any pointers in C library will require covering lot more paths in kernel.

--
Khalid


2019-05-24 15:43:21

by Andrew Murray

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> This patch allows tagged pointers to be passed to the following memory
> syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> remap_file_pages, shmat and shmdt.
>
> This is done by untagging pointers passed to these syscalls in the
> prologues of their handlers.
>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---


> +SYSCALL_DEFINE2(arm64_mlock, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_mlock(start, len, VM_LOCKED);
> +}
> +
> +SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_mlock(start, len, VM_LOCKED);
> +}

I think this may be a copy/paste error...

Shouldn't mlock2 have a third 'flags' argument to distinguish is from mlock?

Thanks,

Andrew Murray

> +
> +SYSCALL_DEFINE2(arm64_munlock, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_munlock(start, len);
> +}
> +
> +SYSCALL_DEFINE3(arm64_mprotect, unsigned long, start, size_t, len,
> + unsigned long, prot)
> +{
> + start = untagged_addr(start);
> + return ksys_mprotect_pkey(start, len, prot, -1);
> +}
> +
> +SYSCALL_DEFINE3(arm64_msync, unsigned long, start, size_t, len, int, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_msync(start, len, flags);
> +}
> +
> +SYSCALL_DEFINE3(arm64_mincore, unsigned long, start, size_t, len,
> + unsigned char __user *, vec)
> +{
> + start = untagged_addr(start);
> + return ksys_mincore(start, len, vec);
> +}
> +
> +SYSCALL_DEFINE5(arm64_remap_file_pages, unsigned long, start,
> + unsigned long, size, unsigned long, prot,
> + unsigned long, pgoff, unsigned long, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_remap_file_pages(start, size, prot, pgoff, flags);
> +}
> +
> +SYSCALL_DEFINE3(arm64_shmat, int, shmid, char __user *, shmaddr, int, shmflg)
> +{
> + shmaddr = untagged_addr(shmaddr);
> + return ksys_shmat(shmid, shmaddr, shmflg);
> +}
> +
> +SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr)
> +{
> + shmaddr = untagged_addr(shmaddr);
> + return ksys_shmdt(shmaddr);
> +}
> +
> /*
> * Wrappers to pass the pt_regs argument.
> */
> #define sys_personality sys_arm64_personality
> +#define sys_mmap_pgoff sys_arm64_mmap_pgoff
> +#define sys_mremap sys_arm64_mremap
> +#define sys_munmap sys_arm64_munmap
> +#define sys_brk sys_arm64_brk
> +#define sys_get_mempolicy sys_arm64_get_mempolicy
> +#define sys_madvise sys_arm64_madvise
> +#define sys_mbind sys_arm64_mbind
> +#define sys_mlock sys_arm64_mlock
> +#define sys_mlock2 sys_arm64_mlock2
> +#define sys_munlock sys_arm64_munlock
> +#define sys_mprotect sys_arm64_mprotect
> +#define sys_msync sys_arm64_msync
> +#define sys_mincore sys_arm64_mincore
> +#define sys_remap_file_pages sys_arm64_remap_file_pages
> +#define sys_shmat sys_arm64_shmat
> +#define sys_shmdt sys_arm64_shmdt
>
> asmlinkage long sys_ni_syscall(const struct pt_regs *);
> #define __arm64_sys_ni_syscall sys_ni_syscall
> --
> 2.21.0.1020.gf2820cf01a-goog
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2019-05-25 10:00:13

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> +SYSCALL_DEFINE5(arm64_get_mempolicy, int __user *, policy,
> + unsigned long __user *, nmask, unsigned long, maxnode,
> + unsigned long, addr, unsigned long, flags)
> +{
> + addr = untagged_addr(addr);
> + return ksys_get_mempolicy(policy, nmask, maxnode, addr, flags);
> +}
[...]
> +SYSCALL_DEFINE6(arm64_mbind, unsigned long, start, unsigned long, len,
> + unsigned long, mode, const unsigned long __user *, nmask,
> + unsigned long, maxnode, unsigned int, flags)
> +{
> + start = untagged_addr(start);
> + return ksys_mbind(start, len, mode, nmask, maxnode, flags);
> +}

The kernel fails to build with CONFIG_NUMA disabled because the above
are in mm/mempolicy.c which is no longer compiled in.

--
Catalin

2019-05-27 09:44:27

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> +SYSCALL_DEFINE2(arm64_mlock2, unsigned long, start, size_t, len)
> +{
> + start = untagged_addr(start);
> + return ksys_mlock(start, len, VM_LOCKED);
> +}

Copy/paste error: sys_mlock2() has 3 arguments and should call
ksys_mlock2().

Still tracking down an LTP failure on test mlock01.

--
Catalin

2019-05-27 14:38:54

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> This patch allows tagged pointers to be passed to the following memory
> syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> remap_file_pages, shmat and shmdt.
>
> This is done by untagging pointers passed to these syscalls in the
> prologues of their handlers.
>
> Signed-off-by: Andrey Konovalov <[email protected]>

Actually, I don't think any of these wrappers get called (have you
tested this patch?). Following commit 4378a7d4be30 ("arm64: implement
syscall wrappers"), I think we have other macro names for overriding the
sys_* ones.

--
Catalin

2019-05-28 13:08:38

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> /*
> * Wrappers to pass the pt_regs argument.
> */
> #define sys_personality sys_arm64_personality
> +#define sys_mmap_pgoff sys_arm64_mmap_pgoff
> +#define sys_mremap sys_arm64_mremap
> +#define sys_munmap sys_arm64_munmap
> +#define sys_brk sys_arm64_brk
> +#define sys_get_mempolicy sys_arm64_get_mempolicy
> +#define sys_madvise sys_arm64_madvise
> +#define sys_mbind sys_arm64_mbind
> +#define sys_mlock sys_arm64_mlock
> +#define sys_mlock2 sys_arm64_mlock2
> +#define sys_munlock sys_arm64_munlock
> +#define sys_mprotect sys_arm64_mprotect
> +#define sys_msync sys_arm64_msync
> +#define sys_mincore sys_arm64_mincore
> +#define sys_remap_file_pages sys_arm64_remap_file_pages
> +#define sys_shmat sys_arm64_shmat
> +#define sys_shmdt sys_arm64_shmdt

This hunk should be (I sent a separate patch for sys_personality):

@@ -160,23 +163,23 @@ SYSCALL_DEFINE1(arm64_shmdt, char __user *, shmaddr)
/*
* Wrappers to pass the pt_regs argument.
*/
-#define sys_personality sys_arm64_personality
-#define sys_mmap_pgoff sys_arm64_mmap_pgoff
-#define sys_mremap sys_arm64_mremap
-#define sys_munmap sys_arm64_munmap
-#define sys_brk sys_arm64_brk
-#define sys_get_mempolicy sys_arm64_get_mempolicy
-#define sys_madvise sys_arm64_madvise
-#define sys_mbind sys_arm64_mbind
-#define sys_mlock sys_arm64_mlock
-#define sys_mlock2 sys_arm64_mlock2
-#define sys_munlock sys_arm64_munlock
-#define sys_mprotect sys_arm64_mprotect
-#define sys_msync sys_arm64_msync
-#define sys_mincore sys_arm64_mincore
-#define sys_remap_file_pages sys_arm64_remap_file_pages
-#define sys_shmat sys_arm64_shmat
-#define sys_shmdt sys_arm64_shmdt
+#define __arm64_sys_personality __arm64_sys_arm64_personality
+#define __arm64_sys_mmap_pgoff __arm64_sys_arm64_mmap_pgoff
+#define __arm64_sys_mremap __arm64_sys_arm64_mremap
+#define __arm64_sys_munmap __arm64_sys_arm64_munmap
+#define __arm64_sys_brk __arm64_sys_arm64_brk
+#define __arm64_sys_get_mempolicy __arm64_sys_arm64_get_mempolicy
+#define __arm64_sys_madvise __arm64_sys_arm64_madvise
+#define __arm64_sys_mbind __arm64_sys_arm64_mbind
+#define __arm64_sys_mlock __arm64_sys_arm64_mlock
+#define __arm64_sys_mlock2 __arm64_sys_arm64_mlock2
+#define __arm64_sys_munlock __arm64_sys_arm64_munlock
+#define __arm64_sys_mprotect __arm64_sys_arm64_mprotect
+#define __arm64_sys_msync __arm64_sys_arm64_msync
+#define __arm64_sys_mincore __arm64_sys_arm64_mincore
+#define __arm64_sys_remap_file_pages __arm64_sys_arm64_remap_file_pages
+#define __arm64_sys_shmat __arm64_sys_arm64_shmat
+#define __arm64_sys_shmdt __arm64_sys_arm64_shmdt

asmlinkage long sys_ni_syscall(const struct pt_regs *);
#define __arm64_sys_ni_syscall sys_ni_syscall

--
Catalin

2019-05-28 14:31:13

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

Thanks for a lot of valuable input! I've read through all the replies
and got somewhat lost. What are the changes I need to do to this
series?

1. Should I move untagging for memory syscalls back to the generic
code so other arches would make use of it as well, or should I keep
the arm64 specific memory syscalls wrappers and address the comments
on that patch?

2. Should I make untagging opt-in and controlled by a command line argument?

3. Should I "add Documentation/core-api/user-addresses.rst to describe
proper care and handling of user space pointers with untagged_addr(),
with examples based on all the cases seen so far in this series"?
Which examples specifically should it cover?

Is there something else?

2019-05-28 14:55:37

by Andrew Murray

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
> On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > pass tagged user pointers (with the top byte set to something else other
> > than 0x00) as syscall arguments.
> >
> > This patch allows tagged pointers to be passed to the following memory
> > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > remap_file_pages, shmat and shmdt.
> >
> > This is done by untagging pointers passed to these syscalls in the
> > prologues of their handlers.
> >
> > Signed-off-by: Andrey Konovalov <[email protected]>
>
> Actually, I don't think any of these wrappers get called (have you
> tested this patch?). Following commit 4378a7d4be30 ("arm64: implement
> syscall wrappers"), I think we have other macro names for overriding the
> sys_* ones.

What is the value in adding these wrappers?

The untagged_addr macro is defined for all in linux/mm.h and these patches
already use untagged_addr in generic code. Thus adding a few more
untagged_addr in the generic syscall handlers (which turn to a nop for most)
is surely better than adding wrappers?

Even if other architectures implement untagged_addr in the future it would
be more consistent if they untagged in the same places and thus not adding
these wrappers enforces that.

Thanks,

Andrew Murray

>
> --
> Catalin
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2019-05-28 15:48:34

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Tue, May 28, 2019 at 03:54:11PM +0100, Andrew Murray wrote:
> On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
> > On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > > pass tagged user pointers (with the top byte set to something else other
> > > than 0x00) as syscall arguments.
> > >
> > > This patch allows tagged pointers to be passed to the following memory
> > > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock, mlock2,
> > > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > > remap_file_pages, shmat and shmdt.
> > >
> > > This is done by untagging pointers passed to these syscalls in the
> > > prologues of their handlers.
> > >
> > > Signed-off-by: Andrey Konovalov <[email protected]>
> >
> > Actually, I don't think any of these wrappers get called (have you
> > tested this patch?). Following commit 4378a7d4be30 ("arm64: implement
> > syscall wrappers"), I think we have other macro names for overriding the
> > sys_* ones.
>
> What is the value in adding these wrappers?

Not much value, initially proposed just to keep the core changes small.
I'm fine with moving them back in the generic code (but see below).

I think another aspect is how we define the ABI. Is allowing tags to
mlock() for example something specific to arm64 or would sparc ADI need
the same? In the absence of other architectures defining such ABI, my
preference would be to keep the wrappers in the arch code.

Assuming sparc won't implement untagged_addr(), we can place the macros
back in the generic code but, as per the review here, we need to be more
restrictive on where we allow tagged addresses. For example, if mmap()
gets a tagged address with MAP_FIXED, is it expected to return the tag?

My thoughts on allowing tags (quick look):

brk - no
get_mempolicy - yes
madvise - yes
mbind - yes
mincore - yes
mlock, mlock2, munlock - yes
mmap - no (we may change this with MTE but not for TBI)
mmap_pgoff - not used on arm64
mprotect - yes
mremap - yes for old_address, no for new_address (on par with mmap)
msync - yes
munmap - probably no (mmap does not return tagged ptrs)
remap_file_pages - no (also deprecated syscall)
shmat, shmdt - shall we allow tagged addresses on shared memory?

The above is only about the TBI ABI while ignoring hardware MTE. For the
latter, we may want to change the mmap() to allow pre-colouring on page
fault which means that munmap()/mprotect() should also support tagged
pointers. Possibly mremap() as well but we need to decide whether it
should allow re-colouring the page (probably no, in which case
old_address and new_address should have the same tag). For some of these
we'll end up with arm64 specific wrappers again, unless sparc ADI adopts
exactly the same ABI restrictions.

--
Catalin

2019-05-28 15:58:29

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:

[...]

> My thoughts on allowing tags (quick look):
>
> brk - no

[...]

> mlock, mlock2, munlock - yes
> mmap - no (we may change this with MTE but not for TBI)

[...]

> mprotect - yes

I haven't following this discussion closely... what's the rationale for
the inconsistencies here (feel free to refer me back to the discussion
if it's elsewhere).

Cheers
---Dave

2019-05-28 16:38:07

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
> On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
>
> [...]
>
> > My thoughts on allowing tags (quick look):
> >
> > brk - no
>
> [...]
>
> > mlock, mlock2, munlock - yes
> > mmap - no (we may change this with MTE but not for TBI)
>
> [...]
>
> > mprotect - yes
>
> I haven't following this discussion closely... what's the rationale for
> the inconsistencies here (feel free to refer me back to the discussion
> if it's elsewhere).

_My_ rationale (feel free to disagree) is that mmap() by default would
not return a tagged address (ignoring MTE for now). If it gets passed a
tagged address or a "tagged NULL" (for lack of a better name) we don't
have clear semantics of whether the returned address should be tagged in
this ABI relaxation. I'd rather reserve this specific behaviour if we
overload the non-zero tag meaning of mmap() for MTE. Similar reasoning
for mremap(), at least on the new_address argument (not entirely sure
about old_address).

munmap() should probably follow the mmap() rules.

As for brk(), I don't see why the user would need to pass a tagged
address, we can't associate any meaning to this tag.

For the rest, since it's likely such addresses would have been tagged by
malloc() in user space, we should allow tagged pointers.

--
Catalin

2019-05-28 17:04:17

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> syzkaller already attempts to randomly inject non-canonical and
> 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> able to write directly to kernel memory[1].
>
> It seems that using TBI by default and not allowing a switch back to
> "normal" ABI without a reboot actually means that userspace cannot inject
> kernel pointers into syscalls any more, since they'll get universally
> stripped now. Is my understanding correct, here? i.e. exploiting
> CVE-2017-5123 would be impossible under TBI?
>
> If so, then I think we should commit to the TBI ABI and have a boot
> flag to disable it, but NOT have a process flag, as that would allow
> attackers to bypass the masking. The only flag should be "TBI or MTE".
>
> If so, can I get top byte masking for other architectures too? Like,
> just to strip high bits off userspace addresses? ;)

Just for fun, hack/attempt at your idea which should not interfere with
TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is
pretty weird ;)):

--------------------------8<---------------------------------
diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h
index 63b46e30b2c3..338455a74eff 100644
--- a/arch/s390/include/asm/compat.h
+++ b/arch/s390/include/asm/compat.h
@@ -11,9 +11,6 @@

#include <asm-generic/compat.h>

-#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \
- typeof(0?(__force t)0:0ULL), u64))
-
#define __SC_DELOUSE(t,v) ({ \
BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \
(__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..b1b9fe8502da 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -119,8 +119,15 @@ struct io_uring_params;
#define __TYPE_IS_L(t) (__TYPE_AS(t, 0L))
#define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL))
#define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL))
+#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64))
#define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
+#ifdef CONFIG_64BIT
+#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \
+ ? (__force t) ((__u64)a & ~(1UL << 55)) \
+ : (__force t) a)
+#else
#define __SC_CAST(t, a) (__force t) a
+#endif
#define __SC_ARGS(t, a) a
#define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))


--
Catalin

2019-05-28 23:37:45

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
> On Tue, May 28, 2019 at 03:54:11PM +0100, Andrew Murray wrote:
> > On Mon, May 27, 2019 at 03:37:20PM +0100, Catalin Marinas wrote:
> > > On Mon, May 06, 2019 at 06:30:51PM +0200, Andrey Konovalov wrote:
> > > > This patch is a part of a series that extends arm64 kernel ABI
> > > > to allow to
> > > > pass tagged user pointers (with the top byte set to something
> > > > else other
> > > > than 0x00) as syscall arguments.
> > > >
> > > > This patch allows tagged pointers to be passed to the following
> > > > memory
> > > > syscalls: brk, get_mempolicy, madvise, mbind, mincore, mlock,
> > > > mlock2,
> > > > mmap, mmap_pgoff, mprotect, mremap, msync, munlock, munmap,
> > > > remap_file_pages, shmat and shmdt.
> > > >
> > > > This is done by untagging pointers passed to these syscalls in
> > > > the
> > > > prologues of their handlers.
> > > >
> > > > Signed-off-by: Andrey Konovalov <[email protected]>
> > >
> > > Actually, I don't think any of these wrappers get called (have
> > > you
> > > tested this patch?). Following commit 4378a7d4be30 ("arm64:
> > > implement
> > > syscall wrappers"), I think we have other macro names for
> > > overriding the
> > > sys_* ones.
> >
> > What is the value in adding these wrappers?
>
> Not much value, initially proposed just to keep the core changes
> small.
> I'm fine with moving them back in the generic code (but see below).
>
> I think another aspect is how we define the ABI. Is allowing tags to
> mlock() for example something specific to arm64 or would sparc ADI
> need
> the same? In the absence of other architectures defining such ABI, my
> preference would be to keep the wrappers in the arch code.
>
> Assuming sparc won't implement untagged_addr(), we can place the
> macros
> back in the generic code but, as per the review here, we need to be
> more
> restrictive on where we allow tagged addresses. For example, if
> mmap()
> gets a tagged address with MAP_FIXED, is it expected to return the
> tag?

I would recommend against any ABI differences between ARM64 MTE/TBI and
sparc ADI unless it simply can not be helped. My understanding of
MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
tagged address has no meaning until following steps happen:

1. set the user mode PSTATE.mcde bit. This acts as the master switch to
enable ADI for a process.

2. set TTE.mcd bit on TLB entries that match the address range ADI is
being enabled on.

3. Store version tag for the range of addresses userspace wants ADI
enabled on using "stxa" instruction. These tags are stored in physical
memory by the memory controller.

Steps 1 and 2 are accomplished by userspace by calling mprotect() with
PROT_ADI. Tags are set by storing tags in a loop, for example:

version = 10;
tmp_addr = shmaddr;
end = shmaddr + BUFFER_SIZE;
while (tmp_addr < end) {
asm volatile(
"stxa %1, [%0]0x90\n\t"
:
: "r" (tmp_addr), "r" (version));
tmp_addr += adi_blksz;
}

With these semantics, giving mmap() or shamat() a tagged address is
meaningless since no tags have been stored at the addresses mmap() will
allocate and one can not store tags before memory range has been
allocated. If we choose to allow tagged addresses to come into mmap()
and shmat(), sparc code can strip the tags unconditionally and that may
help simplify ABI and/or code.

>
> My thoughts on allowing tags (quick look):
>
> brk - no
> get_mempolicy - yes
> madvise - yes
> mbind - yes
> mincore - yes
> mlock, mlock2, munlock - yes
> mmap - no (we may change this with MTE but not for TBI)
> mmap_pgoff - not used on arm64
> mprotect - yes
> mremap - yes for old_address, no for new_address (on par with mmap)
> msync - yes
> munmap - probably no (mmap does not return tagged ptrs)
> remap_file_pages - no (also deprecated syscall)
> shmat, shmdt - shall we allow tagged addresses on shared memory?
>
> The above is only about the TBI ABI while ignoring hardware MTE. For
> the
> latter, we may want to change the mmap() to allow pre-colouring on
> page
> fault which means that munmap()/mprotect() should also support tagged
> pointers. Possibly mremap() as well but we need to decide whether it
> should allow re-colouring the page (probably no, in which case
> old_address and new_address should have the same tag). For some of
> these
> we'll end up with arm64 specific wrappers again, unless sparc ADI
> adopts
> exactly the same ABI restrictions.
>

Let us keep any restrictions common across ARM64 and sparc. pre-
coloring on sparc in the kernel would mean kernel will have to execute
stxa instructions in a loop for each page being faulted in. Not that
big a deal but doesn't that assume the entire page has the same tag
which is dedcued from the upper bits of address? Shouldn't we support
tags at the same granularity level as what the hardware supports? We
went through this discussion for sparc and decision was to support tags
at the same granularity as hardware. That means we can not deduce tags
from the first address that pioints into an mmap or shmat region. Those
tags and the upper bytes of colored address could change for every
cacheline sized block (64-bytes on sparc M7). We can try to store tags
for an entire region in vma but that is expensive, plus on sparc tags
are set in userspace with no participation from kernel and now we need
a way for userspace to communicate the tags to kernel. From sparc point
of view, making kernel responsible for assigning tags to a page on page
fault is full of pitfalls.

Thanks,
Khalid

2019-05-29 06:12:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> Thanks for a lot of valuable input! I've read through all the replies
> and got somewhat lost. What are the changes I need to do to this
> series?
>
> 1. Should I move untagging for memory syscalls back to the generic
> code so other arches would make use of it as well, or should I keep
> the arm64 specific memory syscalls wrappers and address the comments
> on that patch?

It absolutely needs to move to common code. Having arch code leads
to pointless (often unintentional) semantic difference between
architectures, and lots of boilerplate code.

Btw, can anyone of the arm crowd or Khalid comment on the linux-mm
thread on generic gup where I'm dealing with the pre-existing ADI
case of pointer untagging?

2019-05-29 12:14:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 28, 2019 at 11:11:26PM -0700, Christoph Hellwig wrote:
> On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > Thanks for a lot of valuable input! I've read through all the replies
> > and got somewhat lost. What are the changes I need to do to this
> > series?
> >
> > 1. Should I move untagging for memory syscalls back to the generic
> > code so other arches would make use of it as well, or should I keep
> > the arm64 specific memory syscalls wrappers and address the comments
> > on that patch?
>
> It absolutely needs to move to common code. Having arch code leads
> to pointless (often unintentional) semantic difference between
> architectures, and lots of boilerplate code.

That's fine by me as long as we agree on the semantics (which shouldn't
be hard; Khalid already following up). We should probably also move the
proposed ABI document [1] into a common place (or part of since we'll
have arm64-specifics like prctl() calls to explicitly opt in to memory
tagging).

[1] https://lore.kernel.org/lkml/[email protected]/T/#u

--
Catalin

2019-05-29 12:45:00

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
> On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
> > On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
> >
> > [...]
> >
> > > My thoughts on allowing tags (quick look):
> > >
> > > brk - no
> >
> > [...]
> >
> > > mlock, mlock2, munlock - yes
> > > mmap - no (we may change this with MTE but not for TBI)
> >
> > [...]
> >
> > > mprotect - yes
> >
> > I haven't following this discussion closely... what's the rationale for
> > the inconsistencies here (feel free to refer me back to the discussion
> > if it's elsewhere).
>
> _My_ rationale (feel free to disagree) is that mmap() by default would
> not return a tagged address (ignoring MTE for now). If it gets passed a
> tagged address or a "tagged NULL" (for lack of a better name) we don't
> have clear semantics of whether the returned address should be tagged in
> this ABI relaxation. I'd rather reserve this specific behaviour if we
> overload the non-zero tag meaning of mmap() for MTE. Similar reasoning
> for mremap(), at least on the new_address argument (not entirely sure
> about old_address).
>
> munmap() should probably follow the mmap() rules.
>
> As for brk(), I don't see why the user would need to pass a tagged
> address, we can't associate any meaning to this tag.
>
> For the rest, since it's likely such addresses would have been tagged by
> malloc() in user space, we should allow tagged pointers.

Those arguments seem reasonable. We should try to capture this
somewhere when documenting the ABI.

To be clear, I'm not sure that we should guarantee anywhere that a
tagged pointer is rejected: rather the behaviour should probably be
left unspecified. Then we can tidy it up incrementally.

(The behaviour is unspecified today, in any case.)

Cheers
---Dave

2019-05-29 13:26:18

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Wed, May 29, 2019 at 01:42:25PM +0100, Dave P Martin wrote:
> On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
> > On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
> > > On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
> > >
> > > [...]
> > >
> > > > My thoughts on allowing tags (quick look):
> > > >
> > > > brk - no
> > >
> > > [...]
> > >
> > > > mlock, mlock2, munlock - yes
> > > > mmap - no (we may change this with MTE but not for TBI)
> > >
> > > [...]
> > >
> > > > mprotect - yes
> > >
> > > I haven't following this discussion closely... what's the rationale for
> > > the inconsistencies here (feel free to refer me back to the discussion
> > > if it's elsewhere).
> >
> > _My_ rationale (feel free to disagree) is that mmap() by default would
> > not return a tagged address (ignoring MTE for now). If it gets passed a
> > tagged address or a "tagged NULL" (for lack of a better name) we don't
> > have clear semantics of whether the returned address should be tagged in
> > this ABI relaxation. I'd rather reserve this specific behaviour if we
> > overload the non-zero tag meaning of mmap() for MTE. Similar reasoning
> > for mremap(), at least on the new_address argument (not entirely sure
> > about old_address).
> >
> > munmap() should probably follow the mmap() rules.
> >
> > As for brk(), I don't see why the user would need to pass a tagged
> > address, we can't associate any meaning to this tag.
> >
> > For the rest, since it's likely such addresses would have been tagged by
> > malloc() in user space, we should allow tagged pointers.
>
> Those arguments seem reasonable. We should try to capture this
> somewhere when documenting the ABI.
>
> To be clear, I'm not sure that we should guarantee anywhere that a
> tagged pointer is rejected: rather the behaviour should probably be
> left unspecified. Then we can tidy it up incrementally.
>
> (The behaviour is unspecified today, in any case.)

What is specified (or rather de-facto ABI) today is that passing a user
address above TASK_SIZE (e.g. non-zero top byte) would fail in most
cases. If we relax this with the TBI we may end up with some de-facto
ABI before we actually get MTE hardware. Tightening it afterwards may be
slightly more problematic, although MTE needs to be an explicit opt-in.

IOW, I wouldn't want to unnecessarily relax the ABI if we don't need to.

--
Catalin

2019-05-29 14:22:51

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

Hi Khalid,

On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
> On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
> > I think another aspect is how we define the ABI. Is allowing tags to
> > mlock() for example something specific to arm64 or would sparc ADI
> > need the same? In the absence of other architectures defining such
> > ABI, my preference would be to keep the wrappers in the arch code.
> >
> > Assuming sparc won't implement untagged_addr(), we can place the
> > macros back in the generic code but, as per the review here, we need
> > to be more restrictive on where we allow tagged addresses. For
> > example, if mmap() gets a tagged address with MAP_FIXED, is it
> > expected to return the tag?
>
> I would recommend against any ABI differences between ARM64 MTE/TBI and
> sparc ADI unless it simply can not be helped. My understanding of
> MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
> tagged address has no meaning until following steps happen:

Before we go into the MTE/ADI similarities or differences, just to
clarify that TBI is something that we supported from the start of the
arm64 kernel port. TBI (top byte ignore) allows a user pointer to have
non-zero top byte and dereference it without causing a fault (the
hardware masks it out). The user/kernel ABI does not allow such tagged
pointers into the kernel, nor would the kernel return any such tagged
addresses.

With MTE (memory tagging extensions), the top-byte meaning is changed
from no longer being ignored to actually being checked against a tag in
the physical RAM (we call it allocation tag).

> 1. set the user mode PSTATE.mcde bit. This acts as the master switch to
> enable ADI for a process.
>
> 2. set TTE.mcd bit on TLB entries that match the address range ADI is
> being enabled on.

Something close enough for MTE, with the difference that enabling it is
not a PSTATE bit but rather a system control bit (SCTLR_EL1 register),
so only the kernel can turn it on/off for the user.

> 3. Store version tag for the range of addresses userspace wants ADI
> enabled on using "stxa" instruction. These tags are stored in physical
> memory by the memory controller.

Do you have an "ldxa" instruction to load the tags from physical memory?

> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
> PROT_ADI. Tags are set by storing tags in a loop, for example:
>
> version = 10;
> tmp_addr = shmaddr;
> end = shmaddr + BUFFER_SIZE;
> while (tmp_addr < end) {
> asm volatile(
> "stxa %1, [%0]0x90\n\t"
> :
> : "r" (tmp_addr), "r" (version));
> tmp_addr += adi_blksz;
> }

On arm64, a sequence similar to the above would live in the libc. So a
malloc() call will tag the memory and return the tagged address to the
user.

We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
all user memory ranges. We may revisit this before we upstream the MTE
support (probably some marginal benefit for the hardware not fetching
the tags from memory if we don't need to, e.g. code sections).

Given that we already have the TBI feature and with MTE enabled the top
byte is no longer ignored, we are planning for an explicit opt-in by the
user via prctl() to enable MTE.

> With these semantics, giving mmap() or shamat() a tagged address is
> meaningless since no tags have been stored at the addresses mmap() will
> allocate and one can not store tags before memory range has been
> allocated. If we choose to allow tagged addresses to come into mmap()
> and shmat(), sparc code can strip the tags unconditionally and that may
> help simplify ABI and/or code.

We could say that with TBI (pre-MTE support), the top byte is actually
ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
should the user expect the same tagged address back or stripping the tag
is acceptable? If we want to keep the current mmap() semantics, I'd say
the same tag is returned. However, with MTE this also implies that the
memory was coloured.

> > My thoughts on allowing tags (quick look):
> >
> > brk - no
> > get_mempolicy - yes
> > madvise - yes
> > mbind - yes
> > mincore - yes
> > mlock, mlock2, munlock - yes
> > mmap - no (we may change this with MTE but not for TBI)
> > mmap_pgoff - not used on arm64
> > mprotect - yes
> > mremap - yes for old_address, no for new_address (on par with mmap)
> > msync - yes
> > munmap - probably no (mmap does not return tagged ptrs)
> > remap_file_pages - no (also deprecated syscall)
> > shmat, shmdt - shall we allow tagged addresses on shared memory?
> >
> > The above is only about the TBI ABI while ignoring hardware MTE. For
> > the latter, we may want to change the mmap() to allow pre-colouring
> > on page fault which means that munmap()/mprotect() should also
> > support tagged pointers. Possibly mremap() as well but we need to
> > decide whether it should allow re-colouring the page (probably no,
> > in which case old_address and new_address should have the same tag).
> > For some of these we'll end up with arm64 specific wrappers again,
> > unless sparc ADI adopts exactly the same ABI restrictions.
>
> Let us keep any restrictions common across ARM64 and sparc. pre-
> coloring on sparc in the kernel would mean kernel will have to execute
> stxa instructions in a loop for each page being faulted in.

Since the user can probe the pre-existing colour in a faulted-in page
(either with some 'ldxa' instruction or by performing a tag-checked
access), the kernel should always pre-colour (even if colour 0) any
allocated page. There might not be an obvious security risk but I feel
uneasy about letting colours leak between address spaces (different user
processes or between kernel and user).

Since we already need such loop in the kernel, we might as well allow
user space to require a certain colour. This comes in handy for large
malloc() and another advantage is that the C library won't be stuck
trying to paint the whole range (think GB).

> Not that big a deal but doesn't that assume the entire page has the
> same tag which is dedcued from the upper bits of address? Shouldn't we
> support tags at the same granularity level as what the hardware
> supports?

That's mostly about large malloc() optimisation via mmap(), the latter
working on page granularity already. There is another use-case for
pre-coloured thread stacks, also allocated via anonymous mmap().

> We went through this discussion for sparc and decision was to support
> tags at the same granularity as hardware. That means we can not deduce
> tags from the first address that pioints into an mmap or shmat region.
> Those tags and the upper bytes of colored address could change for
> every cacheline sized block (64-bytes on sparc M7).

It's 16-byte for arm64, so smaller than the cacheline.

> We can try to store tags for an entire region in vma but that is
> expensive, plus on sparc tags are set in userspace with no
> participation from kernel and now we need a way for userspace to
> communicate the tags to kernel.

We can't support finer granularity through the mmap() syscall and, as
you said, the vma is not the right thing to store the individual tags.
With the above extension to mmap(), we'd have to store a colour per vma
and prevent merging if different colours (we could as well use the
pkeys mechanism we already have in the kernel but use a colour per vma
instead of a key).

Of course, the user is allowed to change the in-memory colours at a
finer granularity and the kernel will preserve them during swapping
out/in, page migration etc. The above mmap() proposal is just for the
first fault-in of a page in a given range/vma.

> From sparc point of view, making kernel responsible for assigning tags
> to a page on page fault is full of pitfalls.

This could be just some arm64-specific but if you plan to deploy it more
generically for sparc (at the C library level), you may find this
useful.

--
Catalin

2019-05-29 14:52:16

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 01/17] uaccess: add untagged_addr definition for other arches

On Mon, 2019-05-06 at 18:30 +0200, Andrey Konovalov wrote:
> To allow arm64 syscalls to accept tagged pointers from userspace, we
> must
> untag them when they are passed to the kernel. Since untagging is
> done in
> generic parts of the kernel, the untagged_addr macro needs to be
> defined
> for all architectures.
>
> Define it as a noop for architectures other than arm64.
>
> Acked-by: Catalin Marinas <[email protected]>
> Signed-off-by: Andrey Konovalov <[email protected]>
> ---
> include/linux/mm.h | 4 ++++
> 1 file changed, 4 insertions(+)

As discussed in the other thread Chris started, there is a generic need
to untag addresses in kernel and this patch gets us ready for that.

Reviewed-by: Khalid Aziz <[email protected]>

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6b10c21630f5..44041df804a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
> #include <asm/pgtable.h>
> #include <asm/processor.h>
>
> +#ifndef untagged_addr
> +#define untagged_addr(addr) (addr)
> +#endif
> +
> #ifndef __pa_symbol
> #define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
> #endif

2019-05-29 15:20:23

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Wed, May 29, 2019 at 02:23:42PM +0100, Catalin Marinas wrote:
> On Wed, May 29, 2019 at 01:42:25PM +0100, Dave P Martin wrote:
> > On Tue, May 28, 2019 at 05:34:00PM +0100, Catalin Marinas wrote:
> > > On Tue, May 28, 2019 at 04:56:45PM +0100, Dave P Martin wrote:
> > > > On Tue, May 28, 2019 at 04:40:58PM +0100, Catalin Marinas wrote:
> > > >
> > > > [...]
> > > >
> > > > > My thoughts on allowing tags (quick look):
> > > > >
> > > > > brk - no
> > > >
> > > > [...]
> > > >
> > > > > mlock, mlock2, munlock - yes
> > > > > mmap - no (we may change this with MTE but not for TBI)
> > > >
> > > > [...]
> > > >
> > > > > mprotect - yes
> > > >
> > > > I haven't following this discussion closely... what's the rationale for
> > > > the inconsistencies here (feel free to refer me back to the discussion
> > > > if it's elsewhere).
> > >
> > > _My_ rationale (feel free to disagree) is that mmap() by default would
> > > not return a tagged address (ignoring MTE for now). If it gets passed a
> > > tagged address or a "tagged NULL" (for lack of a better name) we don't
> > > have clear semantics of whether the returned address should be tagged in
> > > this ABI relaxation. I'd rather reserve this specific behaviour if we
> > > overload the non-zero tag meaning of mmap() for MTE. Similar reasoning
> > > for mremap(), at least on the new_address argument (not entirely sure
> > > about old_address).
> > >
> > > munmap() should probably follow the mmap() rules.
> > >
> > > As for brk(), I don't see why the user would need to pass a tagged
> > > address, we can't associate any meaning to this tag.
> > >
> > > For the rest, since it's likely such addresses would have been tagged by
> > > malloc() in user space, we should allow tagged pointers.
> >
> > Those arguments seem reasonable. We should try to capture this
> > somewhere when documenting the ABI.
> >
> > To be clear, I'm not sure that we should guarantee anywhere that a
> > tagged pointer is rejected: rather the behaviour should probably be
> > left unspecified. Then we can tidy it up incrementally.
> >
> > (The behaviour is unspecified today, in any case.)
>
> What is specified (or rather de-facto ABI) today is that passing a user
> address above TASK_SIZE (e.g. non-zero top byte) would fail in most
> cases. If we relax this with the TBI we may end up with some de-facto

I may be being too picky, but "would fail in most cases" sounds like
"unspecified" ?

> ABI before we actually get MTE hardware. Tightening it afterwards may be
> slightly more problematic, although MTE needs to be an explicit opt-in.
>
> IOW, I wouldn't want to unnecessarily relax the ABI if we don't need to.

So long we don't block foreseeable future developments unnecessarily
either -- I agree there's a balance to be struck.

I guess this can be reviewed when we have nailed down the details a bit
further.

Cheers
---Dave

2019-05-29 19:19:32

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On 5/29/19 8:20 AM, Catalin Marinas wrote:
> Hi Khalid,
>
> On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
>> On Tue, 2019-05-28 at 16:40 +0100, Catalin Marinas wrote:
>>> I think another aspect is how we define the ABI. Is allowing tags to
>>> mlock() for example something specific to arm64 or would sparc ADI
>>> need the same? In the absence of other architectures defining such
>>> ABI, my preference would be to keep the wrappers in the arch code.
>>>
>>> Assuming sparc won't implement untagged_addr(), we can place the
>>> macros back in the generic code but, as per the review here, we need
>>> to be more restrictive on where we allow tagged addresses. For
>>> example, if mmap() gets a tagged address with MAP_FIXED, is it
>>> expected to return the tag?
>>
>> I would recommend against any ABI differences between ARM64 MTE/TBI and
>> sparc ADI unless it simply can not be helped. My understanding of
>> MTE/TBI is limited, so I will explain how sparc ADI works. On sparc, a
>> tagged address has no meaning until following steps happen:
>
> Before we go into the MTE/ADI similarities or differences, just to
> clarify that TBI is something that we supported from the start of the
> arm64 kernel port. TBI (top byte ignore) allows a user pointer to have
> non-zero top byte and dereference it without causing a fault (the
> hardware masks it out). The user/kernel ABI does not allow such tagged
> pointers into the kernel, nor would the kernel return any such tagged
> addresses.
>
> With MTE (memory tagging extensions), the top-byte meaning is changed
> from no longer being ignored to actually being checked against a tag in
> the physical RAM (we call it allocation tag).
>
>> 1. set the user mode PSTATE.mcde bit. This acts as the master switch to
>> enable ADI for a process.
>>
>> 2. set TTE.mcd bit on TLB entries that match the address range ADI is
>> being enabled on.
>
> Something close enough for MTE, with the difference that enabling it is
> not a PSTATE bit but rather a system control bit (SCTLR_EL1 register),
> so only the kernel can turn it on/off for the user.
>
>> 3. Store version tag for the range of addresses userspace wants ADI
>> enabled on using "stxa" instruction. These tags are stored in physical
>> memory by the memory controller.
>
> Do you have an "ldxa" instruction to load the tags from physical memory?

Yes, "ldxa" can be used to read current tag for any memory location.
Kernel uses it to read the tags for a physical page being swapped out
and restores those tags when the page is swapped back in.

>
>> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
>> PROT_ADI. Tags are set by storing tags in a loop, for example:
>>
>> version = 10;
>> tmp_addr = shmaddr;
>> end = shmaddr + BUFFER_SIZE;
>> while (tmp_addr < end) {
>> asm volatile(
>> "stxa %1, [%0]0x90\n\t"
>> :
>> : "r" (tmp_addr), "r" (version));
>> tmp_addr += adi_blksz;
>> }
>
> On arm64, a sequence similar to the above would live in the libc. So a
> malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by
> user.
>
> We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
> all user memory ranges. We may revisit this before we upstream the MTE
> support (probably some marginal benefit for the hardware not fetching
> the tags from memory if we don't need to, e.g. code sections).
>
> Given that we already have the TBI feature and with MTE enabled the top
> byte is no longer ignored, we are planning for an explicit opt-in by the
> user via prctl() to enable MTE.

OK. I had initially proposed enabling ADI for a process using prctl().
Feedback I got was prctl was not a desirable interface and I ended up
making mprotect() with PROT_ADI enable ADI on the process instead. Just
something to keep in mind.

>
>> With these semantics, giving mmap() or shamat() a tagged address is
>> meaningless since no tags have been stored at the addresses mmap() will
>> allocate and one can not store tags before memory range has been
>> allocated. If we choose to allow tagged addresses to come into mmap()
>> and shmat(), sparc code can strip the tags unconditionally and that may
>> help simplify ABI and/or code.
>
> We could say that with TBI (pre-MTE support), the top byte is actually
> ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
> should the user expect the same tagged address back or stripping the tag
> is acceptable? If we want to keep the current mmap() semantics, I'd say
> the same tag is returned. However, with MTE this also implies that the
> memory was coloured.
>

Is assigning a tag aprivileged operationon ARM64? I am thinking not
since you mentioned libc could do it in a loop for malloc'd memory.
mmap() can return the same tagged address but I am uneasy about kernel
pre-coloring the pages. Database can mmap 100's of GB of memory. That is
lot of work being offloaded to the kernel to pre-color the page even if
done in batches as pages are faulted in.

>>> My thoughts on allowing tags (quick look):
>>>
>>> brk - no
>>> get_mempolicy - yes
>>> madvise - yes
>>> mbind - yes
>>> mincore - yes
>>> mlock, mlock2, munlock - yes
>>> mmap - no (we may change this with MTE but not for TBI)
>>> mmap_pgoff - not used on arm64
>>> mprotect - yes
>>> mremap - yes for old_address, no for new_address (on par with mmap)
>>> msync - yes
>>> munmap - probably no (mmap does not return tagged ptrs)
>>> remap_file_pages - no (also deprecated syscall)
>>> shmat, shmdt - shall we allow tagged addresses on shared memory?
>>>
>>> The above is only about the TBI ABI while ignoring hardware MTE. For
>>> the latter, we may want to change the mmap() to allow pre-colouring
>>> on page fault which means that munmap()/mprotect() should also
>>> support tagged pointers. Possibly mremap() as well but we need to
>>> decide whether it should allow re-colouring the page (probably no,
>>> in which case old_address and new_address should have the same tag).
>>> For some of these we'll end up with arm64 specific wrappers again,
>>> unless sparc ADI adopts exactly the same ABI restrictions.
>>
>> Let us keep any restrictions common across ARM64 and sparc. pre-
>> coloring on sparc in the kernel would mean kernel will have to execute
>> stxa instructions in a loop for each page being faulted in.
>
> Since the user can probe the pre-existing colour in a faulted-in page
> (either with some 'ldxa' instruction or by performing a tag-checked
> access), the kernel should always pre-colour (even if colour 0) any
> allocated page. There might not be an obvious security risk but I feel
> uneasy about letting colours leak between address spaces (different user
> processes or between kernel and user).

On sparc, tags 0 and 15 are special in that 0 means untagged memory and
15 means match any tag in the address. Colour 0 is the default for any
newly faulted in page on sparc.

>
> Since we already need such loop in the kernel, we might as well allow
> user space to require a certain colour. This comes in handy for large
> malloc() and another advantage is that the C library won't be stuck
> trying to paint the whole range (think GB).

If kernel is going to pre-color all pages in a vma, we will need to
store the default tag in the vma. It will add more time to page fault
handling code. On sparc M7, kernel will need to execute additional 128
stxa instructions to set the tags on a normal page.

>
>> Not that big a deal but doesn't that assume the entire page has the
>> same tag which is dedcued from the upper bits of address? Shouldn't we
>> support tags at the same granularity level as what the hardware
>> supports?
>
> That's mostly about large malloc() optimisation via mmap(), the latter
> working on page granularity already. There is another use-case for
> pre-coloured thread stacks, also allocated via anonymous mmap().
>
>> We went through this discussion for sparc and decision was to support
>> tags at the same granularity as hardware. That means we can not deduce
>> tags from the first address that pioints into an mmap or shmat region.
>> Those tags and the upper bytes of colored address could change for
>> every cacheline sized block (64-bytes on sparc M7).
>
> It's 16-byte for arm64, so smaller than the cacheline.
>
>> We can try to store tags for an entire region in vma but that is
>> expensive, plus on sparc tags are set in userspace with no
>> participation from kernel and now we need a way for userspace to
>> communicate the tags to kernel.
>
> We can't support finer granularity through the mmap() syscall and, as
> you said, the vma is not the right thing to store the individual tags.
> With the above extension to mmap(), we'd have to store a colour per vma
> and prevent merging if different colours (we could as well use the
> pkeys mechanism we already have in the kernel but use a colour per vma
> instead of a key).

Since tags can change on any part of mmap region on sparc at any time
without kernel being involved, I am not sure I see much reason for
kernel to enforce any tag related restrictions.

>
> Of course, the user is allowed to change the in-memory colours at a
> finer granularity and the kernel will preserve them during swapping
> out/in, page migration etc. The above mmap() proposal is just for the
> first fault-in of a page in a given range/vma.
>
>> From sparc point of view, making kernel responsible for assigning tags
>> to a page on page fault is full of pitfalls.
>
> This could be just some arm64-specific but if you plan to deploy it more
> generically for sparc (at the C library level), you may find this
> useful.
>

Common semantics from app developer point of view will be very useful to
maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
return a pre-colored page, I would rather have it be the same on any
architecture. Is there a use case that justifies kernel doing this extra
work?

--
Khalid

2019-05-30 15:14:22

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
> On 5/29/19 8:20 AM, Catalin Marinas wrote:
> > On Tue, May 28, 2019 at 05:33:04PM -0600, Khalid Aziz wrote:
> >> Steps 1 and 2 are accomplished by userspace by calling mprotect() with
> >> PROT_ADI. Tags are set by storing tags in a loop, for example:
> >>
> >> version = 10;
> >> tmp_addr = shmaddr;
> >> end = shmaddr + BUFFER_SIZE;
> >> while (tmp_addr < end) {
> >> asm volatile(
> >> "stxa %1, [%0]0x90\n\t"
> >> :
> >> : "r" (tmp_addr), "r" (version));
> >> tmp_addr += adi_blksz;
> >> }
> >
> > On arm64, a sequence similar to the above would live in the libc. So a
> > malloc() call will tag the memory and return the tagged address to thePre-coloring could easily be done by
> > user.
> >
> > We were not planning for a PROT_ADI/MTE but rather have MTE enabled for
> > all user memory ranges. We may revisit this before we upstream the MTE
> > support (probably some marginal benefit for the hardware not fetching
> > the tags from memory if we don't need to, e.g. code sections).
> >
> > Given that we already have the TBI feature and with MTE enabled the top
> > byte is no longer ignored, we are planning for an explicit opt-in by the
> > user via prctl() to enable MTE.
>
> OK. I had initially proposed enabling ADI for a process using prctl().
> Feedback I got was prctl was not a desirable interface and I ended up
> making mprotect() with PROT_ADI enable ADI on the process instead. Just
> something to keep in mind.

Thanks for the feedback. We'll keep this in mind when adding MTE
support. In the way we plan to deploy this, it would be a libc decision
to invoke the mmap() with the right flag.

This could actually simplify the automatic page faulting below brk(),
basically no tagged/coloured memory allowed implicitly. It needs
feedback from the bionic/glibc folk.

> >> With these semantics, giving mmap() or shamat() a tagged address is
> >> meaningless since no tags have been stored at the addresses mmap() will
> >> allocate and one can not store tags before memory range has been
> >> allocated. If we choose to allow tagged addresses to come into mmap()
> >> and shmat(), sparc code can strip the tags unconditionally and that may
> >> help simplify ABI and/or code.
> >
> > We could say that with TBI (pre-MTE support), the top byte is actually
> > ignored on mmap(). Now, if you pass a MAP_FIXED with a tagged address,
> > should the user expect the same tagged address back or stripping the tag
> > is acceptable? If we want to keep the current mmap() semantics, I'd say
> > the same tag is returned. However, with MTE this also implies that the
> > memory was coloured.
>
> Is assigning a tag aprivileged operationon ARM64? I am thinking not
> since you mentioned libc could do it in a loop for malloc'd memory.

Indeed it's not, the user can do it.

> mmap() can return the same tagged address but I am uneasy about kernel
> pre-coloring the pages. Database can mmap 100's of GB of memory. That is
> lot of work being offloaded to the kernel to pre-color the page even if
> done in batches as pages are faulted in.

For anonymous mmap() for example, the kernel would have to zero the
faulted in pages anyway. We can handle the colouring at the same time in
clear_user_page() (as I said below, we have to clear the colour anyway
from previous uses, so it's simply extending this to support something
other than tag/colour 0 by default with no additional overhead).

> > Since the user can probe the pre-existing colour in a faulted-in page
> > (either with some 'ldxa' instruction or by performing a tag-checked
> > access), the kernel should always pre-colour (even if colour 0) any
> > allocated page. There might not be an obvious security risk but I feel
> > uneasy about letting colours leak between address spaces (different user
> > processes or between kernel and user).
>
> On sparc, tags 0 and 15 are special in that 0 means untagged memory and
> 15 means match any tag in the address. Colour 0 is the default for any
> newly faulted in page on sparc.

With MTE we don't have match-all/any tag in memory, only in the virtual
address/pointer. So if we turn on MTE for all pages and the user
accesses an address with a 0 tag, the underlying memory needs to be
coloured with the same 0 value.

> > Since we already need such loop in the kernel, we might as well allow
> > user space to require a certain colour. This comes in handy for large
> > malloc() and another advantage is that the C library won't be stuck
> > trying to paint the whole range (think GB).
>
> If kernel is going to pre-color all pages in a vma, we will need to
> store the default tag in the vma. It will add more time to page fault
> handling code. On sparc M7, kernel will need to execute additional 128
> stxa instructions to set the tags on a normal page.

As I said, since the user can retrieve an old colour using ldxa, the
kernel should perform this operation anyway on any newly allocated page
(unless you clear the existing colour on page freeing).

> >> We can try to store tags for an entire region in vma but that is
> >> expensive, plus on sparc tags are set in userspace with no
> >> participation from kernel and now we need a way for userspace to
> >> communicate the tags to kernel.
> >
> > We can't support finer granularity through the mmap() syscall and, as
> > you said, the vma is not the right thing to store the individual tags.
> > With the above extension to mmap(), we'd have to store a colour per vma
> > and prevent merging if different colours (we could as well use the
> > pkeys mechanism we already have in the kernel but use a colour per vma
> > instead of a key).
>
> Since tags can change on any part of mmap region on sparc at any time
> without kernel being involved, I am not sure I see much reason for
> kernel to enforce any tag related restrictions.

It's not enforcing a tag, more like the default colour for a faulted in
page. Anyway, if sparc is going with default 0/untagged, that's fine as
well. We may add this mmap() option to arm64 only.

> >> From sparc point of view, making kernel responsible for assigning tags
> >> to a page on page fault is full of pitfalls.
> >
> > This could be just some arm64-specific but if you plan to deploy it more
> > generically for sparc (at the C library level), you may find this
> > useful.
>
> Common semantics from app developer point of view will be very useful to
> maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
> return a pre-colored page, I would rather have it be the same on any
> architecture. Is there a use case that justifies kernel doing this extra
> work?

So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
IIUC for sparc the faulted-in pages will have random colours (on 64-byte
granularity). Ignoring the information leak from prior uses of such
pages, it would be the responsibility of the db program to issue the
stxa. On arm64, since we also want to do this via malloc(), any large
allocation would require all pages to be faulted in so that malloc() can
set the write colour before being handed over to the user. That's what
we want to avoid and the user is free to repaint the memory as it likes.

--
Catalin

2019-05-30 16:08:57

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On 5/30/19 9:11 AM, Catalin Marinas wrote:
> On Wed, May 29, 2019 at 01:16:37PM -0600, Khalid Aziz wrote:
>> mmap() can return the same tagged address but I am uneasy about kernel
>> pre-coloring the pages. Database can mmap 100's of GB of memory. That is
>> lot of work being offloaded to the kernel to pre-color the page even if
>> done in batches as pages are faulted in.
>
> For anonymous mmap() for example, the kernel would have to zero the
> faulted in pages anyway. We can handle the colouring at the same time in
> clear_user_page() (as I said below, we have to clear the colour anyway
> from previous uses, so it's simply extending this to support something
> other than tag/colour 0 by default with no additional overhead).
>

On sparc M7, clear_user_page() ends up in M7clear_user_page defined in
arch/sparc/lib/M7memset.S. M7 code use Block Init Store (BIS) to clear
the page. BIS on M7 clears the memory tags as well and no separate
instructions are needed to clear the tags. As a result when kernel
clears a page before returning it to user, the page is not only zeroed
out, its tags are also cleared to 0.

>>> Since we already need such loop in the kernel, we might as well allow
>>> user space to require a certain colour. This comes in handy for large
>>> malloc() and another advantage is that the C library won't be stuck
>>> trying to paint the whole range (think GB).
>>
>> If kernel is going to pre-color all pages in a vma, we will need to
>> store the default tag in the vma. It will add more time to page fault
>> handling code. On sparc M7, kernel will need to execute additional 128
>> stxa instructions to set the tags on a normal page.
>
> As I said, since the user can retrieve an old colour using ldxa, the
> kernel should perform this operation anyway on any newly allocated page
> (unless you clear the existing colour on page freeing).>

Tags are not cleared on sparc on freeing. They get cleared when the page
is allocated again.

>>>> We can try to store tags for an entire region in vma but that is
>>>> expensive, plus on sparc tags are set in userspace with no
>>>> participation from kernel and now we need a way for userspace to
>>>> communicate the tags to kernel.
>>>
>>> We can't support finer granularity through the mmap() syscall and, as
>>> you said, the vma is not the right thing to store the individual tags.
>>> With the above extension to mmap(), we'd have to store a colour per vma
>>> and prevent merging if different colours (we could as well use the
>>> pkeys mechanism we already have in the kernel but use a colour per vma
>>> instead of a key).
>>
>> Since tags can change on any part of mmap region on sparc at any time
>> without kernel being involved, I am not sure I see much reason for
>> kernel to enforce any tag related restrictions.
>
> It's not enforcing a tag, more like the default colour for a faulted in
> page. Anyway, if sparc is going with default 0/untagged, that's fine as
> well. We may add this mmap() option to arm64 only.
>
>>>> From sparc point of view, making kernel responsible for assigning tags
>>>> to a page on page fault is full of pitfalls.
>>>
>>> This could be just some arm64-specific but if you plan to deploy it more
>>> generically for sparc (at the C library level), you may find this
>>> useful.
>>
>> Common semantics from app developer point of view will be very useful to
>> maintain. If arm64 says mmap with MAP_FIXED and a tagged address will
>> return a pre-colored page, I would rather have it be the same on any
>> architecture. Is there a use case that justifies kernel doing this extra
>> work?
>
> So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
> IIUC for sparc the faulted-in pages will have random colours (on 64-byte
> granularity). Ignoring the information leak from prior uses of such
> pages, it would be the responsibility of the db program to issue the
> stxa. On arm64, since we also want to do this via malloc(), any large
> allocation would require all pages to be faulted in so that malloc() can
> set the write colour before being handed over to the user. That's what
> we want to avoid and the user is free to repaint the memory as it likes.
>

On sparc, any newly allocated page is cleared along with any old tags on
it. Since clearing tag happens automatically when page is cleared on
sparc, clear_user_page() will need to execute additional stxa
instructions to set a new tag. It is doable. In a way it is done already
if page is being pre-colored with tag 0 always ;) Where would the
pre-defined tag be stored - as part of address stored in vm_start or a
new field in vm_area_struct?

--
Khalid

2019-05-30 16:59:38

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 05/17] arms64: untag user pointers passed to memory syscalls

On Thu, May 30, 2019 at 10:05:55AM -0600, Khalid Aziz wrote:
> On 5/30/19 9:11 AM, Catalin Marinas wrote:
> > So if a database program is doing an anonymous mmap(PROT_TBI) of 100GB,
> > IIUC for sparc the faulted-in pages will have random colours (on 64-byte
> > granularity). Ignoring the information leak from prior uses of such
> > pages, it would be the responsibility of the db program to issue the
> > stxa. On arm64, since we also want to do this via malloc(), any large
> > allocation would require all pages to be faulted in so that malloc() can
> > set the write colour before being handed over to the user. That's what
> > we want to avoid and the user is free to repaint the memory as it likes.
>
> On sparc, any newly allocated page is cleared along with any old tags on
> it. Since clearing tag happens automatically when page is cleared on
> sparc, clear_user_page() will need to execute additional stxa
> instructions to set a new tag. It is doable. In a way it is done already
> if page is being pre-colored with tag 0 always ;)

Ah, good to know. On arm64 we'd have to use different instructions,
although the same loop.

> Where would the pre-defined tag be stored - as part of address stored
> in vm_start or a new field in vm_area_struct?

I think we can discuss the details when we post the actual MTE patches.
In our internal hack we overloaded the VM_HIGH_ARCH_* flags and selected
CONFIG_ARCH_USES_HIGH_VMA_FLAGS (used for pkeys on x86).

For the time being, I'd rather restrict tagged addresses passed to
mmap() until we agreed that they have any meaning. If we allowed them
now but get ignored (though probably no-one would be doing this), I feel
it's slightly harder to change the semantics afterwards.

Thanks.

--
Catalin

2019-05-30 17:17:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> Thanks for a lot of valuable input! I've read through all the replies
> and got somewhat lost. What are the changes I need to do to this
> series?
>
> 1. Should I move untagging for memory syscalls back to the generic
> code so other arches would make use of it as well, or should I keep
> the arm64 specific memory syscalls wrappers and address the comments
> on that patch?

Keep them generic again but make sure we get agreement with Khalid on
the actual ABI implications for sparc.

> 2. Should I make untagging opt-in and controlled by a command line argument?

Opt-in, yes, but per task rather than kernel command line option.
prctl() is a possibility of opting in.

> 3. Should I "add Documentation/core-api/user-addresses.rst to describe
> proper care and handling of user space pointers with untagged_addr(),
> with examples based on all the cases seen so far in this series"?
> Which examples specifically should it cover?

I think we can leave 3 for now as not too urgent. What I'd like is for
Vincenzo's TBI user ABI document to go into a more common place since we
can expand it to cover both sparc and arm64. We'd need an arm64-specific
doc as well for things like prctl() and later MTE that sparc may support
differently.

--
Catalin

2019-05-31 14:23:57

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v15 17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel

On Wed, May 22, 2019 at 4:16 PM Catalin Marinas <[email protected]> wrote:
>
> On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
> > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > pass tagged user pointers (with the top byte set to something else other
> > than 0x00) as syscall arguments.
> >
> > This patch adds a simple test, that calls the uname syscall with a
> > tagged user pointer as an argument. Without the kernel accepting tagged
> > user pointers the test fails with EFAULT.
>
> That's probably sufficient for a simple example. Something we could add
> to Documentation maybe is a small library that can be LD_PRELOAD'ed so
> that you can run a lot more tests like LTP.

Should I add this into this series, or should this go into Vincenzo's patchset?

>
> We could add this to selftests but I think it's too glibc specific.
>
> --------------------8<------------------------------------
> #include <stdlib.h>
>
> #define TAG_SHIFT (56)
> #define TAG_MASK (0xffUL << TAG_SHIFT)
>
> void *__libc_malloc(size_t size);
> void __libc_free(void *ptr);
> void *__libc_realloc(void *ptr, size_t size);
> void *__libc_calloc(size_t nmemb, size_t size);
>
> static void *tag_ptr(void *ptr)
> {
> unsigned long tag = rand() & 0xff;
> if (!ptr)
> return ptr;
> return (void *)((unsigned long)ptr | (tag << TAG_SHIFT));
> }
>
> static void *untag_ptr(void *ptr)
> {
> return (void *)((unsigned long)ptr & ~TAG_MASK);
> }
>
> void *malloc(size_t size)
> {
> return tag_ptr(__libc_malloc(size));
> }
>
> void free(void *ptr)
> {
> __libc_free(untag_ptr(ptr));
> }
>
> void *realloc(void *ptr, size_t size)
> {
> return tag_ptr(__libc_realloc(untag_ptr(ptr), size));
> }
>
> void *calloc(size_t nmemb, size_t size)
> {
> return tag_ptr(__libc_calloc(nmemb, size));
> }

2019-05-31 14:32:03

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <[email protected]> wrote:
>
> On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > Thanks for a lot of valuable input! I've read through all the replies
> > and got somewhat lost. What are the changes I need to do to this
> > series?
> >
> > 1. Should I move untagging for memory syscalls back to the generic
> > code so other arches would make use of it as well, or should I keep
> > the arm64 specific memory syscalls wrappers and address the comments
> > on that patch?
>
> Keep them generic again but make sure we get agreement with Khalid on
> the actual ABI implications for sparc.

OK, will do. I find it hard to understand what the ABI implications
are. I'll post the next version without untagging in brk, mmap,
munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
and shmdt.

>
> > 2. Should I make untagging opt-in and controlled by a command line argument?
>
> Opt-in, yes, but per task rather than kernel command line option.
> prctl() is a possibility of opting in.

OK. Should I store a flag somewhere in task_struct? Should it be
inheritable on clone?

>
> > 3. Should I "add Documentation/core-api/user-addresses.rst to describe
> > proper care and handling of user space pointers with untagged_addr(),
> > with examples based on all the cases seen so far in this series"?
> > Which examples specifically should it cover?
>
> I think we can leave 3 for now as not too urgent. What I'd like is for
> Vincenzo's TBI user ABI document to go into a more common place since we
> can expand it to cover both sparc and arm64. We'd need an arm64-specific
> doc as well for things like prctl() and later MTE that sparc may support
> differently.

OK.

>
> --
> Catalin

2019-05-31 16:22:03

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <[email protected]> wrote:
> > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > Thanks for a lot of valuable input! I've read through all the replies
> > > and got somewhat lost. What are the changes I need to do to this
> > > series?
> > >
> > > 1. Should I move untagging for memory syscalls back to the generic
> > > code so other arches would make use of it as well, or should I keep
> > > the arm64 specific memory syscalls wrappers and address the comments
> > > on that patch?
> >
> > Keep them generic again but make sure we get agreement with Khalid on
> > the actual ABI implications for sparc.
>
> OK, will do. I find it hard to understand what the ABI implications
> are. I'll post the next version without untagging in brk, mmap,
> munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> and shmdt.

It's more about not relaxing the ABI to accept non-zero top-byte unless
we have a use-case for it. For mmap() etc., I don't think that's needed
but if you think otherwise, please raise it.

> > > 2. Should I make untagging opt-in and controlled by a command line argument?
> >
> > Opt-in, yes, but per task rather than kernel command line option.
> > prctl() is a possibility of opting in.
>
> OK. Should I store a flag somewhere in task_struct? Should it be
> inheritable on clone?

A TIF flag would do but I'd say leave it out for now (default opted in)
until we figure out the best way to do this (can be a patch on top of
this series).

Thanks.

--
Catalin

2019-05-31 16:23:46

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 17/17] selftests, arm64: add a selftest for passing tagged pointers to kernel

On Fri, May 31, 2019 at 04:21:48PM +0200, Andrey Konovalov wrote:
> On Wed, May 22, 2019 at 4:16 PM Catalin Marinas <[email protected]> wrote:
> > On Mon, May 06, 2019 at 06:31:03PM +0200, Andrey Konovalov wrote:
> > > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > > pass tagged user pointers (with the top byte set to something else other
> > > than 0x00) as syscall arguments.
> > >
> > > This patch adds a simple test, that calls the uname syscall with a
> > > tagged user pointer as an argument. Without the kernel accepting tagged
> > > user pointers the test fails with EFAULT.
> >
> > That's probably sufficient for a simple example. Something we could add
> > to Documentation maybe is a small library that can be LD_PRELOAD'ed so
> > that you can run a lot more tests like LTP.
>
> Should I add this into this series, or should this go into Vincenzo's patchset?

If you can tweak the selftest Makefile to build a library and force it
with LD_PRELOAD, you can keep it with this patch. It would be easier to
extend to other syscall tests, signal handling etc.

--
Catalin

2019-05-31 16:25:52

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <[email protected]> wrote:
>
> On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <[email protected]> wrote:
> > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > > Thanks for a lot of valuable input! I've read through all the replies
> > > > and got somewhat lost. What are the changes I need to do to this
> > > > series?
> > > >
> > > > 1. Should I move untagging for memory syscalls back to the generic
> > > > code so other arches would make use of it as well, or should I keep
> > > > the arm64 specific memory syscalls wrappers and address the comments
> > > > on that patch?
> > >
> > > Keep them generic again but make sure we get agreement with Khalid on
> > > the actual ABI implications for sparc.
> >
> > OK, will do. I find it hard to understand what the ABI implications
> > are. I'll post the next version without untagging in brk, mmap,
> > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> > and shmdt.
>
> It's more about not relaxing the ABI to accept non-zero top-byte unless
> we have a use-case for it. For mmap() etc., I don't think that's needed
> but if you think otherwise, please raise it.
>
> > > > 2. Should I make untagging opt-in and controlled by a command line argument?
> > >
> > > Opt-in, yes, but per task rather than kernel command line option.
> > > prctl() is a possibility of opting in.
> >
> > OK. Should I store a flag somewhere in task_struct? Should it be
> > inheritable on clone?
>
> A TIF flag would do but I'd say leave it out for now (default opted in)
> until we figure out the best way to do this (can be a patch on top of
> this series).

You mean leave the whole opt-in/prctl part out? So the only change
would be to move untagging for memory syscalls into generic code?

>
> Thanks.
>
> --
> Catalin

2019-05-31 16:47:43

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Fri, May 31, 2019 at 06:24:06PM +0200, Andrey Konovalov wrote:
> On Fri, May 31, 2019 at 6:20 PM Catalin Marinas <[email protected]> wrote:
> > On Fri, May 31, 2019 at 04:29:10PM +0200, Andrey Konovalov wrote:
> > > On Thu, May 30, 2019 at 7:15 PM Catalin Marinas <[email protected]> wrote:
> > > > On Tue, May 28, 2019 at 04:14:45PM +0200, Andrey Konovalov wrote:
> > > > > Thanks for a lot of valuable input! I've read through all the replies
> > > > > and got somewhat lost. What are the changes I need to do to this
> > > > > series?
> > > > >
> > > > > 1. Should I move untagging for memory syscalls back to the generic
> > > > > code so other arches would make use of it as well, or should I keep
> > > > > the arm64 specific memory syscalls wrappers and address the comments
> > > > > on that patch?
> > > >
> > > > Keep them generic again but make sure we get agreement with Khalid on
> > > > the actual ABI implications for sparc.
> > >
> > > OK, will do. I find it hard to understand what the ABI implications
> > > are. I'll post the next version without untagging in brk, mmap,
> > > munmap, mremap (for new_address), mmap_pgoff, remap_file_pages, shmat
> > > and shmdt.
> >
> > It's more about not relaxing the ABI to accept non-zero top-byte unless
> > we have a use-case for it. For mmap() etc., I don't think that's needed
> > but if you think otherwise, please raise it.
> >
> > > > > 2. Should I make untagging opt-in and controlled by a command line argument?
> > > >
> > > > Opt-in, yes, but per task rather than kernel command line option.
> > > > prctl() is a possibility of opting in.
> > >
> > > OK. Should I store a flag somewhere in task_struct? Should it be
> > > inheritable on clone?
> >
> > A TIF flag would do but I'd say leave it out for now (default opted in)
> > until we figure out the best way to do this (can be a patch on top of
> > this series).
>
> You mean leave the whole opt-in/prctl part out? So the only change
> would be to move untagging for memory syscalls into generic code?

Yes (or just wait until next week to see if the discussion settles
down).

--
Catalin

2019-06-02 05:10:22

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v15 00/17] arm64: untag user pointers passed to the kernel

On Tue, May 28, 2019 at 06:02:45PM +0100, Catalin Marinas wrote:
> On Thu, May 23, 2019 at 02:31:16PM -0700, Kees Cook wrote:
> > syzkaller already attempts to randomly inject non-canonical and
> > 0xFFFF....FFFF addresses for user pointers in syscalls in an effort to
> > find bugs like CVE-2017-5123 where waitid() via unchecked put_user() was
> > able to write directly to kernel memory[1].
> >
> > It seems that using TBI by default and not allowing a switch back to
> > "normal" ABI without a reboot actually means that userspace cannot inject
> > kernel pointers into syscalls any more, since they'll get universally
> > stripped now. Is my understanding correct, here? i.e. exploiting
> > CVE-2017-5123 would be impossible under TBI?
> >
> > If so, then I think we should commit to the TBI ABI and have a boot
> > flag to disable it, but NOT have a process flag, as that would allow
> > attackers to bypass the masking. The only flag should be "TBI or MTE".
> >
> > If so, can I get top byte masking for other architectures too? Like,
> > just to strip high bits off userspace addresses? ;)
>
> Just for fun, hack/attempt at your idea which should not interfere with
> TBI. Only briefly tested on arm64 (and the s390 __TYPE_IS_PTR macro is
> pretty weird ;)):

OMG, this is amazing and bonkers. I love it.

> --------------------------8<---------------------------------
> diff --git a/arch/s390/include/asm/compat.h b/arch/s390/include/asm/compat.h
> index 63b46e30b2c3..338455a74eff 100644
> --- a/arch/s390/include/asm/compat.h
> +++ b/arch/s390/include/asm/compat.h
> @@ -11,9 +11,6 @@
>
> #include <asm-generic/compat.h>
>
> -#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p( \
> - typeof(0?(__force t)0:0ULL), u64))
> -
> #define __SC_DELOUSE(t,v) ({ \
> BUILD_BUG_ON(sizeof(t) > 4 && !__TYPE_IS_PTR(t)); \
> (__force t)(__TYPE_IS_PTR(t) ? ((v) & 0x7fffffff) : (v)); \
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..b1b9fe8502da 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -119,8 +119,15 @@ struct io_uring_params;
> #define __TYPE_IS_L(t) (__TYPE_AS(t, 0L))
> #define __TYPE_IS_UL(t) (__TYPE_AS(t, 0UL))
> #define __TYPE_IS_LL(t) (__TYPE_AS(t, 0LL) || __TYPE_AS(t, 0ULL))
> +#define __TYPE_IS_PTR(t) (!__builtin_types_compatible_p(typeof(0 ? (__force t)0 : 0ULL), u64))
> #define __SC_LONG(t, a) __typeof(__builtin_choose_expr(__TYPE_IS_LL(t), 0LL, 0L)) a
> +#ifdef CONFIG_64BIT
> +#define __SC_CAST(t, a) (__TYPE_IS_PTR(t) \
> + ? (__force t) ((__u64)a & ~(1UL << 55)) \
> + : (__force t) a)
> +#else
> #define __SC_CAST(t, a) (__force t) a
> +#endif
> #define __SC_ARGS(t, a) a
> #define __SC_TEST(t, a) (void)BUILD_BUG_ON_ZERO(!__TYPE_IS_LL(t) && sizeof(t) > sizeof(long))

I'm laughing, I'm crying. Now I have to go look at the disassembly to
see how this actually looks. (I mean, it _does_ solve my specific case
of the waitid() flaw, but wouldn't help with pointers deeper in structs,
etc, though TBI does, I think still help with that. I have to catch back
up on the thread...) Anyway, if it's not too expensive it'd block
reachability for those kinds of flaws.

I wonder what my chances are of actually getting this landed? :)
(Though, I guess I need to find a "VA width" macro instead of a raw 55.)

Thanks for thinking of __SC_CAST() and __TYPE_IS_PTR() together. Really
made my day. :)

--
Kees Cook