2018-12-10 14:29:44

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

arm64 has a feature called Top Byte Ignore, which allows to embed pointer
tags into the top byte of each pointer. Userspace programs (such as
HWASan, a memory debugging tool [1]) might use this feature and pass
tagged user pointers to the kernel through syscalls or other interfaces.

Right now the kernel is already able to handle user faults with tagged
pointers, due to these patches:

1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
tagged pointer")
2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
pointers")
3. 276e9327 ("arm64: entry: improve data abort handling of tagged
pointers")

When passing tagged pointers to syscalls, there's a special case of such a
pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
These syscalls don't do memory accesses but rather deal with memory
ranges, hence an untagged pointer is better suited.

This patchset extends tagged pointer support to non-memory syscalls. This
is done by reusing the untagged_addr macro to untag user pointers when the
kernel performs pointer checking to find out whether the pointer comes
from userspace (most notably in access_ok). The untagging is done only
when the pointer is being checked, the tag is preserved as the pointer
makes its way through the kernel.

One of the alternative approaches to untagging that was considered is to
completely strip the pointer tag as the pointer enters the kernel with
some kind of a syscall wrapper, but that won't work with the countless
number of different ioctl calls. With this approach we would need a custom
wrapper for each ioctl variation, which doesn't seem practical.

The following testing approaches has been taken to find potential issues
with user pointer untagging:

1. Static testing (with sparse [2] and separately with a custom static
analyzer based on Clang) to track casts of __user pointers to integer
types to find places where untagging needs to be done.

2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
a modified syzkaller version that passes tagged pointers to the kernel.

Based on the results of the testing the requried patches have been added
to the patchset.

This patchset has been merged into the Pixel 2 kernel tree and is now
being used to enable testing of Pixel 2 phones with HWASan.

This patchset is a prerequisite for ARM's memory tagging hardware feature
support [3].

Thanks!

[1] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[2] https://github.com/lucvoo/sparse-dev/commit/5f960cb10f56ec2017c128ef9d16060e0145f292

[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

Changes in v9:
- Rebased onto 4.20-rc6.
- Used u64 instead of __u64 in type casts in the untagged_addr macro for
arm64.
- Added braces around (addr) in the untagged_addr macro for other arches.

Changes in v8:
- Rebased onto 65102238 (4.20-rc1).
- Added a note to the cover letter on why syscall wrappers/shims that untag
user pointers won't work.
- Added a note to the cover letter that this patchset has been merged into
the Pixel 2 kernel tree.
- Documentation fixes, in particular added a list of syscalls that don't
support tagged user pointers.

Changes in v7:
- Rebased onto 17b57b18 (4.19-rc6).
- Dropped the "arm64: untag user address in __do_user_fault" patch, since
the existing patches already handle user faults properly.
- Dropped the "usb, arm64: untag user addresses in devio" patch, since the
passed pointer must come from a vma and therefore be untagged.
- Dropped the "arm64: annotate user pointers casts detected by sparse"
patch (see the discussion to the replies of the v6 of this patchset).
- Added more context to the cover letter.
- Updated Documentation/arm64/tagged-pointers.txt.

Changes in v6:
1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001
- Added annotations for user pointer casts found by sparse.
1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001
- Rebased onto 050cdc6c (4.19-rc1+).
1 From 502466b9652c57a23af3bd72124144319212f30b Mon Sep 17 00:00:00 2001

Changes in v5:
- Added 3 new patches that add untagging to places found with static
analysis.
- Rebased onto 44c929e1 (4.18-rc8).

Changes in v4:
- Added a selftest for checking that passing tagged pointers to the
kernel succeeds.
- Rebased onto 81e97f013 (4.18-rc1+).

Changes in v3:
- Rebased onto e5c51f30 (4.17-rc6+).
- Added linux-arch@ to the list of recipients.

Changes in v2:
- Rebased onto 2d618bdf (4.17-rc3+).
- Removed excessive untagging in gup.c.
- Removed untagging pointers returned from __uaccess_mask_ptr.

Changes in v1:
- Rebased onto 4.17-rc1.

Changes in RFC v2:
- Added "#ifndef untagged_addr..." fallback in linux/uaccess.h instead of
defining it for each arch individually.
- Updated Documentation/arm64/tagged-pointers.txt.
- Dropped "mm, arm64: untag user addresses in memory syscalls".
- Rebased onto 3eb2ce82 (4.16-rc7).

Reviewed-by: Luc Van Oostenryck <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>

Andrey Konovalov (8):
arm64: add type casts to untagged_addr macro
uaccess: add untagged_addr definition for other arches
arm64: untag user addresses in access_ok and __uaccess_mask_ptr
mm, arm64: untag user addresses in mm/gup.c
lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user
fs, arm64: untag user address in copy_mount_options
arm64: update Documentation/arm64/tagged-pointers.txt
selftests, arm64: add a selftest for passing tagged pointers to kernel

Documentation/arm64/tagged-pointers.txt | 25 +++++++++++--------
arch/arm64/include/asm/uaccess.h | 14 +++++++----
fs/namespace.c | 2 +-
include/linux/uaccess.h | 4 +++
lib/strncpy_from_user.c | 2 ++
lib/strnlen_user.c | 2 ++
mm/gup.c | 4 +++
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 ++++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 +++++++++
tools/testing/selftests/arm64/tags_test.c | 19 ++++++++++++++
11 files changed, 80 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c

--
2.20.0.rc2.403.gdbc3b29805-goog



2018-12-10 12:55:44

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 8/8] selftests, arm64: add a selftest for passing tagged pointers to kernel

This patch adds a simple test, that calls the uname syscall with a
tagged user pointer as an argument. Without the kernel accepting tagged
user pointers the test fails with EFAULT.

Signed-off-by: Andrey Konovalov <[email protected]>
---
tools/testing/selftests/arm64/.gitignore | 1 +
tools/testing/selftests/arm64/Makefile | 11 +++++++++++
.../testing/selftests/arm64/run_tags_test.sh | 12 ++++++++++++
tools/testing/selftests/arm64/tags_test.c | 19 +++++++++++++++++++
4 files changed, 43 insertions(+)
create mode 100644 tools/testing/selftests/arm64/.gitignore
create mode 100644 tools/testing/selftests/arm64/Makefile
create mode 100755 tools/testing/selftests/arm64/run_tags_test.sh
create mode 100644 tools/testing/selftests/arm64/tags_test.c

diff --git a/tools/testing/selftests/arm64/.gitignore b/tools/testing/selftests/arm64/.gitignore
new file mode 100644
index 000000000000..e8fae8d61ed6
--- /dev/null
+++ b/tools/testing/selftests/arm64/.gitignore
@@ -0,0 +1 @@
+tags_test
diff --git a/tools/testing/selftests/arm64/Makefile b/tools/testing/selftests/arm64/Makefile
new file mode 100644
index 000000000000..a61b2e743e99
--- /dev/null
+++ b/tools/testing/selftests/arm64/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# ARCH can be overridden by the user for cross compiling
+ARCH ?= $(shell uname -m 2>/dev/null || echo not)
+
+ifneq (,$(filter $(ARCH),aarch64 arm64))
+TEST_GEN_PROGS := tags_test
+TEST_PROGS := run_tags_test.sh
+endif
+
+include ../lib.mk
diff --git a/tools/testing/selftests/arm64/run_tags_test.sh b/tools/testing/selftests/arm64/run_tags_test.sh
new file mode 100755
index 000000000000..745f11379930
--- /dev/null
+++ b/tools/testing/selftests/arm64/run_tags_test.sh
@@ -0,0 +1,12 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+echo "--------------------"
+echo "running tags test"
+echo "--------------------"
+./tags_test
+if [ $? -ne 0 ]; then
+ echo "[FAIL]"
+else
+ echo "[PASS]"
+fi
diff --git a/tools/testing/selftests/arm64/tags_test.c b/tools/testing/selftests/arm64/tags_test.c
new file mode 100644
index 000000000000..1452ed7d33f9
--- /dev/null
+++ b/tools/testing/selftests/arm64/tags_test.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <sys/utsname.h>
+
+#define SHIFT_TAG(tag) ((uint64_t)(tag) << 56)
+#define SET_TAG(ptr, tag) (((uint64_t)(ptr) & ~SHIFT_TAG(0xff)) | \
+ SHIFT_TAG(tag))
+
+int main(void)
+{
+ struct utsname utsname;
+ void *ptr = &utsname;
+ void *tagged_ptr = (void *)SET_TAG(ptr, 0x42);
+ int err = uname(tagged_ptr);
+ return err;
+}
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 12:55:53

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 4/8] mm, arm64: untag user addresses in mm/gup.c

mm/gup.c provides a kernel interface that accepts user addresses and
manipulates user pages directly (for example get_user_pages, that is used
by the futex syscall). Since a user can provided tagged addresses, we need
to handle such case.

Add untagging to gup.c functions that use user addresses for vma lookup.

Signed-off-by: Andrey Konovalov <[email protected]>
---
mm/gup.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 8cb68a50dbdf..409aedb1e2d5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -683,6 +683,8 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
if (!nr_pages)
return 0;

+ start = untagged_addr(start);
+
VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

/*
@@ -845,6 +847,8 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
struct vm_area_struct *vma;
vm_fault_t ret, major = 0;

+ address = untagged_addr(address);
+
if (unlocked)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;

--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 12:56:15

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 1/8] arm64: add type casts to untagged_addr macro

This patch makes the untagged_addr macro accept all kinds of address types
(void *, unsigned long, etc.) and allows not to specify type casts in each
place where it is used. This is done by using __typeof__.

Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 07c34087bd5e..3c3864ba3cc1 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -101,7 +101,8 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si
* up with a tagged userland pointer. Clear the tag to get a sane pointer to
* pass on to access_ok(), for instance.
*/
-#define untagged_addr(addr) sign_extend64(addr, 55)
+#define untagged_addr(addr) \
+ ((__typeof__(addr))sign_extend64((u64)(addr), 55))

#define access_ok(type, addr, size) __range_ok(addr, size)
#define user_addr_max get_fs
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 12:56:17

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 6/8] fs, arm64: untag user address in copy_mount_options

In copy_mount_options a user address is being subtracted from TASK_SIZE.
If the address is lower than TASK_SIZE, the size is calculated to not
allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
However if the address is tagged, then the size will be calculated
incorrectly.

Untag the address before subtracting.

Signed-off-by: Andrey Konovalov <[email protected]>
---
fs/namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index a7f91265ea67..694dcedb7e7d 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2686,7 +2686,7 @@ void *copy_mount_options(const void __user * data)
* the remainder of the page.
*/
/* copy_from_user cannot cross TASK_SIZE ! */
- size = TASK_SIZE - (unsigned long)data;
+ size = TASK_SIZE - (unsigned long)untagged_addr(data);
if (size > PAGE_SIZE)
size = PAGE_SIZE;

--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 12:57:06

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 7/8] arm64: update Documentation/arm64/tagged-pointers.txt

Document the changes in Documentation/arm64/tagged-pointers.txt.

Signed-off-by: Andrey Konovalov <[email protected]>
---
Documentation/arm64/tagged-pointers.txt | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/Documentation/arm64/tagged-pointers.txt b/Documentation/arm64/tagged-pointers.txt
index a25a99e82bb1..f4cf1f5cf362 100644
--- a/Documentation/arm64/tagged-pointers.txt
+++ b/Documentation/arm64/tagged-pointers.txt
@@ -17,13 +17,22 @@ this byte for application use.
Passing tagged addresses to the kernel
--------------------------------------

-All interpretation of userspace memory addresses by the kernel assumes
-an address tag of 0x00.
+The kernel supports tags in pointer arguments (including pointers in
+structures) for a limited set of syscalls, the exceptions are:

-This includes, but is not limited to, addresses found in:
+ - memory syscalls: brk, madvise, mbind, mincore, mlock, mlock2, move_pages,
+ mprotect, mremap, msync, munlock, munmap, pkey_mprotect, process_vm_readv,
+ process_vm_writev, remap_file_pages;

- - pointer arguments to system calls, including pointers in structures
- passed to system calls,
+ - ioctls that accept user pointers that describe virtual memory ranges;
+
+ - TCP_ZEROCOPY_RECEIVE setsockopt.
+
+The kernel supports tags in user fault addresses. However the fault_address
+field in the sigcontext struct will contain an untagged address.
+
+All other interpretations of userspace memory addresses by the kernel
+assume an address tag of 0x00, in particular:

- the stack pointer (sp), e.g. when interpreting it to deliver a
signal,
@@ -33,11 +42,7 @@ This includes, but is not limited to, addresses found in:

Using non-zero address tags in any of these locations may result in an
error code being returned, a (fatal) signal being raised, or other modes
-of failure.
-
-For these reasons, passing non-zero address tags to the kernel via
-system calls is forbidden, and using a non-zero address tag for sp is
-strongly discouraged.
+of failure. Using a non-zero address tag for sp is strongly discouraged.

Programs maintaining a frame pointer and frame records that use non-zero
address tags may suffer impaired or inaccurate debug and profiling
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 12:57:27

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 5/8] lib, arm64: untag addrs passed to strncpy_from_user and strnlen_user

strncpy_from_user and strnlen_user accept user addresses as arguments, and
do not go through the same path as copy_from_user and others, so here we
need to handle the case of tagged user addresses separately.

Untag user pointers passed to these functions.

Signed-off-by: Andrey Konovalov <[email protected]>
---
lib/strncpy_from_user.c | 2 ++
lib/strnlen_user.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/lib/strncpy_from_user.c b/lib/strncpy_from_user.c
index b53e1b5d80f4..97467cd2bc59 100644
--- a/lib/strncpy_from_user.c
+++ b/lib/strncpy_from_user.c
@@ -106,6 +106,8 @@ long strncpy_from_user(char *dst, const char __user *src, long count)
if (unlikely(count <= 0))
return 0;

+ src = untagged_addr(src);
+
max_addr = user_addr_max();
src_addr = (unsigned long)src;
if (likely(src_addr < max_addr)) {
diff --git a/lib/strnlen_user.c b/lib/strnlen_user.c
index 60d0bbda8f5e..8b5f56466e00 100644
--- a/lib/strnlen_user.c
+++ b/lib/strnlen_user.c
@@ -108,6 +108,8 @@ long strnlen_user(const char __user *str, long count)
if (unlikely(count <= 0))
return 0;

+ str = untagged_addr(str);
+
max_addr = user_addr_max();
src_addr = (unsigned long)str;
if (likely(src_addr < max_addr)) {
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 14:29:59

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 2/8] uaccess: add untagged_addr definition for other arches

To allow arm64 syscalls accept tagged pointers from userspace, we must
untag them when they are passed to the kernel. Since untagging is done in
generic parts of the kernel, the untagged_addr macro needs to be defined
for all architectures.

Define it as a noop for other architectures besides arm64.

Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
include/linux/uaccess.h | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index efe79c1cdd47..42b7a4ac65e2 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -13,6 +13,10 @@

#include <asm/uaccess.h>

+#ifndef untagged_addr
+#define untagged_addr(addr) (addr)
+#endif
+
/*
* Architectures should provide two primitives (raw_copy_{to,from}_user())
* and get rid of their private instances of copy_{to,from}_user() and
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 14:31:59

by Andrey Konovalov

[permalink] [raw]
Subject: [PATCH v9 3/8] arm64: untag user addresses in access_ok and __uaccess_mask_ptr

copy_from_user (and a few other similar functions) are used to copy data
from user memory into the kernel memory or vice versa. Since a user can
provided a tagged pointer to one of the syscalls that use copy_from_user,
we need to correctly handle such pointers.

Do this by untagging user pointers in access_ok and in __uaccess_mask_ptr,
before performing access validity checks.

Reviewed-by: Catalin Marinas <[email protected]>
Signed-off-by: Andrey Konovalov <[email protected]>
---
arch/arm64/include/asm/uaccess.h | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 3c3864ba3cc1..d28c3b1314ce 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -104,7 +104,8 @@ static inline unsigned long __range_ok(const void __user *addr, unsigned long si
#define untagged_addr(addr) \
((__typeof__(addr))sign_extend64((u64)(addr), 55))

-#define access_ok(type, addr, size) __range_ok(addr, size)
+#define access_ok(type, addr, size) \
+ __range_ok(untagged_addr(addr), size)
#define user_addr_max get_fs

#define _ASM_EXTABLE(from, to) \
@@ -236,7 +237,8 @@ static inline void uaccess_enable_not_uao(void)

/*
* Sanitise a uaccess pointer such that it becomes NULL if above the
- * current addr_limit.
+ * current addr_limit. In case the pointer is tagged (has the top byte set),
+ * untag the pointer before checking.
*/
#define uaccess_mask_ptr(ptr) (__typeof__(ptr))__uaccess_mask_ptr(ptr)
static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
@@ -244,10 +246,11 @@ static inline void __user *__uaccess_mask_ptr(const void __user *ptr)
void __user *safe_ptr;

asm volatile(
- " bics xzr, %1, %2\n"
+ " bics xzr, %3, %2\n"
" csel %0, %1, xzr, eq\n"
: "=&r" (safe_ptr)
- : "r" (ptr), "r" (current_thread_info()->addr_limit)
+ : "r" (ptr), "r" (current_thread_info()->addr_limit),
+ "r" (untagged_addr(ptr))
: "cc");

csdb();
--
2.20.0.rc2.403.gdbc3b29805-goog


2018-12-10 14:57:59

by Vincenzo Frascino

[permalink] [raw]
Subject: [RFC][PATCH 0/3] arm64 relaxed ABI

On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
the userspace (EL0) is allowed to set a non-zero value in the top
byte but the resulting pointers are not allowed at the user-kernel
syscall ABI boundary.

This patchset proposes a relaxation of the ABI and a mechanism to
advertise it to the userspace via an AT_FLAGS.

The rationale behind the choice of AT_FLAGS is that the Unix System V
ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
interpretation.
There are two previous attempts of using AT_FLAGS in the Linux Kernel
for different reasons: the first was more generic and was used to expose
the support for the GNU STACK NX feature [1] and the second was done for
the MIPS architecture and was used to expose the support of "MIPS ABI
Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
Both the changes are currently _not_ merged in mainline.
The only architecture that reserves some of the bits in AT_FLAGS is
currently MIPS, which introduced the concept of platform specific ABI
(psABI) reserving the top-byte [3].

When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
to the userspace that a relaxed ABI is supported hence this type
of pointers are now allowed to be passed to the syscalls when they are
in memory ranges obtained by anonymous mmap() or brk().

The userspace _must_ verify that the flag is set before passing tagged
pointers to the syscalls allowed by this relaxation.

More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
to the software to check that the feature is present, before using the
associated functionality, it provides a degree of control on the decision
of disabling such a feature in future without consequently breaking the
userspace.

The change required a modification of the elf common code, because in Linux
the AT_FLAGS are currently set to zero by default by the kernel.

The newly added flag has been verified on arm64 using the code below.
#include <stdio.h>
#include <stdbool.h>
#include <sys/auxv.h>

#define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)

bool arm64_syscall_tbi_is_present(void)
{
unsigned long at_flags = getauxval(AT_FLAGS);
if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
return true;

return false;
}

void main()
{
if (arm64_syscall_tbi_is_present())
printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n");
}

This patchset should be merged together with [4].

[1] https://patchwork.ozlabs.org/patch/579578/
[2] https://lore.kernel.org/patchwork/cover/618280/
[3] ftp://http://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
[4] https://patchwork.kernel.org/cover/10674351/

ABI References:
---------------
Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html
PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html
AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf
x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf
MIPS ABI: ftp://http://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Kate Stewart <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Kirill A . Shutemov" <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Chintan Pandya <[email protected]>
Cc: Jacob Bramley <[email protected]>
Cc: Ruben Ayrapetyan <[email protected]>
Cc: Andrey Konovalov <[email protected]>
Cc: Lee Smith <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Dmitry Vyukov <[email protected]>,
Cc: Ramana Radhakrishnan <[email protected]>
Cc: Luc Van Oostenryck <[email protected]>
Cc: Evgeniy Stepanov <[email protected]>
CC: Alexander Viro <[email protected]>
Signed-off-by: Vincenzo Frascino <[email protected]>

Vincenzo Frascino (3):
elf: Make AT_FLAGS arch configurable
arm64: Define Documentation/arm64/elf_at_flags.txt
arm64: elf: Advertise relaxed ABI

Documentation/arm64/elf_at_flags.txt | 111 ++++++++++++++++++++++++++
arch/arm64/include/asm/atflags.h | 7 ++
arch/arm64/include/asm/elf.h | 5 ++
arch/arm64/include/uapi/asm/atflags.h | 8 ++
fs/binfmt_elf.c | 6 +-
fs/binfmt_elf_fdpic.c | 6 +-
fs/compat_binfmt_elf.c | 5 ++
7 files changed, 146 insertions(+), 2 deletions(-)
create mode 100644 Documentation/arm64/elf_at_flags.txt
create mode 100644 arch/arm64/include/asm/atflags.h
create mode 100644 arch/arm64/include/uapi/asm/atflags.h

--
2.19.2


2018-12-10 14:58:05

by Vincenzo Frascino

[permalink] [raw]
Subject: [RFC][PATCH 1/3] elf: Make AT_FLAGS arch configurable

Currently, the AT_FLAGS in the elf auxiliary vector are set to 0
by default by the kernel.
Some architectures might need to expose to the userspace a non-zero
value to advertise some platform specific ABI functionalities.

This patch makes AT_FLAGS configurable by the architectures that
require it.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
CC: Andrey Konovalov <[email protected]>
CC: Alexander Viro <[email protected]>
Signed-off-by: Vincenzo Frascino <[email protected]>
---
fs/binfmt_elf.c | 6 +++++-
fs/binfmt_elf_fdpic.c | 6 +++++-
fs/compat_binfmt_elf.c | 5 +++++
3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 54207327f98f..9fa20cc4a437 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -86,6 +86,10 @@ static int elf_core_dump(struct coredump_params *cprm);
#define ELF_CORE_EFLAGS 0
#endif

+#ifndef ELF_AT_FLAGS
+#define ELF_AT_FLAGS 0
+#endif
+
#define ELF_PAGESTART(_v) ((_v) & ~(unsigned long)(ELF_MIN_ALIGN-1))
#define ELF_PAGEOFFSET(_v) ((_v) & (ELF_MIN_ALIGN-1))
#define ELF_PAGEALIGN(_v) (((_v) + ELF_MIN_ALIGN - 1) & ~(ELF_MIN_ALIGN - 1))
@@ -251,7 +255,7 @@ create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
NEW_AUX_ENT(AT_PHNUM, exec->e_phnum);
NEW_AUX_ENT(AT_BASE, interp_load_addr);
- NEW_AUX_ENT(AT_FLAGS, 0);
+ NEW_AUX_ENT(AT_FLAGS, ELF_AT_FLAGS);
NEW_AUX_ENT(AT_ENTRY, exec->e_entry);
NEW_AUX_ENT(AT_UID, from_kuid_munged(cred->user_ns, cred->uid));
NEW_AUX_ENT(AT_EUID, from_kuid_munged(cred->user_ns, cred->euid));
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index b53bb3729ac1..cf1e680a6b88 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -82,6 +82,10 @@ static int elf_fdpic_map_file_by_direct_mmap(struct elf_fdpic_params *,
static int elf_fdpic_core_dump(struct coredump_params *cprm);
#endif

+#ifndef ELF_AT_FLAGS
+#define ELF_AT_FLAGS 0
+#endif
+
static struct linux_binfmt elf_fdpic_format = {
.module = THIS_MODULE,
.load_binary = load_elf_fdpic_binary,
@@ -651,7 +655,7 @@ static int create_elf_fdpic_tables(struct linux_binprm *bprm,
NEW_AUX_ENT(AT_PHENT, sizeof(struct elf_phdr));
NEW_AUX_ENT(AT_PHNUM, exec_params->hdr.e_phnum);
NEW_AUX_ENT(AT_BASE, interp_params->elfhdr_addr);
- NEW_AUX_ENT(AT_FLAGS, 0);
+ NEW_AUX_ENT(AT_FLAGS, ELF_AT_FLAGS);
NEW_AUX_ENT(AT_ENTRY, exec_params->entry_addr);
NEW_AUX_ENT(AT_UID, (elf_addr_t) from_kuid_munged(cred->user_ns, cred->uid));
NEW_AUX_ENT(AT_EUID, (elf_addr_t) from_kuid_munged(cred->user_ns, cred->euid));
diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c
index 15f6e96b3bd9..a21cf99701ae 100644
--- a/fs/compat_binfmt_elf.c
+++ b/fs/compat_binfmt_elf.c
@@ -79,6 +79,11 @@
#define ELF_HWCAP2 COMPAT_ELF_HWCAP2
#endif

+#ifdef COMPAT_ELF_AT_FLAGS
+#undef ELF_AT_FLAGS
+#define ELF_AT_FLAGS COMPAT_ELF_AT_FLAGS
+#endif
+
#ifdef COMPAT_ARCH_DLINFO
#undef ARCH_DLINFO
#define ARCH_DLINFO COMPAT_ARCH_DLINFO
--
2.19.2


2018-12-10 14:59:19

by Vincenzo Frascino

[permalink] [raw]
Subject: [RFC][PATCH 3/3] arm64: elf: Advertise relaxed ABI

On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
the userspace (EL0) is allowed to set a non-zero value in the top
byte but the resulting pointers are not allowed at the user-kernel
syscall ABI boundary.

This patch sets ARM64_AT_FLAGS_SYSCALL_TBI (bit[0]) in the AT_FLAGS
to advertise the relaxation of the ABI to the userspace.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
CC: Andrey Konovalov <[email protected]>
Signed-off-by: Vincenzo Frascino <[email protected]>
---
arch/arm64/include/asm/atflags.h | 7 +++++++
arch/arm64/include/asm/elf.h | 5 +++++
arch/arm64/include/uapi/asm/atflags.h | 8 ++++++++
3 files changed, 20 insertions(+)
create mode 100644 arch/arm64/include/asm/atflags.h
create mode 100644 arch/arm64/include/uapi/asm/atflags.h

diff --git a/arch/arm64/include/asm/atflags.h b/arch/arm64/include/asm/atflags.h
new file mode 100644
index 000000000000..b20093d61bf2
--- /dev/null
+++ b/arch/arm64/include/asm/atflags.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ATFLAGS_H
+#define __ASM_ATFLAGS_H
+
+#include <uapi/asm/atflags.h>
+
+#endif
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 433b9554c6a1..da5a6d310ff4 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -16,6 +16,7 @@
#ifndef __ASM_ELF_H
#define __ASM_ELF_H

+#include <asm/atflags.h>
#include <asm/hwcap.h>

/*
@@ -163,6 +164,10 @@ do { \
NEW_AUX_ENT(AT_IGNORE, 0); \
} while (0)

+/* Platform specific AT_FLAGS */
+#define ELF_AT_FLAGS ARM64_AT_FLAGS_SYSCALL_TBI
+#define COMPAT_ELF_AT_FLAGS 0
+
#define ARCH_HAS_SETUP_ADDITIONAL_PAGES
struct linux_binprm;
extern int arch_setup_additional_pages(struct linux_binprm *bprm,
diff --git a/arch/arm64/include/uapi/asm/atflags.h b/arch/arm64/include/uapi/asm/atflags.h
new file mode 100644
index 000000000000..1cf25692ffd6
--- /dev/null
+++ b/arch/arm64/include/uapi/asm/atflags.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __UAPI_ASM_ATFLAGS_H
+#define __UAPI_ASM_ATFLAGS_H
+
+/* Platform specific AT_FLAGS */
+#define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)
+
+#endif
--
2.19.2


2018-12-10 14:59:19

by Vincenzo Frascino

[permalink] [raw]
Subject: [RFC][PATCH 2/3] arm64: Define Documentation/arm64/elf_at_flags.txt

On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
the userspace (EL0) is allowed to set a non-zero value in the
top byte but the resulting pointers are not allowed at the
user-kernel syscall ABI boundary.

With the relaxed ABI proposed through this document, it is now possible
to pass tagged pointers to the syscalls, when these pointers are in
memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().

This change in the ABI requires a mechanism to inform the userspace
that such an option is available.

This patch specifies and documents the way on which AT_FLAGS can be
used to advertise this feature to the userspace.

Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
CC: Andrey Konovalov <[email protected]>
Signed-off-by: Vincenzo Frascino <[email protected]>
---
Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++
1 file changed, 111 insertions(+)
create mode 100644 Documentation/arm64/elf_at_flags.txt

diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt
new file mode 100644
index 000000000000..153e657c058a
--- /dev/null
+++ b/Documentation/arm64/elf_at_flags.txt
@@ -0,0 +1,111 @@
+ARM64 ELF AT_FLAGS
+==================
+
+This document describes the usage and semantics of AT_FLAGS on arm64.
+
+1. Introduction
+---------------
+
+AT_FLAGS is part of the Auxiliary Vector, contains the flags and it
+is currently set to zero by the kernel on arm64.
+
+The auxiliary vector can be accessed by the userspace using the
+getauxval() API provided by the C library.
+getauxval() returns an unsigned long and when a flag is present in
+the AT_FLAGS, the corresponding bit in the returned value is set to 1.
+
+The AT_FLAGS with a "defined semantic" on arm64 are exposed to the
+userspace via user API (uapi/asm/atflags.h).
+The AT_FLAGS bits with "undefined semantics" are set to zero by default.
+This means that the AT_FLAGS bits to which this document does not assign
+an explicit meaning are to be intended reserved for future use.
+The kernel will populate all such bits with zero until meanings are
+assigned to them. If and when meanings are assigned, it is guaranteed
+that they will not impact the functional operation of existing userspace
+software. Userspace software should ignore any AT_FLAGS bit whose meaning
+is not defined when the software is written.
+
+The userspace software can test for features by acquiring the AT_FLAGS
+entry of the auxiliary vector, and testing whether a relevant flag
+is set.
+
+Example of a userspace test function:
+
+bool feature_x_is_present(void)
+{
+ unsigned long at_flags = getauxval(AT_FLAGS);
+ if (at_flags & FEATURE_X)
+ return true;
+
+ return false;
+}
+
+Where the software relies on a feature advertised by AT_FLAGS, it
+should check that the feature is present before attempting to
+use it.
+
+2. Features exposed via AT_FLAGS
+--------------------------------
+
+bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI
+
+ On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
+ the userspace (EL0) is allowed to set a non-zero value in the top
+ byte but the resulting pointers are not allowed at the user-kernel
+ syscall ABI boundary.
+ When bit[0] is set to 1 the kernel is advertising to the userspace
+ that a relaxed ABI is supported hence this type of pointers are now
+ allowed to be passed to the syscalls, when these pointers are in
+ memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk().
+ In these cases the tag is preserved as the pointer goes through the
+ kernel. Only when the kernel needs to check if a pointer is coming
+ from userspace (i.e. access_ok()) an untag operation is required.
+
+3. ARM64_AT_FLAGS_SYSCALL_TBI
+-----------------------------
+
+When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged
+pointers, when these pointers are in memory ranges obtained by an anonymous
+(MAP_ANONYMOUS) mmap() or brk().
+
+A definition of the meaning of tagged pointers on arm64 can be found in:
+Documentation/arm64/tagged-pointers.txt.
+
+When a pointer does not are in a memory range obtained by an anonymous mmap()
+or brk(), this can not be passed to a syscall if it is tagged.
+
+To be more explicit: a syscall can accept pointers whose memory range is
+obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in
+the top-byte is 0x00.
+
+When a new syscall is added, this can accept tagged pointers if and only if
+these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS)
+mmap() or brk(). In all the other cases, the tag encoded in the top-byte is
+expected to be 0x00.
+
+Example of correct usage (pseudo-code) for a userspace application:
+
+bool arm64_syscall_tbi_is_present(void)
+{
+ unsigned long at_flags = getauxval(AT_FLAGS);
+ if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
+ return true;
+
+ return false;
+}
+
+void main(void)
+{
+ char *addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS, -1, 0);
+
+ /* Check if the relaxed ABI is supported */
+ if (arm64_syscall_tbi_is_present()) {
+ /* Add a tag to the pointer and to the memory */
+ addr = tag_pointer_and_memory(addr);
+ }
+
+ /* Write to memory */
+ strcpy("Hello World\n", addr);
+}
+
--
2.19.2


2018-12-12 14:26:38

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
<[email protected]> wrote:
>
> On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> the userspace (EL0) is allowed to set a non-zero value in the top
> byte but the resulting pointers are not allowed at the user-kernel
> syscall ABI boundary.
>
> This patchset proposes a relaxation of the ABI and a mechanism to
> advertise it to the userspace via an AT_FLAGS.
>
> The rationale behind the choice of AT_FLAGS is that the Unix System V
> ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> interpretation.
> There are two previous attempts of using AT_FLAGS in the Linux Kernel
> for different reasons: the first was more generic and was used to expose
> the support for the GNU STACK NX feature [1] and the second was done for
> the MIPS architecture and was used to expose the support of "MIPS ABI
> Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> Both the changes are currently _not_ merged in mainline.
> The only architecture that reserves some of the bits in AT_FLAGS is
> currently MIPS, which introduced the concept of platform specific ABI
> (psABI) reserving the top-byte [3].
>
> When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> to the userspace that a relaxed ABI is supported hence this type
> of pointers are now allowed to be passed to the syscalls when they are
> in memory ranges obtained by anonymous mmap() or brk().
>
> The userspace _must_ verify that the flag is set before passing tagged
> pointers to the syscalls allowed by this relaxation.
>
> More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> to the software to check that the feature is present, before using the
> associated functionality, it provides a degree of control on the decision
> of disabling such a feature in future without consequently breaking the
> userspace.
>
> The change required a modification of the elf common code, because in Linux
> the AT_FLAGS are currently set to zero by default by the kernel.
>
> The newly added flag has been verified on arm64 using the code below.
> #include <stdio.h>
> #include <stdbool.h>
> #include <sys/auxv.h>
>
> #define ARM64_AT_FLAGS_SYSCALL_TBI (1 << 0)
>
> bool arm64_syscall_tbi_is_present(void)
> {
> unsigned long at_flags = getauxval(AT_FLAGS);
> if (at_flags & ARM64_AT_FLAGS_SYSCALL_TBI)
> return true;
>
> return false;
> }
>
> void main()
> {
> if (arm64_syscall_tbi_is_present())
> printf("ARM64_AT_FLAGS_SYSCALL_TBI is present\n");
> }
>
> This patchset should be merged together with [4].
>
> [1] https://patchwork.ozlabs.org/patch/579578/
> [2] https://lore.kernel.org/patchwork/cover/618280/
> [3] ftp://http://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> [4] https://patchwork.kernel.org/cover/10674351/
>
> ABI References:
> ---------------
> Sco SysV ABI: http://www.sco.com/developers/gabi/2003-12-17/contents.html
> PowerPC AUXV: http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655242_98651.html
> AMD64 ABI: https://www.cs.tufts.edu/comp/40-2012f/readings/amd64-abi.pdf
> x86 ABI: https://www.uclibc.org/docs/psABI-i386.pdf
> MIPS ABI: ftp://http://www.linux-mips.org/pub/linux/mips/doc/ABI/psABI_mips3.0.pdf
> ARM ABI: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
> SPARC ABI: http://math-atlas.sourceforge.net/devel/assembly/abi_sysV_sparc.pdf
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Robin Murphy <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: Kate Stewart <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "Kirill A . Shutemov" <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Cc: Chintan Pandya <[email protected]>
> Cc: Jacob Bramley <[email protected]>
> Cc: Ruben Ayrapetyan <[email protected]>
> Cc: Andrey Konovalov <[email protected]>
> Cc: Lee Smith <[email protected]>
> Cc: Kostya Serebryany <[email protected]>
> Cc: Dmitry Vyukov <[email protected]>,
> Cc: Ramana Radhakrishnan <[email protected]>
> Cc: Luc Van Oostenryck <[email protected]>
> Cc: Evgeniy Stepanov <[email protected]>
> CC: Alexander Viro <[email protected]>
> Signed-off-by: Vincenzo Frascino <[email protected]>
>
> Vincenzo Frascino (3):
> elf: Make AT_FLAGS arch configurable
> arm64: Define Documentation/arm64/elf_at_flags.txt
> arm64: elf: Advertise relaxed ABI
>
> Documentation/arm64/elf_at_flags.txt | 111 ++++++++++++++++++++++++++
> arch/arm64/include/asm/atflags.h | 7 ++
> arch/arm64/include/asm/elf.h | 5 ++
> arch/arm64/include/uapi/asm/atflags.h | 8 ++
> fs/binfmt_elf.c | 6 +-
> fs/binfmt_elf_fdpic.c | 6 +-
> fs/compat_binfmt_elf.c | 5 ++
> 7 files changed, 146 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/arm64/elf_at_flags.txt
> create mode 100644 arch/arm64/include/asm/atflags.h
> create mode 100644 arch/arm64/include/uapi/asm/atflags.h
>
> --
> 2.19.2
>

Acked-by: Andrey Konovalov <[email protected]>

2018-12-12 15:05:05

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

Hi Andrey,

On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
> <[email protected]> wrote:
> > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > the userspace (EL0) is allowed to set a non-zero value in the top
> > byte but the resulting pointers are not allowed at the user-kernel
> > syscall ABI boundary.
> >
> > This patchset proposes a relaxation of the ABI and a mechanism to
> > advertise it to the userspace via an AT_FLAGS.
> >
> > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > interpretation.
> > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > for different reasons: the first was more generic and was used to expose
> > the support for the GNU STACK NX feature [1] and the second was done for
> > the MIPS architecture and was used to expose the support of "MIPS ABI
> > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > Both the changes are currently _not_ merged in mainline.
> > The only architecture that reserves some of the bits in AT_FLAGS is
> > currently MIPS, which introduced the concept of platform specific ABI
> > (psABI) reserving the top-byte [3].
> >
> > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > to the userspace that a relaxed ABI is supported hence this type
> > of pointers are now allowed to be passed to the syscalls when they are
> > in memory ranges obtained by anonymous mmap() or brk().
> >
> > The userspace _must_ verify that the flag is set before passing tagged
> > pointers to the syscalls allowed by this relaxation.
> >
> > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > to the software to check that the feature is present, before using the
> > associated functionality, it provides a degree of control on the decision
> > of disabling such a feature in future without consequently breaking the
> > userspace.
[...]
> Acked-by: Andrey Konovalov <[email protected]>

Thanks for the ack. However, if we go ahead with this ABI proposal it
means that your patches need to be reworked to allow a non-zero top byte
in all syscalls, including mmap() and friends, ioctl(). There are ABI
concerns in either case but I'd rather have this discussion in the open.
It doesn't necessarily mean that I endorse this proposal, I would like
feedback and not just from kernel developers but user space ones.

The summary of our internal discussions (mostly between kernel
developers) is that we can't properly describe a user ABI that covers
future syscalls or syscall extensions while not all syscalls accept
tagged pointers. So we tweaked the requirements slightly to only allow
tagged pointers back into the kernel *if* the originating address is
from an anonymous mmap() or below sbrk(0). This should cover some of the
ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
pointer to a buffer obtained via mmap() on the device operations.

(sorry for not being clear on what Vincenzo's proposal implies)

--
Catalin

2018-12-12 17:04:24

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> tags into the top byte of each pointer. Userspace programs (such as
> HWASan, a memory debugging tool [1]) might use this feature and pass
> tagged user pointers to the kernel through syscalls or other interfaces.
>
> Right now the kernel is already able to handle user faults with tagged
> pointers, due to these patches:
>
> 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
> tagged pointer")
> 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
> pointers")
> 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
> pointers")
>
> When passing tagged pointers to syscalls, there's a special case of such a
> pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
> These syscalls don't do memory accesses but rather deal with memory
> ranges, hence an untagged pointer is better suited.
>
> This patchset extends tagged pointer support to non-memory syscalls. This
> is done by reusing the untagged_addr macro to untag user pointers when the
> kernel performs pointer checking to find out whether the pointer comes
> from userspace (most notably in access_ok). The untagging is done only
> when the pointer is being checked, the tag is preserved as the pointer
> makes its way through the kernel.
>
> One of the alternative approaches to untagging that was considered is to
> completely strip the pointer tag as the pointer enters the kernel with
> some kind of a syscall wrapper, but that won't work with the countless
> number of different ioctl calls. With this approach we would need a custom
> wrapper for each ioctl variation, which doesn't seem practical.
>
> The following testing approaches has been taken to find potential issues
> with user pointer untagging:
>
> 1. Static testing (with sparse [2] and separately with a custom static
> analyzer based on Clang) to track casts of __user pointers to integer
> types to find places where untagging needs to be done.
>
> 2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
> a modified syzkaller version that passes tagged pointers to the kernel.
>
> Based on the results of the testing the requried patches have been added
> to the patchset.
>
> This patchset has been merged into the Pixel 2 kernel tree and is now
> being used to enable testing of Pixel 2 phones with HWASan.

Do you have an idea of how much of the user/kernel interface is covered
by this workload?

> This patchset is a prerequisite for ARM's memory tagging hardware feature
> support [3].

It looks like there's been a lot of progress made here towards smoking
out most of the sites in the kernel where pointers need to be untagged.

However, I do think that we need a clear policy for how existing kernel
interfaces are to be interpreted in the presence of tagged pointers.
Unless we have that nailed down, we are likely to be able to make only
vague guarantees to userspace about what works, and the ongoing risk
of ABI regressions and inconsistencies seems high.

I don't really see how we can advertise a full system interface if we
know some subset of it doesn't work for foreseeable userspace
environments. I feel that presenting the current changes as an ABI
relaxation may be a recipe for future problems, since the forwards
compatibility guarantees we're able to make today are few and rather
vague.

Can we define an opt-in for tagged-pointer userspace, that rejects all
syscalls that we haven't checked and whitelisted (or that are
uncheckable like ioctl)? This reflects the reality that we don't have
a regular userspace environment in which standards-compliant software
that uses address tags in a reasonable way will just work.

It might be feasible to promote this to be enabled by default later on,
if it becomes sufficiently complete.


In the meantime, I think we really need to nail down the kernel's
policies on

* in the default configuration (without opt-in), is the presence of
non-address bits in pointers exchanged with the kernel simply
considered broken? (Even with this series, the de factor answer
generally seems to be "yes", although many specific things will now
work fine)

* if not, how do we tighten syscall / interface specifications to
describe what happens with pointers containing non-address bits, while
keeping the existing behaviour for untagged pointers?

We would want a general recipe that gives clear guidance on what
userspace should expect an arbitrarily chosen syscall to do with its
pointers, without having to enumerate each and every case.

To be sustainable, we would also need to solve that in a way that
doesn't need to be reintented per-arch.


There may already be some background on these topics -- can you throw me
a link if so?

Cheers
---Dave

2018-12-12 17:36:48

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/3] arm64: Define Documentation/arm64/elf_at_flags.txt

On Mon, Dec 10, 2018 at 02:30:43PM +0000, Vincenzo Frascino wrote:
> On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> the userspace (EL0) is allowed to set a non-zero value in the
> top byte but the resulting pointers are not allowed at the
> user-kernel syscall ABI boundary.
>
> With the relaxed ABI proposed through this document, it is now possible
> to pass tagged pointers to the syscalls, when these pointers are in
> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().

What about other anonymous memory such as the process stack?

What about MAP_PRIVATE mappings of /dev/zero (i.e., oldskool "anonymous"
mappings)?

I wonder whether this should really say MAP_PRIVATE rather than
MAP_ANONYMOUS. There are two requirements here:

* the memory must be the exclusive property of a single process,
otherwise tagging it on-the-fly could break some other process;

* the memory should be regular memory, i.e., not something like a
mapped device. Since copy-on-write mappings of devices make little
sense, and we want writes to devices to propagate to the hardware
directly, MAP_PRIVATE doesn't make a lot of sense for such mappings.

MAP_ANONYMOUS | MAP_SHARED is theoretically possible, and we should
make sure that it does the expected thing. Tagging such memory is
probably a bad idea, just like any other MAP_SHARED memory.

I'm assuming here that tagging some currently shared copy-on-write pages
would throw a page fault and trigger a copy, so that we end up tagging
the calling process's private copy of the page.

I also don't see how the above requirements conflict with regular file-
backed mappings (which would need to work if you want to be able to
tag objects in .bss or .data etc.)

> This change in the ABI requires a mechanism to inform the userspace
> that such an option is available.
>
> This patch specifies and documents the way on which AT_FLAGS can be
> used to advertise this feature to the userspace.
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> CC: Andrey Konovalov <[email protected]>
> Signed-off-by: Vincenzo Frascino <[email protected]>
> ---
> Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++
> 1 file changed, 111 insertions(+)
> create mode 100644 Documentation/arm64/elf_at_flags.txt
>
> diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt
> new file mode 100644
> index 000000000000..153e657c058a
> --- /dev/null
> +++ b/Documentation/arm64/elf_at_flags.txt
> @@ -0,0 +1,111 @@
> +ARM64 ELF AT_FLAGS
> +==================
> +
> +This document describes the usage and semantics of AT_FLAGS on arm64.
> +
> +1. Introduction
> +---------------
> +
> +AT_FLAGS is part of the Auxiliary Vector, contains the flags and it
> +is currently set to zero by the kernel on arm64.
> +
> +The auxiliary vector can be accessed by the userspace using the
> +getauxval() API provided by the C library.
> +getauxval() returns an unsigned long and when a flag is present in
> +the AT_FLAGS, the corresponding bit in the returned value is set to 1.
> +
> +The AT_FLAGS with a "defined semantic" on arm64 are exposed to the
> +userspace via user API (uapi/asm/atflags.h).
> +The AT_FLAGS bits with "undefined semantics" are set to zero by default.
> +This means that the AT_FLAGS bits to which this document does not assign
> +an explicit meaning are to be intended reserved for future use.
> +The kernel will populate all such bits with zero until meanings are
> +assigned to them. If and when meanings are assigned, it is guaranteed
> +that they will not impact the functional operation of existing userspace
> +software. Userspace software should ignore any AT_FLAGS bit whose meaning
> +is not defined when the software is written.
> +
> +The userspace software can test for features by acquiring the AT_FLAGS
> +entry of the auxiliary vector, and testing whether a relevant flag
> +is set.
> +
> +Example of a userspace test function:
> +
> +bool feature_x_is_present(void)
> +{
> + unsigned long at_flags = getauxval(AT_FLAGS);
> + if (at_flags & FEATURE_X)
> + return true;
> +
> + return false;
> +}
> +
> +Where the software relies on a feature advertised by AT_FLAGS, it
> +should check that the feature is present before attempting to
> +use it.
> +
> +2. Features exposed via AT_FLAGS
> +--------------------------------
> +
> +bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI
> +
> + On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> + the userspace (EL0) is allowed to set a non-zero value in the top
> + byte but the resulting pointers are not allowed at the user-kernel
> + syscall ABI boundary.
> + When bit[0] is set to 1 the kernel is advertising to the userspace
> + that a relaxed ABI is supported hence this type of pointers are now
> + allowed to be passed to the syscalls, when these pointers are in
> + memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk().

"TBI" is a slightly odd name.

The kernel seems not to be ignoring the top byte, otherwise how could
it make a difference whehter the memory is anonymous or something else?

(With memory tagging enabled, the top byte is also not architecturally
ignored.)

> + In these cases the tag is preserved as the pointer goes through the
> + kernel. Only when the kernel needs to check if a pointer is coming
> + from userspace (i.e. access_ok()) an untag operation is required.

Does the last sentence belong here? That's about kernel internals,
whereas the rest all seems user-facing.

> +
> +3. ARM64_AT_FLAGS_SYSCALL_TBI
> +-----------------------------
> +
> +When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged
> +pointers, when these pointers are in memory ranges obtained by an anonymous
> +(MAP_ANONYMOUS) mmap() or brk().
> +
> +A definition of the meaning of tagged pointers on arm64 can be found in:
> +Documentation/arm64/tagged-pointers.txt.
> +
> +When a pointer does not are in a memory range obtained by an anonymous mmap()
> +or brk(), this can not be passed to a syscall if it is tagged.
> +
> +To be more explicit: a syscall can accept pointers whose memory range is
> +obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in
> +the top-byte is 0x00.
> +
> +When a new syscall is added, this can accept tagged pointers if and only if
> +these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS)
> +mmap() or brk(). In all the other cases, the tag encoded in the top-byte is
> +expected to be 0x00.

Does this apply to kernel interfaces that are not syscalls?

And does it apply to ioctls in general (I think from discussions
elsewhere that it can't).

What about things that flow through the kernel, like an
si_value.sival_ptr that propagates from sigqueue(2) to the signal frame
of the signalled thread, or registered with the kernel via aio_read(2)?

This kind of thing is why I would like to define a set of rules for
making an educated guess about how the kernel should interpret
arbitrary arguments (in an ideal world).


With issues like this in the mix, it seems difficult to extract general
guarantees from ARM64_AT_FLAGS_SYSCALL_TBI. If it means that just
certain specific uses of a few specific syscalls work with tagged
pointers, that may not be very useful by ifself? It sounds like if
you are tagging memory at all, you suddenly need to port every random
library you're using.

[...]

Cheers
---Dave

2018-12-18 15:07:37

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <[email protected]> wrote:
>
> Hi Andrey,
>
> On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> > On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
> > <[email protected]> wrote:
> > > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > > the userspace (EL0) is allowed to set a non-zero value in the top
> > > byte but the resulting pointers are not allowed at the user-kernel
> > > syscall ABI boundary.
> > >
> > > This patchset proposes a relaxation of the ABI and a mechanism to
> > > advertise it to the userspace via an AT_FLAGS.
> > >
> > > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > > interpretation.
> > > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > > for different reasons: the first was more generic and was used to expose
> > > the support for the GNU STACK NX feature [1] and the second was done for
> > > the MIPS architecture and was used to expose the support of "MIPS ABI
> > > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > > Both the changes are currently _not_ merged in mainline.
> > > The only architecture that reserves some of the bits in AT_FLAGS is
> > > currently MIPS, which introduced the concept of platform specific ABI
> > > (psABI) reserving the top-byte [3].
> > >
> > > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > > to the userspace that a relaxed ABI is supported hence this type
> > > of pointers are now allowed to be passed to the syscalls when they are
> > > in memory ranges obtained by anonymous mmap() or brk().
> > >
> > > The userspace _must_ verify that the flag is set before passing tagged
> > > pointers to the syscalls allowed by this relaxation.
> > >
> > > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > > to the software to check that the feature is present, before using the
> > > associated functionality, it provides a degree of control on the decision
> > > of disabling such a feature in future without consequently breaking the
> > > userspace.
> [...]
> > Acked-by: Andrey Konovalov <[email protected]>
>
> Thanks for the ack. However, if we go ahead with this ABI proposal it
> means that your patches need to be reworked to allow a non-zero top byte
> in all syscalls, including mmap() and friends, ioctl(). There are ABI
> concerns in either case but I'd rather have this discussion in the open.
> It doesn't necessarily mean that I endorse this proposal, I would like
> feedback and not just from kernel developers but user space ones.
>
> The summary of our internal discussions (mostly between kernel
> developers) is that we can't properly describe a user ABI that covers
> future syscalls or syscall extensions while not all syscalls accept
> tagged pointers. So we tweaked the requirements slightly to only allow
> tagged pointers back into the kernel *if* the originating address is
> from an anonymous mmap() or below sbrk(0). This should cover some of the
> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> pointer to a buffer obtained via mmap() on the device operations.
>
> (sorry for not being clear on what Vincenzo's proposal implies)

OK, I see. So I need to make the following changes to my patchset AFAIU.

1. Make sure that we only allow tagged user addresses that originate
from an anonymous mmap() or below sbrk(0). How exactly should this
check be performed?

2. Allow tagged addressed passed to memory syscalls (as long as (1) is
satisfied). Do I understand correctly that this means that I need to
locate all find_vma() callers outside of mm/ and fix them up as well?

2018-12-18 18:36:37

by Andrey Konovalov

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

On Wed, Dec 12, 2018 at 6:01 PM Dave Martin <[email protected]> wrote:
>
> On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > tags into the top byte of each pointer. Userspace programs (such as
> > HWASan, a memory debugging tool [1]) might use this feature and pass
> > tagged user pointers to the kernel through syscalls or other interfaces.
> >
> > Right now the kernel is already able to handle user faults with tagged
> > pointers, due to these patches:
> >
> > 1. 81cddd65 ("arm64: traps: fix userspace cache maintenance emulation on a
> > tagged pointer")
> > 2. 7dcd9dd8 ("arm64: hw_breakpoint: fix watchpoint matching for tagged
> > pointers")
> > 3. 276e9327 ("arm64: entry: improve data abort handling of tagged
> > pointers")
> >
> > When passing tagged pointers to syscalls, there's a special case of such a
> > pointer being passed to one of the memory syscalls (mmap, mprotect, etc.).
> > These syscalls don't do memory accesses but rather deal with memory
> > ranges, hence an untagged pointer is better suited.
> >
> > This patchset extends tagged pointer support to non-memory syscalls. This
> > is done by reusing the untagged_addr macro to untag user pointers when the
> > kernel performs pointer checking to find out whether the pointer comes
> > from userspace (most notably in access_ok). The untagging is done only
> > when the pointer is being checked, the tag is preserved as the pointer
> > makes its way through the kernel.
> >
> > One of the alternative approaches to untagging that was considered is to
> > completely strip the pointer tag as the pointer enters the kernel with
> > some kind of a syscall wrapper, but that won't work with the countless
> > number of different ioctl calls. With this approach we would need a custom
> > wrapper for each ioctl variation, which doesn't seem practical.
> >
> > The following testing approaches has been taken to find potential issues
> > with user pointer untagging:
> >
> > 1. Static testing (with sparse [2] and separately with a custom static
> > analyzer based on Clang) to track casts of __user pointers to integer
> > types to find places where untagging needs to be done.
> >
> > 2. Dynamic testing: adding BUG_ON(has_tag(addr)) to find_vma() and running
> > a modified syzkaller version that passes tagged pointers to the kernel.
> >
> > Based on the results of the testing the requried patches have been added
> > to the patchset.
> >
> > This patchset has been merged into the Pixel 2 kernel tree and is now
> > being used to enable testing of Pixel 2 phones with HWASan.

Hi, Dave,

>
> Do you have an idea of how much of the user/kernel interface is covered
> by this workload?

Not really. I don't even know what kind of measurements can be used to
obtain this estimate. But Pixel 2 kernel with these patches + Android
runtime instrumented with HWASan works.

>
> > This patchset is a prerequisite for ARM's memory tagging hardware feature
> > support [3].
>
> It looks like there's been a lot of progress made here towards smoking
> out most of the sites in the kernel where pointers need to be untagged.
>
> However, I do think that we need a clear policy for how existing kernel
> interfaces are to be interpreted in the presence of tagged pointers.
> Unless we have that nailed down, we are likely to be able to make only
> vague guarantees to userspace about what works, and the ongoing risk
> of ABI regressions and inconsistencies seems high.
>
> I don't really see how we can advertise a full system interface if we
> know some subset of it doesn't work for foreseeable userspace
> environments. I feel that presenting the current changes as an ABI
> relaxation may be a recipe for future problems, since the forwards
> compatibility guarantees we're able to make today are few and rather
> vague.
>
> Can we define an opt-in for tagged-pointer userspace, that rejects all
> syscalls that we haven't checked and whitelisted (or that are
> uncheckable like ioctl)? This reflects the reality that we don't have
> a regular userspace environment in which standards-compliant software
> that uses address tags in a reasonable way will just work.
>
> It might be feasible to promote this to be enabled by default later on,
> if it becomes sufficiently complete.
>
>
> In the meantime, I think we really need to nail down the kernel's
> policies on
>
> * in the default configuration (without opt-in), is the presence of
> non-address bits in pointers exchanged with the kernel simply
> considered broken? (Even with this series, the de factor answer
> generally seems to be "yes", although many specific things will now
> work fine)
>
> * if not, how do we tighten syscall / interface specifications to
> describe what happens with pointers containing non-address bits, while
> keeping the existing behaviour for untagged pointers?
>
> We would want a general recipe that gives clear guidance on what
> userspace should expect an arbitrarily chosen syscall to do with its
> pointers, without having to enumerate each and every case.
>
> To be sustainable, we would also need to solve that in a way that
> doesn't need to be reintented per-arch.

As I understand your main concern is userspace/kernel ABI changes
these patches introduce. This concern was already pointed out by
Catalin, and working out the details is still in progress.

>
> There may already be some background on these topics -- can you throw me
> a link if so?

I don't have a single link, I would suggest to look at the comments
for all the previous versions of this patchset. I see you saw the
pathset by Vincenzo, it also has some information about this.

>
> Cheers
> ---Dave

Thanks!

2018-12-18 18:39:08

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <[email protected]> wrote:
> > The summary of our internal discussions (mostly between kernel
> > developers) is that we can't properly describe a user ABI that covers
> > future syscalls or syscall extensions while not all syscalls accept
> > tagged pointers. So we tweaked the requirements slightly to only allow
> > tagged pointers back into the kernel *if* the originating address is
> > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > pointer to a buffer obtained via mmap() on the device operations.
> >
> > (sorry for not being clear on what Vincenzo's proposal implies)
>
> OK, I see. So I need to make the following changes to my patchset AFAIU.
>
> 1. Make sure that we only allow tagged user addresses that originate
> from an anonymous mmap() or below sbrk(0). How exactly should this
> check be performed?

I don't think we should perform such checks. That's rather stating that
the kernel only guarantees that the tagged pointers work if they
originated from these memory ranges.

> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> satisfied). Do I understand correctly that this means that I need to
> locate all find_vma() callers outside of mm/ and fix them up as well?

Yes (unless anyone as a better idea or objections to this approach).

BTW, I'll be off until the new year, so won't be able to follow up.

--
Catalin

2018-12-19 15:37:40

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> > On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas <[email protected]> wrote:
> > > The summary of our internal discussions (mostly between kernel
> > > developers) is that we can't properly describe a user ABI that covers
> > > future syscalls or syscall extensions while not all syscalls accept
> > > tagged pointers. So we tweaked the requirements slightly to only allow
> > > tagged pointers back into the kernel *if* the originating address is
> > > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > > pointer to a buffer obtained via mmap() on the device operations.
> > >
> > > (sorry for not being clear on what Vincenzo's proposal implies)
> >
> > OK, I see. So I need to make the following changes to my patchset AFAIU.
> >
> > 1. Make sure that we only allow tagged user addresses that originate
> > from an anonymous mmap() or below sbrk(0). How exactly should this
> > check be performed?
>
> I don't think we should perform such checks. That's rather stating that
> the kernel only guarantees that the tagged pointers work if they
> originated from these memory ranges.

I concur.

Really, the kernel should do the expected thing with all "non-weird"
memory.

In lieu of a proper definition of "non-weird", I think we should have
some lists of things that are explicitly included, and also excluded:

OK:
kernel-allocated process stack
brk area
MAP_ANONYMOUS | MAP_PRIVATE
MAP_PRIVATE mappings of /dev/zero

Not OK:
MAP_SHARED
mmaps of non-memory-like devices
mmaps of anything that is not a regular file
the VDSO
...

In general, userspace can tag memory that it "owns", and we do not assume
a transfer of ownership except in the "OK" list above. Otherwise, it's
the kernel's memory, or the owner is simply not well defined.


I would also like to see advice for userspace developers, particularly
things like (strawman, please challenge!):

* Userspace should set tags at the point of allocation only.

* If you don't know how an object was allocated, you cannot modify the
tag, period.

* A single C object should be accessed using a single, fixed pointer tag
throughout its entire lifetime.

* Tags can be changed only when there are no outstanding pointers to
the affected object or region that may be used to access the object
or region (i.e., if the object were allocated from the C heap and
is it safe to realloc() it, then it is safe to change the tag; for
other types of allocation, analogous arguments can be applied).

* When the kernel dereferences a pointer on userspace's behalf, it
shall behave equivalently to userspace dereferencing the same pointer,
including use of the same tag (where passed by userspace).

* Where the pointer tag affects pointer dereference behaviour (i.e.,
with hardware memory colouring) the kernel makes no guarantee to
honour pointer tags correctly for every location a buffer based on a
pointer passed by userspace to the kernel.

(This means for example that for a read(fd, buf, size), we can check
the tag for a single arbitrary location in *(char (*)[size])buf
before passing the buffer to get_user_pages(). Hopefully this could
be done in get_user_pages() itself rather than hunting call sites.
For userspace, it means that you're on your own if you ask the
kernel to operate on a buffer than spans multiple, independently-
allocated objects, or a deliberately striped single object.)

* The kernel shall not extend the lifetime of user pointers in ways
that are not clear from the specification of the syscall or
interface to which the pointer is passed (and in any case shall not
extend pointer lifetimes without good reason).

So no clever transparent caching between syscalls, unless it _really_
is transparent in the presence of tags.

* For purposes other than dereference, the kernel shall accept any
legitimately tagged pointer (according to the above rules) as
identifying the associated memory location.

So, mprotect(some_page_aligned_object, ...); is valid irrespective
of where page_aligned_object() came from. There is no implicit
derefence by the kernel here, hence no tag check.

The kernel does not guarantee to work correctly if the wrong tag
is used, but there is not always a well-defined "right" tag, so
we can't really guarantee to check it. So a pointer derived by
any reasonable means by userspace has to be treated as equally
valid.


We would need to get some cross-arch buy-in for this, otherwise core
maintainers might just refuse to maintain the necessary guarantees.


> > 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> > satisfied). Do I understand correctly that this means that I need to
> > locate all find_vma() callers outside of mm/ and fix them up as well?
>
> Yes (unless anyone as a better idea or objections to this approach).

Also, watch out for code that pokes about inside struct vma directly.

I'm wondering, could we define an explicit type, say,

struct user_vaddr {
unsigned long addr;
};

to replace the unsigned longs in struct vma the mm API? This would
turn ad-hoc (unsigned long) casts into build breaks. We could have
an explicit conversion functions, say,

struct user_vaddr __user_vaddr_unsafe(void __user *);
void __user *__user_ptr_unsafe(struct user_vaddr);

that we robotically insert in all the relevant places to mark
unaudited code.

This allows us to keep the kernel buildable, while flagging things
that will need review. We would also need to warn the mm folks to
reject any new code using these unsafe conversions.

Of course, it would be a non-trivial effort...

>
> BTW, I'll be off until the new year, so won't be able to follow up.

Cheers
---Dave

2019-01-09 13:38:53

by Vincenzo Frascino

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/3] arm64: Define Documentation/arm64/elf_at_flags.txt

On 12/12/2018 17:34, Dave Martin wrote:
> On Mon, Dec 10, 2018 at 02:30:43PM +0000, Vincenzo Frascino wrote:
>> On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
>> the userspace (EL0) is allowed to set a non-zero value in the
>> top byte but the resulting pointers are not allowed at the
>> user-kernel syscall ABI boundary.
>>
>> With the relaxed ABI proposed through this document, it is now possible
>> to pass tagged pointers to the syscalls, when these pointers are in
>> memory ranges obtained by an anonymous (MAP_ANONYMOUS) mmap() or brk().
>
> What about other anonymous memory such as the process stack?
>

The process stack is set up by the kernel as a consequence of the exec* family
of system calls in a way that should be transparent to the userspace.
For what concerns the userspace applications, the ones that are aware of the
AT_FLAGS[0] should assume that the stack is always tagged.

> What about MAP_PRIVATE mappings of /dev/zero (i.e., oldskool "anonymous"
> mappings)?
>

MAP_PRIVATE of /dev/zero is equivalent to MAP_ANONYMOUS | MAP_PRIVATE, hence it
should follow the same rule of MAP_ANONYMOUS obtained memory ranges.

> I wonder whether this should really say MAP_PRIVATE rather than
> MAP_ANONYMOUS. There are two requirements here:
>
> * the memory must be the exclusive property of a single process,
> otherwise tagging it on-the-fly could break some other process;
>

I think that it is responsibility of the userspace to tag or not a shared area.
I do not think that we should exclude MAP_ANONYMOUS | MAP_SHARED because if such
an area is tagged improperly by a userspace process, the kernel will be able to
detect the issue and report it.

> * the memory should be regular memory, i.e., not something like a
> mapped device. Since copy-on-write mappings of devices make little
> sense, and we want writes to devices to propagate to the hardware
> directly, MAP_PRIVATE doesn't make a lot of sense for such mappings.
>

What we are trying to target with this ABI change are only pointers in NORMAL
memory ranges. DEVICE memory should be untagged.

...

> I'm assuming here that tagging some currently shared copy-on-write pages
> would throw a page fault and trigger a copy, so that we end up tagging
> the calling process's private copy of the page.
>
> I also don't see how the above requirements conflict with regular file-
> backed mappings (which would need to work if you want to be able to
> tag objects in .bss or .data etc.)
>

I agree that we should look at the more general cases to make sure that we will
not face any incompatibility in future, but, what I am proposing here, assumes
that the memory areas that can be tagged are only the stack and the heap. This
is because the features that will use it will be mostly looking for memory
related errors in these areas.

>> This change in the ABI requires a mechanism to inform the userspace
>> that such an option is available.
>>
>> This patch specifies and documents the way on which AT_FLAGS can be
>> used to advertise this feature to the userspace.
>>
>> Cc: Catalin Marinas <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> CC: Andrey Konovalov <[email protected]>
>> Signed-off-by: Vincenzo Frascino <[email protected]>
>> ---
>> Documentation/arm64/elf_at_flags.txt | 111 +++++++++++++++++++++++++++
>> 1 file changed, 111 insertions(+)
>> create mode 100644 Documentation/arm64/elf_at_flags.txt
>>
>> diff --git a/Documentation/arm64/elf_at_flags.txt b/Documentation/arm64/elf_at_flags.txt
>> new file mode 100644
>> index 000000000000..153e657c058a
>> --- /dev/null
>> +++ b/Documentation/arm64/elf_at_flags.txt
>> @@ -0,0 +1,111 @@
>> +ARM64 ELF AT_FLAGS
>> +==================
>> +
>> +This document describes the usage and semantics of AT_FLAGS on arm64.
>> +
>> +1. Introduction
>> +---------------
>> +
>> +AT_FLAGS is part of the Auxiliary Vector, contains the flags and it
>> +is currently set to zero by the kernel on arm64.
>> +
>> +The auxiliary vector can be accessed by the userspace using the
>> +getauxval() API provided by the C library.
>> +getauxval() returns an unsigned long and when a flag is present in
>> +the AT_FLAGS, the corresponding bit in the returned value is set to 1.
>> +
>> +The AT_FLAGS with a "defined semantic" on arm64 are exposed to the
>> +userspace via user API (uapi/asm/atflags.h).
>> +The AT_FLAGS bits with "undefined semantics" are set to zero by default.
>> +This means that the AT_FLAGS bits to which this document does not assign
>> +an explicit meaning are to be intended reserved for future use.
>> +The kernel will populate all such bits with zero until meanings are
>> +assigned to them. If and when meanings are assigned, it is guaranteed
>> +that they will not impact the functional operation of existing userspace
>> +software. Userspace software should ignore any AT_FLAGS bit whose meaning
>> +is not defined when the software is written.
>> +
>> +The userspace software can test for features by acquiring the AT_FLAGS
>> +entry of the auxiliary vector, and testing whether a relevant flag
>> +is set.
>> +
>> +Example of a userspace test function:
>> +
>> +bool feature_x_is_present(void)
>> +{
>> + unsigned long at_flags = getauxval(AT_FLAGS);
>> + if (at_flags & FEATURE_X)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +Where the software relies on a feature advertised by AT_FLAGS, it
>> +should check that the feature is present before attempting to
>> +use it.
>> +
>> +2. Features exposed via AT_FLAGS
>> +--------------------------------
>> +
>> +bit[0]: ARM64_AT_FLAGS_SYSCALL_TBI
>> +
>> + On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
>> + the userspace (EL0) is allowed to set a non-zero value in the top
>> + byte but the resulting pointers are not allowed at the user-kernel
>> + syscall ABI boundary.
>> + When bit[0] is set to 1 the kernel is advertising to the userspace
>> + that a relaxed ABI is supported hence this type of pointers are now
>> + allowed to be passed to the syscalls, when these pointers are in
>> + memory ranges obtained by anonymous (MAP_ANONYMOUS) mmap() or brk().
>
> "TBI" is a slightly odd name.
>
> The kernel seems not to be ignoring the top byte, otherwise how could
> it make a difference whehter the memory is anonymous or something else?
>
> (With memory tagging enabled, the top byte is also not architecturally
> ignored.)
>

The top-byte is ignored for MMU translations hence it can carry information that
are relevant for other aspects. This means that the kernel can not ignore that
the top-byte could be set.

>> + In these cases the tag is preserved as the pointer goes through the
>> + kernel. Only when the kernel needs to check if a pointer is coming
>> + from userspace (i.e. access_ok()) an untag operation is required.
>
> Does the last sentence belong here? That's about kernel internals,
> whereas the rest all seems user-facing.
>

I put it here just to give an idea to the user on when the kernel is going to
perform an untag operation. But if we want to keep the document completely
user-facing, I am fine with removing it.

>> +
>> +3. ARM64_AT_FLAGS_SYSCALL_TBI
>> +-----------------------------
>> +
>> +When ARM64_AT_FLAGS_SYSCALL_TBI is enabled every syscalls can accept tagged
>> +pointers, when these pointers are in memory ranges obtained by an anonymous
>> +(MAP_ANONYMOUS) mmap() or brk().
>> +
>> +A definition of the meaning of tagged pointers on arm64 can be found in:
>> +Documentation/arm64/tagged-pointers.txt.
>> +
>> +When a pointer does not are in a memory range obtained by an anonymous mmap()
>> +or brk(), this can not be passed to a syscall if it is tagged.
>> +
>> +To be more explicit: a syscall can accept pointers whose memory range is
>> +obtained by a non-anonymous mmap() or brk() if and only if the tag encoded in
>> +the top-byte is 0x00.
>> +
>> +When a new syscall is added, this can accept tagged pointers if and only if
>> +these pointers are in memory ranges obtained by an anonymous (MAP_ANONYMOUS)
>> +mmap() or brk(). In all the other cases, the tag encoded in the top-byte is
>> +expected to be 0x00.
>
> Does this apply to kernel interfaces that are not syscalls?
>

Not at the moment, but it does not seem incompatible with a future extension.

> And does it apply to ioctls in general (I think from discussions
> elsewhere that it can't).
>

Requiring that the pointers accepted by the syscalls belong to ranges that are
obtained by anonymous mappings, should cover the drivers mmap() cases.

> What about things that flow through the kernel, like an
> si_value.sival_ptr that propagates from sigqueue(2) to the signal frame
> of the signalled thread, or registered with the kernel via aio_read(2)?
>
> This kind of thing is why I would like to define a set of rules for
> making an educated guess about how the kernel should interpret
> arbitrary arguments (in an ideal world).
>

I think that to make an educated choice we need to restrict the problem we are
trying to target (making sure that we do not take any assumption that could be
incompatible with the general case) as a first step. Once we are confident that
it works as expected we can try to extend it progressively.

>
> With issues like this in the mix, it seems difficult to extract general
> guarantees from ARM64_AT_FLAGS_SYSCALL_TBI. If it means that just
> certain specific uses of a few specific syscalls work with tagged
> pointers, that may not be very useful by ifself? It sounds like if
> you are tagging memory at all, you suddenly need to port every random
> library you're using.
>

The ABI presented in this document seems quite general (it is not confined to
specific syscalls). This means that an application that tags the memory can use
any non tag aware random library as is (without porting it), as far as it
sanitizes the pointers before and after calling it.

> [...]
>
> Cheers
> ---Dave
>

--
Regards,
Vincenzo

2019-02-11 11:36:11

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

Hi Dave,

On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
> On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > tags into the top byte of each pointer. Userspace programs (such as
> > HWASan, a memory debugging tool [1]) might use this feature and pass
> > tagged user pointers to the kernel through syscalls or other interfaces.
[...]
> It looks like there's been a lot of progress made here towards smoking
> out most of the sites in the kernel where pointers need to be untagged.

In summary, based on last summer's analysis, there are two main (and
rather broad) scenarios of __user pointers use in the kernel: (a)
uaccess macros, together with access_ok() checks and (b) identifying
of user address ranges (find_vma() and related, some ioctls). The
patches here handle the former by allowing sign-extension in access_ok()
and subsequent uaccess routines work fine with tagged pointers.
Identifying the latter is a bit more problematic and the approach we
took was tracking down pointer to long conversion which seems to cover
the majority of cases. However, this approach doesn't scale as, for
example, we'd rather change get_user_pages() to sign-extend the address
rather than all the callers. In lots of other cases we don't even need
untagging as we don't expect user space to tag such pointers (i.e.
mmap() of device memory).

We might be able to improve the static analysis by introducing a
virt_addr_t but that's significant effort and we still won't cover all
cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which
wouldn't use a pointer, just a u64 for address).

> However, I do think that we need a clear policy for how existing kernel
> interfaces are to be interpreted in the presence of tagged pointers.
> Unless we have that nailed down, we are likely to be able to make only
> vague guarantees to userspace about what works, and the ongoing risk
> of ABI regressions and inconsistencies seems high.

I agree.

> Can we define an opt-in for tagged-pointer userspace, that rejects all
> syscalls that we haven't checked and whitelisted (or that are
> uncheckable like ioctl)?

Defining an opt-in is not a problem, however, rejecting all syscalls
that we haven't whitelisted is not feasible. We can have an opt-in per
process (that's what we were going to do with MTE) but the only thing
we can reasonably do is change the behaviour of access_ok(). That's too
big a knob and a new syscall that we haven't got around to whitelist may
just work. This eventually leads to de-facto ABI and our whitelist would
simply be ignored.

I'm not really keen on a big syscall shim in the arm64 kernel which
checks syscall arguments, including in-struct values. If we are to do
this, I'd rather keep it in user space as part of the C library.

> In the meantime, I think we really need to nail down the kernel's
> policies on
>
> * in the default configuration (without opt-in), is the presence of
> non-address bits in pointers exchanged with the kernel simply
> considered broken? (Even with this series, the de factor answer
> generally seems to be "yes", although many specific things will now
> work fine)

Without these patches, passing non-address bits in pointers is
considered broken. I couldn't find a case where it would still work with
non-zero tag but maybe I haven't looked hard enough.

> * if not, how do we tighten syscall / interface specifications to
> describe what happens with pointers containing non-address bits, while
> keeping the existing behaviour for untagged pointers?
>
> We would want a general recipe that gives clear guidance on what
> userspace should expect an arbitrarily chosen syscall to do with its
> pointers, without having to enumerate each and every case.

That's what we are aiming with the pointer origins, to move away from a
syscall whitelist to a generic definition. That said, the two approaches
are orthogonal, we can use the pointer origins as the base rule for
which syscalls can be whitelisted.

If we step back a bit to look at the use-case for TBI (and MTE), the
normal application programmer shouldn't really care about this ABI
(well, most of the time). The app gets a tagged pointer from the C
library as a result of a malloc()/realloc() (possibly alloca()) call and
it expects to be able to pass it back into the kernel (usually via the C
library) without any awareness of the non-address bits. Now, we can't
define a user/kernel ABI based on the provenance of the pointer in user
space (i.e. we only support tags for heap and stack), so we are trying
to generalise this based where the pointer originated from in the kernel
(e.g. anonymous mmap()).

> There may already be some background on these topics -- can you throw me
> a link if so?

That's an interesting sub-thread to read:

https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyknvl@google.com/

--
Catalin

2019-02-11 20:28:52

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

On Mon, Feb 11, 2019 at 11:35:12AM +0000, Catalin Marinas wrote:
> Hi Dave,
>
> On Wed, Dec 12, 2018 at 05:01:12PM +0000, Dave P Martin wrote:
> > On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > > tags into the top byte of each pointer. Userspace programs (such as
> > > HWASan, a memory debugging tool [1]) might use this feature and pass
> > > tagged user pointers to the kernel through syscalls or other interfaces.
> [...]
> > It looks like there's been a lot of progress made here towards smoking
> > out most of the sites in the kernel where pointers need to be untagged.
>
> In summary, based on last summer's analysis, there are two main (and
> rather broad) scenarios of __user pointers use in the kernel: (a)
> uaccess macros, together with access_ok() checks and (b) identifying
> of user address ranges (find_vma() and related, some ioctls). The
> patches here handle the former by allowing sign-extension in access_ok()
> and subsequent uaccess routines work fine with tagged pointers.
> Identifying the latter is a bit more problematic and the approach we
> took was tracking down pointer to long conversion which seems to cover
> the majority of cases. However, this approach doesn't scale as, for
> example, we'd rather change get_user_pages() to sign-extend the address
> rather than all the callers. In lots of other cases we don't even need
> untagging as we don't expect user space to tag such pointers (i.e.
> mmap() of device memory).
>
> We might be able to improve the static analysis by introducing a
> virt_addr_t but that's significant effort and we still won't cover all
> cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which
> wouldn't use a pointer, just a u64 for address).
>
> > However, I do think that we need a clear policy for how existing kernel
> > interfaces are to be interpreted in the presence of tagged pointers.
> > Unless we have that nailed down, we are likely to be able to make only
> > vague guarantees to userspace about what works, and the ongoing risk
> > of ABI regressions and inconsistencies seems high.
>
> I agree.
>
> > Can we define an opt-in for tagged-pointer userspace, that rejects all
> > syscalls that we haven't checked and whitelisted (or that are
> > uncheckable like ioctl)?
>
> Defining an opt-in is not a problem, however, rejecting all syscalls
> that we haven't whitelisted is not feasible. We can have an opt-in per
> process (that's what we were going to do with MTE) but the only thing
> we can reasonably do is change the behaviour of access_ok(). That's too
> big a knob and a new syscall that we haven't got around to whitelist may
> just work. This eventually leads to de-facto ABI and our whitelist would
> simply be ignored.
>
> I'm not really keen on a big syscall shim in the arm64 kernel which
> checks syscall arguments, including in-struct values. If we are to do
> this, I'd rather keep it in user space as part of the C library.
>
> > In the meantime, I think we really need to nail down the kernel's
> > policies on
> >
> > * in the default configuration (without opt-in), is the presence of
> > non-address bits in pointers exchanged with the kernel simply
> > considered broken? (Even with this series, the de factor answer
> > generally seems to be "yes", although many specific things will now
> > work fine)
>
> Without these patches, passing non-address bits in pointers is
> considered broken. I couldn't find a case where it would still work with
> non-zero tag but maybe I haven't looked hard enough.
>
> > * if not, how do we tighten syscall / interface specifications to
> > describe what happens with pointers containing non-address bits, while
> > keeping the existing behaviour for untagged pointers?
> >
> > We would want a general recipe that gives clear guidance on what
> > userspace should expect an arbitrarily chosen syscall to do with its
> > pointers, without having to enumerate each and every case.
>
> That's what we are aiming with the pointer origins, to move away from a
> syscall whitelist to a generic definition. That said, the two approaches
> are orthogonal, we can use the pointer origins as the base rule for
> which syscalls can be whitelisted.
>
> If we step back a bit to look at the use-case for TBI (and MTE), the
> normal application programmer shouldn't really care about this ABI
> (well, most of the time). The app gets a tagged pointer from the C
> library as a result of a malloc()/realloc() (possibly alloca()) call and
> it expects to be able to pass it back into the kernel (usually via the C
> library) without any awareness of the non-address bits. Now, we can't
> define a user/kernel ABI based on the provenance of the pointer in user
> space (i.e. we only support tags for heap and stack), so we are trying
> to generalise this based where the pointer originated from in the kernel
> (e.g. anonymous mmap()).

This sounds generally reasonable.

It is not adequate for describing changing the tag on already-tagged
memory (which a memory allocator will definitely do), but we may be able
to come up with some weasel words to cover that.

It is also not adequete for describing tagging (and retagging) regions
of the stack -- but as you say, we can rule that use-case out for now
in the interest of simplicity, since we know we wouldn't be able to
deploy it widely for now anyway due to the incompability with non-MTE-
capable hardware.


Ideally we would clarify user/kernel interface semantics in terms of
object and pointer lifetimes and accessibility, but that's a larger
project that should be pursued separately (if at all).

I could also quibble about whether "anonymous mmap" is the right thing
here -- we should still give specific examples of things that do / don't
qualify, to make it clear what we mean.

Cheers
---Dave

2019-02-11 20:34:07

by Kevin Brodsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On 19/12/2018 12:52, Dave Martin wrote:
> On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
>> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
>>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas<[email protected]> wrote:
>>>> The summary of our internal discussions (mostly between kernel
>>>> developers) is that we can't properly describe a user ABI that covers
>>>> future syscalls or syscall extensions while not all syscalls accept
>>>> tagged pointers. So we tweaked the requirements slightly to only allow
>>>> tagged pointers back into the kernel *if* the originating address is
>>>> from an anonymous mmap() or below sbrk(0). This should cover some of the
>>>> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
>>>> pointer to a buffer obtained via mmap() on the device operations.
>>>>
>>>> (sorry for not being clear on what Vincenzo's proposal implies)
>>> OK, I see. So I need to make the following changes to my patchset AFAIU.
>>>
>>> 1. Make sure that we only allow tagged user addresses that originate
>>> from an anonymous mmap() or below sbrk(0). How exactly should this
>>> check be performed?
>> I don't think we should perform such checks. That's rather stating that
>> the kernel only guarantees that the tagged pointers work if they
>> originated from these memory ranges.
> I concur.
>
> Really, the kernel should do the expected thing with all "non-weird"
> memory.
>
> In lieu of a proper definition of "non-weird", I think we should have
> some lists of things that are explicitly included, and also excluded:
>
> OK:
> kernel-allocated process stack
> brk area
> MAP_ANONYMOUS | MAP_PRIVATE
> MAP_PRIVATE mappings of /dev/zero
>
> Not OK:
> MAP_SHARED
> mmaps of non-memory-like devices
> mmaps of anything that is not a regular file
> the VDSO
> ...
>
> In general, userspace can tag memory that it "owns", and we do not assume
> a transfer of ownership except in the "OK" list above. Otherwise, it's
> the kernel's memory, or the owner is simply not well defined.

Agreed on the general idea: a process should be able to pass tagged pointers at the
syscall interface, as long as they point to memory privately owned by the process. I
think it would be possible to simplify the definition of "non-weird" memory by using
only this "OK" list:
- mmap() done by the process itself, where either:
  * flags = MAP_PRIVATE | MAP_ANONYMOUS
  * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
device files (like /dev/zero)
- brk() done by the process itself
- Any memory mapped by the kernel in the new process's address space during execve(),
with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

> I would also like to see advice for userspace developers, particularly
> things like (strawman, please challenge!):

To some extent, one could call me a userspace developer, so I'll try to help :)

> * Userspace should set tags at the point of allocation only.

Yes, tags are only supposed to be set at the point of either allocation or
deallocation/reallocation. However, allocators can in principle be nested, so an
allocator could  take a region allocated by malloc() as input and subdivide it
(changing tags in the process). That said, this suballocator must not free() that
region until all the suballocations themselves have been freed (thereby restoring the
tags initially set by malloc()).

> * If you don't know how an object was allocated, you cannot modify the
> tag, period.

Agreed, allocators that tag memory can only operate on memory with a well-defined
provenance (for instance anonymous mmap() or malloc()).

> * A single C object should be accessed using a single, fixed pointer tag
> throughout its entire lifetime.

Agreed. Allocators themselves may need to be excluded though, depending on how they
represent their managed memory.

> * Tags can be changed only when there are no outstanding pointers to
> the affected object or region that may be used to access the object
> or region (i.e., if the object were allocated from the C heap and
> is it safe to realloc() it, then it is safe to change the tag; for
> other types of allocation, analogous arguments can be applied).

Tags can only be changed at the point of deallocation/reallocation. Pointers to the
object become invalid and cannot be used after it has been deallocated; memory
tagging allows to catch such invalid usage.

> * When the kernel dereferences a pointer on userspace's behalf, it
> shall behave equivalently to userspace dereferencing the same pointer,
> including use of the same tag (where passed by userspace).
>
> * Where the pointer tag affects pointer dereference behaviour (i.e.,
> with hardware memory colouring) the kernel makes no guarantee to
> honour pointer tags correctly for every location a buffer based on a
> pointer passed by userspace to the kernel.
>
> (This means for example that for a read(fd, buf, size), we can check
> the tag for a single arbitrary location in *(char (*)[size])buf
> before passing the buffer to get_user_pages(). Hopefully this could
> be done in get_user_pages() itself rather than hunting call sites.
> For userspace, it means that you're on your own if you ask the
> kernel to operate on a buffer than spans multiple, independently-
> allocated objects, or a deliberately striped single object.)

I think both points are reasonable. It is very valuable for the kernel to access
userspace memory using the user-provided tag, because it enables kernel accesses to
be checked in the same way as user accesses, allowing to detect bugs that are
potentially hard to find. For instance, if a pointer to an object is passed to the
kernel after it has been deallocated, this is invalid and should be detected.
However, you are absolutely right that the kernel cannot *guarantee* that such a
check is carried out for the entire memory range (or in fact at all); it should be a
best-effort approach.

> * The kernel shall not extend the lifetime of user pointers in ways
> that are not clear from the specification of the syscall or
> interface to which the pointer is passed (and in any case shall not
> extend pointer lifetimes without good reason).
>
> So no clever transparent caching between syscalls, unless it _really_
> is transparent in the presence of tags.

Do you have any particular case in mind? If such caching is really valuable, it is
always possible to access the object while ignoring the tag. For sure, the
user-provided tag can only be used during the syscall handling itself, not
asynchronously later on, unless otherwise specified.

> * For purposes other than dereference, the kernel shall accept any
> legitimately tagged pointer (according to the above rules) as
> identifying the associated memory location.
>
> So, mprotect(some_page_aligned_object, ...); is valid irrespective
> of where page_aligned_object() came from. There is no implicit
> derefence by the kernel here, hence no tag check.
>
> The kernel does not guarantee to work correctly if the wrong tag
> is used, but there is not always a well-defined "right" tag, so
> we can't really guarantee to check it. So a pointer derived by
> any reasonable means by userspace has to be treated as equally
> valid.

This is a disputed point :) In my opinion, this is the the most reasonable approach.

Cheers,
Kevin

> We would need to get some cross-arch buy-in for this, otherwise core
> maintainers might just refuse to maintain the necessary guarantees.
>
>
>>> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
>>> satisfied). Do I understand correctly that this means that I need to
>>> locate all find_vma() callers outside of mm/ and fix them up as well?
>> Yes (unless anyone as a better idea or objections to this approach).
> Also, watch out for code that pokes about inside struct vma directly.
>
> I'm wondering, could we define an explicit type, say,
>
> struct user_vaddr {
> unsigned long addr;
> };
>
> to replace the unsigned longs in struct vma the mm API? This would
> turn ad-hoc (unsigned long) casts into build breaks. We could have
> an explicit conversion functions, say,
>
> struct user_vaddr __user_vaddr_unsafe(void __user *);
> void __user *__user_ptr_unsafe(struct user_vaddr);
>
> that we robotically insert in all the relevant places to mark
> unaudited code.
>
> This allows us to keep the kernel buildable, while flagging things
> that will need review. We would also need to warn the mm folks to
> reject any new code using these unsafe conversions.
>
> Of course, it would be a non-trivial effort...
>
>> BTW, I'll be off until the new year, so won't be able to follow up.
> Cheers
> ---Dave
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


2019-02-11 20:47:04

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <[email protected]> wrote:
>
> On 19/12/2018 12:52, Dave Martin wrote:
> > On Tue, Dec 18, 2018 at 05:59:38PM +0000, Catalin Marinas wrote:
> >> On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> >>> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas<[email protected]> wrote:
> >>>> The summary of our internal discussions (mostly between kernel
> >>>> developers) is that we can't properly describe a user ABI that covers
> >>>> future syscalls or syscall extensions while not all syscalls accept
> >>>> tagged pointers. So we tweaked the requirements slightly to only allow
> >>>> tagged pointers back into the kernel *if* the originating address is
> >>>> from an anonymous mmap() or below sbrk(0). This should cover some of the
> >>>> ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> >>>> pointer to a buffer obtained via mmap() on the device operations.
> >>>>
> >>>> (sorry for not being clear on what Vincenzo's proposal implies)
> >>> OK, I see. So I need to make the following changes to my patchset AFAIU.
> >>>
> >>> 1. Make sure that we only allow tagged user addresses that originate
> >>> from an anonymous mmap() or below sbrk(0). How exactly should this
> >>> check be performed?
> >> I don't think we should perform such checks. That's rather stating that
> >> the kernel only guarantees that the tagged pointers work if they
> >> originated from these memory ranges.
> > I concur.
> >
> > Really, the kernel should do the expected thing with all "non-weird"
> > memory.
> >
> > In lieu of a proper definition of "non-weird", I think we should have
> > some lists of things that are explicitly included, and also excluded:
> >
> > OK:
> > kernel-allocated process stack
> > brk area
> > MAP_ANONYMOUS | MAP_PRIVATE
> > MAP_PRIVATE mappings of /dev/zero
> >
> > Not OK:
> > MAP_SHARED
> > mmaps of non-memory-like devices
> > mmaps of anything that is not a regular file
> > the VDSO
> > ...
> >
> > In general, userspace can tag memory that it "owns", and we do not assume
> > a transfer of ownership except in the "OK" list above. Otherwise, it's
> > the kernel's memory, or the owner is simply not well defined.
>
> Agreed on the general idea: a process should be able to pass tagged pointers at the
> syscall interface, as long as they point to memory privately owned by the process. I
> think it would be possible to simplify the definition of "non-weird" memory by using
> only this "OK" list:
> - mmap() done by the process itself, where either:
> * flags = MAP_PRIVATE | MAP_ANONYMOUS
> * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
> device files (like /dev/zero)
> - brk() done by the process itself
> - Any memory mapped by the kernel in the new process's address space during execve(),
> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
>
> > I would also like to see advice for userspace developers, particularly
> > things like (strawman, please challenge!):
>
> To some extent, one could call me a userspace developer, so I'll try to help :)
>
> > * Userspace should set tags at the point of allocation only.
>
> Yes, tags are only supposed to be set at the point of either allocation or
> deallocation/reallocation. However, allocators can in principle be nested, so an
> allocator could take a region allocated by malloc() as input and subdivide it
> (changing tags in the process). That said, this suballocator must not free() that
> region until all the suballocations themselves have been freed (thereby restoring the
> tags initially set by malloc()).
>
> > * If you don't know how an object was allocated, you cannot modify the
> > tag, period.
>
> Agreed, allocators that tag memory can only operate on memory with a well-defined
> provenance (for instance anonymous mmap() or malloc()).
>
> > * A single C object should be accessed using a single, fixed pointer tag
> > throughout its entire lifetime.
>
> Agreed. Allocators themselves may need to be excluded though, depending on how they
> represent their managed memory.
>
> > * Tags can be changed only when there are no outstanding pointers to
> > the affected object or region that may be used to access the object
> > or region (i.e., if the object were allocated from the C heap and
> > is it safe to realloc() it, then it is safe to change the tag; for
> > other types of allocation, analogous arguments can be applied).
>
> Tags can only be changed at the point of deallocation/reallocation. Pointers to the
> object become invalid and cannot be used after it has been deallocated; memory
> tagging allows to catch such invalid usage.
>
> > * When the kernel dereferences a pointer on userspace's behalf, it
> > shall behave equivalently to userspace dereferencing the same pointer,
> > including use of the same tag (where passed by userspace).
> >
> > * Where the pointer tag affects pointer dereference behaviour (i.e.,
> > with hardware memory colouring) the kernel makes no guarantee to
> > honour pointer tags correctly for every location a buffer based on a
> > pointer passed by userspace to the kernel.
> >
> > (This means for example that for a read(fd, buf, size), we can check
> > the tag for a single arbitrary location in *(char (*)[size])buf
> > before passing the buffer to get_user_pages(). Hopefully this could
> > be done in get_user_pages() itself rather than hunting call sites.
> > For userspace, it means that you're on your own if you ask the
> > kernel to operate on a buffer than spans multiple, independently-
> > allocated objects, or a deliberately striped single object.)
>
> I think both points are reasonable. It is very valuable for the kernel to access
> userspace memory using the user-provided tag, because it enables kernel accesses to
> be checked in the same way as user accesses, allowing to detect bugs that are
> potentially hard to find. For instance, if a pointer to an object is passed to the
> kernel after it has been deallocated, this is invalid and should be detected.
> However, you are absolutely right that the kernel cannot *guarantee* that such a
> check is carried out for the entire memory range (or in fact at all); it should be a
> best-effort approach.

It would also be valuable to narrow down the set of "relaxed" (i.e.
not tag-checking) syscalls as reasonably possible. We would want to
provide tag-checking userspace wrappers for any important calls that
are not checked in the kernel. Is it correct to assume that anything
that goes through copy_from_user / copy_to_user is checked?

>
> > * The kernel shall not extend the lifetime of user pointers in ways
> > that are not clear from the specification of the syscall or
> > interface to which the pointer is passed (and in any case shall not
> > extend pointer lifetimes without good reason).
> >
> > So no clever transparent caching between syscalls, unless it _really_
> > is transparent in the presence of tags.
>
> Do you have any particular case in mind? If such caching is really valuable, it is
> always possible to access the object while ignoring the tag. For sure, the
> user-provided tag can only be used during the syscall handling itself, not
> asynchronously later on, unless otherwise specified.

For aio* operations it would be nice if the tag was checked at the
time of the actual userspace read/write, either instead of or in
addition to at the time of the system call.

>
> > * For purposes other than dereference, the kernel shall accept any
> > legitimately tagged pointer (according to the above rules) as
> > identifying the associated memory location.
> >
> > So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > of where page_aligned_object() came from. There is no implicit
> > derefence by the kernel here, hence no tag check.
> >
> > The kernel does not guarantee to work correctly if the wrong tag
> > is used, but there is not always a well-defined "right" tag, so
> > we can't really guarantee to check it. So a pointer derived by
> > any reasonable means by userspace has to be treated as equally
> > valid.
>
> This is a disputed point :) In my opinion, this is the the most reasonable approach.

Yes, it would be nice if the kernel explicitly promised, ex.
mprotect() over a range of differently tagged pages to be allowed
(i.e. address tag should be unchecked).


>
> Cheers,
> Kevin
>
> > We would need to get some cross-arch buy-in for this, otherwise core
> > maintainers might just refuse to maintain the necessary guarantees.
> >
> >
> >>> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> >>> satisfied). Do I understand correctly that this means that I need to
> >>> locate all find_vma() callers outside of mm/ and fix them up as well?
> >> Yes (unless anyone as a better idea or objections to this approach).
> > Also, watch out for code that pokes about inside struct vma directly.
> >
> > I'm wondering, could we define an explicit type, say,
> >
> > struct user_vaddr {
> > unsigned long addr;
> > };
> >
> > to replace the unsigned longs in struct vma the mm API? This would
> > turn ad-hoc (unsigned long) casts into build breaks. We could have
> > an explicit conversion functions, say,
> >
> > struct user_vaddr __user_vaddr_unsafe(void __user *);
> > void __user *__user_ptr_unsafe(struct user_vaddr);
> >
> > that we robotically insert in all the relevant places to mark
> > unaudited code.
> >
> > This allows us to keep the kernel buildable, while flagging things
> > that will need review. We would also need to warn the mm folks to
> > reject any new code using these unsafe conversions.
> >
> > Of course, it would be a non-trivial effort...
> >
> >> BTW, I'll be off until the new year, so won't be able to follow up.
> > Cheers
> > ---Dave
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > [email protected]
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2019-02-12 18:04:30

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <[email protected]> wrote:
> > On 19/12/2018 12:52, Dave Martin wrote:
> > > Really, the kernel should do the expected thing with all "non-weird"
> > > memory.
> > >
> > > In lieu of a proper definition of "non-weird", I think we should have
> > > some lists of things that are explicitly included, and also excluded:
> > >
> > > OK:
> > > kernel-allocated process stack
> > > brk area
> > > MAP_ANONYMOUS | MAP_PRIVATE
> > > MAP_PRIVATE mappings of /dev/zero
> > >
> > > Not OK:
> > > MAP_SHARED
> > > mmaps of non-memory-like devices
> > > mmaps of anything that is not a regular file
> > > the VDSO
> > > ...
> > >
> > > In general, userspace can tag memory that it "owns", and we do not assume
> > > a transfer of ownership except in the "OK" list above. Otherwise, it's
> > > the kernel's memory, or the owner is simply not well defined.
> >
> > Agreed on the general idea: a process should be able to pass tagged pointers at the
> > syscall interface, as long as they point to memory privately owned by the process. I
> > think it would be possible to simplify the definition of "non-weird" memory by using
> > only this "OK" list:
> > - mmap() done by the process itself, where either:
> > * flags = MAP_PRIVATE | MAP_ANONYMOUS
> > * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
> > device files (like /dev/zero)
> > - brk() done by the process itself
> > - Any memory mapped by the kernel in the new process's address space during execve(),
> > with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

Sounds reasonable.

> > > * Userspace should set tags at the point of allocation only.
> >
> > Yes, tags are only supposed to be set at the point of either allocation or
> > deallocation/reallocation. However, allocators can in principle be nested, so an
> > allocator could take a region allocated by malloc() as input and subdivide it
> > (changing tags in the process). That said, this suballocator must not free() that
> > region until all the suballocations themselves have been freed (thereby restoring the
> > tags initially set by malloc()).
> >
> > > * If you don't know how an object was allocated, you cannot modify the
> > > tag, period.
> >
> > Agreed, allocators that tag memory can only operate on memory with a well-defined
> > provenance (for instance anonymous mmap() or malloc()).
> >
> > > * A single C object should be accessed using a single, fixed pointer tag
> > > throughout its entire lifetime.
> >
> > Agreed. Allocators themselves may need to be excluded though, depending on how they
> > represent their managed memory.
> >
> > > * Tags can be changed only when there are no outstanding pointers to
> > > the affected object or region that may be used to access the object
> > > or region (i.e., if the object were allocated from the C heap and
> > > is it safe to realloc() it, then it is safe to change the tag; for
> > > other types of allocation, analogous arguments can be applied).
> >
> > Tags can only be changed at the point of deallocation/reallocation. Pointers to the
> > object become invalid and cannot be used after it has been deallocated; memory
> > tagging allows to catch such invalid usage.

All the above sound well but that's mostly a guideline on what a C
library can do. It doesn't help much with defining the kernel ABI.
Anyway, it's good to clarify the use-cases.

> > > * When the kernel dereferences a pointer on userspace's behalf, it
> > > shall behave equivalently to userspace dereferencing the same pointer,
> > > including use of the same tag (where passed by userspace).
> > >
> > > * Where the pointer tag affects pointer dereference behaviour (i.e.,
> > > with hardware memory colouring) the kernel makes no guarantee to
> > > honour pointer tags correctly for every location a buffer based on a
> > > pointer passed by userspace to the kernel.
> > >
> > > (This means for example that for a read(fd, buf, size), we can check
> > > the tag for a single arbitrary location in *(char (*)[size])buf
> > > before passing the buffer to get_user_pages(). Hopefully this could
> > > be done in get_user_pages() itself rather than hunting call sites.
> > > For userspace, it means that you're on your own if you ask the
> > > kernel to operate on a buffer than spans multiple, independently-
> > > allocated objects, or a deliberately striped single object.)
> >
> > I think both points are reasonable. It is very valuable for the kernel to access
> > userspace memory using the user-provided tag, because it enables kernel accesses to
> > be checked in the same way as user accesses, allowing to detect bugs that are
> > potentially hard to find. For instance, if a pointer to an object is passed to the
> > kernel after it has been deallocated, this is invalid and should be detected.
> > However, you are absolutely right that the kernel cannot *guarantee* that such a
> > check is carried out for the entire memory range (or in fact at all); it should be a
> > best-effort approach.
>
> It would also be valuable to narrow down the set of "relaxed" (i.e.
> not tag-checking) syscalls as reasonably possible. We would want to
> provide tag-checking userspace wrappers for any important calls that
> are not checked in the kernel. Is it correct to assume that anything
> that goes through copy_from_user / copy_to_user is checked?

I lost track of the context of this thread but if it's just about
relaxing the ABI for hwasan, the kernel has no idea of the compiler
generated structures in user space, so nothing is checked.

If we talk about tags in the context of MTE, than yes, with the current
proposal the tag would be checked by copy_*_user() functions.

> > > * The kernel shall not extend the lifetime of user pointers in ways
> > > that are not clear from the specification of the syscall or
> > > interface to which the pointer is passed (and in any case shall not
> > > extend pointer lifetimes without good reason).
> > >
> > > So no clever transparent caching between syscalls, unless it _really_
> > > is transparent in the presence of tags.
> >
> > Do you have any particular case in mind? If such caching is really valuable, it is
> > always possible to access the object while ignoring the tag. For sure, the
> > user-provided tag can only be used during the syscall handling itself, not
> > asynchronously later on, unless otherwise specified.
>
> For aio* operations it would be nice if the tag was checked at the
> time of the actual userspace read/write, either instead of or in
> addition to at the time of the system call.

With aio* (and synchronous iovec-based syscalls), the kernel may access
the memory while the corresponding user process is scheduled out. Given
that such access is not done in the context of the user process (and
using the user VA like copy_*_user), the kernel cannot handle potential
tag faults. Moreover, the transfer may be done by DMA and the device
does not understand tags.

I'd like to keep tags as a property of the pointer in a specific virtual
address space. The moment you convert it to a different address space
(e.g. kernel linear map, physical address), the tag property is stripped
and I don't think we should re-build it (and have it checked).

> > > * For purposes other than dereference, the kernel shall accept any
> > > legitimately tagged pointer (according to the above rules) as
> > > identifying the associated memory location.
> > >
> > > So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > > of where page_aligned_object() came from. There is no implicit
> > > derefence by the kernel here, hence no tag check.
> > >
> > > The kernel does not guarantee to work correctly if the wrong tag
> > > is used, but there is not always a well-defined "right" tag, so
> > > we can't really guarantee to check it. So a pointer derived by
> > > any reasonable means by userspace has to be treated as equally
> > > valid.
> >
> > This is a disputed point :) In my opinion, this is the the most reasonable approach.
>
> Yes, it would be nice if the kernel explicitly promised, ex.
> mprotect() over a range of differently tagged pages to be allowed
> (i.e. address tag should be unchecked).

I don't think mprotect() over differently tagged pages was ever a
problem. I originally asked that mprotect() and friends do not accept
tagged pointers since these functions deal with memory ranges rather
than dereferencing such pointer (the reason being minimal kernel
changes). However, given how complicated it is to specify an ABI, I came
to the conclusion that a pointer passed to such function should be
allowed to have non-zero top byte. It would be the kernel's
responsibility to strip it out as appropriate.

--
Catalin

2019-02-13 15:33:31

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> > On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <[email protected]> wrote:
> > > On 19/12/2018 12:52, Dave Martin wrote:

[...]

> > > > * A single C object should be accessed using a single, fixed
> > > > pointer tag throughout its entire lifetime.
> > >
> > > Agreed. Allocators themselves may need to be excluded though,
> > > depending on how they represent their managed memory.
> > >
> > > > * Tags can be changed only when there are no outstanding pointers to
> > > > the affected object or region that may be used to access the object
> > > > or region (i.e., if the object were allocated from the C heap and
> > > > is it safe to realloc() it, then it is safe to change the tag; for
> > > > other types of allocation, analogous arguments can be applied).
> > >
> > > Tags can only be changed at the point of deallocation/
> > > reallocation. Pointers to the object become invalid and cannot
> > > be used after it has been deallocated; memory tagging allows to
> > > catch such invalid usage.
>
> All the above sound well but that's mostly a guideline on what a C
> library can do. It doesn't help much with defining the kernel ABI.
> Anyway, it's good to clarify the use-cases.

My aim was to clarify the use case in userspace, since I wasn't directly
involved in that. The kernel ABI needs to be compatible with the the
use case, but doesn't need to specify must of it.

I'm wondering whether we can piggy-back on existing concepts.

We could say that recolouring memory is safe when and only when
unmapping of the page or removing permissions on the page (via
munmap/mremap/mprotect) would be safe. Otherwise, the resulting
behaviour of the process is undefined.

Hopefully there are friendly fuzzers testing this kind of thing.

[...]

> > It would also be valuable to narrow down the set of "relaxed" (i.e.
> > not tag-checking) syscalls as reasonably possible. We would want to
> > provide tag-checking userspace wrappers for any important calls that
> > are not checked in the kernel. Is it correct to assume that anything
> > that goes through copy_from_user / copy_to_user is checked?
>
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
>
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

Also put_user() and friends?

It might be reasonable to do the check in access_ok() and skip it in
__put_user() etc.

(I seem to remember some separate discussion about abolishing
__put_user() and friends though, due to the accident risk they pose.)

> > For aio* operations it would be nice if the tag was checked at the
> > time of the actual userspace read/write, either instead of or in
> > addition to at the time of the system call.
>
> With aio* (and synchronous iovec-based syscalls), the kernel may access
> the memory while the corresponding user process is scheduled out. Given
> that such access is not done in the context of the user process (and
> using the user VA like copy_*_user), the kernel cannot handle potential
> tag faults. Moreover, the transfer may be done by DMA and the device
> does not understand tags.
>
> I'd like to keep tags as a property of the pointer in a specific virtual
> address space. The moment you convert it to a different address space
> (e.g. kernel linear map, physical address), the tag property is stripped
> and I don't think we should re-build it (and have it checked).

This is probably reasonable.

Ideally we would check the tag at the point of stripping it off, but
most likely it's going to be rather best-effort.

If memory tagging is essential a debugging feature then this seems
an acceptable compromise.

> > > > * For purposes other than dereference, the kernel shall accept any
> > > > legitimately tagged pointer (according to the above rules) as
> > > > identifying the associated memory location.
> > > >
> > > > So, mprotect(some_page_aligned_object, ...); is valid irrespective
> > > > of where page_aligned_object() came from. There is no implicit
> > > > derefence by the kernel here, hence no tag check.
> > > >
> > > > The kernel does not guarantee to work correctly if the wrong tag
> > > > is used, but there is not always a well-defined "right" tag, so
> > > > we can't really guarantee to check it. So a pointer derived by
> > > > any reasonable means by userspace has to be treated as equally
> > > > valid.
> > >
> > > This is a disputed point :) In my opinion, this is the the most
> > > reasonable approach.
> >
> > Yes, it would be nice if the kernel explicitly promised, ex.
> > mprotect() over a range of differently tagged pages to be allowed
> > (i.e. address tag should be unchecked).
>
> I don't think mprotect() over differently tagged pages was ever a
> problem. I originally asked that mprotect() and friends do not accept
> tagged pointers since these functions deal with memory ranges rather
> than dereferencing such pointer (the reason being minimal kernel
> changes). However, given how complicated it is to specify an ABI, I came
> to the conclusion that a pointer passed to such function should be
> allowed to have non-zero top byte. It would be the kernel's
> responsibility to strip it out as appropriate.

I think that if the page range is all the same colour then it should be
legitimate to pass a matching tag.

But it doesn't seem reasonable for the kernel to require this. If
free() calls munmap(), the page(s) will contain possibly randomly-
coloured garbage. There's no correct tag to pass in such a case.

The most obvious solution is just to ignore the tags passed by userspace
to such syscalls. This would imply that the kernel must explicitly
strip it out, as you suggest.

The number of affected syscalls is relatively small though.

Cheers
---Dave

2019-02-13 16:43:35

by Kevin Brodsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

(+Cc other people with MTE experience: Branislav, Ruben)

On 13/02/2019 14:58, Dave Martin wrote:
> On Tue, Feb 12, 2019 at 06:02:24PM +0000, Catalin Marinas wrote:
>> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <[email protected]> wrote:
>>>> On 19/12/2018 12:52, Dave Martin wrote:
> [...]
>
>>>>> * A single C object should be accessed using a single, fixed
>>>>> pointer tag throughout its entire lifetime.
>>>> Agreed. Allocators themselves may need to be excluded though,
>>>> depending on how they represent their managed memory.
>>>>
>>>>> * Tags can be changed only when there are no outstanding pointers to
>>>>> the affected object or region that may be used to access the object
>>>>> or region (i.e., if the object were allocated from the C heap and
>>>>> is it safe to realloc() it, then it is safe to change the tag; for
>>>>> other types of allocation, analogous arguments can be applied).
>>>> Tags can only be changed at the point of deallocation/
>>>> reallocation. Pointers to the object become invalid and cannot
>>>> be used after it has been deallocated; memory tagging allows to
>>>> catch such invalid usage.
>> All the above sound well but that's mostly a guideline on what a C
>> library can do. It doesn't help much with defining the kernel ABI.
>> Anyway, it's good to clarify the use-cases.
> My aim was to clarify the use case in userspace, since I wasn't directly
> involved in that. The kernel ABI needs to be compatible with the the
> use case, but doesn't need to specify must of it.
>
> I'm wondering whether we can piggy-back on existing concepts.
>
> We could say that recolouring memory is safe when and only when
> unmapping of the page or removing permissions on the page (via
> munmap/mremap/mprotect) would be safe. Otherwise, the resulting
> behaviour of the process is undefined.

Is that a sufficient requirement? I don't think that anything prevents you from using
mprotect() on say [vvar], but we don't necessarily want to map [vvar] as tagged. I'm
not sure it's easy to define what "safe" would mean here.

> Hopefully there are friendly fuzzers testing this kind of thing.
>
> [...]
>
>>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>>> not tag-checking) syscalls as reasonably possible. We would want to
>>> provide tag-checking userspace wrappers for any important calls that
>>> are not checked in the kernel. Is it correct to assume that anything
>>> that goes through copy_from_user / copy_to_user is checked?
>> I lost track of the context of this thread but if it's just about
>> relaxing the ABI for hwasan, the kernel has no idea of the compiler
>> generated structures in user space, so nothing is checked.
>>
>> If we talk about tags in the context of MTE, than yes, with the current
>> proposal the tag would be checked by copy_*_user() functions.
> Also put_user() and friends?
>
> It might be reasonable to do the check in access_ok() and skip it in
> __put_user() etc.
>
> (I seem to remember some separate discussion about abolishing
> __put_user() and friends though, due to the accident risk they pose.)

Keep in mind that with MTE, there is no need to do any explicit check when accessing
user memory via a user-provided pointer. The tagged user pointer is directly passed
to copy_*_user() or put_user(). If the load/store causes a tag fault, then it is
handled just like a page fault (i.e. invoking the fixup handler). As far as I can
tell, there's no need to do anything special in access_ok() in that case.

[The above applies to precise mode. In imprecise mode, some more work will be needed
after the load/store to check whether a tag fault happened.]

>
>>> For aio* operations it would be nice if the tag was checked at the
>>> time of the actual userspace read/write, either instead of or in
>>> addition to at the time of the system call.
>> With aio* (and synchronous iovec-based syscalls), the kernel may access
>> the memory while the corresponding user process is scheduled out. Given
>> that such access is not done in the context of the user process (and
>> using the user VA like copy_*_user), the kernel cannot handle potential
>> tag faults. Moreover, the transfer may be done by DMA and the device
>> does not understand tags.
>>
>> I'd like to keep tags as a property of the pointer in a specific virtual
>> address space. The moment you convert it to a different address space
>> (e.g. kernel linear map, physical address), the tag property is stripped
>> and I don't think we should re-build it (and have it checked).
> This is probably reasonable.
>
> Ideally we would check the tag at the point of stripping it off, but
> most likely it's going to be rather best-effort.
>
> If memory tagging is essential a debugging feature then this seems
> an acceptable compromise.

There are many possible ways to deploy MTE, and debugging is just one of them. For
instance, you may want to turn on heap colouring for some processes in the system,
including in production.

Regarding those cases where it is impossible to check tags at the point of accessing
user memory, it is indeed possible to check the memory tags at the point of stripping
the tag from the user pointer. Given that some MTE use-cases favour performance over
tag check coverage, the ideal approach would be to make these checks configurable
(e.g. check one granule, check all of them, or check none). I don't know how feasible
this is in practice.

Kevin

>
>>>>> * For purposes other than dereference, the kernel shall accept any
>>>>> legitimately tagged pointer (according to the above rules) as
>>>>> identifying the associated memory location.
>>>>>
>>>>> So, mprotect(some_page_aligned_object, ...); is valid irrespective
>>>>> of where page_aligned_object() came from. There is no implicit
>>>>> derefence by the kernel here, hence no tag check.
>>>>>
>>>>> The kernel does not guarantee to work correctly if the wrong tag
>>>>> is used, but there is not always a well-defined "right" tag, so
>>>>> we can't really guarantee to check it. So a pointer derived by
>>>>> any reasonable means by userspace has to be treated as equally
>>>>> valid.
>>>> This is a disputed point :) In my opinion, this is the the most
>>>> reasonable approach.
>>> Yes, it would be nice if the kernel explicitly promised, ex.
>>> mprotect() over a range of differently tagged pages to be allowed
>>> (i.e. address tag should be unchecked).
>> I don't think mprotect() over differently tagged pages was ever a
>> problem. I originally asked that mprotect() and friends do not accept
>> tagged pointers since these functions deal with memory ranges rather
>> than dereferencing such pointer (the reason being minimal kernel
>> changes). However, given how complicated it is to specify an ABI, I came
>> to the conclusion that a pointer passed to such function should be
>> allowed to have non-zero top byte. It would be the kernel's
>> responsibility to strip it out as appropriate.
> I think that if the page range is all the same colour then it should be
> legitimate to pass a matching tag.
>
> But it doesn't seem reasonable for the kernel to require this. If
> free() calls munmap(), the page(s) will contain possibly randomly-
> coloured garbage. There's no correct tag to pass in such a case.
>
> The most obvious solution is just to ignore the tags passed by userspace
> to such syscalls. This would imply that the kernel must explicitly
> strip it out, as you suggest.
>
> The number of affected syscalls is relatively small though.
>
> Cheers
> ---Dave


2019-02-13 23:08:13

by Dave Martin

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
> (+Cc other people with MTE experience: Branislav, Ruben)

[...]

> >I'm wondering whether we can piggy-back on existing concepts.
> >
> >We could say that recolouring memory is safe when and only when
> >unmapping of the page or removing permissions on the page (via
> >munmap/mremap/mprotect) would be safe. Otherwise, the resulting
> >behaviour of the process is undefined.
>
> Is that a sufficient requirement? I don't think that anything prevents you
> from using mprotect() on say [vvar], but we don't necessarily want to map
> [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> here.

I think the origin rules have to apply too: [vvar] is not a regular,
private page but a weird, shared thing mapped for you by the kernel.

Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.

I'm also assuming that userspace cannot recolour memory in read-only
pages. That sounds bad if there's no way to prevent it.

[...]

> >It might be reasonable to do the check in access_ok() and skip it in
> >__put_user() etc.
> >
> >(I seem to remember some separate discussion about abolishing
> >__put_user() and friends though, due to the accident risk they pose.)
>
> Keep in mind that with MTE, there is no need to do any explicit check when
> accessing user memory via a user-provided pointer. The tagged user pointer
> is directly passed to copy_*_user() or put_user(). If the load/store causes
> a tag fault, then it is handled just like a page fault (i.e. invoking the
> fixup handler). As far as I can tell, there's no need to do anything special
> in access_ok() in that case.
>
> [The above applies to precise mode. In imprecise mode, some more work will
> be needed after the load/store to check whether a tag fault happened.]

Fair enough, I'm a bit hazy on the details as of right now..

[...]

> There are many possible ways to deploy MTE, and debugging is just one of
> them. For instance, you may want to turn on heap colouring for some
> processes in the system, including in production.

To implement enforceable protection, or as a diagnostic tool for when
something goes wrong?

In the latter case it's still OK for the kernel's tag checking not to be
exhaustive.

> Regarding those cases where it is impossible to check tags at the point of
> accessing user memory, it is indeed possible to check the memory tags at the
> point of stripping the tag from the user pointer. Given that some MTE
> use-cases favour performance over tag check coverage, the ideal approach
> would be to make these checks configurable (e.g. check one granule, check
> all of them, or check none). I don't know how feasible this is in practice.

Check all granules of a massive DMA buffer?

That doesn't sounds feasible without explicit support in the hardware to
have the DMA check tags itself as the memory is accessed. MTE by itself
doesn't provide for this IIUC (at least, it would require support in the
platform, not just the CPU).

We do not want to bake any assumptions into the ABI about whether a
given data transfer may or may not be offloaded to DMA. That feels
like a slippery slope.

Providing we get the checks for free in put_user/get_user/
copy_{to,from}_user(), those will cover a lot of cases though, for
non-bulk-IO cases.


My assumption has been that at this point in time we are mainly aiming
to support the debug/diagnostic use cases today.

At least, those are the low(ish)-hanging fruit.

Others are better placed than me to comment on the goals here.

Cheers
---Dave

2019-02-14 09:00:08

by Evgenii Stepanov

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On Wed, Feb 13, 2019 at 9:43 AM Dave Martin <[email protected]> wrote:
>
> On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
> > (+Cc other people with MTE experience: Branislav, Ruben)
>
> [...]
>
> > >I'm wondering whether we can piggy-back on existing concepts.
> > >
> > >We could say that recolouring memory is safe when and only when
> > >unmapping of the page or removing permissions on the page (via
> > >munmap/mremap/mprotect) would be safe. Otherwise, the resulting
> > >behaviour of the process is undefined.
> >
> > Is that a sufficient requirement? I don't think that anything prevents you
> > from using mprotect() on say [vvar], but we don't necessarily want to map
> > [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
> > here.
>
> I think the origin rules have to apply too: [vvar] is not a regular,
> private page but a weird, shared thing mapped for you by the kernel.
>
> Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
>
> I'm also assuming that userspace cannot recolour memory in read-only
> pages. That sounds bad if there's no way to prevent it.

That sounds like something we would like to do to catch out of bounds
read of .rodata globals.
Another potentially interesting use case for MTE is infinite hardware
watchpoints - that would require trapping reads for individual tagging
granules, include those in read-only binary segment.

>
> [...]
>
> > >It might be reasonable to do the check in access_ok() and skip it in
> > >__put_user() etc.
> > >
> > >(I seem to remember some separate discussion about abolishing
> > >__put_user() and friends though, due to the accident risk they pose.)
> >
> > Keep in mind that with MTE, there is no need to do any explicit check when
> > accessing user memory via a user-provided pointer. The tagged user pointer
> > is directly passed to copy_*_user() or put_user(). If the load/store causes
> > a tag fault, then it is handled just like a page fault (i.e. invoking the
> > fixup handler). As far as I can tell, there's no need to do anything special
> > in access_ok() in that case.
> >
> > [The above applies to precise mode. In imprecise mode, some more work will
> > be needed after the load/store to check whether a tag fault happened.]
>
> Fair enough, I'm a bit hazy on the details as of right now..
>
> [...]
>
> > There are many possible ways to deploy MTE, and debugging is just one of
> > them. For instance, you may want to turn on heap colouring for some
> > processes in the system, including in production.
>
> To implement enforceable protection, or as a diagnostic tool for when
> something goes wrong?
>
> In the latter case it's still OK for the kernel's tag checking not to be
> exhaustive.
>
> > Regarding those cases where it is impossible to check tags at the point of
> > accessing user memory, it is indeed possible to check the memory tags at the
> > point of stripping the tag from the user pointer. Given that some MTE
> > use-cases favour performance over tag check coverage, the ideal approach
> > would be to make these checks configurable (e.g. check one granule, check
> > all of them, or check none). I don't know how feasible this is in practice.
>
> Check all granules of a massive DMA buffer?
>
> That doesn't sounds feasible without explicit support in the hardware to
> have the DMA check tags itself as the memory is accessed. MTE by itself
> doesn't provide for this IIUC (at least, it would require support in the
> platform, not just the CPU).
>
> We do not want to bake any assumptions into the ABI about whether a
> given data transfer may or may not be offloaded to DMA. That feels
> like a slippery slope.
>
> Providing we get the checks for free in put_user/get_user/
> copy_{to,from}_user(), those will cover a lot of cases though, for
> non-bulk-IO cases.
>
>
> My assumption has been that at this point in time we are mainly aiming
> to support the debug/diagnostic use cases today.
>
> At least, those are the low(ish)-hanging fruit.
>
> Others are better placed than me to comment on the goals here.
>
> Cheers
> ---Dave

2019-02-14 18:34:18

by Kevin Brodsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On 13/02/2019 21:41, Evgenii Stepanov wrote:
> On Wed, Feb 13, 2019 at 9:43 AM Dave Martin <[email protected]> wrote:
>> On Wed, Feb 13, 2019 at 04:42:11PM +0000, Kevin Brodsky wrote:
>>> (+Cc other people with MTE experience: Branislav, Ruben)
>> [...]
>>
>>>> I'm wondering whether we can piggy-back on existing concepts.
>>>>
>>>> We could say that recolouring memory is safe when and only when
>>>> unmapping of the page or removing permissions on the page (via
>>>> munmap/mremap/mprotect) would be safe. Otherwise, the resulting
>>>> behaviour of the process is undefined.
>>> Is that a sufficient requirement? I don't think that anything prevents you
>>> from using mprotect() on say [vvar], but we don't necessarily want to map
>>> [vvar] as tagged. I'm not sure it's easy to define what "safe" would mean
>>> here.
>> I think the origin rules have to apply too: [vvar] is not a regular,
>> private page but a weird, shared thing mapped for you by the kernel.
>>
>> Presumably userspace _cannot_ do mprotect(PROT_WRITE) on it.
>>
>> I'm also assuming that userspace cannot recolour memory in read-only
>> pages. That sounds bad if there's no way to prevent it.
> That sounds like something we would like to do to catch out of bounds
> read of .rodata globals.
> Another potentially interesting use case for MTE is infinite hardware
> watchpoints - that would require trapping reads for individual tagging
> granules, include those in read-only binary segment.

I think we should keep this discussion for a later, separate thread. Vincenzo's
proposal is about allowing userspace to pass tags at the syscall interface. The set
of mappings allowed to be tagged by userspace (in MTE) should be contained in the set
of mappings that userspace can pass tagged pointers to (at the syscall interface),
but they are not necessarily the same. Private read-only mappings are an edge case
(you can pass tagged pointers to them, the memory may or may not be mapped as tagged,
but in any case it is not possible to change the memory tags via such mapping).

>
>> [...]
>>
>>>> It might be reasonable to do the check in access_ok() and skip it in
>>>> __put_user() etc.
>>>>
>>>> (I seem to remember some separate discussion about abolishing
>>>> __put_user() and friends though, due to the accident risk they pose.)
>>> Keep in mind that with MTE, there is no need to do any explicit check when
>>> accessing user memory via a user-provided pointer. The tagged user pointer
>>> is directly passed to copy_*_user() or put_user(). If the load/store causes
>>> a tag fault, then it is handled just like a page fault (i.e. invoking the
>>> fixup handler). As far as I can tell, there's no need to do anything special
>>> in access_ok() in that case.
>>>
>>> [The above applies to precise mode. In imprecise mode, some more work will
>>> be needed after the load/store to check whether a tag fault happened.]
>> Fair enough, I'm a bit hazy on the details as of right now..
>>
>> [...]
>>
>>> There are many possible ways to deploy MTE, and debugging is just one of
>>> them. For instance, you may want to turn on heap colouring for some
>>> processes in the system, including in production.
>> To implement enforceable protection, or as a diagnostic tool for when
>> something goes wrong?
>>
>> In the latter case it's still OK for the kernel's tag checking not to be
>> exhaustive.
>>
>>> Regarding those cases where it is impossible to check tags at the point of
>>> accessing user memory, it is indeed possible to check the memory tags at the
>>> point of stripping the tag from the user pointer. Given that some MTE
>>> use-cases favour performance over tag check coverage, the ideal approach
>>> would be to make these checks configurable (e.g. check one granule, check
>>> all of them, or check none). I don't know how feasible this is in practice.
>> Check all granules of a massive DMA buffer?
>>
>> That doesn't sounds feasible without explicit support in the hardware to
>> have the DMA check tags itself as the memory is accessed. MTE by itself
>> doesn't provide for this IIUC (at least, it would require support in the
>> platform, not just the CPU).
>>
>> We do not want to bake any assumptions into the ABI about whether a
>> given data transfer may or may not be offloaded to DMA. That feels
>> like a slippery slope.
>>
>> Providing we get the checks for free in put_user/get_user/
>> copy_{to,from}_user(), those will cover a lot of cases though, for
>> non-bulk-IO cases.
>>
>>
>> My assumption has been that at this point in time we are mainly aiming
>> to support the debug/diagnostic use cases today.

MTE can be used both for diagnostics (imprecise mode is especially suitable for
that), and to halt execution when something wrong is detected. Even in the latter
case, one cannot expect exhaustive checking from MTE, because the way it works is
fundamentally statistical; an invalid pointer may by chance have the right tag to
access the given location. So again, I think that a best-effort approach is
appropriate when the kernel accesses user memory, in terms of checking that tags match.

More specifically, different use-cases come with different tradeoffs (performance /
tag check coverage). That's why I am suggesting that in the cases where tag checks
would need to be done _explicitly_ (before losing the user-provided tag), it would be
nice to be able to choose how much should be checked. I am not suggesting that always
checking all the granules by default is sane. Maybe checking just the first granule
is the right default.

I don't think we need to get to the bottom of this specific aspect at this point.
This ABI proposal is not about memory tagging, so there is no need to specify how or
when tag checking is done. As long as this ABI allows tagged pointers, pointing to
mappings that could be potentially tagged, to be passed to syscalls, I don't think
further relaxations are needed to enable memory tagging.

Kevin

>>
>> At least, those are the low(ish)-hanging fruit.
>>
>> Others are better placed than me to comment on the goals here.
>>
>> Cheers
>> ---Dave


2019-02-19 18:40:26

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On 12/02/2019 18:02, Catalin Marinas wrote:
> On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
>> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky <[email protected]> wrote:
>>> On 19/12/2018 12:52, Dave Martin wrote:
>>>> Really, the kernel should do the expected thing with all "non-weird"
>>>> memory.
>>>>
>>>> In lieu of a proper definition of "non-weird", I think we should have
>>>> some lists of things that are explicitly included, and also excluded:
>>>>
>>>> OK:
>>>> kernel-allocated process stack
>>>> brk area
>>>> MAP_ANONYMOUS | MAP_PRIVATE
>>>> MAP_PRIVATE mappings of /dev/zero
>>>>
>>>> Not OK:
>>>> MAP_SHARED
>>>> mmaps of non-memory-like devices
>>>> mmaps of anything that is not a regular file
>>>> the VDSO
>>>> ...
>>>>
>>>> In general, userspace can tag memory that it "owns", and we do not assume
>>>> a transfer of ownership except in the "OK" list above. Otherwise, it's
>>>> the kernel's memory, or the owner is simply not well defined.
>>>
>>> Agreed on the general idea: a process should be able to pass tagged pointers at the
>>> syscall interface, as long as they point to memory privately owned by the process. I
>>> think it would be possible to simplify the definition of "non-weird" memory by using
>>> only this "OK" list:
>>> - mmap() done by the process itself, where either:
>>> * flags = MAP_PRIVATE | MAP_ANONYMOUS
>>> * flags = MAP_PRIVATE and fd refers to a regular file or a well-defined list of
>>> device files (like /dev/zero)
>>> - brk() done by the process itself
>>> - Any memory mapped by the kernel in the new process's address space during execve(),
>>> with the same restrictions as above ([vdso]/[vvar] are therefore excluded)
>
> Sounds reasonable.

OK. this non-weird memory definition works for me too.

rule 1: if weird memory pointers are passed to the kernel
with top byte set then the behaviour is undefined.

>>>> * When the kernel dereferences a pointer on userspace's behalf, it
>>>> shall behave equivalently to userspace dereferencing the same pointer,
>>>> including use of the same tag (where passed by userspace).
>>>>
>>>> * Where the pointer tag affects pointer dereference behaviour (i.e.,
>>>> with hardware memory colouring) the kernel makes no guarantee to
>>>> honour pointer tags correctly for every location a buffer based on a
>>>> pointer passed by userspace to the kernel.
>>>>
>>>> (This means for example that for a read(fd, buf, size), we can check
>>>> the tag for a single arbitrary location in *(char (*)[size])buf
>>>> before passing the buffer to get_user_pages(). Hopefully this could
>>>> be done in get_user_pages() itself rather than hunting call sites.
>>>> For userspace, it means that you're on your own if you ask the
>>>> kernel to operate on a buffer than spans multiple, independently-
>>>> allocated objects, or a deliberately striped single object.)
>>>
>>> I think both points are reasonable. It is very valuable for the kernel to access
>>> userspace memory using the user-provided tag, because it enables kernel accesses to
>>> be checked in the same way as user accesses, allowing to detect bugs that are
>>> potentially hard to find. For instance, if a pointer to an object is passed to the
>>> kernel after it has been deallocated, this is invalid and should be detected.
>>> However, you are absolutely right that the kernel cannot *guarantee* that such a
>>> check is carried out for the entire memory range (or in fact at all); it should be a
>>> best-effort approach.
>>
>> It would also be valuable to narrow down the set of "relaxed" (i.e.
>> not tag-checking) syscalls as reasonably possible. We would want to
>> provide tag-checking userspace wrappers for any important calls that
>> are not checked in the kernel. Is it correct to assume that anything
>> that goes through copy_from_user / copy_to_user is checked?
>
> I lost track of the context of this thread but if it's just about
> relaxing the ABI for hwasan, the kernel has no idea of the compiler
> generated structures in user space, so nothing is checked.
>
> If we talk about tags in the context of MTE, than yes, with the current
> proposal the tag would be checked by copy_*_user() functions.

rule 2: kernel derefs as if user derefs when non-weird memory
pointers are passed to the kernel.

note that the important bit is what happens on valid pointer
derefs, invalid pointer deref is usually undefined for user
programs, so what happens in case of mte tag failures is
more of a QoI issue than abi i think.

(e.g. EFAULT is not guaranteed by the kernel currently, i can
successfully do write(open("/dev/null",O_WRONLY), 0, 1), or
get a crash when passing invalid pointer to a vdso function,
so userspace should not rely on some strict EFAULT behaviour).

>>>> * The kernel shall not extend the lifetime of user pointers in ways
>>>> that are not clear from the specification of the syscall or
>>>> interface to which the pointer is passed (and in any case shall not
>>>> extend pointer lifetimes without good reason).
>>>>
>>>> So no clever transparent caching between syscalls, unless it _really_
>>>> is transparent in the presence of tags.
>>>
>>> Do you have any particular case in mind? If such caching is really valuable, it is
>>> always possible to access the object while ignoring the tag. For sure, the
>>> user-provided tag can only be used during the syscall handling itself, not
>>> asynchronously later on, unless otherwise specified.
>>
>> For aio* operations it would be nice if the tag was checked at the
>> time of the actual userspace read/write, either instead of or in
>> addition to at the time of the system call.
>
> With aio* (and synchronous iovec-based syscalls), the kernel may access
> the memory while the corresponding user process is scheduled out. Given
> that such access is not done in the context of the user process (and
> using the user VA like copy_*_user), the kernel cannot handle potential
> tag faults. Moreover, the transfer may be done by DMA and the device
> does not understand tags.
>
> I'd like to keep tags as a property of the pointer in a specific virtual
> address space. The moment you convert it to a different address space
> (e.g. kernel linear map, physical address), the tag property is stripped
> and I don't think we should re-build it (and have it checked).

OK.

i don't think the new abi needs special rules about
pointer lifetime.

>>>> * For purposes other than dereference, the kernel shall accept any
>>>> legitimately tagged pointer (according to the above rules) as
>>>> identifying the associated memory location.
>>>>
>>>> So, mprotect(some_page_aligned_object, ...); is valid irrespective
>>>> of where page_aligned_object() came from. There is no implicit
>>>> derefence by the kernel here, hence no tag check.
>>>>
>>>> The kernel does not guarantee to work correctly if the wrong tag
>>>> is used, but there is not always a well-defined "right" tag, so
>>>> we can't really guarantee to check it. So a pointer derived by
>>>> any reasonable means by userspace has to be treated as equally
>>>> valid.
>>>
>>> This is a disputed point :) In my opinion, this is the the most reasonable approach.
>>
>> Yes, it would be nice if the kernel explicitly promised, ex.
>> mprotect() over a range of differently tagged pages to be allowed
>> (i.e. address tag should be unchecked).
>
> I don't think mprotect() over differently tagged pages was ever a
> problem. I originally asked that mprotect() and friends do not accept
> tagged pointers since these functions deal with memory ranges rather
> than dereferencing such pointer (the reason being minimal kernel
> changes). However, given how complicated it is to specify an ABI, I came
> to the conclusion that a pointer passed to such function should be
> allowed to have non-zero top byte. It would be the kernel's
> responsibility to strip it out as appropriate.

OK.

rule 3: kernel accepts legitimately tagged non-weird memory
pointers and untags them before usage other than deref.

this is relevant if a syscall uses pointers for address range
specification, instead of deref. (mprotect, madvise,...)

i also propose:

rule 4: kernel keeps legitimate tags on non-weird memory
pointers that it returns to the user.

e.g. clone passes stack/arg/tls pointers on without dropping
tags, same for set/get_robust_list. i'm not sure if there
are pointer values observable in /proc etc but those should
keep tags too.

"legitimately tagged" may not always be obvious, but the
illegitimately tagged case can be left unspecified i think,
so dropping tags is ok, but not required if tbi is off and
mte is not used (i.e. tag is illegitimate).

i think these rules work for the cases i care about, a more
tricky question is when/how to check for the new syscall abi
and when/how the TCR_EL1.TBI0 setting may be turned off.
consider the following cases (tb == top byte):

binary 1: user tb = any, syscall tb = 0
tbi is on, "legacy binary"

binary 2: user tb = any, syscall tb = any
tbi is on, "new binary using tb"
for backward compat it needs to check for new syscall abi.

binary 3: user tb = 0, syscall tb = 0
tbi can be off, "new binary",
binary is marked to indicate unused tb,
kernel may turn tbi off: additional pac bits.

binary 4: user tb = mte, syscall tb = mte
like binary 3, but with mte, "new binary using mte"
does it have to check for new syscall abi?
or MTE HWCAP would imply it?
(is it possible to use mte without new syscall abi?)

in userspace we want most binaries to be like binary 3 and 4
eventually, i.e. marked as not-relying-on-tbi, if a dso is
loaded that is unmarked (legacy or new tb user), then either
the load fails (e.g. if mte is already used? or can we turn
mte off at runtime?) or tbi has to be enabled (prctl? does
this work with pac? or multi-threads?).

as for checking the new syscall abi: i don't see much semantic
difference between AT_HWCAP and AT_FLAGS (either way, the user
has to check a feature flag before using the feature of the
underlying system and it does not matter much if it's a syscall
abi feature or cpu feature), but i don't see anything wrong
with AT_FLAGS if the kernel prefers that.

the discussion here was mostly about binary 2, but for
me the open question is if we can make binary 3/4 work.
(which requires some elf binary marking, that is recognised
by the kernel and dynamic loader, and efficient handling of
the TBI0 bit, ..if it's not possible, then i don't see how
mte will be deployed).

and i guess on the kernel side the open question is if the
rules 1/2/3/4 can be made to work in corner cases e.g. when
pointers embedded into structs are passed down in ioctl.

2019-02-25 16:58:06

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

Hi Szabolcs,

Thanks for looking into this. Comments below.

On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
> i think these rules work for the cases i care about, a more
> tricky question is when/how to check for the new syscall abi
> and when/how the TCR_EL1.TBI0 setting may be turned off.

I don't think turning TBI0 off is critical (it's handy for PAC with
52-bit VA but then it's short-lived if you want more security features
like MTE).

> consider the following cases (tb == top byte):
>
> binary 1: user tb = any, syscall tb = 0
> tbi is on, "legacy binary"
>
> binary 2: user tb = any, syscall tb = any
> tbi is on, "new binary using tb"
> for backward compat it needs to check for new syscall abi.
>
> binary 3: user tb = 0, syscall tb = 0
> tbi can be off, "new binary",
> binary is marked to indicate unused tb,
> kernel may turn tbi off: additional pac bits.
>
> binary 4: user tb = mte, syscall tb = mte
> like binary 3, but with mte, "new binary using mte"
> does it have to check for new syscall abi?
> or MTE HWCAP would imply it?
> (is it possible to use mte without new syscall abi?)

I think MTE HWCAP should imply it.

> in userspace we want most binaries to be like binary 3 and 4
> eventually, i.e. marked as not-relying-on-tbi, if a dso is
> loaded that is unmarked (legacy or new tb user), then either
> the load fails (e.g. if mte is already used? or can we turn
> mte off at runtime?) or tbi has to be enabled (prctl? does
> this work with pac? or multi-threads?).

We could enable it via prctl. That's the plan for MTE as well (in
addition maybe to some ELF flag).

> as for checking the new syscall abi: i don't see much semantic
> difference between AT_HWCAP and AT_FLAGS (either way, the user
> has to check a feature flag before using the feature of the
> underlying system and it does not matter much if it's a syscall
> abi feature or cpu feature), but i don't see anything wrong
> with AT_FLAGS if the kernel prefers that.

The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
relaxation of the syscall ABI to accept tb = any. The MTE support will
have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
AT_FLAGS is either redundant here if MTE implies it (and no harm in
keeping it around) or the meaning is different: a tb != 0 may be checked
by the kernel against the allocation tag (i.e. get_user() could fail,
the tag is not entirely ignored).

> the discussion here was mostly about binary 2,

That's because passing tb != 0 into the syscall ABI is the main blocker
here that needs clearing out before merging the MTE support. There is,
of course, a variation of binary 1 for MTE:

binary 5: user tb = mte, syscall tb = 0

but this requires a lot of C lib changes to support properly.

> but for
> me the open question is if we can make binary 3/4 work.
> (which requires some elf binary marking, that is recognised
> by the kernel and dynamic loader, and efficient handling of
> the TBI0 bit, ..if it's not possible, then i don't see how
> mte will be deployed).

If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
MTE or not.

> and i guess on the kernel side the open question is if the
> rules 1/2/3/4 can be made to work in corner cases e.g. when
> pointers embedded into structs are passed down in ioctl.

We've been trying to track these down since last summer and we came to
the conclusion that it should be (mostly) fine for the non-weird memory
described above.

--
Catalin

2019-02-25 18:03:43

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On 25/02/2019 16:57, Catalin Marinas wrote:
> On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
>> i think these rules work for the cases i care about, a more
>> tricky question is when/how to check for the new syscall abi
>> and when/how the TCR_EL1.TBI0 setting may be turned off.
>
> I don't think turning TBI0 off is critical (it's handy for PAC with
> 52-bit VA but then it's short-lived if you want more security features
> like MTE).

yes, i made a mistake assuming TBI0 off is
required for (or at least compatible with) MTE.

if TBI0 needs to be on for MTE then some of my
analysis is wrong, and i expect TBI0 to be on
in the foreseeable future.

>> consider the following cases (tb == top byte):
>>
>> binary 1: user tb = any, syscall tb = 0
>> tbi is on, "legacy binary"
>>
>> binary 2: user tb = any, syscall tb = any
>> tbi is on, "new binary using tb"
>> for backward compat it needs to check for new syscall abi.
>>
>> binary 3: user tb = 0, syscall tb = 0
>> tbi can be off, "new binary",
>> binary is marked to indicate unused tb,
>> kernel may turn tbi off: additional pac bits.
>>
>> binary 4: user tb = mte, syscall tb = mte
>> like binary 3, but with mte, "new binary using mte"

so this should be "like binary 2, but with mte".

>> does it have to check for new syscall abi?
>> or MTE HWCAP would imply it?
>> (is it possible to use mte without new syscall abi?)
>
> I think MTE HWCAP should imply it.
>
>> in userspace we want most binaries to be like binary 3 and 4
>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>> loaded that is unmarked (legacy or new tb user), then either
>> the load fails (e.g. if mte is already used? or can we turn
>> mte off at runtime?) or tbi has to be enabled (prctl? does
>> this work with pac? or multi-threads?).
>
> We could enable it via prctl. That's the plan for MTE as well (in
> addition maybe to some ELF flag).
>
>> as for checking the new syscall abi: i don't see much semantic
>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>> has to check a feature flag before using the feature of the
>> underlying system and it does not matter much if it's a syscall
>> abi feature or cpu feature), but i don't see anything wrong
>> with AT_FLAGS if the kernel prefers that.
>
> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
> relaxation of the syscall ABI to accept tb = any. The MTE support will
> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
> AT_FLAGS is either redundant here if MTE implies it (and no harm in
> keeping it around) or the meaning is different: a tb != 0 may be checked
> by the kernel against the allocation tag (i.e. get_user() could fail,
> the tag is not entirely ignored).
>
>> the discussion here was mostly about binary 2,
>
> That's because passing tb != 0 into the syscall ABI is the main blocker
> here that needs clearing out before merging the MTE support. There is,
> of course, a variation of binary 1 for MTE:
>
> binary 5: user tb = mte, syscall tb = 0
>
> but this requires a lot of C lib changes to support properly.

yes, i don't think we want to do that.

but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.

>> but for
>> me the open question is if we can make binary 3/4 work.
>> (which requires some elf binary marking, that is recognised
>> by the kernel and dynamic loader, and efficient handling of
>> the TBI0 bit, ..if it's not possible, then i don't see how
>> mte will be deployed).
>
> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
> MTE or not.
>
>> and i guess on the kernel side the open question is if the
>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>> pointers embedded into structs are passed down in ioctl.
>
> We've been trying to track these down since last summer and we came to
> the conclusion that it should be (mostly) fine for the non-weird memory
> described above.

i think an interesting case is when userspace passes
a pointer to the kernel and later gets it back,
which is why i proposed rule 4 (kernel has to keep
the tag then).

but i wonder what's the right thing to do for sp
(user can malloc thread/sigalt/makecontext stack
which will be mte tagged in practice with mte on)
does tagged sp work? should userspace untag the
stack memory before setting it up as a stack?
(but then user pointers to that allocation may get
broken..)

2019-02-26 17:33:04

by Kevin Brodsky

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] arm64 relaxed ABI

On 25/02/2019 18:02, Szabolcs Nagy wrote:
> On 25/02/2019 16:57, Catalin Marinas wrote:
>> On Tue, Feb 19, 2019 at 06:38:31PM +0000, Szabolcs Nagy wrote:
>>> i think these rules work for the cases i care about, a more
>>> tricky question is when/how to check for the new syscall abi
>>> and when/how the TCR_EL1.TBI0 setting may be turned off.
>> I don't think turning TBI0 off is critical (it's handy for PAC with
>> 52-bit VA but then it's short-lived if you want more security features
>> like MTE).
> yes, i made a mistake assuming TBI0 off is
> required for (or at least compatible with) MTE.
>
> if TBI0 needs to be on for MTE then some of my
> analysis is wrong, and i expect TBI0 to be on
> in the foreseeable future.
>
>>> consider the following cases (tb == top byte):
>>>
>>> binary 1: user tb = any, syscall tb = 0
>>> tbi is on, "legacy binary"
>>>
>>> binary 2: user tb = any, syscall tb = any
>>> tbi is on, "new binary using tb"
>>> for backward compat it needs to check for new syscall abi.
>>>
>>> binary 3: user tb = 0, syscall tb = 0
>>> tbi can be off, "new binary",
>>> binary is marked to indicate unused tb,
>>> kernel may turn tbi off: additional pac bits.
>>>
>>> binary 4: user tb = mte, syscall tb = mte
>>> like binary 3, but with mte, "new binary using mte"
> so this should be "like binary 2, but with mte".
>
>>> does it have to check for new syscall abi?
>>> or MTE HWCAP would imply it?
>>> (is it possible to use mte without new syscall abi?)
>> I think MTE HWCAP should imply it.
>>
>>> in userspace we want most binaries to be like binary 3 and 4
>>> eventually, i.e. marked as not-relying-on-tbi, if a dso is
>>> loaded that is unmarked (legacy or new tb user), then either
>>> the load fails (e.g. if mte is already used? or can we turn
>>> mte off at runtime?) or tbi has to be enabled (prctl? does
>>> this work with pac? or multi-threads?).
>> We could enable it via prctl. That's the plan for MTE as well (in
>> addition maybe to some ELF flag).
>>
>>> as for checking the new syscall abi: i don't see much semantic
>>> difference between AT_HWCAP and AT_FLAGS (either way, the user
>>> has to check a feature flag before using the feature of the
>>> underlying system and it does not matter much if it's a syscall
>>> abi feature or cpu feature), but i don't see anything wrong
>>> with AT_FLAGS if the kernel prefers that.
>> The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
>> relaxation of the syscall ABI to accept tb = any. The MTE support will
>> have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
>> AT_FLAGS is either redundant here if MTE implies it (and no harm in
>> keeping it around) or the meaning is different: a tb != 0 may be checked
>> by the kernel against the allocation tag (i.e. get_user() could fail,
>> the tag is not entirely ignored).
>>
>>> the discussion here was mostly about binary 2,
>> That's because passing tb != 0 into the syscall ABI is the main blocker
>> here that needs clearing out before merging the MTE support. There is,
>> of course, a variation of binary 1 for MTE:
>>
>> binary 5: user tb = mte, syscall tb = 0
>>
>> but this requires a lot of C lib changes to support properly.
> yes, i don't think we want to do that.
>
> but it's ok to have both syscall tbi AT_FLAGS and MTE HWCAP.
>
>>> but for
>>> me the open question is if we can make binary 3/4 work.
>>> (which requires some elf binary marking, that is recognised
>>> by the kernel and dynamic loader, and efficient handling of
>>> the TBI0 bit, ..if it's not possible, then i don't see how
>>> mte will be deployed).
>> If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
>> MTE or not.
>>
>>> and i guess on the kernel side the open question is if the
>>> rules 1/2/3/4 can be made to work in corner cases e.g. when
>>> pointers embedded into structs are passed down in ioctl.
>> We've been trying to track these down since last summer and we came to
>> the conclusion that it should be (mostly) fine for the non-weird memory
>> described above.
> i think an interesting case is when userspace passes
> a pointer to the kernel and later gets it back,
> which is why i proposed rule 4 (kernel has to keep
> the tag then).
>
> but i wonder what's the right thing to do for sp
> (user can malloc thread/sigalt/makecontext stack
> which will be mte tagged in practice with mte on)
> does tagged sp work? should userspace untag the
> stack memory before setting it up as a stack?
> (but then user pointers to that allocation may get
> broken..)

Tagged SP does work, and it is actually a good idea (it avoids using the default tag
for the stack). It would be quite easy for the kernel to tag the initial SP and the
stack on execve(). For other stacks, it is up to userspace, as you say, and would be
made easier by making it possible to choose how a mapping should be tagged by the
kernel via a new mmap() flag. Some software that makes too many assumptions on the
address of stack variables will be disturbed by a tagged SP, but this should be
fairly rare.

In any case, I don't think this impacts this ABI proposal (beyond the fact that
passing tagged pointers to the stack needs to be allowed).

Kevin