Hi all. Here's long overdue update on LAM enabling.
# Description #
Linear Address Masking[1] (LAM) modifies the checking that is applied to
64-bit linear addresses, allowing software to use of the untranslated
address bits for metadata.
The patchset brings support for LAM for userspace addresses.
The most sensitive part of enabling is change in tlb.c, where CR3 flags
get set. Please take a look that what I'm doing makes sense.
The feature competes for bits with 5-level paging: LAM_U48 makes it
impossible to map anything about 47-bits. The patchset made these
capability mutually exclusive: whatever used first wins. LAM_U57 can be
combined with mappings above 47-bits.
[1] ISE, Chapter 14.
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf
# What's new #
The main change is interface rework. It is now arch_prctl(2)-based and
suppose to be extendable to CET.
QEMU implementation is also updated. It can now be applied onto current
master branch. QEMU patch as it is was rejected by upstream, but it is
functinal and can be used for testing.
Please take a look. Any suggestions are welcome.
v2:
- Rebased onto v5.18-rc1
- New arch_prctl(2)-based API
- Expose status of LAM (or other thread features) in
/proc/$PID/arch_status.
Kirill A. Shutemov (10):
x86/mm: Fix CR3_ADDR_MASK
x86: CPUID and CR3/CR4 flags for Linear Address Masking
x86: Introduce userspace API to handle per-thread features
x86/mm: Introduce X86_THREAD_LAM_U48 and X86_THREAD_LAM_U57
x86/mm: Provide untagged_addr() helper
x86/uaccess: Remove tags from the address before checking
x86/mm: Handle tagged memory accesses from kernel threads
x86/mm: Make LAM_U48 and mappings above 47-bits mutually exclusive
x86/mm: Add userspace API to enable Linear Address Masking
x86: Expose thread features status in /proc/$PID/arch_status
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/elf.h | 3 +-
arch/x86/include/asm/mmu.h | 1 +
arch/x86/include/asm/mmu_context.h | 13 +++
arch/x86/include/asm/page_32.h | 3 +
arch/x86/include/asm/page_64.h | 20 ++++
arch/x86/include/asm/processor-flags.h | 2 +-
arch/x86/include/asm/processor.h | 3 +
arch/x86/include/asm/tlbflush.h | 5 +
arch/x86/include/asm/uaccess.h | 15 ++-
arch/x86/include/uapi/asm/prctl.h | 8 ++
arch/x86/include/uapi/asm/processor-flags.h | 6 +
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/fpu/xstate.c | 47 --------
arch/x86/kernel/proc.c | 63 ++++++++++
arch/x86/kernel/process.c | 56 +++++++++
arch/x86/kernel/process.h | 2 +
arch/x86/kernel/process_64.c | 46 ++++++++
arch/x86/kernel/sys_x86_64.c | 5 +-
arch/x86/mm/hugetlbpage.c | 6 +-
arch/x86/mm/mmap.c | 9 +-
arch/x86/mm/tlb.c | 123 +++++++++++++++++---
22 files changed, 367 insertions(+), 72 deletions(-)
create mode 100644 arch/x86/kernel/proc.c
--
2.35.1
Allow to enable Linear Address Masking via ARCH_THREAD_FEATURE_ENABLE
arch_prctl(2).
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/process.c | 21 +++++++++++++++-
arch/x86/kernel/process.h | 2 ++
arch/x86/kernel/process_64.c | 46 ++++++++++++++++++++++++++++++++++++
3 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index cb8fc28f2eae..911c24321312 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -46,6 +46,8 @@
#include <asm/proto.h>
#include <asm/frame.h>
#include <asm/unwind.h>
+#include <asm/mmu_context.h>
+#include <asm/compat.h>
#include "process.h"
@@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
static long thread_feature_prctl(struct task_struct *task, int option,
unsigned long features)
{
- const unsigned long known_features = 0;
+ const unsigned long known_features =
+ X86_THREAD_LAM_U48 |
+ X86_THREAD_LAM_U57;
if (features & ~known_features)
return -EINVAL;
@@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
/* Handle ARCH_THREAD_FEATURE_ENABLE */
+ if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
+ long ret;
+
+ /* LAM is only available in long mode */
+ if (in_32bit_syscall())
+ return -EINVAL;
+
+ ret = enable_lam(task, features);
+ if (ret)
+ return ret;
+ }
+
task->thread.features |= features;
out:
+ /* Update CR3 to get LAM active */
+ switch_mm(task->mm, task->mm, task);
+
return task->thread.features;
}
diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 76b547b83232..b8fa0e599c6e 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -4,6 +4,8 @@
#include <asm/spec-ctrl.h>
+long enable_lam(struct task_struct *task, unsigned long features);
+
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
/*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e459253649be..a25c51da7005 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -729,6 +729,52 @@ void set_personality_ia32(bool x32)
}
EXPORT_SYMBOL_GPL(set_personality_ia32);
+static bool lam_u48_allowed(void)
+{
+ struct mm_struct *mm = current->mm;
+
+ if (!full_va_allowed(mm))
+ return true;
+
+ return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
+}
+
+long enable_lam(struct task_struct *task, unsigned long features)
+{
+ features |= task->thread.features;
+
+ /* LAM_U48 and LAM_U57 are mutually exclusive */
+ if ((features & X86_THREAD_LAM_U48) && (features & X86_THREAD_LAM_U57))
+ return -EINVAL;
+
+ if (!cpu_feature_enabled(X86_FEATURE_LAM))
+ return -ENXIO;
+
+ if (mmap_write_lock_killable(task->mm))
+ return -EINTR;
+
+ if ((features & X86_THREAD_LAM_U48) && !lam_u48_allowed()) {
+ mmap_write_unlock(task->mm);
+ return -EINVAL;
+ }
+
+ /*
+ * Record the most permissive (allowing the widest tags) LAM
+ * mode to the mm context. It determinates if a mappings above
+ * 47 bit is allowed for the process.
+ *
+ * The mode is also used by a kernel thread when it does work
+ * on behalf of the process (like async I/O, io_uring, etc.)
+ */
+ if (features & X86_THREAD_LAM_U48)
+ current->mm->context.lam = LAM_U48;
+ else if (current->mm->context.lam == LAM_NONE)
+ current->mm->context.lam = LAM_U57;
+
+ mmap_write_unlock(task->mm);
+ return 0;
+}
+
#ifdef CONFIG_CHECKPOINT_RESTORE
static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
{
--
2.35.1
On Wed, May 11, 2022 at 05:27:50AM +0300, Kirill A. Shutemov wrote:
> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>
> /* Handle ARCH_THREAD_FEATURE_ENABLE */
>
> + if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
> + long ret;
> +
> + /* LAM is only available in long mode */
> + if (in_32bit_syscall())
> + return -EINVAL;
So what happens if userspace sets up a 32bit code entry in the LDT and
does the LAM thing as a 64bit syscamm but then goes run 32bit code?
> +
> + ret = enable_lam(task, features);
> + if (ret)
> + return ret;
> + }
> +
> task->thread.features |= features;
> out:
> + /* Update CR3 to get LAM active */
> + switch_mm(task->mm, task->mm, task);
> +
> return task->thread.features;
> }
>
When a kernel thread performs memory access on behalf of a process (like
in async I/O, io_uring, etc.) it has to respect tagging setup of the
process as user addresses can include tags.
Normally, LAM setup is per-thread and recorded in thread features, but
for this use case kernel also tracks LAM setup per-mm. mm->context.lam
would record LAM that allows the most tag bits among the threads of
the mm.
The info used by switch_mm_irqs_off() to construct CR3 if the task is
kernel thread. Thread featrues of the kernel thread get updated
according to mm->context.lam. It allows untagged_addr() to work
correctly.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/mmu.h | 1 +
arch/x86/mm/tlb.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5d7494631ea9..52f3749f14e8 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -40,6 +40,7 @@ typedef struct {
#ifdef CONFIG_X86_64
unsigned short flags;
+ u8 lam;
#endif
struct mutex lock;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f9fe71d1f42c..b320556e1c22 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -185,6 +185,34 @@ static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
if (!tsk)
return LAM_NONE;
+ if (tsk->flags & PF_KTHREAD) {
+ /*
+ * For kernel thread use the most permissive LAM
+ * used by the mm. It's required to handle kernel thread
+ * memory accesses on behalf of a process.
+ *
+ * Adjust thread flags accodringly, so untagged_addr() would
+ * work correctly.
+ */
+
+ tsk->thread.features &= ~(X86_THREAD_LAM_U48 |
+ X86_THREAD_LAM_U57);
+
+ switch (mm->context.lam) {
+ case LAM_NONE:
+ return LAM_NONE;
+ case LAM_U57:
+ tsk->thread.features |= X86_THREAD_LAM_U57;
+ return LAM_U57;
+ case LAM_U48:
+ tsk->thread.features |= X86_THREAD_LAM_U48;
+ return LAM_U48;
+ default:
+ WARN_ON_ONCE(1);
+ return LAM_NONE;
+ }
+ }
+
if (tsk->thread.features & X86_THREAD_LAM_U57)
return LAM_U57;
if (tsk->thread.features & X86_THREAD_LAM_U48)
--
2.35.1
On Tue, May 10, 2022 at 7:29 PM Kirill A. Shutemov
<[email protected]> wrote:
>
> Allow to enable Linear Address Masking via ARCH_THREAD_FEATURE_ENABLE
> arch_prctl(2).
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/kernel/process.c | 21 +++++++++++++++-
> arch/x86/kernel/process.h | 2 ++
> arch/x86/kernel/process_64.c | 46 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 68 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index cb8fc28f2eae..911c24321312 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -46,6 +46,8 @@
> #include <asm/proto.h>
> #include <asm/frame.h>
> #include <asm/unwind.h>
> +#include <asm/mmu_context.h>
> +#include <asm/compat.h>
>
> #include "process.h"
>
> @@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
> static long thread_feature_prctl(struct task_struct *task, int option,
> unsigned long features)
Since this arch_prctl will also be used for CET, which supports
32-bit processes,
shouldn't int, instead of long, be used?
> {
> - const unsigned long known_features = 0;
> + const unsigned long known_features =
> + X86_THREAD_LAM_U48 |
> + X86_THREAD_LAM_U57;
>
> if (features & ~known_features)
> return -EINVAL;
> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>
> /* Handle ARCH_THREAD_FEATURE_ENABLE */
>
> + if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
> + long ret;
> +
> + /* LAM is only available in long mode */
> + if (in_32bit_syscall())
> + return -EINVAL;
> +
> + ret = enable_lam(task, features);
> + if (ret)
> + return ret;
> + }
> +
> task->thread.features |= features;
> out:
> + /* Update CR3 to get LAM active */
> + switch_mm(task->mm, task->mm, task);
> +
> return task->thread.features;
> }
>
> diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
> index 76b547b83232..b8fa0e599c6e 100644
> --- a/arch/x86/kernel/process.h
> +++ b/arch/x86/kernel/process.h
> @@ -4,6 +4,8 @@
>
> #include <asm/spec-ctrl.h>
>
> +long enable_lam(struct task_struct *task, unsigned long features);
> +
> void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
>
> /*
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index e459253649be..a25c51da7005 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -729,6 +729,52 @@ void set_personality_ia32(bool x32)
> }
> EXPORT_SYMBOL_GPL(set_personality_ia32);
>
> +static bool lam_u48_allowed(void)
> +{
> + struct mm_struct *mm = current->mm;
> +
> + if (!full_va_allowed(mm))
> + return true;
> +
> + return find_vma(mm, DEFAULT_MAP_WINDOW) == NULL;
> +}
> +
> +long enable_lam(struct task_struct *task, unsigned long features)
> +{
> + features |= task->thread.features;
> +
> + /* LAM_U48 and LAM_U57 are mutually exclusive */
> + if ((features & X86_THREAD_LAM_U48) && (features & X86_THREAD_LAM_U57))
> + return -EINVAL;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_LAM))
> + return -ENXIO;
> +
> + if (mmap_write_lock_killable(task->mm))
> + return -EINTR;
> +
> + if ((features & X86_THREAD_LAM_U48) && !lam_u48_allowed()) {
> + mmap_write_unlock(task->mm);
> + return -EINVAL;
> + }
> +
> + /*
> + * Record the most permissive (allowing the widest tags) LAM
> + * mode to the mm context. It determinates if a mappings above
> + * 47 bit is allowed for the process.
> + *
> + * The mode is also used by a kernel thread when it does work
> + * on behalf of the process (like async I/O, io_uring, etc.)
> + */
> + if (features & X86_THREAD_LAM_U48)
> + current->mm->context.lam = LAM_U48;
> + else if (current->mm->context.lam == LAM_NONE)
> + current->mm->context.lam = LAM_U57;
> +
> + mmap_write_unlock(task->mm);
> + return 0;
> +}
> +
> #ifdef CONFIG_CHECKPOINT_RESTORE
> static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
> {
> --
> 2.35.1
>
--
H.J.
On 5/12/22 12:39, Thomas Gleixner wrote:
>> It's OK for a debugging build that runs on one kind of hardware. But,
>> if we want LAM-using binaries to be portable, we have to do something
>> different.
>>
>> One of the stated reasons for adding LAM hardware is that folks want to
>> use sanitizers outside of debugging environments. To me, that means
>> that LAM is something that the same binary might run with or without.
> On/off yes, but is there an actual use case where such a mechanism would
> at start time dynamically chose the number of bits?
I'd love to hear from folks doing the userspace side of this. Will
userspace be saying: "Give me all the bits you can!". Or, will it
really just be looking for 6 bits only, and it doesn't care whether it
gets 6 or 15, it will use only 6?
Do the sanitizers have more overhead with more bits? Or *less* overhead
because they can store more metadata in the pointers?
Will anyone care about the difference about potentially missing 1/64
issues with U57 versus 1/32768 with U48?
On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
> On 5/10/22 23:49, Peter Zijlstra wrote:
>>> The feature competes for bits with 5-level paging: LAM_U48 makes it
>>> impossible to map anything about 47-bits. The patchset made these
>>> capability mutually exclusive: whatever used first wins. LAM_U57 can be
>>> combined with mappings above 47-bits.
>> So aren't we creating a problem with LAM_U48 where programs relying on
>> it are of limited sustainability?
>
> I think allowing apps to say, "It's LAM_U48 or bust!" is a mistake.
That'd be outright stupid.
> It's OK for a debugging build that runs on one kind of hardware. But,
> if we want LAM-using binaries to be portable, we have to do something
> different.
>
> One of the stated reasons for adding LAM hardware is that folks want to
> use sanitizers outside of debugging environments. To me, that means
> that LAM is something that the same binary might run with or without.
On/off yes, but is there an actual use case where such a mechanism would
at start time dynamically chose the number of bits?
> It's totally fine with me if the kernel only initially supports LAM_U57.
> But, I'd ideally like to make sure that the ABI can support LAM_U57,
> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.
Sure. No argument here.
Thanks,
tglx
On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> On 5/12/22 17:08, H.J. Lu wrote:
> If I had to take a shot at this today, I think I'd opt for:
>
> mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>
> although I'm not super confident about the "fuzzy" flag. I also don't
> think I'd totally hate the "blind" interface where the kernel just gets
> to pick unilaterally and takes zero input from userspace.
That's the only sane choice and you can make it simple for userspace:
ret = prctl(GET_XXX_MASK, &mask);
and then let it decide based on @ret and @mask whether to use it or not.
But of course nobody thought about this as a generic feature and so we
have the ARM64 TBI muck as a precedence.
So much for coordination and portability...
I'm so tired of this short sighted 'cram my feature in' approach of
_all_ involved parties.
Thanks,
tglx
On Thu, May 12 2022 at 20:05, Dave Hansen wrote:
> On 5/12/22 18:27, Thomas Gleixner wrote:
>> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>>> On 5/12/22 17:08, H.J. Lu wrote:
>>> If I had to take a shot at this today, I think I'd opt for:
>>>
>>> mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>>>
>>> although I'm not super confident about the "fuzzy" flag. I also don't
>>> think I'd totally hate the "blind" interface where the kernel just gets
>>> to pick unilaterally and takes zero input from userspace.
>> That's the only sane choice and you can make it simple for userspace:
>>
>> ret = prctl(GET_XXX_MASK, &mask);
>>
>> and then let it decide based on @ret and @mask whether to use it or not.
>>
>> But of course nobody thought about this as a generic feature and so we
>> have the ARM64 TBI muck as a precedence.
>
> Well, not quite *nobody*:
>
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
Sigh....
On Thu, May 12 2022 at 15:10, H. J. Lu wrote:
> On Thu, May 12, 2022 at 2:51 PM Dave Hansen <[email protected]> wrote:
>> On 5/12/22 12:39, Thomas Gleixner wrote:
>> >> It's OK for a debugging build that runs on one kind of hardware. But,
>> >> if we want LAM-using binaries to be portable, we have to do something
>> >> different.
>> >>
>> >> One of the stated reasons for adding LAM hardware is that folks want to
>> >> use sanitizers outside of debugging environments. To me, that means
>> >> that LAM is something that the same binary might run with or without.
>> > On/off yes, but is there an actual use case where such a mechanism would
>> > at start time dynamically chose the number of bits?
>>
>> I'd love to hear from folks doing the userspace side of this. Will
>> userspace be saying: "Give me all the bits you can!". Or, will it
>> really just be looking for 6 bits only, and it doesn't care whether it
>> gets 6 or 15, it will use only 6?
>>
>> Do the sanitizers have more overhead with more bits? Or *less* overhead
>> because they can store more metadata in the pointers?
>>
>> Will anyone care about the difference about potentially missing 1/64
>> issues with U57 versus 1/32768 with U48?
>
> The only LAM usage I know so far is LAM_U57 in HWASAN.
That's at least a halfways useful answer.
> An application can ask for LAM_U48 or LAM_U57. But the decision should
> be made by application.
It can ask for whatever, but the decision whether it's granted is made
by the kernel for obvious reasons.
> When an application asks for LAM_U57, I expect it will store tags in
> upper 6 bits, even if the kernel enables LAM_U48.
The kernel does not enable LAM_U48 when the application only wants to
have LAM_U57, because that would restrict the address space of the
application to 47 bits on 5-level capable system for no reason.
So what are you trying to tell me?
Thanks,
tglx
On 5/12/22 17:08, H.J. Lu wrote:
> I am expecting applications to ask for LAM_U48 or LAM_U57, not just
> LAM.
If AMD comes along with UAI that doesn't match LAM_U48 or LAM_U57, apps
will specifically be coded to ask for one of the three? That seems like
an awfully rigid ABI.
That also seems like a surefire way to have non-portable users of this
feature. It basically guarantees that userspace code will look like this:
if (support_lam_57()) {
sys_enable_masking(LAM_57);
mask = LAM_57_MASK;
} else if (support_lam_48()) {
sys_enable_masking(LAM_48);
mask = LAM_48_MASK;
} else if (...)
... others
Which is *ENTIRELY* non-portable and needs to get patched if anything
changes in the slightest. Where, if we move that logic into the kernel,
it's something more like:
mask = sys_enable_masking(...);
if (bitmap_weight(&mask) < MINIMUM_BITS)
goto whoops;
That actually works for all underlying implementations and doesn't
hard-code any assumptions about the implementation other than a basic
sanity check.
There are three choices we'd have to make for a more generic ABI that I
can think of:
ABI Question #1:
Should userspace be asking the kernel for a specific type of masking,
like a number of bits to mask or a mask itself? If not, the enabling
syscall is dirt simple: it's "mask = sys_enable_masking()". The kernel
picks what it wants to mask unilaterally and just tells userspace.
ABI Question #2:
Assuming that userspace is asking for a specific kind of address
masking: Should that request be made in terms of an actual mask or a
number of bits? For instance, if userspace asks for 0xf000000000000000,
it would fit UAI or ARM TBI. If it asks for 0x7e00000000000000, it
would match LAM_U57 behavior.
Or, does userspace ask for "8 bits", or "6 bits" or "15 bits"?
ABI Question #3:
If userspace asks for something that the kernel can't satisfy exactly,
like "8 bits" on a LAM system, is it OK for the kernel to fall back to
the next-largest mask? For instance sys_enable_masking(bits=8), could
the kernel unilaterally return a LAM_U48 mask because LAM_U48 means
supports 15>8 bits? Or, could this "fuzzy" behavior be an opt-in?
If I had to take a shot at this today, I think I'd opt for:
mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
although I'm not super confident about the "fuzzy" flag. I also don't
think I'd totally hate the "blind" interface where the kernel just gets
to pick unilaterally and takes zero input from userspace.
On Wed, May 11 2022 at 05:27, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index f9fe71d1f42c..b320556e1c22 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -185,6 +185,34 @@ static u8 gen_lam(struct task_struct *tsk, struct mm_struct *mm)
> if (!tsk)
> return LAM_NONE;
>
> + if (tsk->flags & PF_KTHREAD) {
> + /*
> + * For kernel thread use the most permissive LAM
> + * used by the mm. It's required to handle kernel thread
> + * memory accesses on behalf of a process.
> + *
> + * Adjust thread flags accodringly, so untagged_addr() would
> + * work correctly.
> + */
> +
> + tsk->thread.features &= ~(X86_THREAD_LAM_U48 |
> + X86_THREAD_LAM_U57);
> +
> + switch (mm->context.lam) {
> + case LAM_NONE:
> + return LAM_NONE;
> + case LAM_U57:
> + tsk->thread.features |= X86_THREAD_LAM_U57;
> + return LAM_U57;
> + case LAM_U48:
> + tsk->thread.features |= X86_THREAD_LAM_U48;
> + return LAM_U48;
Pretending that LAM is configurable per thread and then having a magic
override in the per process mm when accessing that process' memory from
a kernel thread is inconsistent, a horrible hack and a recipe for
hard to diagnose problems.
LAM has to be enabled by the process _before_ creating threads and then
stay enabled until the whole thing dies. That's the only sensible use
case.
I understand that tsk->thread.features is conveniant for the untagging
mechanism, but the whole setup should be:
prctl(ENABLE, which)
if (can_enable_lam(which)) {
mm->lam.c3_mask = CR3_LAM(which);
mm->lam.untag_mask = UNTAG_LAM(which);
current->thread.lam_untag_mask = mm->lam.untag_mask;
}
and
can_enable_lam(which)
if (current_is_multithreaded())
return -ETOOLATE;
if (current->mm->lam_cr3_mask)
return -EBUSY;
....
Now vs. kernel threads. Doing this like the above is just the wrong
place. If a kernel thread accesses user space memory of a process then
it has to invoke kthread_use_mm(), right? So the obvious point to cache
that setting is in kthread_use_mm() and kthread_unuse_mm() clears it:
kthread_use_mm()
current->thread.lam_untag_mask = mm->lam.untag_mask;
kthread_unuse_mm()
current->thread.lam_untag_mask = 0;
This makes all of the mechanics trivial because CR3 switch then simply
does:
new_cr3 |= mm->lam.c3_mask;
No conditionals and evaluations, nothing. Just straight forward and
comprehensible code.
Thanks,
tglx
On Fri, May 13, 2022 at 1:28 PM David Laight <[email protected]> wrote:
>
> ...
> > Once we have the possibility to store tags in the pointers, we don't
> > need redzones for heap/stack objects anymore, which saves quite a bit
> > of memory.
>
> You still need redzones.
> The high bits are ignored for actual memory accesses.
>
> To do otherwise you'd need the high bits to be in the PTE,
> copied to the TLB and finally get into the cache tag.
>
> Then you'd have to use the correct tags for each page.
Sorry, I don't understand how this is relevant to HWASan in the userspace.
Like in ASan, we have a custom allocator that assigns tags to heap
objects. The assigned tag is stored in both the shadow memory for the
object and the pointer returned by the allocator.
Instrumentation inserted by the compiler checks the pointer before
every memory access and ensures that its tag matches the tag of the
object in the shadow memory.
A tag mismatch is reported as an out-of-bounds or a use-after-free,
depending on whether the accessed memory is still considered
allocated.
Because objects with different tags follow each other, there is no
need to add extra redzones to the objects to detect buffer overflows.
(We might need to increase the object alignment though, but that's a
different story).
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
--
Alexander Potapenko
Software Engineer
Google Germany GmbH
Erika-Mann-Straße, 33
80636 München
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.
This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.
On Wed, May 11 2022 at 07:15, H. J. Lu wrote:
>> @@ -992,7 +994,9 @@ unsigned long __get_wchan(struct task_struct *p)
>> static long thread_feature_prctl(struct task_struct *task, int option,
>> unsigned long features)
>
> Since this arch_prctl will also be used for CET, which supports
> 32-bit processes,
> shouldn't int, instead of long, be used?
This is kernel internal code and the compat syscall takes care of the
int to long conversion. So yes, that could use unsigned int, but it does
not matter.
Thanks,
tglx
On Thu, May 12, 2022 at 08:05:26PM -0700, Dave Hansen wrote:
> On 5/12/22 18:27, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> >> On 5/12/22 17:08, H.J. Lu wrote:
> >> If I had to take a shot at this today, I think I'd opt for:
> >>
> >> mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
> >>
> >> although I'm not super confident about the "fuzzy" flag. I also don't
> >> think I'd totally hate the "blind" interface where the kernel just gets
> >> to pick unilaterally and takes zero input from userspace.
> > That's the only sane choice and you can make it simple for userspace:
> >
> > ret = prctl(GET_XXX_MASK, &mask);
> >
> > and then let it decide based on @ret and @mask whether to use it or not.
> >
> > But of course nobody thought about this as a generic feature and so we
> > have the ARM64 TBI muck as a precedence.
>
> Well, not quite *nobody*:
>
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
In the first RFC I tried to get ARM TBI interface generic. I resurrect it
if it fits better:
https://lore.kernel.org/all/[email protected]/
--
Kirill A. Shutemov
On Thu, May 12, 2022 at 2:51 PM Dave Hansen <[email protected]> wrote:
>
> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> It's OK for a debugging build that runs on one kind of hardware. But,
> >> if we want LAM-using binaries to be portable, we have to do something
> >> different.
> >>
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments. To me, that means
> >> that LAM is something that the same binary might run with or without.
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> I'd love to hear from folks doing the userspace side of this. Will
> userspace be saying: "Give me all the bits you can!". Or, will it
> really just be looking for 6 bits only, and it doesn't care whether it
> gets 6 or 15, it will use only 6?
>
> Do the sanitizers have more overhead with more bits? Or *less* overhead
> because they can store more metadata in the pointers?
>
> Will anyone care about the difference about potentially missing 1/64
> issues with U57 versus 1/32768 with U48?
The only LAM usage I know so far is LAM_U57 in HWASAN. An application
can ask for LAM_U48 or LAM_U57. But the decision should be made by
application. When an application asks for LAM_U57, I expect it will store
tags in upper 6 bits, even if the kernel enables LAM_U48.
--
H.J.
...
> Once we have the possibility to store tags in the pointers, we don't
> need redzones for heap/stack objects anymore, which saves quite a bit
> of memory.
You still need redzones.
The high bits are ignored for actual memory accesses.
To do otherwise you'd need the high bits to be in the PTE,
copied to the TLB and finally get into the cache tag.
Then you'd have to use the correct tags for each page.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, May 12 2022 at 21:31, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 19:56, Kirill A. Shutemov wrote:
>> On Thu, May 12, 2022 at 05:42:58PM +0200, Thomas Gleixner wrote:
>>> On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
>>> > On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
>>> > So aren't we creating a problem with LAM_U48 where programs relying on
>>> > it are of limited sustainability?
>>> >
>>> > Any such program simply *cannot* run on 5 level pagetables. Why do we
>>> > want to do this?
>>>
>>> More bits are better :)
>>>
>>> Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
>>> makes a lot of sense _and_ makes the whole thing way simpler.
>>>
>>> So supporting both needs a truly good justification and a real world use
>>> case.
>>
>> I asked the question before[1]. Basically, more bits more better:
>>
>> For HWASAN #bits == detection probability.
>> For MarkUS #bits == exponential cost reduction
>
> What is MarkUS? It's not really helpful to provide acronyms which are
> not decodable.
>
>> I would really like to have only LAM_U57, but IIUC 6 bits is not always
>> enough.
>>
>> Dmitry, could you elaborate?
>>
>> [1] https://mobile.twitter.com/dvyukov/status/1342019823400837120
>
> I don't know whether he reacts on posting a link to his twitter
> account. I've CC'ed him now. Maybe that works better.
Duh. I should have looked at 'To:' and not only at 'Cc:'
Maybe someday I get used to this email thing.
Thanks,
tglx
From: Alexander Potapenko
> Sent: 13 May 2022 13:26
>
> On Fri, May 13, 2022 at 1:28 PM David Laight <[email protected]> wrote:
> >
> > ...
> > > Once we have the possibility to store tags in the pointers, we don't
> > > need redzones for heap/stack objects anymore, which saves quite a bit
> > > of memory.
> >
> > You still need redzones.
> > The high bits are ignored for actual memory accesses.
> >
> > To do otherwise you'd need the high bits to be in the PTE,
> > copied to the TLB and finally get into the cache tag.
> >
> > Then you'd have to use the correct tags for each page.
>
> Sorry, I don't understand how this is relevant to HWASan in the userspace.
> Like in ASan, we have a custom allocator that assigns tags to heap
> objects. The assigned tag is stored in both the shadow memory for the
> object and the pointer returned by the allocator.
> Instrumentation inserted by the compiler checks the pointer before
> every memory access and ensures that its tag matches the tag of the
> object in the shadow memory.
Doesn't that add so much overhead that the system runs like a sick pig?
I don't see any point adding overhead to a generic kernel to support
such operation.
> A tag mismatch is reported as an out-of-bounds or a use-after-free,
> depending on whether the accessed memory is still considered
> allocated.
> Because objects with different tags follow each other, there is no
> need to add extra redzones to the objects to detect buffer overflows.
> (We might need to increase the object alignment though, but that's a
> different story).
How does all that help if a system call (eg read()) is given
an invalid length.
If that length is checked then the 'unmasked' address can be
passed to the kernel.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, May 12, 2022 at 4:35 PM Thomas Gleixner <[email protected]> wrote:
>
> On Thu, May 12 2022 at 15:10, H. J. Lu wrote:
> > On Thu, May 12, 2022 at 2:51 PM Dave Hansen <[email protected]> wrote:
> >> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> >> It's OK for a debugging build that runs on one kind of hardware. But,
> >> >> if we want LAM-using binaries to be portable, we have to do something
> >> >> different.
> >> >>
> >> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> >> use sanitizers outside of debugging environments. To me, that means
> >> >> that LAM is something that the same binary might run with or without.
> >> > On/off yes, but is there an actual use case where such a mechanism would
> >> > at start time dynamically chose the number of bits?
> >>
> >> I'd love to hear from folks doing the userspace side of this. Will
> >> userspace be saying: "Give me all the bits you can!". Or, will it
> >> really just be looking for 6 bits only, and it doesn't care whether it
> >> gets 6 or 15, it will use only 6?
> >>
> >> Do the sanitizers have more overhead with more bits? Or *less* overhead
> >> because they can store more metadata in the pointers?
> >>
> >> Will anyone care about the difference about potentially missing 1/64
> >> issues with U57 versus 1/32768 with U48?
> >
> > The only LAM usage I know so far is LAM_U57 in HWASAN.
>
> That's at least a halfways useful answer.
>
> > An application can ask for LAM_U48 or LAM_U57. But the decision should
> > be made by application.
>
> It can ask for whatever, but the decision whether it's granted is made
> by the kernel for obvious reasons.
>
> > When an application asks for LAM_U57, I expect it will store tags in
> > upper 6 bits, even if the kernel enables LAM_U48.
>
> The kernel does not enable LAM_U48 when the application only wants to
> have LAM_U57, because that would restrict the address space of the
> application to 47 bits on 5-level capable system for no reason.
>
> So what are you trying to tell me?
>
I am expecting applications to ask for LAM_U48 or LAM_U57, not just
LAM.
--
H.J.
On Fri, May 13 2022 at 10:14, Catalin Marinas wrote:
> On Fri, May 13, 2022 at 03:27:24AM +0200, Thomas Gleixner wrote:
>> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>> > On 5/12/22 17:08, H.J. Lu wrote:
>> > If I had to take a shot at this today, I think I'd opt for:
>> >
>> > mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>> >
>> > although I'm not super confident about the "fuzzy" flag. I also don't
>> > think I'd totally hate the "blind" interface where the kernel just gets
>> > to pick unilaterally and takes zero input from userspace.
>>
>> That's the only sane choice and you can make it simple for userspace:
>>
>> ret = prctl(GET_XXX_MASK, &mask);
>>
>> and then let it decide based on @ret and @mask whether to use it or not.
>
> Getting the mask would work for arm64 as well (it's always 0xffUL << 56,
> top-byte-ignore). Setting the mask from user space won't be of any use
> to us, it's baked in hardware.
Sure.
> Dave indeed mentioned passing a mask to allow a more flexible control
> but, as already mentioned in the old thread, for arm64 the feature was
> already on, so it didn't make much sense, it seemed more like
> over-engineering. Had we known that Intel is pursing something similar,
> maybe we'd have designed the interface differently (we didn't get the
> hint).
Fair enough
> Intel's LAM has more flexibility but I don't see the arm64 TBI getting
> in the way. Just don't use it as an example because they evolved in
> different ways. I'm happy for arm64 to adopt a more flexible interface
> while keeping the current one around for backwards compatibility). But
> on arm64 we can't control the masking, not even disable it per process
> since it has always been on.
That's fine. The point is that we want uniform interfaces for the same
functionality. It's obviously hardware specific which subset of the
interface is supported.
>> I'm so tired of this short sighted 'cram my feature in' approach of
>> _all_ involved parties.
>
> Unfortunately it happens occasionally, especially when developers can't
> disclose that their companies work on similar features (resctrl is a
> good example where arm64 would have benefited from a more generic
> approach but at the time MPAM was not public).
Yeah. It's a constant pain.
Thanks,
tglx
On Thu, May 12 2022 at 17:08, H. J. Lu wrote:
> On Thu, May 12, 2022 at 4:35 PM Thomas Gleixner <[email protected]> wrote:
>> > When an application asks for LAM_U57, I expect it will store tags in
>> > upper 6 bits, even if the kernel enables LAM_U48.
>>
>> The kernel does not enable LAM_U48 when the application only wants to
>> have LAM_U57, because that would restrict the address space of the
>> application to 47 bits on 5-level capable system for no reason.
>>
>> So what are you trying to tell me?
>>
> I am expecting applications to ask for LAM_U48 or LAM_U57, not just
> LAM.
That still does not tell anything.
You can as well tell me, that this will depend on the moon phase. That
would be more telling at least.
If we commit to an ABI, which we have to support forever, then we want
factual arguments, not expectations.
The fact that hardware supports 5 variants does not mean that all of
them make sense to support in the OS.
Quite the contrary, 99% of all 'flexible' hardware features are based on
bogus assumptions. The worst of these assumptions is:
'We can handle this in software'
Sure, most of the trainwrecks hardware people provide can be 'handled'
in software, but you have to consider the price for doing so.
The price is that we amount technical debt.
You are well aware of this as you have contributed your share of
technical debt by cramming unsupportable crap into user space projects
prematurely.
So can you please take a step back and seriously think about the
semantics and long term consequences instead of providing handwaving
expectations which are biased by uninformed wishful thinking, AKA
marketing?
Thanks,
tglx
On Thu, May 12, 2022 at 11:51 PM Dave Hansen <[email protected]> wrote:
>
> On 5/12/22 12:39, Thomas Gleixner wrote:
> >> It's OK for a debugging build that runs on one kind of hardware. But,
> >> if we want LAM-using binaries to be portable, we have to do something
> >> different.
> >>
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments. To me, that means
> >> that LAM is something that the same binary might run with or without.
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> I'd love to hear from folks doing the userspace side of this. Will
> userspace be saying: "Give me all the bits you can!". Or, will it
> really just be looking for 6 bits only, and it doesn't care whether it
> gets 6 or 15, it will use only 6?
(speaking more or less on behalf of the userspace folks here)
I think it is safe to assume that in the upcoming year or two HWASan
will be fine having just 6 bits for the tags on x86 machines.
We are interested in running it on kernels with and without
CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.
> Do the sanitizers have more overhead with more bits? Or *less* overhead
> because they can store more metadata in the pointers?
Once we have the possibility to store tags in the pointers, we don't
need redzones for heap/stack objects anymore, which saves quite a bit
of memory.
Also, HWASan doesn't use quarantine and has smaller shadow memory size
(see https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html
for more details).
Having more bits increases the probability to detect a UAF or buffer
overflow and reduces the shadow memory size further, but that one is
small enough already.
> Will anyone care about the difference about potentially missing 1/64
> issues with U57 versus 1/32768 with U48?
I don't think anyone will.
Having said that, I agree with Dave that it would be nice to have an
interface that would just request the mask from the system.
That way we could have support for U57 in the kernel now and keep the
possibility to add U48 in the future without breaking existing users.
I also may be missing something obvious, but I can't come up with a
case where different apps in the system may request U48 and U57 at the
same time.
It seems natural to me to let the OS decide which of the modes is
supported and give the app the freedom to use it or lose it.
--
Alexander Potapenko
Software Engineer
Google Germany GmbH
Erika-Mann-Straße, 33
80636 München
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.
This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.
On Fri, May 13, 2022 at 4:26 PM David Laight <[email protected]> wrote:
>
> From: Alexander Potapenko
> > Sent: 13 May 2022 13:26
> >
> > On Fri, May 13, 2022 at 1:28 PM David Laight <[email protected]> wrote:
> > >
> > > ...
> > > > Once we have the possibility to store tags in the pointers, we don't
> > > > need redzones for heap/stack objects anymore, which saves quite a bit
> > > > of memory.
> > >
> > > You still need redzones.
> > > The high bits are ignored for actual memory accesses.
> > >
> > > To do otherwise you'd need the high bits to be in the PTE,
> > > copied to the TLB and finally get into the cache tag.
> > >
> > > Then you'd have to use the correct tags for each page.
> >
> > Sorry, I don't understand how this is relevant to HWASan in the userspace.
> > Like in ASan, we have a custom allocator that assigns tags to heap
> > objects. The assigned tag is stored in both the shadow memory for the
> > object and the pointer returned by the allocator.
> > Instrumentation inserted by the compiler checks the pointer before
> > every memory access and ensures that its tag matches the tag of the
> > object in the shadow memory.
>
> Doesn't that add so much overhead that the system runs like a sick pig?
> I don't see any point adding overhead to a generic kernel to support
> such operation.
Let me ensure we are on the same page here. Right now we are talking
about LAM support for userspace addresses.
At this point nobody is going to add instrumentation to a generic
kernel - just a prctl (let aside how exactly it works) that makes the
CPU ignore certain address bits in a particular userspace process.
The whole system is not supposed to run slower because of that - even
if one or many processes choose to enable LAM.
Now let's consider ASan (https://clang.llvm.org/docs/AddressSanitizer.html).
It is a powerful detector of memory corruptions in the userspace, but
it comes with a cost:
- compiler instrumentation bloats the code (by roughly 50%) and slows
down the execution (by up to 2x);
- redzones around stack and heap objects, memory quarantine and
shadow memory increase the memory consumption (by up to 4x).
In short, for each 8 bytes of app memory ASan stores one byte of the
metadata (shadow memory) indicating the addressability of those 8
bytes.
It then uses compiler instrumentation to verify that every memory
access in the program accesses only addressable memory.
ASan is widely used for testing and to some extent can be used in
production, but for big server-side apps the RAM overhead becomes
critical.
This is where HWASan
(https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html)
comes to the rescue.
Instead of storing addressability info in the shadow memory, it stores
a 1-byte tag for every 16 bytes of app memory (see
https://arxiv.org/pdf/1802.09517.pdf for other options).
As I mentioned before, the custom userspace memory allocator assigns
the tags to memory chunks and returns tagged pointers.
Like ASan, HWASan uses compiler instrumentation to verify that every
memory access is touching valid memory, but in this case it must
ensure that the pointer tag matches the tag stored in the shadow
memory.
Because of instrumentation, HWASan still has comparable code size and
execution overheads, but it uses way less memory (10-35% of the
original app memory consumption).
This lets us test beefy applications, e.g. feeding real-world queries
to production services. Even smaller applications benefit from it,
e.g. because of reduced cache pressure.
HWASan has been available for Android for a while now, and proved itself useful.
> > A tag mismatch is reported as an out-of-bounds or a use-after-free,
> > depending on whether the accessed memory is still considered
> > allocated.
> > Because objects with different tags follow each other, there is no
> > need to add extra redzones to the objects to detect buffer overflows.
> > (We might need to increase the object alignment though, but that's a
> > different story).
>
> How does all that help if a system call (eg read()) is given
> an invalid length.
Neither ASan nor HWASan care about what happens in the kernel.
Instead, they wrap system calls (along with some important libc
functions) and check their arguments to ensure there are no buffer
overflows.
HTH,
Alex
--
Alexander Potapenko
Software Engineer
Google Germany GmbH
Erika-Mann-Straße, 33
80636 München
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Falls Sie diese fälschlicherweise
erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes
weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich
bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.
This e-mail is confidential. If you received this communication by
mistake, please don't forward it to anyone else, please erase all
copies and attachments, and please let me know that it has gone to the
wrong person.
On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
>> One of the stated reasons for adding LAM hardware is that folks want to
>> use sanitizers outside of debugging environments. To me, that means
>> that LAM is something that the same binary might run with or without.
>
> On/off yes, but is there an actual use case where such a mechanism would
> at start time dynamically chose the number of bits?
This would need cooperation from the application because it has to tell
the magic facility whether it intends to utilize the large VA space on a
5-level paging system or not.
I have no idea how that is supposed to work, but what do I know about
magic.
>> It's totally fine with me if the kernel only initially supports LAM_U57.
>> But, I'd ideally like to make sure that the ABI can support LAM_U57,
>> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.
>
> Sure. No argument here.
Assumed that the acronym of the day, which uses this, has a real benefit
from the larger number of bits, we can support it.
But we are not going to make this a per thread selectable thing.
It's a process wide decision at startup simply because it does no buy
thread A anything to select U57 if thread B selects U48 before thread A
was able to map something into the U48 covered address space. Same issue
the other way round as then B has to fallback to U57 or NONE. That does
not make any sense at all.
I'm all for flexible, but not just because we can. It has to make sense.
Making it process wide and once on startup puts the 'complexity' into
the prctl(), but keeps the runtime overhead as small as it gets:
- CR3 switching needs just the | mm->lam_cr3_mask
- untagging one of the uglies Peter and I came up with
Making U48/U57 hardcoded would not buy much.
Thanks,
tglx
On Fri, May 13, 2022 at 03:27:24AM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
> > On 5/12/22 17:08, H.J. Lu wrote:
> > If I had to take a shot at this today, I think I'd opt for:
> >
> > mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
> >
> > although I'm not super confident about the "fuzzy" flag. I also don't
> > think I'd totally hate the "blind" interface where the kernel just gets
> > to pick unilaterally and takes zero input from userspace.
>
> That's the only sane choice and you can make it simple for userspace:
>
> ret = prctl(GET_XXX_MASK, &mask);
>
> and then let it decide based on @ret and @mask whether to use it or not.
Getting the mask would work for arm64 as well (it's always 0xffUL << 56,
top-byte-ignore). Setting the mask from user space won't be of any use
to us, it's baked in hardware.
> But of course nobody thought about this as a generic feature and so we
> have the ARM64 TBI muck as a precedence.
>
> So much for coordination and portability...
Well, we had TBI in the architecture and enabled for user-space since
the first arm64 kernel port (2012), no ABI controls needed. It had some
specific uses like JITs to avoid masking out type bits encoded in
pointers.
In 2019 sanitisers appeared and we relaxed the TBI at the syscall level
but, to avoid potentially confusing some programs, we added a control
which only changes the behaviour of access_ok(). More of a safety thing,
we might have as well skipped it. There is no hardware configuration
toggled by this control, nor is the user address space layout (max
52-bit on arm64). Since sanitisers require compiler instrumentation (or,
with MTE, arm64-specific libc changes), it's pretty much all within the
arm64-specific user codebase. MTE came along and we added some more bits
on top which, again, are hardware specific and contained within the
arm64 libc startup code (tag checking modes etc).
Dave indeed mentioned passing a mask to allow a more flexible control
but, as already mentioned in the old thread, for arm64 the feature was
already on, so it didn't make much sense, it seemed more like
over-engineering. Had we known that Intel is pursing something similar,
maybe we'd have designed the interface differently (we didn't get the
hint).
Intel's LAM has more flexibility but I don't see the arm64 TBI getting
in the way. Just don't use it as an example because they evolved in
different ways. I'm happy for arm64 to adopt a more flexible interface
while keeping the current one around for backwards compatibility). But
on arm64 we can't control the masking, not even disable it per process
since it has always been on.
> I'm so tired of this short sighted 'cram my feature in' approach of
> _all_ involved parties.
Unfortunately it happens occasionally, especially when developers can't
disclose that their companies work on similar features (resctrl is a
good example where arm64 would have benefited from a more generic
approach but at the time MPAM was not public).
--
Catalin
On Wed, May 11 2022 at 09:26, Peter Zijlstra wrote:
> On Wed, May 11, 2022 at 05:27:50AM +0300, Kirill A. Shutemov wrote:
>> @@ -1013,8 +1017,23 @@ static long thread_feature_prctl(struct task_struct *task, int option,
>>
>> /* Handle ARCH_THREAD_FEATURE_ENABLE */
>>
>> + if (features & (X86_THREAD_LAM_U48 | X86_THREAD_LAM_U57)) {
>> + long ret;
>> +
>> + /* LAM is only available in long mode */
>> + if (in_32bit_syscall())
>> + return -EINVAL;
>
> So what happens if userspace sets up a 32bit code entry in the LDT and
> does the LAM thing as a 64bit syscamm but then goes run 32bit code?
AFAICS, nothing happens. The only requirements are CR4.PAE = 1,
IA32_EFER.LME = 1. Those are unaffected from user space running 32bit
code, no?
32bit code can't use 64bit pointers so it can't have metadata bits
set. But x32 can and is excluded by the above too.
So the whole muck must be conditional on CONFIG_X86_64=y and does not
need any other restrictions IMO.
Thanks,
tglx
On Thu, May 12, 2022 at 11:24:27PM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments. To me, that means
> >> that LAM is something that the same binary might run with or without.
> >
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> This would need cooperation from the application because it has to tell
> the magic facility whether it intends to utilize the large VA space on a
> 5-level paging system or not.
Taking a step back ...
Consider an application like 'grep'. It operates on streams of data,
and has a fairly bounded amount of virtual memory that it will use.
Let's say it's 1GB (*handwave*). So it has 33 bits of address space that
it can use for "security" of one form or another. ASLR is one thing which
is vying for bits, and now ASAN is another. Somehow it is supposed to
tell the kernel "I want to use 6 bits for ASAN and 27 bits for ASLR".
Or "I want to use 15 bits for ASAN and 18 bits for ASLR".
So how does grep make that decision? How does it find out what the
kernel supports? Deciding which tradeoff is more valuable to grep is
clearly not something the kernel can help with, but I do think that the
kernel needs to have an API to query what's available.
Something like Libreoffice or Firefox is obviously going to be much more
complex; it doesn't really have a bounded amount of virtual memory to
work with. But if we can't even solve this problem for applications
with bounded address space, we stand no chance of solving it for hard
cases.
On Thu, May 12 2022 at 19:56, Kirill A. Shutemov wrote:
> On Thu, May 12, 2022 at 05:42:58PM +0200, Thomas Gleixner wrote:
>> On Wed, May 11 2022 at 08:49, Peter Zijlstra wrote:
>> > On Wed, May 11, 2022 at 05:27:40AM +0300, Kirill A. Shutemov wrote:
>> > So aren't we creating a problem with LAM_U48 where programs relying on
>> > it are of limited sustainability?
>> >
>> > Any such program simply *cannot* run on 5 level pagetables. Why do we
>> > want to do this?
>>
>> More bits are better :)
>>
>> Seriously, I agree that restricting it to LAM57, which gives us 6 bits,
>> makes a lot of sense _and_ makes the whole thing way simpler.
>>
>> So supporting both needs a truly good justification and a real world use
>> case.
>
> I asked the question before[1]. Basically, more bits more better:
>
> For HWASAN #bits == detection probability.
> For MarkUS #bits == exponential cost reduction
What is MarkUS? It's not really helpful to provide acronyms which are
not decodable.
> I would really like to have only LAM_U57, but IIUC 6 bits is not always
> enough.
>
> Dmitry, could you elaborate?
>
> [1] https://mobile.twitter.com/dvyukov/status/1342019823400837120
I don't know whether he reacts on posting a link to his twitter
account. I've CC'ed him now. Maybe that works better.
Thanks,
tglx
On 5/12/22 18:27, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 17:46, Dave Hansen wrote:
>> On 5/12/22 17:08, H.J. Lu wrote:
>> If I had to take a shot at this today, I think I'd opt for:
>>
>> mask = sys_enable_masking(bits=6, flags=FUZZY_NR_BITS);
>>
>> although I'm not super confident about the "fuzzy" flag. I also don't
>> think I'd totally hate the "blind" interface where the kernel just gets
>> to pick unilaterally and takes zero input from userspace.
> That's the only sane choice and you can make it simple for userspace:
>
> ret = prctl(GET_XXX_MASK, &mask);
>
> and then let it decide based on @ret and @mask whether to use it or not.
>
> But of course nobody thought about this as a generic feature and so we
> have the ARM64 TBI muck as a precedence.
Well, not quite *nobody*:
https://lore.kernel.org/linux-arm-kernel/[email protected]/
On Thu, May 12, 2022 at 11:24:27PM +0200, Thomas Gleixner wrote:
> On Thu, May 12 2022 at 21:39, Thomas Gleixner wrote:
> > On Thu, May 12 2022 at 10:22, Dave Hansen wrote:
> >> One of the stated reasons for adding LAM hardware is that folks want to
> >> use sanitizers outside of debugging environments. To me, that means
> >> that LAM is something that the same binary might run with or without.
> >
> > On/off yes, but is there an actual use case where such a mechanism would
> > at start time dynamically chose the number of bits?
>
> This would need cooperation from the application because it has to tell
> the magic facility whether it intends to utilize the large VA space on a
> 5-level paging system or not.
>
> I have no idea how that is supposed to work, but what do I know about
> magic.
>
> >> It's totally fine with me if the kernel only initially supports LAM_U57.
> >> But, I'd ideally like to make sure that the ABI can support LAM_U57,
> >> LAM_U48, AMD's UAI (in whatever form it settles), or other masks.
> >
> > Sure. No argument here.
>
> Assumed that the acronym of the day, which uses this, has a real benefit
> from the larger number of bits, we can support it.
>
> But we are not going to make this a per thread selectable thing.
>
> It's a process wide decision at startup simply because it does no buy
> thread A anything to select U57 if thread B selects U48 before thread A
> was able to map something into the U48 covered address space. Same issue
> the other way round as then B has to fallback to U57 or NONE. That does
> not make any sense at all.
>
> I'm all for flexible, but not just because we can. It has to make sense.
Some JVMs arn javascript engines are known for using tags in high bit of
pointers (and clearing them manually on dereferencing as of now).
One use-case I had in mind was having a thread that runs VM/JIT, like
javascript/JVM/LUA/whatever that serves the rest of the application.
The thread uses LAM while the rest of the application does not. Leaking
tagged pointer into into thread that does not use LAM would indicate an
issue and SIGSEGV would be deserved.
No idea if it is practical.
--
Kirill A. Shutemov
On Fri, May 13, 2022 at 01:07:43PM +0200, Alexander Potapenko wrote:
> On Thu, May 12, 2022 at 11:51 PM Dave Hansen <[email protected]> wrote:
> >
> > On 5/12/22 12:39, Thomas Gleixner wrote:
> > >> It's OK for a debugging build that runs on one kind of hardware. But,
> > >> if we want LAM-using binaries to be portable, we have to do something
> > >> different.
> > >>
> > >> One of the stated reasons for adding LAM hardware is that folks want to
> > >> use sanitizers outside of debugging environments. To me, that means
> > >> that LAM is something that the same binary might run with or without.
> > > On/off yes, but is there an actual use case where such a mechanism would
> > > at start time dynamically chose the number of bits?
> >
> > I'd love to hear from folks doing the userspace side of this. Will
> > userspace be saying: "Give me all the bits you can!". Or, will it
> > really just be looking for 6 bits only, and it doesn't care whether it
> > gets 6 or 15, it will use only 6?
>
> (speaking more or less on behalf of the userspace folks here)
> I think it is safe to assume that in the upcoming year or two HWASan
> will be fine having just 6 bits for the tags on x86 machines.
> We are interested in running it on kernels with and without
> CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.
Just to be clear: LAM_U48 works on machine with 5-level paging enabled as
long as the process doesn't map anything above 47-bit.
--
Kirill A. Shutemov
On Sat, May 14 2022 at 02:01, Kirill A. Shutemov wrote:
> On Fri, May 13, 2022 at 01:07:43PM +0200, Alexander Potapenko wrote:
>> On Thu, May 12, 2022 at 11:51 PM Dave Hansen <[email protected]> wrote:
>> >
>> > On 5/12/22 12:39, Thomas Gleixner wrote:
>> > >> It's OK for a debugging build that runs on one kind of hardware. But,
>> > >> if we want LAM-using binaries to be portable, we have to do something
>> > >> different.
>> > >>
>> > >> One of the stated reasons for adding LAM hardware is that folks want to
>> > >> use sanitizers outside of debugging environments. To me, that means
>> > >> that LAM is something that the same binary might run with or without.
>> > > On/off yes, but is there an actual use case where such a mechanism would
>> > > at start time dynamically chose the number of bits?
>> >
>> > I'd love to hear from folks doing the userspace side of this. Will
>> > userspace be saying: "Give me all the bits you can!". Or, will it
>> > really just be looking for 6 bits only, and it doesn't care whether it
>> > gets 6 or 15, it will use only 6?
>>
>> (speaking more or less on behalf of the userspace folks here)
>> I think it is safe to assume that in the upcoming year or two HWASan
>> will be fine having just 6 bits for the tags on x86 machines.
>> We are interested in running it on kernels with and without
>> CONFIG_X86_5LEVEL=y, so U48 is not an option in some cases anyway.
>
> Just to be clear: LAM_U48 works on machine with 5-level paging enabled as
> long as the process doesn't map anything above 47-bit.
That's the whole point. If you use LAM_U48 in one thread for some
obscure reason, you prevent the whole process from utilizing the full
virtual address space. The other way round is also weird. If one thread
manages to have a virtual address above bit 47 then you can't get U48
for the other anymore.
Aside of that the whole per-thread approach can only ever work when an
application is crafted carefully to handle it. Think about shared data
structures with pointers inside. This surely can be made work, but is it
something which is generally safe? No.
Plus it comes with inconsistent behaviour in the kthread_use_mm() case.
What could work is a mechanism which tells the loader that it should apply
U48 and restrict the address space randomization to 47 bits, but
anything else is going to create more problems than it solves.
Thanks,
tglx