Folks!
Back in the good old spectre v2 days (2018) we decided to not use
IBRS. In hindsight this might have been the wrong decision because it did
not force people to come up with alternative approaches.
It was already discussed back then to try software based call depth
accounting and RSB stuffing on underflow for Intel SKL[-X] systems to avoid
the insane overhead of IBRS.
This has been tried in 2018 and was rejected due to the massive overhead
and other shortcomings of the approach to put the accounting into each
function prologue:
1) Text size increase which is inflicted on everyone. While CPUs are
good in ignoring NOPs they still pollute the I-cache.
2) That results in tail call over-accounting which can be exploited.
Disabling tail calls is not an option either and adding a 10 byte padding
in front of every direct call is even worse in terms of text size and
I-cache impact. We also could patch calls past the accounting in the
function prologue but that becomes a nightmare vs. ENDBR.
As IBRS is a performance horror show, Peter Zijstra and me revisited the
call depth tracking approach and implemented it in a way which is hopefully
more palatable and avoids the downsides of the original attempt.
We both unsurprisingly hate the result with a passion...
The way we approached this is:
1) objtool creates a list of function entry points and a list of direct
call sites into new sections which can be discarded after init.
2) On affected machines, use the new sections, allocate module memory
and create a call thunk per function (16 bytes without
debug/statistics). Then patch all direct calls to invoke the thunk,
which does the call accounting and then jumps to the original call
site.
3) Utilize the retbleed return thunk mechanism by making the jump
target run-time configurable. Add the accounting counterpart and
stuff RSB on underflow in that alternate implementation.
This does not need a new compiler and avoids almost all overhead for
non-affected machines. It can be selected via 'retbleed=stuff' on the
kernel command line.
The memory consumption is impressive. On a affected server with a Debian
config this results in about 1.8MB call thunk memory and 2MB btree memory
to keep track of the thunks. The call thunk memory is overcommitted due to
the way how objtool collects symbols. This probably could be cut in half,
but we need to allocate a 2MB region anyway due to the following:
Our initial approach of just using module_alloc() turned out to create a
massive ITLB miss rate, which is not surprising as each call goes out to a
randomly distributed call thunk. Peter added a method to allocate with
large 2MB TLBs which made that problem mostly go away and gave us another
nice performance gain. It obviously does not reduce ICache overhead. It
neither cures the problem that the ITLB cache has one slot permanently
occupied by the call-thunk mapping.
The theory behind call depth tracking is:
RSB is a stack with depth 16 which is filled on every call. On the return
path speculation "pops" entries to speculate down the call chain. Once the
speculative RSB is empty it switches to other predictors, e.g. the Branch
History Buffer, which can be mistrained by user space and misguides the
speculation path to a disclosure gadget as described in the retbleed paper.
Call depth tracking is designed to break this speculation path by stuffing
speculation trap calls into the RSB which are never getting a corresponding
return executed. This stalls the prediction path until it gets resteered,
The assumption is that stuffing at the 12th return is sufficient to break
the speculation before it hits the underflow and the fallback to the other
predictors. Testing confirms that it works. Johannes, one of the retbleed
researchers. tried to attack this approach and confirmed that it brings
the signal to noise ratio down to the crystal ball level.
There is obviously no scientific proof that this will withstand future
research progress, but all we can do right now is to speculate about that.
So much for the theory. Here are numbers for various tests I did myself and
FIO results provided by Tim Chen.
Just to make everyone feel as bad as I feel, this comes with two overhead
values: one against mitigations=off and one against retbleed=off.
hackbench Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 24.41s 0.00% -15.26%
retbleed=off 28.80s +18.00% 0.00%
retbleed=ibrs 34.92s +43.07% +21.24%
retbleed=stuff 31.95s +30.91% +10.94%
----------------------------------------------------------------------------
sys_time loop Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 1.28s 0.00% -78.66%
retbleed=off 5.98s +368.50% 0.00%
retbleed=ibrs 9.73s +662.81% +62.82%
retbleed=stuff 6.15s +381.93% +2.87%
----------------------------------------------------------------------------
kernel build Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 197.55s 0.00% -3.64%
retbleed=off 205.02s +3.78% 0.00%
retbleed=ibrs 218.00s +10.35% +6.33%
retbleed=stuff 209.92s +6.26% +2.39%
----------------------------------------------------------------------------
microbench deep
callchain Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 1.72s 0.00% -39.88%
retbleed=off 2.86s +66.34% 0.00%
retbleed=ibrs 3.92s +128.23% +37.20%
retbleed=stuff 3.39s +97.05% +18.46%
----------------------------------------------------------------------------
fio rand-read Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 352.7 kIOPS 0.00% +3.42%
retbleed=off 341.0 kIOPS -3.31% 0.00%
retbleed=ibrs 242.0 kIOPS -31.38% -29.03%
retbleed=stuff 288.3 kIOPS -18.24% -15.44%
----------------------------------------------------------------------------
fio 70/30 Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 349.0 kIOPS 0.00% +10.49%
retbleed=off 315.9 kIOPS -9.49% 0.00%
retbleed=ibrs 241.5 kIOPS -30.80% -23.54%
retbleed=stuff 276.2 kIOPS -20.86% -12.56%
----------------------------------------------------------------------------
fio write Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 299.3 kIOPS 0.00% +10.49%
retbleed=off 275.6 kIOPS -9.49% 0.00%
retbleed=ibrs 232.7 kIOPS -22.27% -15.60%
retbleed=stuff 259.3 kIOPS -13.36% -5.93%
----------------------------------------------------------------------------
Sockperf 14 byte
localhost Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 4.495 MBps 0.00% +33.19%
retbleed=off 3.375 MBps -24.92% 0.00%
retbleed=ibrs 2.573 MBps -42.76% -23.76%
retbleed=stuff 2.725 MBps -39.38% -19.26%
----------------------------------------------------------------------------
Sockperf 1472 byte
localhost Baseline: mitigations=off retbleed=off
----------------------------------------------------------------------------
mitigations=off 425.494 MBps 0.00% +26.21%
retbleed=off 337.132 MBps -20.77% 0.00%
retbleed=ibrs 261.243 MBps -38.60% -22.51%
retbleed=stuff 275.111 MBps -35.34% -18.40%
----------------------------------------------------------------------------
The micro benchmark is just an artificial 30 calls deep call-chain through
a syscall with empty functions to measure the overhead. But it's also
interestingly close to the pathological FIO random read case. It just
emphasizes the overhead slightly more. I wrote that because I was not able
to take real FIO numbers as any variant was close to the max-out point of
that not so silly fast SSD/NVME in my machine.
So the benefit varies depending on hardware and workload scenarios. At
least it does not get worse than IBeeRS.
The whole lot is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git depthtracking
The ftrace and BPF changes need some extra scrunity from the respective
experts.
Peter and me went great length to analyze the overhead. Unsurprisingly the
call thunks are contributing the largest amount after the actual stuffing
in the return thunk. The call thunks are creating trouble in various ways:
1) An extra ITLB which has been very prominent in our first experiments
where we had 4k PTEs instead of the 2M mapping which we use now.
But this still causes ITLB misses in certain workloads
2) Icache footprint is obviously larger and the out of line call-thunks
are not prefetcher friendly.
3) The extra jump adds overhead for predictors and branch history
A potential solution for this is to let the compiler add a 16 byte padding
in front of every function. That allows to bring the call thunk into
ITLB/Icache locality and avoids the extra jump.
Obviously this would only affect SKL variants and no other machine would go
through that code (which is just padding in the compiled binary). Contrary
to adding this to the prologue and NOPing it out, this avoids the ICache
overhead for the non SKL case completely.
The function alignment option does not work for that because it just
guarantees that the next function entry is aligned, but the padding size
depends on the position of the last instruction of the previous function
which might be anything between 0 and padsize-1 obviously, which is not a
good starting point to put 10 bytes of accounting code into it reliably.
I hacked up GCC to emit such padding and from first experimentation it
brings quite some performance back.
IBRS stuff stuff(pad)
sockperf 14 bytes: -23.76% -19.26% -14.31%
sockperf 1472 bytes: -22.51% -18.40% -12.25%
microbench: +37.20% +18.46% +15.47%
hackbench: +21.24% +10.94% +10.12%
For FIO I don't have numbers yet, but I expect FIO to get a significant
gain too.
>From a quick survey it seems to have no impact for the case where the
thunks are not used. But that really needs some deep investigation and
there is a potential conflict with the clang CFI efforts.
The kernel text size increases with a Debian config from 9.9M to 10.4M, so
about 5%. If the thunk is not 16 byte aligned, the text size increase is
about 3%, but it turned out that 16 byte aligned is slightly faster.
The 16 byte function alignment turned out to be beneficial in general even
without the thunks. Not much of an improvement, but measurable. We should
revisit this independent of these horrors.
The implementation falls back to the allocated thunks when padding is not
available. I'll send out the GCC patch and the required kernel patch as a
reply to this series after polishing it a bit.
We also evaluated another approach to solve this problem, which does not
involve call thunks and solely relies on return thunks. The idea is to have
a return thunk which is not reliably predictable. That option would be
"retbleed=confuse". It looks like this:
retthunk:
test $somebit, something
jz 1f
ret
int3
1:
test $someotherbit, something
jnz 2f
ret
int3
2:
....
There are two questions to this approach:
1) How does $something have enough entropy to be effective
Ideally we can use a register for that, because the frequent memory
operation in the return thunk has obviously a significant performance
penalty.
In combination with stack randomization the RSP looks like a feasible
option, but for kernel centric workloads that does not create much
entropy.
Though keep in mind that the attacks rely on syscalls and observing
their side effects.
Which in turn means that entry stack randomization is a good entropy
source. But it's easy enough to create a scenario where the attacking
thread spins and relies on interrupts to cause the kernel entry.
Interrupt entry from user-space does not do kernel task stack
randomization today, but that's easy enough to fix.
If we can't use RSP because it turns out to be too predictable, then we
can still create entropy in a per CPU variable, but RSP is definitely
faster than a memory access.
2) How "easy" is it to circumvent the "randomization"
It makes it impossible for the occasional exploit writer like myself
and probably for the vast majority of script kiddies, but we really
need input from the researchers who seem to have way more nasty ideas
than we can imagine.
I asked some of those folks, but the answers were not conclusive at
all. There are some legitimate concerns, but the question is whether
they are real threats or purely academic problems (at the moment).
That retthunk "randomization" can easily be extended to have several
return thunks and they can be assigned during patching randomly. As
that's a boot time decision it's possible to figure it out, but in the
worst case we can randomly reassign them once in a while.
But also this approach has quite some performance impact:
For 2 RET pathes randomized with randomize_kstack_offset=y and RSP bit 3:
IBRS stuff stuff(pad) confuse
microbench: +37.20% +18.46% +15.47% +6.76%
sockperf 14 bytes: -23.76% -19.26% -14.31% -9.04%
sockperf 1472 bytes: -22.51% -18.40% -12.25% -7.51%
For 3 RET pathes randomized with randomize_kstack_offset=y and RSP bit 3, 6:
IBRS stuff stuff(pad) confuse
microbench: +37.20% +18.46% +15.47% +7.12%
sockperf 14 bytes: -23.76% -19.26% -14.31% -12.03%
sockperf 1472 bytes: -22.51% -18.40% -12.25% -10.49%
For 4 RET pathes randomized with randomize_kstack_offset=y and RSP bit 3, 6, 5:
IBRS stuff stuff(pad) confuse
microbench: +37.20% +18.46% +15.47% +7.46%
sockperf 14 bytes: -23.76% -19.26% -14.31% -16.80%
sockperf 1472 bytes: -22.51% -18.40% -12.25% -15.95%
So for the more randomized variant sockperf tanks and is already slower
than stuffing with thunks in the compiler provided padding space.
I send out a patch in reply to this series which implements that variant,
but there needs to be input from the security researchers how protective
this is. If we could get away with 2 RET pathes (perhaps multiple instances
with different bits), that would be amazing.
Thanks go to Andrew Cooper for helpful discussions around this, Johannes
Wikner for spending the effort of trying to break the stuffing defense and
to Tim Chen for getting the FIO numbers to us.
And of course many thanks to Peter for working on this with me. We went
through quite some mispredicted branches and had to retire some "brilliant"
ideas on the way.
Thanks,
tglx
---
arch/x86/Kconfig | 37 +
arch/x86/entry/entry_64.S | 26
arch/x86/entry/vdso/Makefile | 11
arch/x86/include/asm/alternative.h | 69 ++
arch/x86/include/asm/cpufeatures.h | 1
arch/x86/include/asm/disabled-features.h | 9
arch/x86/include/asm/module.h | 2
arch/x86/include/asm/nospec-branch.h | 166 +++++
arch/x86/include/asm/paravirt.h | 5
arch/x86/include/asm/paravirt_types.h | 21
arch/x86/include/asm/processor.h | 1
arch/x86/include/asm/text-patching.h | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/alternative.c | 114 +++
arch/x86/kernel/callthunks.c | 799 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/bugs.c | 32 +
arch/x86/kernel/cpu/common.c | 17
arch/x86/kernel/ftrace.c | 20
arch/x86/kernel/ftrace_64.S | 31 +
arch/x86/kernel/kprobes/core.c | 1
arch/x86/kernel/module.c | 86 ---
arch/x86/kernel/static_call.c | 3
arch/x86/kernel/unwind_orc.c | 21
arch/x86/kernel/vmlinux.lds.S | 14
arch/x86/lib/retpoline.S | 106 +++
arch/x86/mm/Makefile | 2
arch/x86/mm/module_alloc.c | 68 ++
arch/x86/net/bpf_jit_comp.c | 56 +
include/linux/btree.h | 6
include/linux/kallsyms.h | 24
include/linux/module.h | 72 +-
include/linux/static_call.h | 2
include/linux/vmalloc.h | 3
init/main.c | 2
kernel/kallsyms.c | 23
kernel/kprobes.c | 52 +
kernel/module/internal.h | 8
kernel/module/main.c | 6
kernel/module/tree_lookup.c | 17
kernel/static_call_inline.c | 23
kernel/trace/trace_selftest.c | 5
lib/btree.c | 8
mm/vmalloc.c | 33 -
samples/ftrace/ftrace-direct-modify.c | 2
samples/ftrace/ftrace-direct-multi-modify.c | 2
samples/ftrace/ftrace-direct-multi.c | 1
samples/ftrace/ftrace-direct-too.c | 1
samples/ftrace/ftrace-direct.c | 1
scripts/Makefile.lib | 3
tools/objtool/arch/x86/decode.c | 26
tools/objtool/builtin-check.c | 7
tools/objtool/check.c | 160 ++++-
tools/objtool/include/objtool/arch.h | 4
tools/objtool/include/objtool/builtin.h | 1
tools/objtool/include/objtool/check.h | 20
tools/objtool/include/objtool/elf.h | 2
tools/objtool/include/objtool/objtool.h | 1
tools/objtool/objtool.c | 1
58 files changed, 1970 insertions(+), 268 deletions(-)
From: Thomas Gleixner
> Sent: 17 July 2022 00:17
> Folks!
>
> Back in the good old spectre v2 days (2018) we decided to not use
> IBRS. In hindsight this might have been the wrong decision because it did
> not force people to come up with alternative approaches.
>
> It was already discussed back then to try software based call depth
> accounting and RSB stuffing on underflow for Intel SKL[-X] systems to avoid
> the insane overhead of IBRS.
>
> This has been tried in 2018 and was rejected due to the massive overhead
> and other shortcomings of the approach to put the accounting into each
> function prologue:
>
> 1) Text size increase which is inflicted on everyone. While CPUs are
> good in ignoring NOPs they still pollute the I-cache.
>
> 2) That results in tail call over-accounting which can be exploited.
>
> Disabling tail calls is not an option either and adding a 10 byte padding
> in front of every direct call is even worse in terms of text size and
> I-cache impact. We also could patch calls past the accounting in the
> function prologue but that becomes a nightmare vs. ENDBR.
>
> As IBRS is a performance horror show, Peter Zijstra and me revisited the
> call depth tracking approach and implemented it in a way which is hopefully
> more palatable and avoids the downsides of the original attempt.
>
> We both unsurprisingly hate the result with a passion...
>
> The way we approached this is:
>
> 1) objtool creates a list of function entry points and a list of direct
> call sites into new sections which can be discarded after init.
>
> 2) On affected machines, use the new sections, allocate module memory
> and create a call thunk per function (16 bytes without
> debug/statistics). Then patch all direct calls to invoke the thunk,
> which does the call accounting and then jumps to the original call
> site.
>
> 3) Utilize the retbleed return thunk mechanism by making the jump
> target run-time configurable. Add the accounting counterpart and
> stuff RSB on underflow in that alternate implementation.
What happens to indirect calls?
The above would imply that they miss the function entry thunk, but
get the return one.
Won't this lead to mis-counting of the RSB?
I also thought that retpolines would trash the return stack?
Using a single retpoline thunk would pretty much ensure that
they are never correctly predicted from the BTB, but it only
gives a single BTB entry that needs 'setting up' to get mis-
prediction.
I'm also sure I managed to infer from a document of instruction
timings and architectures that some x86 cpu actually used the BTB
for normal conditional jumps?
Possibly to avoid passing the full %ip value all down the
cpu pipeline.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, Jul 17 2022 at 09:45, David Laight wrote:
> From: Thomas Gleixner
>>
>> 3) Utilize the retbleed return thunk mechanism by making the jump
>> target run-time configurable. Add the accounting counterpart and
>> stuff RSB on underflow in that alternate implementation.
>
> What happens to indirect calls?
> The above would imply that they miss the function entry thunk, but
> get the return one.
> Won't this lead to mis-counting of the RSB?
That's accounted in the indirect call thunk. This mitigation requires
retpolines enabled.
> I also thought that retpolines would trash the return stack?
No. They prevent that the CPU misspeculates an indirect call due to a
mistrained BTB.
> Using a single retpoline thunk would pretty much ensure that
> they are never correctly predicted from the BTB, but it only
> gives a single BTB entry that needs 'setting up' to get mis-
> prediction.
BTB != RSB
The intra function call in the retpoline is of course adding a RSB entry
which points to the speculation trap, but that gets popped immediately
after that by the return which goes to the called function.
But that does not prevent the RSB underflow problem. As I described the
RSB is a stack with depth 16. Call pushs, ret pops. So if speculation is
ahead and emptied the RSB while speculating down the rets then the next
speculated RET will fall back to other prediction mechanism which is
what the SKL specific retbleed variant exploits via BHB mistraining.
> I'm also sure I managed to infer from a document of instruction
> timings and architectures that some x86 cpu actually used the BTB
> for normal conditional jumps?
That's relevant to the problem at hand in which way?
Thanks,
tglx
From: Thomas Gleixner
> Sent: 17 July 2022 16:07
>
> On Sun, Jul 17 2022 at 09:45, David Laight wrote:
> > From: Thomas Gleixner
> >>
> >> 3) Utilize the retbleed return thunk mechanism by making the jump
> >> target run-time configurable. Add the accounting counterpart and
> >> stuff RSB on underflow in that alternate implementation.
> >
> > What happens to indirect calls?
> > The above would imply that they miss the function entry thunk, but
> > get the return one.
> > Won't this lead to mis-counting of the RSB?
>
> That's accounted in the indirect call thunk. This mitigation requires
> retpolines enabled.
Thanks, that wasn't in the summary.
> > I also thought that retpolines would trash the return stack?
>
> No. They prevent that the CPU misspeculates an indirect call due to a
> mistrained BTB.
>
> > Using a single retpoline thunk would pretty much ensure that
> > they are never correctly predicted from the BTB, but it only
> > gives a single BTB entry that needs 'setting up' to get mis-
> > prediction.
>
> BTB != RSB
I was thinking about what happens after the RSB has underflowed.
Which is when (I presume) the BTB based speculation happens.
> The intra function call in the retpoline is of course adding a RSB entry
> which points to the speculation trap, but that gets popped immediately
> after that by the return which goes to the called function.
I'm remembering the 'active' instructions in a retpoline being 'push; ret'.
Which is an RSB imbalance.
...
> > I'm also sure I managed to infer from a document of instruction
> > timings and architectures that some x86 cpu actually used the BTB
> > for normal conditional jumps?
>
> That's relevant to the problem at hand in which way?
The next problem :-)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Sun, Jul 17 2022 at 17:56, David Laight wrote:
> From: Thomas Gleixner
>> On Sun, Jul 17 2022 at 09:45, David Laight wrote:
> I was thinking about what happens after the RSB has underflowed.
> Which is when (I presume) the BTB based speculation happens.
>
>> The intra function call in the retpoline is of course adding a RSB entry
>> which points to the speculation trap, but that gets popped immediately
>> after that by the return which goes to the called function.
>
> I'm remembering the 'active' instructions in a retpoline being 'push; ret'.
> Which is an RSB imbalance.
Looking at the code might help to remember correctly:
call 1f
speculation trap
1: mov %reg, %rsp
ret
Thanks,
tglx
On Sun, Jul 17 2022 at 01:17, Thomas Gleixner wrote:
> The function alignment option does not work for that because it just
> guarantees that the next function entry is aligned, but the padding size
> depends on the position of the last instruction of the previous function
> which might be anything between 0 and padsize-1 obviously, which is not a
> good starting point to put 10 bytes of accounting code into it reliably.
>
> I hacked up GCC to emit such padding and from first experimentation it
> brings quite some performance back.
>
> IBRS stuff stuff(pad)
> sockperf 14 bytes: -23.76% -19.26% -14.31%
> sockperf 1472 bytes: -22.51% -18.40% -12.25%
> microbench: +37.20% +18.46% +15.47%
> hackbench: +21.24% +10.94% +10.12%
>
> For FIO I don't have numbers yet, but I expect FIO to get a significant
> gain too.
>
>>From a quick survey it seems to have no impact for the case where the
> thunks are not used. But that really needs some deep investigation and
> there is a potential conflict with the clang CFI efforts.
>
> The kernel text size increases with a Debian config from 9.9M to 10.4M, so
> about 5%. If the thunk is not 16 byte aligned, the text size increase is
> about 3%, but it turned out that 16 byte aligned is slightly faster.
>
> The 16 byte function alignment turned out to be beneficial in general even
> without the thunks. Not much of an improvement, but measurable. We should
> revisit this independent of these horrors.
>
> The implementation falls back to the allocated thunks when padding is not
> available. I'll send out the GCC patch and the required kernel patch as a
> reply to this series after polishing it a bit.
Here it goes. GCC hackery first.
---
Subject: gcc: Add padding in front of function entry points
From: Thomas Gleixner <[email protected]>
Date: Fri, 15 Jul 2022 14:37:53 +0200
For testing purposes:
Add a 16 byte padding filled with int3 in front of each function entry
so the kernel can put call depth accounting into it.
Not-Signed-off-by: Thomas Gleixner <[email protected]>
---
gcc/config/i386/i386.cc | 11 +++++++++++
gcc/config/i386/i386.h | 7 +++++++
gcc/config/i386/i386.opt | 4 ++++
gcc/doc/invoke.texi | 6 ++++++
4 files changed, 28 insertions(+)
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -6182,6 +6182,17 @@ ix86_code_end (void)
file_end_indicate_split_stack ();
}
+void
+x86_asm_output_function_prefix (FILE *asm_out_file,
+ const char *fnname ATTRIBUTE_UNUSED)
+{
+ if (flag_force_function_padding)
+ {
+ fprintf (asm_out_file, "\t.align 16\n");
+ fprintf (asm_out_file, "\t.skip 16,0xcc\n");
+ }
+}
+
/* Emit code for the SET_GOT patterns. */
const char *
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2860,6 +2860,13 @@ extern enum attr_cpu ix86_schedule;
#define LIBGCC2_UNWIND_ATTRIBUTE __attribute__((target ("no-mmx,no-sse")))
#endif
+#include <stdio.h>
+extern void
+x86_asm_output_function_prefix (FILE *asm_out_file,
+ const char *fnname ATTRIBUTE_UNUSED);
+#undef ASM_OUTPUT_FUNCTION_PREFIX
+#define ASM_OUTPUT_FUNCTION_PREFIX x86_asm_output_function_prefix
+
/*
Local variables:
version-control: t
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1064,6 +1064,10 @@ mindirect-branch=
Target RejectNegative Joined Enum(indirect_branch) Var(ix86_indirect_branch) Init(indirect_branch_keep)
Convert indirect call and jump to call and return thunks.
+mforce-function-padding
+Target Var(flag_force_function_padding) Init(0)
+Put a 16 byte padding area before each function
+
mfunction-return=
Target RejectNegative Joined Enum(indirect_branch) Var(ix86_function_return) Init(indirect_branch_keep)
Convert function return to call and return thunk.
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1451,6 +1451,7 @@ See RS/6000 and PowerPC Options.
-mindirect-branch=@var{choice} -mfunction-return=@var{choice} @gol
-mindirect-branch-register -mharden-sls=@var{choice} @gol
-mindirect-branch-cs-prefix -mneeded -mno-direct-extern-access}
+-mforce-function-padding @gol
@emph{x86 Windows Options}
@gccoptlist{-mconsole -mcygwin -mno-cygwin -mdll @gol
@@ -32849,6 +32850,11 @@ Force all calls to functions to be indir
when using Intel Processor Trace where it generates more precise timing
information for function calls.
+@item -mforce-function-padding
+@opindex -mforce-function-padding
+Force a 16 byte padding are before each function which allows run-time
+code patching to put a special prologue before the function entry.
+
@item -mmanual-endbr
@opindex mmanual-endbr
Insert ENDBR instruction at function entry only via the @code{cf_check}
On Mon, Jul 18 2022 at 21:29, Thomas Gleixner wrote:
>> The implementation falls back to the allocated thunks when padding is not
>> available. I'll send out the GCC patch and the required kernel patch as a
>> reply to this series after polishing it a bit.
>
> Here it goes. GCC hackery first.
And the kernel counterpart.
---
Subject: x06/callthunks: Put thunks into compiler provided padding area
From: Thomas Gleixner <[email protected]>
Date: Fri, 15 Jul 2022 16:12:47 +0200
- NOT FOR INCLUSION -
Let the compiler add a 16 byte padding in front of each function entry
point and put the call depth accounting there. That avoids calling out
into the module area and reduces ITLB pressure.
Not-Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/Kconfig | 14 ++++++
arch/x86/Makefile | 4 +
arch/x86/kernel/callthunks.c | 99 ++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 115 insertions(+), 2 deletions(-)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2440,6 +2440,9 @@ config CC_HAS_SLS
config CC_HAS_RETURN_THUNK
def_bool $(cc-option,-mfunction-return=thunk-extern)
+config CC_HAS_PADDING
+ def_bool $(cc-option,-mforce-function-padding)
+
config HAVE_CALL_THUNKS
def_bool y
depends on RETHUNK && OBJTOOL
@@ -2512,6 +2515,17 @@ config CALL_DEPTH_TRACKING
of this option is marginal as the call depth tracking is using
run-time generated call thunks and call patching.
+config CALL_THUNKS_IN_PADDING
+ bool "Put call depth into padding area before function"
+ depends on CALL_DEPTH_TRACKING && CC_HAS_PADDING
+ default n
+ help
+ Put the call depth accounting into a padding area before the
+ function entry. This padding area is generated by the
+ compiler. This increases text size by ~5%. For non affected
+ systems this space is unused. On affected SKL systems this
+ results in a significant performance gain.
+
config CALL_THUNKS_DEBUG
bool "Enable call thunks and call depth tracking debugging"
depends on CALL_DEPTH_TRACKING
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -197,6 +197,10 @@ ifdef CONFIG_SLS
KBUILD_CFLAGS += -mharden-sls=all
endif
+ifdef CONFIG_CALL_THUNKS_IN_PADDING
+ KBUILD_CFLAGS += -mforce-function-padding
+endif
+
KBUILD_LDFLAGS += -m elf_$(UTS_MACHINE)
ifdef CONFIG_LTO_CLANG
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -92,6 +92,7 @@ struct thunk_mem {
struct thunk_mem_area {
struct thunk_mem *tmem;
+ unsigned long *dests;
unsigned long start;
unsigned long nthunks;
};
@@ -181,6 +182,16 @@ static __init_or_module void callthunk_f
tmem->base + area->start * callthunk_desc.thunk_size,
area->start, area->nthunks);
+ /* Remove thunks in the padding area */
+ for (i = 0; area->dests && i < area->nthunks; i++) {
+ void *dest = (void *)area->dests[i];
+
+ if (!dest)
+ continue;
+ pr_info("Remove %px at index %u\n", dest, i);
+ btree_remove64(&call_thunks, (unsigned long)dest);
+ }
+
/* Jump starts right after the template */
thunk = tmem->base + area->start * callthunk_desc.thunk_size;
tp = thunk + callthunk_desc.template_size;
@@ -204,6 +215,7 @@ static __init_or_module void callthunk_f
size = area->nthunks * callthunk_desc.thunk_size;
text_poke_set_locked(thunk, 0xcc, size);
}
+ vfree(area->dests);
kfree(area);
}
@@ -289,7 +301,8 @@ patch_paravirt_call_sites(struct paravir
patch_call(p->instr, layout);
}
-static struct thunk_mem_area *callthunks_alloc(unsigned int nthunks)
+static struct thunk_mem_area *callthunks_alloc(unsigned int nthunks,
+ bool module)
{
struct thunk_mem_area *area;
unsigned int size, mapsize;
@@ -299,6 +312,13 @@ static struct thunk_mem_area *callthunks
if (!area)
return NULL;
+ if (module) {
+ area->dests = vzalloc(nthunks * sizeof(unsigned long));
+ if (!area->dests)
+ goto free_area;
+ pr_info("Allocated dests array: %px\n", area->dests);
+ }
+
list_for_each_entry(tmem, &thunk_mem_list, list) {
unsigned long start;
@@ -340,6 +360,7 @@ static struct thunk_mem_area *callthunks
free_tmem:
kfree(tmem);
free_area:
+ vfree(area->dests);
kfree(area);
return NULL;
}
@@ -372,6 +393,73 @@ static __init_or_module int callthunk_se
return 0;
}
+int setup_padding_thunks(s32 *start, s32 *end, struct thunk_mem_area *area,
+ struct module_layout *layout)
+{
+ int nthunks = 0, idx = 0;
+ s32 *s;
+
+ if (callthunk_desc.template_size > 16)
+ return 0;
+
+ for (s = start; s < end; s++) {
+ void *thunk, *tp, *dest = (void *)s + *s;
+ unsigned long key = (unsigned long)dest;
+ int fail, i;
+ u8 opcode;
+
+ if (is_inittext(layout, dest)) {
+ prdbg("Ignoring init dest: %pS %px\n", dest, dest);
+ return 0;
+ }
+
+ /* Multiple symbols can have the same location. */
+ if (btree_lookup64(&call_thunks, key)) {
+ prdbg("Ignoring duplicate dest: %pS %px\n", dest, dest);
+ continue;
+ }
+
+ thunk = tp = dest - 16;
+ prdbg("Probing dest: %pS %px at %px\n", dest, dest, tp);
+ pagefault_disable();
+ fail = 0;
+ for (i = 0; !fail && i < 16; i++) {
+ if (get_kernel_nofault(opcode, tp + i)) {
+ fail = 1;
+ } else if (opcode != 0xcc) {
+ fail = 2;
+ }
+ }
+ pagefault_enable();
+ switch (fail) {
+ case 1:
+ prdbg("Faulted for dest: %pS %px\n", dest, dest);
+ nthunks++;
+ continue;
+ case 2:
+ prdbg("No padding for dest: %pS %px\n", dest, dest);
+ nthunks++;
+ continue;
+ }
+
+ prdbg("Thunk for dest: %pS %px at %px\n", dest, dest, tp);
+ memcpy(tp, callthunk_desc.template, callthunk_desc.template_size);
+ tp += callthunk_desc.template_size;
+ memcpy(tp, x86_nops[6], 6);
+
+ if (area->dests) {
+ pr_info("Insert %px at index %d\n", dest, idx);
+ area->dests[idx++] = key;
+ }
+
+ fail = btree_insert64(&call_thunks, key, (void *)thunk, GFP_KERNEL);
+ if (fail)
+ return fail;
+ }
+ prdbg("%d external thunks required\n", nthunks);
+ return 0;
+}
+
static __init_or_module int callthunks_setup(struct callthunk_sites *cs,
struct module_layout *layout)
{
@@ -394,7 +482,7 @@ static __init_or_module int callthunks_s
if (!nthunks)
goto patch;
- area = callthunks_alloc(nthunks);
+ area = callthunks_alloc(nthunks, !!layout->mtn.mod);
if (!area)
return -ENOMEM;
@@ -420,6 +508,13 @@ static __init_or_module int callthunks_s
prdbg("Using thunk vbuf %px\n", vbuf);
}
+ if (IS_ENABLED(CONFIG_CALL_THUNKS_IN_PADDING)) {
+ ret = setup_padding_thunks(cs->syms_start, cs->syms_end,
+ area, layout);
+ if (ret < 0)
+ goto fail;
+ }
+
for (s = cs->syms_start; s < cs->syms_end; s++) {
void *dest = (void *)s + *s;
On Mon, Jul 18, 2022 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
>
> Let the compiler add a 16 byte padding in front of each function entry
> point and put the call depth accounting there. That avoids calling out
> into the module area and reduces ITLB pressure.
Ooh.
I actually like this a lot better.
Could we just say "use this instead if you have SKL and care about the issue?"
I don't hate your module thunk trick, but this does seem *so* much
simpler, and if it performs better anyway, it really does seem like
the better approach.
And people and distros who care would have an easy time adding that
simple compiler patch instead.
I do think that for generality, the "-mforce-function-padding" option
should perhaps take as an argument how much padding (and how much
alignment) to force:
-mforce-function-padding=5:16
would force 5 bytes of minimum padding, and align functions to 16
bytes. It should be easy to generate (no more complexity than your
current one) by just making the output do
.skip 5,0xcc
.palign 4,0xcc
and now you can specify that you only need X bytes of padding, for example.
Linus
On Sun, Jul 17 2022 at 01:17, Thomas Gleixner wrote:
> For 4 RET pathes randomized with randomize_kstack_offset=y and RSP bit 3, 6, 5:
>
> IBRS stuff stuff(pad) confuse
> microbench: +37.20% +18.46% +15.47% +7.46%
> sockperf 14 bytes: -23.76% -19.26% -14.31% -16.80%
> sockperf 1472 bytes: -22.51% -18.40% -12.25% -15.95%
>
> So for the more randomized variant sockperf tanks and is already slower
> than stuffing with thunks in the compiler provided padding space.
>
> I send out a patch in reply to this series which implements that variant,
> but there needs to be input from the security researchers how protective
> this is. If we could get away with 2 RET pathes (perhaps multiple instances
> with different bits), that would be amazing.
Here is goes.
---
Subject: x86/retbleed: Add confusion mitigation
From: Thomas Gleixner <[email protected]>
Date: Fri, 15 Jul 2022 11:41:05 +0200
- NOT FOR INCLUSION -
Experimental option to confuse the return path by randomization.
The following command line options enable this:
retbleed=confuse 4 return pathes
retbleed=confuse,4 4 return pathes
retbleed=confuse,3 3 return pathes
retbleed=confuse,2 2 return pathes
This need scrunity by security researchers.
Not-Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/Kconfig | 12 ++++++
arch/x86/include/asm/nospec-branch.h | 23 +++++++++++
arch/x86/kernel/cpu/bugs.c | 41 +++++++++++++++++++++
arch/x86/lib/retpoline.S | 68 +++++++++++++++++++++++++++++++++++
include/linux/randomize_kstack.h | 6 +++
kernel/entry/common.c | 3 +
6 files changed, 153 insertions(+)
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2538,6 +2538,18 @@ config CALL_THUNKS_DEBUG
Only enable this, when you are debugging call thunks as this
creates a noticable runtime overhead. If unsure say N.
+config RETURN_CONFUSION
+ bool "Mitigate RSB underflow with return confusion"
+ depends on CPU_SUP_INTEL && RETHUNK && RANDOMIZE_KSTACK_OFFSET
+ default y
+ help
+ Compile the kernel with return path confusion to mitigate the
+ Intel SKL Return-Speculation-Buffer (RSB) underflow issue. The
+ mitigation is off by default and needs to be enabled on the
+ kernel command line via the retbleed=confuse option. For
+ non-affected systems the overhead of this option is marginal as
+ the return thunk jumps are patched to direct ret instructions.
+
config CPU_IBPB_ENTRY
bool "Enable IBPB on kernel entry"
depends on CPU_SUP_AMD
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -312,6 +312,29 @@ static inline void x86_set_skl_return_th
#endif
+#ifdef CONFIG_RETURN_CONFUSION
+extern void __x86_return_confused_skl2(void);
+extern void __x86_return_confused_skl3(void);
+extern void __x86_return_confused_skl4(void);
+
+static inline void x86_set_skl_confused_return_thunk(int which)
+{
+ switch (which) {
+ case 2:
+ x86_return_thunk = &__x86_return_confused_skl2;
+ break;
+ case 3:
+ x86_return_thunk = &__x86_return_confused_skl3;
+ break;
+ case 4:
+ x86_return_thunk = &__x86_return_confused_skl4;
+ break;
+ }
+}
+#else
+static inline void x86_set_skl_confused_return_thunk(void) { }
+#endif
+
#ifdef CONFIG_RETPOLINE
#define GEN(reg) \
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -14,6 +14,7 @@
#include <linux/module.h>
#include <linux/nospec.h>
#include <linux/prctl.h>
+#include <linux/randomize_kstack.h>
#include <linux/sched/smt.h>
#include <linux/pgtable.h>
#include <linux/bpf.h>
@@ -785,6 +786,7 @@ enum retbleed_mitigation {
RETBLEED_MITIGATION_IBRS,
RETBLEED_MITIGATION_EIBRS,
RETBLEED_MITIGATION_STUFF,
+ RETBLEED_MITIGATION_CONFUSE,
};
enum retbleed_mitigation_cmd {
@@ -793,6 +795,7 @@ enum retbleed_mitigation_cmd {
RETBLEED_CMD_UNRET,
RETBLEED_CMD_IBPB,
RETBLEED_CMD_STUFF,
+ RETBLEED_CMD_CONFUSE,
};
const char * const retbleed_strings[] = {
@@ -802,6 +805,7 @@ const char * const retbleed_strings[] =
[RETBLEED_MITIGATION_IBRS] = "Mitigation: IBRS",
[RETBLEED_MITIGATION_EIBRS] = "Mitigation: Enhanced IBRS",
[RETBLEED_MITIGATION_STUFF] = "Mitigation: Stuffing",
+ [RETBLEED_MITIGATION_CONFUSE] = "Mitigation: Return confusion",
};
static enum retbleed_mitigation retbleed_mitigation __ro_after_init =
@@ -810,6 +814,7 @@ static enum retbleed_mitigation_cmd retb
RETBLEED_CMD_AUTO;
static int __ro_after_init retbleed_nosmt = false;
+static int __ro_after_init rethunk_confuse_skl = 4;
static int __init retbleed_parse_cmdline(char *str)
{
@@ -833,8 +838,19 @@ static int __init retbleed_parse_cmdline
retbleed_cmd = RETBLEED_CMD_IBPB;
} else if (!strcmp(str, "stuff")) {
retbleed_cmd = RETBLEED_CMD_STUFF;
+ } else if (!strcmp(str, "confuse")) {
+ retbleed_cmd = RETBLEED_CMD_CONFUSE;
} else if (!strcmp(str, "nosmt")) {
retbleed_nosmt = true;
+ } else if (retbleed_cmd == RETBLEED_CMD_CONFUSE &&
+ !kstrtouint(str, 10, &rethunk_confuse_skl)) {
+
+ if (rethunk_confuse_skl < 2 ||
+ rethunk_confuse_skl > 4) {
+ pr_err("Ignoring out-of-bound stuff count (%d).",
+ rethunk_confuse_skl);
+ rethunk_confuse_skl = 4;
+ }
} else {
pr_err("Ignoring unknown retbleed option (%s).", str);
}
@@ -896,6 +912,25 @@ static void __init retbleed_select_mitig
}
break;
+ case RETBLEED_CMD_CONFUSE:
+ if (IS_ENABLED(CONFIG_RETURN_CONFUSION) &&
+ spectre_v2_enabled == SPECTRE_V2_RETPOLINE &&
+ random_kstack_offset_enabled()) {
+ retbleed_mitigation = RETBLEED_MITIGATION_CONFUSE;
+ } else {
+ if (IS_ENABLED(CONFIG_RETURN_CONFUSION) &&
+ random_kstack_offset_enabled())
+ pr_err("WARNING: retbleed=confuse depends on randomize_kstack_offset=y\n");
+ else if (IS_ENABLED(CONFIG_RETURN_CONFUSION) &&
+ spectre_v2_enabled != SPECTRE_V2_RETPOLINE)
+ pr_err("WARNING: retbleed=confuse depends on spectre_v2=retpoline\n");
+ else
+ pr_err("WARNING: kernel not compiled with RETURN_CONFUSION.\n");
+
+ goto do_cmd_auto;
+ }
+ break;
+
do_cmd_auto:
case RETBLEED_CMD_AUTO:
default:
@@ -939,6 +974,11 @@ static void __init retbleed_select_mitig
x86_set_skl_return_thunk();
break;
+ case RETBLEED_MITIGATION_CONFUSE:
+ setup_force_cpu_cap(X86_FEATURE_RETHUNK);
+ x86_set_skl_confused_return_thunk(rethunk_confuse_skl);
+ break;
+
default:
break;
}
@@ -1389,6 +1429,7 @@ static void __init spectre_v2_select_mit
boot_cpu_has_bug(X86_BUG_RETBLEED) &&
retbleed_cmd != RETBLEED_CMD_OFF &&
retbleed_cmd != RETBLEED_CMD_STUFF &&
+ retbleed_cmd != RETBLEED_CMD_CONFUSE &&
boot_cpu_has(X86_FEATURE_IBRS) &&
boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
mode = SPECTRE_V2_IBRS;
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -230,3 +230,71 @@ SYM_FUNC_START(__x86_return_skl)
SYM_FUNC_END(__x86_return_skl)
#endif /* CONFIG_CALL_DEPTH_TRACKING */
+
+#ifdef CONFIG_RETURN_CONFUSION
+ .align 64
+SYM_FUNC_START(__x86_return_confused_skl4)
+ ANNOTATE_NOENDBR
+ testq $3, %rsp
+ jz 1f
+
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+1:
+ testq $6, %rsp
+ jz 2f
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+
+2:
+ testq $5, %rsp
+ jz 3f
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+3:
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_FUNC_END(__x86_return_confused_skl4)
+
+ .align 64
+SYM_FUNC_START(__x86_return_confused_skl3)
+ ANNOTATE_NOENDBR
+ testq $3, %rsp
+ jz 1f
+
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+1:
+ testq $6, %rsp
+ jz 2f
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+
+2:
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_FUNC_END(__x86_return_confused_skl3)
+
+ .align 64
+SYM_FUNC_START(__x86_return_confused_skl2)
+ ANNOTATE_NOENDBR
+ testq $3, %rsp
+ jz 1f
+
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+1:
+ ANNOTATE_UNRET_SAFE
+ ret
+ int3
+SYM_FUNC_END(__x86_return_confused_skl2)
+
+#endif /* CONFIG_RETURN_CONFUSION */
--- a/include/linux/randomize_kstack.h
+++ b/include/linux/randomize_kstack.h
@@ -84,9 +84,15 @@ DECLARE_PER_CPU(u32, kstack_offset);
raw_cpu_write(kstack_offset, offset); \
} \
} while (0)
+
+#define random_kstack_offset_enabled() \
+ static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
+ &randomize_kstack_offset)
+
#else /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#define add_random_kstack_offset() do { } while (0)
#define choose_random_kstack_offset(rand) do { } while (0)
+#define random_kstack_offset_enabled() false
#endif /* CONFIG_RANDOMIZE_KSTACK_OFFSET */
#endif
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -298,6 +298,7 @@ void syscall_exit_to_user_mode_work(stru
noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
{
+ add_random_kstack_offset();
__enter_from_user_mode(regs);
}
@@ -444,6 +445,8 @@ irqentry_state_t noinstr irqentry_nmi_en
{
irqentry_state_t irq_state;
+ if (user_mode(regs))
+ add_random_kstack_offset();
irq_state.lockdep = lockdep_hardirqs_enabled();
__nmi_enter();
On Mon, Jul 18 2022 at 12:51, Linus Torvalds wrote:
> On Mon, Jul 18, 2022 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
>>
>> Let the compiler add a 16 byte padding in front of each function entry
>> point and put the call depth accounting there. That avoids calling out
>> into the module area and reduces ITLB pressure.
>
> Ooh.
>
> I actually like this a lot better.
>
> Could we just say "use this instead if you have SKL and care about the issue?"
>
> I don't hate your module thunk trick, but this does seem *so* much
> simpler, and if it performs better anyway, it really does seem like
> the better approach.
Yes, Peter and I came from avoiding a new compiler and the overhead for
everyone when putting the padding into the code. We realized only when
staring at the perf data that this padding in front of the function
might be an acceptable solution. I did some more tests today on different
machines with mitigations=off with kernels compiled with and without
that padding. I couldn't find a single test case where the result was
outside of the usual noise. But then my tests are definitely incomplete.
> And people and distros who care would have an easy time adding that
> simple compiler patch instead.
>
> I do think that for generality, the "-mforce-function-padding" option
> should perhaps take as an argument how much padding (and how much
> alignment) to force:
>
> -mforce-function-padding=5:16
>
> would force 5 bytes of minimum padding, and align functions to 16
> bytes. It should be easy to generate (no more complexity than your
> current one) by just making the output do
>
> .skip 5,0xcc
> .palign 4,0xcc
>
> and now you can specify that you only need X bytes of padding, for example.
Yes, I know. But it was horrible enough to find the right spot in that
gcc maze. Then I was happy that I figured how to add the boolean
option. I let real compiler people take care of the rest. HJL???
And we need input from the Clang folks because their CFI work also puts
stuff in front of the function entry, which nicely collides.
Thanks,
tglx
On Mon, Jul 18, 2022 at 1:44 PM Thomas Gleixner <[email protected]> wrote:
>
> Yes, Peter and I came from avoiding a new compiler and the overhead for
> everyone when putting the padding into the code. We realized only when
> staring at the perf data that this padding in front of the function
> might be an acceptable solution. I did some more tests today on different
> machines with mitigations=off with kernels compiled with and without
> that padding. I couldn't find a single test case where the result was
> outside of the usual noise. But then my tests are definitely incomplete.
Well, it sounds like it most definitely isn't a huge and obvious problem.
> Yes, I know. But it was horrible enough to find the right spot in that
> gcc maze. Then I was happy that I figured how to add the boolean
> option. I let real compiler people take care of the rest. HJL???
>
> And we need input from the Clang folks because their CFI work also puts
> stuff in front of the function entry, which nicely collides.
Yeah, looking at the gcc sources (I have them locally because it helps
with the gcc bug reports I've done over the years), that
ASM_OUTPUT_FUNCTION_PREFIX is very convenient, but it's too late to do
any inter-function alignment for, because it's already after the usual
function-alignment output.
So I guess the padding thing is largely tied together with alignment
of the function start, so that idea of having different padding and
alignment bytes doesn't workl that well.
At least not in that ASM_OUTPUT_FUNCTION_PREFIX model, which is how
the gcc patch ends up being so small.
Linus
On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
> > I do think that for generality, the "-mforce-function-padding" option
> > should perhaps take as an argument how much padding (and how much
> > alignment) to force:
> >
> > -mforce-function-padding=5:16
> >
> > would force 5 bytes of minimum padding, and align functions to 16
> > bytes. It should be easy to generate (no more complexity than your
> > current one) by just making the output do
> >
> > .skip 5,0xcc
> > .palign 4,0xcc
> >
> > and now you can specify that you only need X bytes of padding, for example.
>
> Yes, I know. But it was horrible enough to find the right spot in that
> gcc maze. Then I was happy that I figured how to add the boolean
> option. I let real compiler people take care of the rest. HJL???
>
> And we need input from the Clang folks because their CFI work also puts
> stuff in front of the function entry, which nicely collides.
Right, I need to go look at the latest kCFI patches, that sorta got
side-tracked for working on all the retbleed muck :/
Basically kCFI wants to preface every (indirect callable) function with:
__cfi_\func:
int3
movl $0x12345678, %rax
int3
int3
\func:
endbr
\func_direct:
Ofc, we can still put the whole:
sarq $5, PER_CPU_VAR(__x86_call_depth);
jmp \func_direct
thing in front of that. But it does somewhat destroy the version I had
that only needs the 10 bytes padding for the sarq.
On Mon, Jul 18, 2022 at 02:01:43PM -0700, Linus Torvalds wrote:
> On Mon, Jul 18, 2022 at 1:44 PM Thomas Gleixner <[email protected]> wrote:
> >
> > Yes, Peter and I came from avoiding a new compiler and the overhead for
> > everyone when putting the padding into the code. We realized only when
> > staring at the perf data that this padding in front of the function
> > might be an acceptable solution. I did some more tests today on different
> > machines with mitigations=off with kernels compiled with and without
> > that padding. I couldn't find a single test case where the result was
> > outside of the usual noise. But then my tests are definitely incomplete.
>
> Well, it sounds like it most definitely isn't a huge and obvious problem.
>
> > Yes, I know. But it was horrible enough to find the right spot in that
> > gcc maze. Then I was happy that I figured how to add the boolean
> > option. I let real compiler people take care of the rest. HJL???
> >
> > And we need input from the Clang folks because their CFI work also puts
> > stuff in front of the function entry, which nicely collides.
>
> Yeah, looking at the gcc sources (I have them locally because it helps
> with the gcc bug reports I've done over the years), that
> ASM_OUTPUT_FUNCTION_PREFIX is very convenient, but it's too late to do
> any inter-function alignment for, because it's already after the usual
> function-alignment output.
>
> So I guess the padding thing is largely tied together with alignment
> of the function start, so that idea of having different padding and
> alignment bytes doesn't workl that well.
>
> At least not in that ASM_OUTPUT_FUNCTION_PREFIX model, which is how
> the gcc patch ends up being so small.
FWIW, when I was poking at this last week, I found that -falign-function
only seems to apply to the normal .text section and not to random other
sections with text we create.
Or rather, I was seeind a lot of unaligned functions that all had custom
sections despite explicitly using the (what I thought was a global)
function alignment toggle.
On Mon, Jul 18 2022 at 23:18, Peter Zijlstra wrote:
> On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
>> And we need input from the Clang folks because their CFI work also puts
>> stuff in front of the function entry, which nicely collides.
>
> Right, I need to go look at the latest kCFI patches, that sorta got
> side-tracked for working on all the retbleed muck :/
>
> Basically kCFI wants to preface every (indirect callable) function with:
>
> __cfi_\func:
> int3
> movl $0x12345678, %rax
> int3
> int3
> \func:
> endbr
> \func_direct:
>
> Ofc, we can still put the whole:
>
> sarq $5, PER_CPU_VAR(__x86_call_depth);
> jmp \func_direct
>
> thing in front of that. But it does somewhat destroy the version I had
> that only needs the 10 bytes padding for the sarq.
Right, because it needs the jump. I was just chatting with Jaoa about
that over IRC.
The jump slow things down. Jaoa has ideas and will reply soonish.
Thanks,
tglx
On Mon, Jul 18, 2022 at 2:43 PM Peter Zijlstra <[email protected]> wrote:
>
> FWIW, when I was poking at this last week, I found that -falign-function
> only seems to apply to the normal .text section and not to random other
> sections with text we create.
>
> Or rather, I was seeind a lot of unaligned functions that all had custom
> sections despite explicitly using the (what I thought was a global)
> function alignment toggle.
Hmm. This triggers a memory..
I think we may have two different issues at play.
One is that I think our linker script only aligns code sections to 8
bytes by default. Grep for ALIGN_FUNCTION.
And I think that any .align directive (or .p2align) only aligns
relative to that section, so if the section itself wasn't aligned, it
doesn't help to have some alignment within the section.
I may be wrong.
But I can definitely see gcc not aligning functions too, and doing a
nm vmlinux | grep ' t ' | grep -v '0 t ' | grep -v '\.cold$' | sort
shows a _lot_ of them for me.
I think the main cause is that the ACPI code builds with
ccflags-y := -Os -D_LINUX -DBUILDING_ACPICA
and that '-Os' will disable all function alignment. I think there's a
few other places that do that too.
I don't see the same effect in my clang build, so I think that -Os
behavior is likely gcc-specific.
In my clang build, I do see a few unaligned function symbols, but they
seem to be all our own assembler ones (eg "nested_nmi") and they seem
to be intentional (ie that "nested_nmi" thing is in the middle of the
"asm_exc_nmi" function, which is the real thing and which _is_
aligned).
Linus
On Mon, Jul 18, 2022 at 2:18 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
> > And we need input from the Clang folks because their CFI work also puts
> > stuff in front of the function entry, which nicely collides.
>
> Right, I need to go look at the latest kCFI patches, that sorta got
> side-tracked for working on all the retbleed muck :/
>
> Basically kCFI wants to preface every (indirect callable) function with:
>
> __cfi_\func:
> int3
> movl $0x12345678, %rax
> int3
> int3
> \func:
Yes, and in order to avoid scattering the code with call target
gadgets, the preamble should remain immediately before the function.
> Ofc, we can still put the whole:
>
> sarq $5, PER_CPU_VAR(__x86_call_depth);
> jmp \func_direct
>
> thing in front of that.
Sure, that would work.
> But it does somewhat destroy the version I had that only needs the
> 10 bytes padding for the sarq.
There's also the question of how function alignment should work in the
KCFI case. Currently, the __cfi_ preamble is 16-byte aligned, which
obviously means the function itself isn't.
Sami
On 2022-07-18 15:22, Thomas Gleixner wrote:
> On Mon, Jul 18 2022 at 23:18, Peter Zijlstra wrote:
>> On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
>>> And we need input from the Clang folks because their CFI work also
>>> puts
>>> stuff in front of the function entry, which nicely collides.
>>
>> Right, I need to go look at the latest kCFI patches, that sorta got
>> side-tracked for working on all the retbleed muck :/
>>
>> Basically kCFI wants to preface every (indirect callable) function
>> with:
>>
>> __cfi_\func:
>> int3
>> movl $0x12345678, %rax
>> int3
>> int3
>> \func:
>> endbr
>> \func_direct:
>>
>> Ofc, we can still put the whole:
>>
>> sarq $5, PER_CPU_VAR(__x86_call_depth);
>> jmp \func_direct
>>
>> thing in front of that. But it does somewhat destroy the version I had
>> that only needs the 10 bytes padding for the sarq.
>
> Right, because it needs the jump. I was just chatting with Jaoa about
> that over IRC.
>
> The jump slow things down. Jaoa has ideas and will reply soonish.
So, IIRC, kCFI will do something like this to validate call targets
based on the hash as described on Peter's e-mail:
func_whatever:
...
cmpl $0x\hash, -6(%rax)
je 1f
ud2
1:
call *%rax
...
Thus the hash will be 6 bytes before the function entry point. Then we
can get the compiler to emit a padding area before the __cfi_\func
snippet and, during boot, if the CPU needs the call depth tracking
mitigation, we:
- move the __cfi_func into the padding area
- patch the call depth tracking snippet ahead of it (overwriting the old
__cfi_\func:)
- fix the cmpl offset in the caller
func_whatever:
...
cmpl $0x\hash, -FIXED_OFFSET(%rax)
je 1f
ud2
1:
call *%rax
...
This approach is very similar to what we discussed in the past for
replacing kCFI with FineIBT if CET is available. Also, it would prevent
the need for any jump and would keep the additional padding area in 10
bytes.
Tks,
Joao
On Mon, Jul 18, 2022 at 3:47 PM Joao Moreira <[email protected]> wrote:
> Thus the hash will be 6 bytes before the function entry point. Then we
> can get the compiler to emit a padding area before the __cfi_\func
> snippet and, during boot, if the CPU needs the call depth tracking
> mitigation, we:
> - move the __cfi_func into the padding area
> - patch the call depth tracking snippet ahead of it (overwriting the old
> __cfi_\func:)
> - fix the cmpl offset in the caller
>
> func_whatever:
> ...
> cmpl $0x\hash, -FIXED_OFFSET(%rax)
> je 1f
> ud2
> 1:
> call *%rax
> ...
The problem with this is that the cmpl instruction contains the full
type hash, which means that any instruction that's FIXED_OFFSET from
the cmpl is a valid indirect call target as far as KCFI is concerned.
-6 was chosen specifically to make the ud2 the only possible target.
Sami
On Mon, Jul 18 2022 at 15:48, Sami Tolvanen wrote:
> On Mon, Jul 18, 2022 at 2:18 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
>> > And we need input from the Clang folks because their CFI work also puts
>> > stuff in front of the function entry, which nicely collides.
>>
>> Right, I need to go look at the latest kCFI patches, that sorta got
>> side-tracked for working on all the retbleed muck :/
>>
>> Basically kCFI wants to preface every (indirect callable) function with:
>>
>> __cfi_\func:
>> int3
>> movl $0x12345678, %rax
>> int3
>> int3
>> \func:
>
> Yes, and in order to avoid scattering the code with call target
> gadgets, the preamble should remain immediately before the function.
>
>> Ofc, we can still put the whole:
>>
>> sarq $5, PER_CPU_VAR(__x86_call_depth);
>> jmp \func_direct
>>
>> thing in front of that.
>
> Sure, that would work.
>
>> But it does somewhat destroy the version I had that only needs the
>> 10 bytes padding for the sarq.
>
> There's also the question of how function alignment should work in the
> KCFI case. Currently, the __cfi_ preamble is 16-byte aligned, which
> obviously means the function itself isn't.
That's bad. The function entry should be 16 byte aligned and as I just
learned for AMD the ideal alignment would be possibly 32 byte as that's
their I-fetch width. But my experiments with 16 bytes alignment
independent of the padding muck is benefitial for both AMD and Intel
over the 4 byte alignment we have right now.
This really needs a lot of thought and performance analysis before we
commit to anything here. Peter's an my investigations have shown how
sensitive this is.
We can't just add stuff without taking the whole picture into account
(independent of the proposed padding muck).
Thanks,
tglx
> The problem with this is that the cmpl instruction contains the full
> type hash, which means that any instruction that's FIXED_OFFSET from
> the cmpl is a valid indirect call target as far as KCFI is concerned.
> -6 was chosen specifically to make the ud2 the only possible target.
Ugh. The bitter truth. I'll think a bit further.
On Mon, Jul 18, 2022 at 3:59 PM Thomas Gleixner <[email protected]> wrote:
>
> On Mon, Jul 18 2022 at 15:48, Sami Tolvanen wrote:
> > On Mon, Jul 18, 2022 at 2:18 PM Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
> >> > And we need input from the Clang folks because their CFI work also puts
> >> > stuff in front of the function entry, which nicely collides.
> >>
> >> Right, I need to go look at the latest kCFI patches, that sorta got
> >> side-tracked for working on all the retbleed muck :/
> >>
> >> Basically kCFI wants to preface every (indirect callable) function with:
> >>
> >> __cfi_\func:
> >> int3
> >> movl $0x12345678, %rax
> >> int3
> >> int3
> >> \func:
> >
> > Yes, and in order to avoid scattering the code with call target
> > gadgets, the preamble should remain immediately before the function.
> >
> >> Ofc, we can still put the whole:
> >>
> >> sarq $5, PER_CPU_VAR(__x86_call_depth);
> >> jmp \func_direct
> >>
> >> thing in front of that.
> >
> > Sure, that would work.
> >
> >> But it does somewhat destroy the version I had that only needs the
> >> 10 bytes padding for the sarq.
> >
> > There's also the question of how function alignment should work in the
> > KCFI case. Currently, the __cfi_ preamble is 16-byte aligned, which
> > obviously means the function itself isn't.
>
> That's bad. The function entry should be 16 byte aligned and as I just
> learned for AMD the ideal alignment would be possibly 32 byte as that's
> their I-fetch width. But my experiments with 16 bytes alignment
> independent of the padding muck is benefitial for both AMD and Intel
> over the 4 byte alignment we have right now.
OK, that's what I thought. KCFI hasn't landed in Clang yet, so it
shouldn't be a problem to fix this.
Sami
On Mon, Jul 18, 2022 at 3:59 PM Thomas Gleixner <[email protected]> wrote:
>
> That's bad. The function entry should be 16 byte aligned and as I just
> learned for AMD the ideal alignment would be possibly 32 byte as that's
> their I-fetch width.
In general, the L1 cache line size is likely the only "ideal" alignment.
Even if (I think) intel fetches just 16 bytes per cycle from the L1 I$
when decoding, being cacheline aligned still means that the L2->L1
transfer for the beginning of the function starts out better.
But Intel's current optimization many actually end sup having special rules:
When executing code from the Decoded Icache, direct branches that
are mostly taken should have all their instruction bytes in a 64B
cache line and nearer the end of that cache line. Their targets should
be at or near the beginning of a 64B cache line.
When executing code from the legacy decode pipeline, direct
branches that are mostly taken should have all their instruction bytes
in a 16B aligned chunk of memory and nearer the end of that 16B
aligned chunk. Their targets should be at or near the beginning of a
16B aligned chunk of memory.
So note how the branch itself should be at the end of the chunk, but
the branch _target_ should be at the beginning of the chunk. And the
chunk size is 16 bytes for decoding new instructions, and 64 bytes for
predecoded.
I suspect that for the kernel, and for the beginning of the function
(ie the target case), the 16-byte thing is still the main thing.
Because the L0 I$ ("uop cache", "Decoded Icache", whatever you want to
call it) is probably too small for a lot of kernel loads where user
space has flushed things in between system calls or faults.
Older versions of the intel code just said "All branch targets should
be 16-byte aligned".
So I think 16 bytes for function alignment ends up being what we
generally want, but yes, it's slowly changing. AMD fetches 32-byte
chunks, and Intel has that 64-bit predecode thing.
Linus
On Mon, Jul 18 2022 at 15:55, Sami Tolvanen wrote:
> On Mon, Jul 18, 2022 at 3:47 PM Joao Moreira <[email protected]> wrote:
>> Thus the hash will be 6 bytes before the function entry point. Then we
>> can get the compiler to emit a padding area before the __cfi_\func
>> snippet and, during boot, if the CPU needs the call depth tracking
>> mitigation, we:
>> - move the __cfi_func into the padding area
>> - patch the call depth tracking snippet ahead of it (overwriting the old
>> __cfi_\func:)
>> - fix the cmpl offset in the caller
>>
>> func_whatever:
>> ...
>> cmpl $0x\hash, -FIXED_OFFSET(%rax)
>> je 1f
>> ud2
>> 1:
>> call *%rax
>> ...
>
> The problem with this is that the cmpl instruction contains the full
> type hash, which means that any instruction that's FIXED_OFFSET from
> the cmpl is a valid indirect call target as far as KCFI is concerned.
> -6 was chosen specifically to make the ud2 the only possible target.
But that's an implementation detail, right? Whatever we put in between
will still be a fixed offset, no? It's a different offset, but that's
what patching can deal with.
Thanks,
tglx
On Mon, Jul 18, 2022 at 03:34:51PM -0700, Linus Torvalds wrote:
> I think the main cause is that the ACPI code builds with
>
> ccflags-y := -Os -D_LINUX -DBUILDING_ACPICA
>
> and that '-Os' will disable all function alignment. I think there's a
> few other places that do that too.
Urgh, that's -Ostupid, right?
On Mon, Jul 18, 2022 at 03:48:04PM -0700, Sami Tolvanen wrote:
> On Mon, Jul 18, 2022 at 2:18 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Jul 18, 2022 at 10:44:14PM +0200, Thomas Gleixner wrote:
> > > And we need input from the Clang folks because their CFI work also puts
> > > stuff in front of the function entry, which nicely collides.
> >
> > Right, I need to go look at the latest kCFI patches, that sorta got
> > side-tracked for working on all the retbleed muck :/
> >
> > Basically kCFI wants to preface every (indirect callable) function with:
> >
> > __cfi_\func:
> > int3
> > movl $0x12345678, %rax
> > int3
> > int3
> > \func:
>
> Yes, and in order to avoid scattering the code with call target
> gadgets, the preamble should remain immediately before the function.
I think we have a little room, but yes, -6 is just right to hit the UD2.
> > Ofc, we can still put the whole:
> >
> > sarq $5, PER_CPU_VAR(__x86_call_depth);
> > jmp \func_direct
> >
> > thing in front of that.
>
> Sure, that would work.
So if we assume \func starts with ENDBR, and further assume we've fixed
up every direct jmp/call to land at +4, we can overwrite the ENDBR with
part of the SARQ, that leaves us 6 more byte, placing the immediate at
-10 if I'm not mis-counting.
Now, the call sites are:
41 81 7b fa 78 56 34 12 cmpl $0x12345678, -6(%r11)
74 02 je 1f
0f 0b ud2
e8 00 00 00 00 1: call __x86_indirect_thunk_r11
That means the offset of +10 lands in the middle of the CALL
instruction, and since we only have 16 thunks there is a limited number
of byte patterns available there.
This really isn't as nice as the -6 but might just work well enough,
hmm?
> > But it does somewhat destroy the version I had that only needs the
> > 10 bytes padding for the sarq.
>
> There's also the question of how function alignment should work in the
> KCFI case. Currently, the __cfi_ preamble is 16-byte aligned, which
> obviously means the function itself isn't.
That seems unfortunate, at least the intel parts have a 16 byte i-fetch
window (IIRC) so aligning the actual instructions at 16 bytes gets you
the best bang for the buck wrt ifetch (functions are random access and
not sequential).
Also, since we're talking at least 4 bytes more padding over the 7 that
are required by the kCFI scheme, the FineIBT alternative gets a little
more room to breathe. I'm thinking we can have the FineIBT landing site
at -16.
__fineibt_\func:
endbr64 # 4
xorl $0x12345678, %r10d # 7
je \func+4 # 2
ud2 # 2
\func:
nop4
...
With the callsite looking like:
nop7
movl $0x12345678, %r10d # 7
call *%r11 # 3
or something like that (everything having IBT has eIBRS at the very
least).
On Mon, Jul 18, 2022 at 4:19 PM Thomas Gleixner <[email protected]> wrote:
>
> But that's an implementation detail, right? Whatever we put in between
> will still be a fixed offset, no? It's a different offset, but that's
> what patching can deal with.
No, what Sami is sayin that because the "cmpl" *inside* the function
that checks the hash value will have that same (valid) hash value
encoded as part of it, then you actually have *two* valid markers with
that hash value.
You have the "real" marker before the function.
But you also have the "false" marker that is part of the hash check
that is *inside* the function.
The "real marker + 6" points to the function head itself, and so is ok
as a target (normal operation).
The "false marker + 6" points to the "UD2", and so is *also* ok as a
target (bad guy trying to mis-use the false marker gets trapped by
UD2).
Linus
On Mon, Jul 18, 2022 at 04:52:09PM -0700, Linus Torvalds wrote:
> I also happen to believe that the kCFI code should have entirely
> different targets for direct jumps and for indirect jumps, but that's
> a separate issue. Maybe it already does that?
kCFI is purely about indirect calls.
On Mon, Jul 18, 2022 at 4:42 PM Linus Torvalds
<[email protected]> wrote:
>
> You have the "real" marker before the function.
>
> But you also have the "false" marker that is part of the hash check
> that is *inside* the function.
>
> The "real marker + 6" points to the function head itself, and so is ok
> as a target (normal operation).
Of course, one fix for that is to make the hash be only 24 bits, and
make the int3 byte part of the value you check, and not have the same
pattern in the checking code at all.
Honestly, I think that would be a better model - yes, you lose 8 bits
of hash, but considering that apparently the current KCFI code
*guarantees* that the hash pattern will exist even outside the actual
target pattern, I think it's still a better model.
I also happen to believe that the kCFI code should have entirely
different targets for direct jumps and for indirect jumps, but that's
a separate issue. Maybe it already does that?
Linus
On Mon, Jul 18, 2022 at 4:52 PM Linus Torvalds
<[email protected]> wrote:
>
> Honestly, I think that would be a better model - yes, you lose 8 bits
> of hash, but considering that apparently the current KCFI code
> *guarantees* that the hash pattern will exist even outside the actual
> target pattern,
Gaah, I'm being stupid,. You still get the value collision, since the
int3 byte pattern would just be part of the compare pattern.
You'd have to use some multi-instruction compare to avoid having the
pattern in the instruction stream. Probably with another register.
Like
movl -FIXED_OFFSET(%eax),%rdx
addl $ANTI_PATTERN,%rdx
je ok
so that the "compare" wouldn't use the same pattern value, but be an
add with the negated pattern value instead.
The extra instruction is likely less of a problem than the extra register used.
Linus
On Mon, Jul 18, 2022 at 4:58 PM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Jul 18, 2022 at 04:52:09PM -0700, Linus Torvalds wrote:
> > I also happen to believe that the kCFI code should have entirely
> > different targets for direct jumps and for indirect jumps, but that's
> > a separate issue. Maybe it already does that?
>
> kCFI is purely about indirect calls.
So it already only adds the pattern to things that have their address
taken, not all functions?
If so, that's simple enough to sort out: don't do any RSB stack
adjustment for those thunks AT ALL.
Because they should just then end up with a jump to the "real" target,
and that real target will do the RSB stack thing.
Linus
On Mon, Jul 18, 2022 at 05:11:27PM -0700, Linus Torvalds wrote:
> On Mon, Jul 18, 2022 at 5:03 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > So it already only adds the pattern to things that have their address
> > taken, not all functions?
> >
> > If so, that's simple enough to sort out: don't do any RSB stack
> > adjustment for those thunks AT ALL.
> >
> > Because they should just then end up with a jump to the "real" target,
> > and that real target will do the RSB stack thing.
>
> Put another way, let's say that you have a function that looks like this:
>
> int silly(void)
> {
> return 0;
> }
>
> and now you have two cases:
>
> - the "direct callable version" of that function looks exactly the
> way it always has looked, and gets the 16 bytes of padding for it, and
> the RSB counting can happen in that padding
>
> - the "somebody took the address of this function" creates code that
> has the hash marker before it, and has the hash check, and then does a
> "jmp silly" to actually jump to the real code.
>
> So what the RSB counting does is just ignore that second case entirely
> as far as the RSB code generation goes. No need to have any padding
> for it at all, it has that (completely different) kCFI padding
> instead.
>
> Instead, only the "real" silly function gets that RSB code, and the
> "jmp silly" from the kCFI thunk needs to be updated to point to the
> RSB thunk in front of it.
>
> Yes, yes, it makes indirect calls slightly more expensive than direct
> calls (because that kCFI thing can't just fall through to the real
> thing), but considering all the *other* costs of indirect calls, the
> cost of having that one "jmp" instruction doesn't really seem to
> matter, does it?
So it's like 2:15 am here, so I might not be following things right, but
doesn't the above mean you have to play funny games with what a function
pointer is?
That is, the content of a function pointer (address taken) no longer
match the actual function? That gives grief with things like
static_call(), ftrace and other things that write call instructions
instead of doing indirect calls.
On Mon, Jul 18, 2022 at 5:03 PM Linus Torvalds
<[email protected]> wrote:
>
> So it already only adds the pattern to things that have their address
> taken, not all functions?
>
> If so, that's simple enough to sort out: don't do any RSB stack
> adjustment for those thunks AT ALL.
>
> Because they should just then end up with a jump to the "real" target,
> and that real target will do the RSB stack thing.
Put another way, let's say that you have a function that looks like this:
int silly(void)
{
return 0;
}
and now you have two cases:
- the "direct callable version" of that function looks exactly the
way it always has looked, and gets the 16 bytes of padding for it, and
the RSB counting can happen in that padding
- the "somebody took the address of this function" creates code that
has the hash marker before it, and has the hash check, and then does a
"jmp silly" to actually jump to the real code.
So what the RSB counting does is just ignore that second case entirely
as far as the RSB code generation goes. No need to have any padding
for it at all, it has that (completely different) kCFI padding
instead.
Instead, only the "real" silly function gets that RSB code, and the
"jmp silly" from the kCFI thunk needs to be updated to point to the
RSB thunk in front of it.
Yes, yes, it makes indirect calls slightly more expensive than direct
calls (because that kCFI thing can't just fall through to the real
thing), but considering all the *other* costs of indirect calls, the
cost of having that one "jmp" instruction doesn't really seem to
matter, does it?
Linus
On Mon, Jul 18, 2022 at 5:23 PM Peter Zijlstra <[email protected]> wrote:
>
> So it's like 2:15 am here, so I might not be following things right, but
> doesn't the above mean you have to play funny games with what a function
> pointer is?
Yes, but probably no more than compilers already do.
On x86, function pointers are simple, and just pointers to the first
instruction of the function.
But that's actually not true in general, and several other
architectures have *much* more complicated function pointers, where
they are pointers to special "function descriptor blocks" etc.
So I bet gcc has all that infrastructure in place anyway.
And the whole "use a different address for a direct call than for an
indirect call" is still much simpler than having an actual separate
function descriptor thing.
At worst, you'd actually always generate the thunk for the indirect
call case, and let the linker kill unused cases. The compiler wouldn't
even have to know about the two cases, except to use a different names
for the direct call case.
Do I claim it would be *pretty*? No. But I bet the existing CFI
patches already do things like this anyway.
(I have llvm sources on my machine too, because I used to build my own
clang from source back when I was testing the asm goto stuff. But
unlike gcc, I've never really *looked* at llvm, so I'm not familiar
with it at all, and I'm not going to try to figure out what the CFI
code actually does, and instead just handwave widely while saying "I
bet it already does this".)
Linus
> The extra instruction is likely less of a problem than the extra
> register used.
>
FWIIW, per-ABI, R11 is a scratch-reg and should be usable without hard
consequences in this scenario.
Joao
From: Linus Torvalds
> Sent: 19 July 2022 01:02
>
> On Mon, Jul 18, 2022 at 4:52 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > Honestly, I think that would be a better model - yes, you lose 8 bits
> > of hash, but considering that apparently the current KCFI code
> > *guarantees* that the hash pattern will exist even outside the actual
> > target pattern,
>
> Gaah, I'm being stupid,. You still get the value collision, since the
> int3 byte pattern would just be part of the compare pattern.
>
> You'd have to use some multi-instruction compare to avoid having the
> pattern in the instruction stream. Probably with another register.
> Like
>
> movl -FIXED_OFFSET(%eax),%rdx
> addl $ANTI_PATTERN,%rdx
> je ok
>
> so that the "compare" wouldn't use the same pattern value, but be an
> add with the negated pattern value instead.
>
> The extra instruction is likely less of a problem than the extra register used.
Shouldn't it be testing the value the caller supplied?
The extra instruction is likely to be one clock - I doubt it will
sensibly run in parallel with code later in the function.
The larger costs are (probably) polluting the D$ with I addresses
(already done by the caller) and the likely mispredicted 'je ok'.
Unless the function has been recently called the 'je ok' gets
static prediction.
While traditionally that would predict a forwards branch 'not taken'
ISTR more recent Intel cpu just use the predictor output - ie random.
Not at all sure about AMD cpu though.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Jul 19, 2022 at 1:26 AM David Laight <[email protected]> wrote:
>
> Shouldn't it be testing the value the caller supplied?
Actually, I'm just all confused.
All that verification code is *in* the caller, before the call - to
verify that the target looks fine.
I think I was confused by the hash thunk above the function also being
generated with a "cmpl $hash". And I don't even know why that is, and
why it wasn't just the bare constant.
Linus
On Tue, Jul 19, 2022 at 09:27:02AM -0700, Linus Torvalds wrote:
> On Tue, Jul 19, 2022 at 1:26 AM David Laight <[email protected]> wrote:
> >
> > Shouldn't it be testing the value the caller supplied?
>
> Actually, I'm just all confused.
>
> All that verification code is *in* the caller, before the call - to
> verify that the target looks fine.
>
> I think I was confused by the hash thunk above the function also being
> generated with a "cmpl $hash". And I don't even know why that is, and
> why it wasn't just the bare constant.
The preamble hash is encoded into an instruction just to avoid special
casing objtool, which would otherwise get confused about the random
bytes. On arm64, we just emit a bare constant before the function.
Sami
On Mon, Jul 18, 2022 at 05:11:27PM -0700, Linus Torvalds wrote:
> On Mon, Jul 18, 2022 at 5:03 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > So it already only adds the pattern to things that have their address
> > taken, not all functions?
The preamble is added to address-taken static functions and all global
functions, because those might be indirectly called from other
translation units. With LTO, we could prune unnecessary preambles from
non-address-taken globals too.
> > If so, that's simple enough to sort out: don't do any RSB stack
> > adjustment for those thunks AT ALL.
> >
> > Because they should just then end up with a jump to the "real" target,
> > and that real target will do the RSB stack thing.
>
> Put another way, let's say that you have a function that looks like this:
>
> int silly(void)
> {
> return 0;
> }
>
> and now you have two cases:
>
> - the "direct callable version" of that function looks exactly the
> way it always has looked, and gets the 16 bytes of padding for it, and
> the RSB counting can happen in that padding
>
> - the "somebody took the address of this function" creates code that
> has the hash marker before it, and has the hash check, and then does a
> "jmp silly" to actually jump to the real code.
Clang's current CFI implementation is somewhat similar to this. It
creates separate thunks for address-taken functions and changes
function addresses in C code to point to the thunks instead.
While this works, it creates painful situations when interacting with
assembly (e.g. a function address taken in assembly cannot be used
for indirect calls in C as it doesn't point to the thunk) and needs
unpleasant hacks when we want take the actual function address in C
(i.e. scattering the code with function_nocfi() calls).
I have to agree with Peter on this, I would rather avoid messing with
function pointers in KCFI to avoid these issues.
Sami
On Mon, Jul 18, 2022 at 05:19:13PM -0700, Joao Moreira wrote:
> > The extra instruction is likely less of a problem than the extra
> > register used.
> >
> FWIIW, per-ABI, R11 is a scratch-reg and should be usable without hard
> consequences in this scenario.
Clang always uses r11 for the indirect call with retpolines, so we'd
need to use another register. Nevertheless, splitting the constant into
two instructions would solve the call target gadget issue.
Sami
> Clang always uses r11 for the indirect call with retpolines, so we'd
> need to use another register. Nevertheless, splitting the constant into
> two instructions would solve the call target gadget issue.
Yeah, it clicked later yesterday. But, FWIIW, R10 is also considered a
scratch register, although used for passing static chain pointers which
I think is not a thing in kernel context. Last case scenario we can
always do liveness analysis and I doubt we'll have a significant (if
any) number of spills.
If we are comparing through registers, I would suggest using a sub
instruction instead of a cmp, as this will destroy the contents of the
register and prevent it from being re-used on further unprotected
indirect branches, if any exists.
On Tue, Jul 19, 2022 at 10:23 AM Sami Tolvanen <[email protected]> wrote:
>
> The preamble hash is encoded into an instruction just to avoid special
> casing objtool, which would otherwise get confused about the random
> bytes. On arm64, we just emit a bare constant before the function.
Ahh.
I think objtool would want to understand about kCFI anyway, so I think
in the long run that hack isn't a goog idea.
But I get why you'd do it as a "do this as just a compiler thing and
hide it from objtool" as a development strategy.
Linus
On Tue, Jul 19, 2022 at 10:27:00AM -0700, Linus Torvalds wrote:
> On Tue, Jul 19, 2022 at 10:23 AM Sami Tolvanen <[email protected]> wrote:
> >
> > The preamble hash is encoded into an instruction just to avoid special
> > casing objtool, which would otherwise get confused about the random
> > bytes. On arm64, we just emit a bare constant before the function.
>
> Ahh.
>
> I think objtool would want to understand about kCFI anyway, so I think
> in the long run that hack isn't a goog idea.
>
> But I get why you'd do it as a "do this as just a compiler thing and
> hide it from objtool" as a development strategy.
I believe it was actually Peter's idea to use an instruction. :) In
earlier revisions of KCFI, I did teach objtool about the preambles, but
that was just so it can ignore them.
Sami
On Tue, Jul 19, 2022 at 11:06:40AM -0700, Sami Tolvanen wrote:
> On Tue, Jul 19, 2022 at 10:27:00AM -0700, Linus Torvalds wrote:
> > On Tue, Jul 19, 2022 at 10:23 AM Sami Tolvanen <[email protected]> wrote:
> > >
> > > The preamble hash is encoded into an instruction just to avoid special
> > > casing objtool, which would otherwise get confused about the random
> > > bytes. On arm64, we just emit a bare constant before the function.
> >
> > Ahh.
> >
> > I think objtool would want to understand about kCFI anyway, so I think
> > in the long run that hack isn't a goog idea.
> >
> > But I get why you'd do it as a "do this as just a compiler thing and
> > hide it from objtool" as a development strategy.
>
> I believe it was actually Peter's idea to use an instruction. :) In
> earlier revisions of KCFI, I did teach objtool about the preambles, but
> that was just so it can ignore them.
Right; even if we teach objtool about kCFI, having text be actual
instructions makes things much nicer. Objdump and friends also shit
their pants if you put random bytes in. It only costs a single byte to
encode the immediate, so why not.
Specifically, the encoding used is:
movl $0x12345678, %eax
and that is 0xb8 followed by the constant, but there's plenty other
single byte ops that could be used.
On Tue, Jul 19 2022 at 01:51, Peter Zijlstra wrote:
> On Mon, Jul 18, 2022 at 03:48:04PM -0700, Sami Tolvanen wrote:
>> On Mon, Jul 18, 2022 at 2:18 PM Peter Zijlstra <[email protected]> wrote:
>> > Ofc, we can still put the whole:
>> >
>> > sarq $5, PER_CPU_VAR(__x86_call_depth);
>> > jmp \func_direct
>> >
>> > thing in front of that.
>>
>> Sure, that would work.
>
> So if we assume \func starts with ENDBR, and further assume we've fixed
> up every direct jmp/call to land at +4, we can overwrite the ENDBR with
> part of the SARQ, that leaves us 6 more byte, placing the immediate at
> -10 if I'm not mis-counting.
>
> Now, the call sites are:
>
> 41 81 7b fa 78 56 34 12 cmpl $0x12345678, -6(%r11)
> 74 02 je 1f
> 0f 0b ud2
> e8 00 00 00 00 1: call __x86_indirect_thunk_r11
>
> That means the offset of +10 lands in the middle of the CALL
> instruction, and since we only have 16 thunks there is a limited number
> of byte patterns available there.
>
> This really isn't as nice as the -6 but might just work well enough,
> hmm?
So I added a 32byte padding and put the thunk at the start:
sarq $5, PER_CPU_VAR(__x86_call_depth);
jmp \func_direct
For sockperf that costs about 1% performance vs. the 16 byte
variant. For mitigations=off it's a ~0.5% drop.
That's on a SKL. Did not check on other systems yet.
Thanks,
tglx
I just ran my ftrace test suite against v5.19-rc7 to see if anything pops
up. And one of my boot tests failed with:
[ 2.459713] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[ 2.460712] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[ 2.461712] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 2.462713] Spectre V2 : Mitigation: Retpolines
[ 2.464712] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 2.465712] Speculative Store Bypass: Vulnerable
[ 2.466713] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[ 2.467712] SRBDS: Vulnerable: No microcode
[ 2.488002] ------------[ cut here ]------------
[ 2.488712] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:558 apply_returns+0xa3/0x1ec
[ 2.489712] Modules linked in:
[ 2.490712] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc7-test #65
[ 2.491712] Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
[ 2.492712] RIP: 0010:apply_returns+0xa3/0x1ec
[ 2.493712] Code: 0f b6 5d a2 48 63 45 88 48 01 c3 4c 01 e3 48 89 da 4c 89 e7 e8 c2 25 00 00 84 c0 0f 85 1f 01 00 00 48 81 fb c0 49 40 a1 74 07 <0f> 0b e9 0f 01 00 00 83 3d a1 83 4f 02 00 74 24 0f b6 55 a2 48 63
[ 2.494712] RSP: 0000:ffffffffa1e03df0 EFLAGS: 00010206
[ 2.495711] RAX: 0000000000000000 RBX: ffffffffa173e8ad RCX: 0000000000000001
[ 2.496711] RDX: 0000000000000003 RSI: ffffffffa161b20c RDI: ffffffffa173e8ad
[ 2.497711] RBP: ffffffffa1e03ea8 R08: 00000000fffffff1 R09: 000000000000000f
[ 2.498711] R10: ffffffffa1e03db8 R11: 0000000000000b03 R12: ffffffffa173e8a8
[ 2.499711] R13: ffffffffa2550d30 R14: 0000000000000000 R15: ffffffffa1e32138
[ 2.500712] FS: 0000000000000000(0000) GS:ffff8e8796400000(0000) knlGS:0000000000000000
[ 2.501711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.502711] CR2: ffff8e879edff000 CR3: 0000000014e2a001 CR4: 00000000001706f0
[ 2.503712] Call Trace:
[ 2.504712] <TASK>
[ 2.505721] alternative_instructions+0x39/0xe9
[ 2.506712] check_bugs+0x310/0x330
[ 2.507712] start_kernel+0x605/0x63e
[ 2.508715] x86_64_start_reservations+0x24/0x2a
[ 2.509712] x86_64_start_kernel+0x8d/0x97
[ 2.510713] secondary_startup_64_no_verify+0xe0/0xeb
[ 2.511719] </TASK>
[ 2.512712] irq event stamp: 142170
[ 2.513711] hardirqs last enabled at (142180): [<ffffffffa014427c>] __up_console_sem+0x4b/0x53
[ 2.514711] hardirqs last disabled at (142189): [<ffffffffa014425c>] __up_console_sem+0x2b/0x53
[ 2.515712] softirqs last enabled at (8013): [<ffffffffa1400389>] __do_softirq+0x389/0x3c8
[ 2.516711] softirqs last disabled at (8006): [<ffffffffa00e2daa>] __irq_exit_rcu+0x72/0xcb
[ 2.517711] ---[ end trace 0000000000000000 ]---
[ 2.529074] Freeing SMP alternatives memory: 44K
[ 2.633924] smpboot: CPU0: Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[ 2.636420] cblist_init_generic: Setting adjustable number of callback queues.
[ 2.636712] cblist_init_generic: Setting shift to 2 and lim to 1.
[ 2.637821] cblist_init_generic: Setting shift to 2 and lim to 1.
[ 2.638822] Running RCU-tasks wait API self tests
[ 2.742759] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.
[ 2.743718] ... version: 3
[ 2.744712] ... bit width: 48
[ 2.745712] ... generic registers: 4
[ 2.746712] ... value mask: 0000ffffffffffff
[ 2.747712] ... max period: 00007fffffffffff
[ 2.748712] ... fixed-purpose events: 3
[ 2.749712] ... event mask: 000000070000000f
[ 2.751038] Estimated ratio of average max frequency by base frequency (times 1024): 1024
[ 2.751848] rcu: Hierarchical SRCU implementation.
[ 2.755405] smp: Bringing up secondary CPUs ...
[ 2.756470] x86: Booting SMP configuration:
[ 2.757713] .... node #0, CPUs: #1
[ 1.309155] numa_add_cpu cpu 1 node 0: mask now 0-1
[ 2.767842] #2
[ 1.309155] numa_add_cpu cpu 2 node 0: mask now 0-2
[ 2.774816] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 2.777343] #3
[ 1.309155] numa_add_cpu cpu 3 node 0: mask now 0-3
[ 2.784230] smp: Brought up 1 node, 4 CPUs
[ 2.784713] smpboot: Max logical packages: 1
Attached is the config.
-- Steve
On Tue, Jul 19, 2022 at 01:51:14AM +0200, Peter Zijlstra wrote:
> So if we assume \func starts with ENDBR, and further assume we've fixed
> up every direct jmp/call to land at +4, we can overwrite the ENDBR with
> part of the SARQ, that leaves us 6 more byte, placing the immediate at
> -10 if I'm not mis-counting.
>
> Now, the call sites are:
>
> 41 81 7b fa 78 56 34 12 cmpl $0x12345678, -6(%r11)
> 74 02 je 1f
> 0f 0b ud2
> e8 00 00 00 00 1: call __x86_indirect_thunk_r11
>
> That means the offset of +10 lands in the middle of the CALL
> instruction, and since we only have 16 thunks there is a limited number
> of byte patterns available there.
>
> This really isn't as nice as the -6 but might just work well enough,
> hmm?
I agree, this is probably fine, or at least low enough risk.
Did you have thoughts about changing the check instruction sequence
to split the hash into multiple instructions and thus avoiding this
issue altogether?
Sami
On Wed, Jul 20, 2022 at 9:57 AM Steven Rostedt <[email protected]> wrote:
>
> [ 2.488712] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:558 apply_returns+0xa3/0x1ec
That warning is kind of annoying, in how it doesn't actually give any
information about where the problem is.
I do note that we only fix up JMP32_INSN_OPCODE, and I wonder if we
have a "jmp __x86_return_thunk" that is close enough to the return
thunk that it actually uses a byte offset?
But that WARN_ON_ONCE() should probably be changed to actually give
some information about where the problem is.
The silly thing is, there's even debug output in that function that
you could enable, but it will enable output for the *normal* case, not
for the WARN_ON_ONCE() case or the "we didn't do anything" case. That
seems a bit backwards.
Linus
On Wed, 20 Jul 2022 19:24:32 +0200
Peter Zijlstra <[email protected]> wrote:
> There's a patch for that:
>
> https://lkml.kernel.org/r/[email protected]
>
> it seems to have gotten lost, let me go queue that.
With this applied, I now have:
[ 2.442118] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[ 2.443117] SRBDS: Vulnerable: No microcode
[ 2.463339] ------------[ cut here ]------------
[ 2.464117] missing return thunk: lkdtm_rodata_do_nothing+0x0/0x8-lkdtm_rodata_do_nothing+0x5/0x8: e9 00 00 00 00
[ 2.464128] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:558 apply_returns+0xcb/0x219
[ 2.466117] Modules linked in:
[ 2.467117] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc7-test+ #66
[ 2.468117] Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
[ 2.469117] RIP: 0010:apply_returns+0xcb/0x219
[ 2.470117] Code: 80 3d d1 32 06 02 00 75 59 49 89 d8 b9 05 00 00 00 4c 89 e2 48 89 de 48 c7 c7 63 98 b4 ae c6 05 b3 32 06 02 01 e8 f8 4f fb 00 <0f> 0b eb 34 44 0f b6 65 a2 31 c0 48 8d 55 c1 c6 45 c0 c3 48 89 d7
[ 2.471117] RSP: 0000:ffffffffaee03df0 EFLAGS: 00010286
[ 2.472117] RAX: 0000000000000000 RBX: ffffffffae73e8a8 RCX: ffffffffae056641
[ 2.473117] RDX: ffffffffaee03d78 RSI: 0000000000000001 RDI: 00000000ffffffff
[ 2.474117] RBP: ffffffffaee03ea8 R08: 0000000000000000 R09: 00000000ffffffea
[ 2.475117] R10: 000000000000001f R11: ffffffffaee03a4d R12: ffffffffae73e8ad
[ 2.476117] R13: ffffffffaf550d30 R14: 0000000000000000 R15: ffffffffaee32138
[ 2.477117] FS: 0000000000000000(0000) GS:ffff9d71d6400000(0000) knlGS:0000000000000000
[ 2.478117] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
-- Steve
On Wed, Jul 20, 2022 at 10:50 AM Steven Rostedt <[email protected]> wrote:
>
> [ 2.464117] missing return thunk: lkdtm_rodata_do_nothing+0x0/0x8-lkdtm_rodata_do_nothing+0x5/0x8: e9 00 00 00 00
Well, that looks like a "jmp" instruction that has never been relocated.
The 'e9' is 'jmp', the four zeros after it are either "I'm jumping to
the next instruction" or "I haven't been filled in".
I'm assuming it's the second case.
That lkdtm_rodata_do_nothing thing is odd, and does
OBJCOPYFLAGS_rodata_objcopy.o := \
--rename-section
.noinstr.text=.rodata,alloc,readonly,load,contents
to put the code in an odd section. I'm assuming this hackery is
related to it then not getting relocated.
Linus
On Wed, Jul 20, 2022 at 10:09:37AM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2022 at 9:57 AM Steven Rostedt <[email protected]> wrote:
> >
> > [ 2.488712] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:558 apply_returns+0xa3/0x1ec
>
> That warning is kind of annoying, in how it doesn't actually give any
> information about where the problem is.
>
> I do note that we only fix up JMP32_INSN_OPCODE, and I wonder if we
> have a "jmp __x86_return_thunk" that is close enough to the return
> thunk that it actually uses a byte offset?
>
> But that WARN_ON_ONCE() should probably be changed to actually give
> some information about where the problem is.
There's a patch for that:
https://lkml.kernel.org/r/[email protected]
it seems to have gotten lost, let me go queue that.
On Wed, 20 Jul 2022 11:07:26 -0700
Linus Torvalds <[email protected]> wrote:
> On Wed, Jul 20, 2022 at 10:50 AM Steven Rostedt <[email protected]> wrote:
> >
> > [ 2.464117] missing return thunk: lkdtm_rodata_do_nothing+0x0/0x8-lkdtm_rodata_do_nothing+0x5/0x8: e9 00 00 00 00
>
> Well, that looks like a "jmp" instruction that has never been relocated.
>
> The 'e9' is 'jmp', the four zeros after it are either "I'm jumping to
> the next instruction" or "I haven't been filled in".
>
> I'm assuming it's the second case.
>
> That lkdtm_rodata_do_nothing thing is odd, and does
>
> OBJCOPYFLAGS_rodata_objcopy.o := \
> --rename-section
> .noinstr.text=.rodata,alloc,readonly,load,contents
>
> to put the code in an odd section. I'm assuming this hackery is
> related to it then not getting relocated.
>
Right, because this looks to be some magic being done for testing purposes:
static void lkdtm_EXEC_RODATA(void)
{
execute_location(dereference_function_descriptor(lkdtm_rodata_do_nothing),
CODE_AS_IS);
}
static void *setup_function_descriptor(func_desc_t *fdesc, void *dst)
{
if (!have_function_descriptors())
return dst;
memcpy(fdesc, do_nothing, sizeof(*fdesc));
fdesc->addr = (unsigned long)dst;
barrier();
return fdesc;
}
static noinline void execute_location(void *dst, bool write)
{
void (*func)(void);
func_desc_t fdesc;
void *do_nothing_text = dereference_function_descriptor(do_nothing);
pr_info("attempting ok execution at %px\n", do_nothing_text);
do_nothing();
if (write == CODE_WRITE) {
memcpy(dst, do_nothing_text, EXEC_SIZE);
flush_icache_range((unsigned long)dst,
(unsigned long)dst + EXEC_SIZE);
}
pr_info("attempting bad execution at %px\n", dst);
func = setup_function_descriptor(&fdesc, dst);
func();
pr_err("FAIL: func returned\n");
}
And that appears that it wants to crash, as the code is located in readonly
data.
OBJCOPYFLAGS_rodata_objcopy.o := \
--rename-section .noinstr.text=.rodata,alloc,readonly,load,contents
And because the alternatives fixup tries to write to it, and fails due to
it being readonly, I'm guessing we get this warning.
Thus, is there a way to keep this file from being entered into the
return_sites section?
-- Steve
On Wed, Jul 20, 2022 at 11:31 AM Steven Rostedt <[email protected]> wrote:
>
> Thus, is there a way to keep this file from being entered into the
> return_sites section?
I think the whole concept is broken.
Testing known-broken code on the expectation that "this won't work
anyway, so we can jump off to code that is broken" is not acceptable.
*If* the test were to fail, it would start executing random code that
hasn't been relocated or fixed up properly.
So I think the whole concept is broken. It relies on the compiler
generating code that can work in a read-only data section, and it's
not clear that that is even physically possible (ie the data section
might be far enough away from a code section that any relocation just
fundamentally cannot happen).
I think it worked purely by mistake, because the code was simple
enough that it didn't need any relocation at all before. But even
without RETHUNK, that was never guaranteed, because any random tracing
or debug code or whatever could have made even that empty function
have code in it that just fundamentally wouldn't work in a non-text
section.
So honestly, I think that test should be removed as a "we used this,
it happened to work almost by mistake, but it doesn't work any more
and it is unfixably broken".
Maybe somebody can come up with an entirely different way to do that
test that isn't so broken, but if so, I think it's going to be using
some other machinery (eg bpf and explicitly marking it read-only and
non-executable), and removing this broken model is the right thing
regardless.
So unless somebody has some one-liner workaround, I really suspect the
fix is to remove all this. The amount of hackery to make it work in
the first place is kind of disgusting anyway.
Since this was a WARN_ONCE(), can you make sure that with this case
removed, nothing else triggers?
Linus
On Wed, 20 Jul 2022 11:43:37 -0700
Linus Torvalds <[email protected]> wrote:
> So unless somebody has some one-liner workaround, I really suspect the
> fix is to remove all this. The amount of hackery to make it work in
> the first place is kind of disgusting anyway.
>
> Since this was a WARN_ONCE(), can you make sure that with this case
> removed, nothing else triggers?
Actually, this fixes it too:
(and this config boots to completion without warnings).
I'll add this to my full test suite and see if it finishes.
-- Steve
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 1f40dad30d50..2dd61d8594f4 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -25,6 +25,7 @@ endif
ifdef CONFIG_RETHUNK
RETHUNK_CFLAGS := -mfunction-return=thunk-extern
RETPOLINE_CFLAGS += $(RETHUNK_CFLAGS)
+export RETHUNK_CFLAGS
endif
export RETPOLINE_CFLAGS
diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 2e0aa74ac185..fd96ac1617f7 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -16,7 +16,7 @@ lkdtm-$(CONFIG_PPC_64S_HASH_MMU) += powerpc.o
KASAN_SANITIZE_rodata.o := n
KASAN_SANITIZE_stackleak.o := n
KCOV_INSTRUMENT_rodata.o := n
-CFLAGS_REMOVE_rodata.o += $(CC_FLAGS_LTO)
+CFLAGS_REMOVE_rodata.o += $(CC_FLAGS_LTO) $(RETHUNK_CFLAGS)
OBJCOPYFLAGS :=
OBJCOPYFLAGS_rodata_objcopy.o := \
On Tue, Jul 19, 2022 at 01:51:14AM +0200, Peter Zijlstra wrote:
> That means the offset of +10 lands in the middle of the CALL
> instruction, and since we only have 16 thunks there is a limited number
> of byte patterns available there.
>
> This really isn't as nice as the -6 but might just work well enough,
> hmm?
pcc pointed out that we can also just add two more ud2 instructions to
the check sequence if we want to be safe, with the cost of extra four
bytes per callsite.
> Also, since we're talking at least 4 bytes more padding over the 7 that
> are required by the kCFI scheme, the FineIBT alternative gets a little
> more room to breathe. I'm thinking we can have the FineIBT landing site
> at -16.
>
> __fineibt_\func:
> endbr64 # 4
> xorl $0x12345678, %r10d # 7
> je \func+4 # 2
> ud2 # 2
>
> \func:
> nop4
> ...
I assume this means the preamble must also be 16-byte aligned to avoid
performance issues with the FineIBT alternative? Which means we'll have
a 16-byte preamble preceded by the usual nop padding.
Sami
On Wed, Jul 20, 2022 at 11:07:26AM -0700, Linus Torvalds wrote:
> On Wed, Jul 20, 2022 at 10:50 AM Steven Rostedt <[email protected]> wrote:
> >
> > [ 2.464117] missing return thunk: lkdtm_rodata_do_nothing+0x0/0x8-lkdtm_rodata_do_nothing+0x5/0x8: e9 00 00 00 00
>
> Well, that looks like a "jmp" instruction that has never been relocated.
Peter, Josh, and I drilled down into this recently[1] and discussed
some solutions[2].
This test is doing what's expected: it needed an arch-agnostic way to do
a "return", and when the way to do that changed, it also changed (which
would normally be good, but in this case broke it). It's been happily
being used as part of the per-section architectural behavior testing[3]
of execution-vs-expected-memory-permissions for quite a long while now.
I'd rather not remove it (or do it dynamically) since the point is to
test what has been generated by the toolchain/build process and stuffed
into the .rodata section. i.e. making sure gadgets there can't be
executed, that the boot-time section permission-setting works correctly,
etc. Before the retbleed mitigation, this test worked for all
architectures; I'd hate to regress it. :(
-Kees
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] e.g. https://linux.kernelci.org/test/plan/id/62d61ee8ef31e0f0faa39bff/
--
Kees Cook
On Wed, 20 Jul 2022 12:36:38 -0700
Kees Cook <[email protected]> wrote:
> I'd rather not remove it (or do it dynamically) since the point is to
> test what has been generated by the toolchain/build process and stuffed
> into the .rodata section. i.e. making sure gadgets there can't be
> executed, that the boot-time section permission-setting works correctly,
> etc. Before the retbleed mitigation, this test worked for all
> architectures; I'd hate to regress it. :(
If you haven't noticed my reply, I wasn't able to come up with a one line
workaround, but I was able to come up with a two line workaround. Hopefully
that will be good enough to keep your little feature.
https://lore.kernel.org/all/[email protected]/
I'm currently running it under my entire ftrace test suite. If it passes,
I'll submit a formal patch.
-- Steve
On Tue, Jul 19, 2022 at 10:19:18AM -0700, Sami Tolvanen wrote:
> Clang's current CFI implementation is somewhat similar to this. It
> creates separate thunks for address-taken functions and changes
> function addresses in C code to point to the thunks instead.
>
> While this works, it creates painful situations when interacting with
> assembly (e.g. a function address taken in assembly cannot be used
> for indirect calls in C as it doesn't point to the thunk) and needs
> unpleasant hacks when we want take the actual function address in C
> (i.e. scattering the code with function_nocfi() calls).
>
> I have to agree with Peter on this, I would rather avoid messing with
> function pointers in KCFI to avoid these issues.
It is either this; and I think I can avoid the worst of it (see below);
or grow the indirect_callsites to obscure the immediate (as Linus
suggested), there's around ~16k indirect callsites in a defconfig-ish
kernel, so growing it isn't too horrible, but it isn't nice either.
The prettiest option to obscure the immediate at the callsite I could
conjure up is something like:
kcfi_caller_linus:
movl $0x12345600, %r10d
movb $0x78, %r10b
cmpl %r10d, -OFFSET(%r11)
je 1f
ud2
1: call __x86_thunk_indirect_r11
Which comes to around 22 bytes (+5 over the original).
Joao suggested putting part of that in the retpoline thunk like:
kcfi_caller_joao:
movl $0x12345600, %r10d
movb $0x78, %r10b
call __x86_thunk_indirect_cfi
__x86_thunk_indirect_cfi:
cmpl %r10d, -OFFSET(%r11)
je 1f
ud2
1:
call 1f
int3
1:
mov %r11, (%rsp)
ret
int3
The only down-side there is that eIBRS hardware doesn't need retpolines
(given we currently default to ignoring Spectre-BHB) and as such this
doesn't really work nicely (we don't want to re-introduce funneling).
The other option I came up with, alluded to above, is below, and having
written it out, I'm pretty sure I faviour just growing the indirect
callsite as per Linus' option above.
Suppose:
indirect_callsite:
cmpl $0x12345678, -6(%r11) # 8
je 1f # 2
ud2 # 2
call __x86_indirect_thunk_r11 # 5 (-> .retpoline_sites)
__cfi_\func:
movl $0x12345678, %eax # 5
int3 # 1
int3 # 1
\func: # aligned 16
endbr # 4
nop12 # 12
call __fentry__ # 5
...
And for functions that do not get their address taken:
\func: # aligned 16
nop16 # 16
call __fentry__ # 5
...
Instead, extend the objtool .call_sites to also include tail-calls and
for:
- regular (!SKL, !IBT) systems;
* patch all direct calls/jmps to +16 (.call_sites)
* static_call/ftrace/etc.. can triviall add the +16
* retpolines can do +16 for the indirect calls
* retutn thunks are patched to ret;int3 (.return_sites)
(indirect calls for eIBRS which don't use retpoline
simply eat the nops)
- SKL systems;
* patch the first 16 bytes into:
nop6
sarq $5, PER_CPU_VAR(__x86_call_depth)
* patch all direct calls to +6 (.call_sites)
* patch all direct jumps to +16 (.call_sites)
* static_call/ftrace adjust to +6/+16 depending on instruction type
* retpolines are split between call/jmp and do +6/+16 resp.
* return thunks are patches to x86_return_skl (.return_sites)
- IBT systes;
* patch the first 16 bytes to:
endbr # 4
xorl $0x12345678, %r10d # 7
je 1f # 2
ud2 # 2
nop # 1
1:
* patch the callsites to: (.retpoline_sites)
movl $0x12345678, %r10d # 7
call *$r11 # 3
nop7 # 7
* patch all the direct calls/jmps to +16 (.call_sites)
* static_call/ftrace/etc.. add +16
* retutn thunks are patched to ret;int3 (.return_sites)
Yes, frobbing the address for static_call/ftrace/etc.. is a bit
horrible, but at least &sym remains exactly that address and not
something magical.
Note: It is possible to shift the __fentry__ call, but that would mean
that we loose alignment or get to carry .call_sites at runtime (and it
is *huge*)
On Wed, Jul 20, 2022 at 12:36:38PM -0700, Kees Cook wrote:
> On Wed, Jul 20, 2022 at 11:07:26AM -0700, Linus Torvalds wrote:
> > On Wed, Jul 20, 2022 at 10:50 AM Steven Rostedt <[email protected]> wrote:
> > >
> > > [ 2.464117] missing return thunk: lkdtm_rodata_do_nothing+0x0/0x8-lkdtm_rodata_do_nothing+0x5/0x8: e9 00 00 00 00
> >
> > Well, that looks like a "jmp" instruction that has never been relocated.
>
> Peter, Josh, and I drilled down into this recently[1] and discussed
> some solutions[2].
>
> This test is doing what's expected: it needed an arch-agnostic way to do
> a "return", and when the way to do that changed, it also changed (which
> would normally be good, but in this case broke it). It's been happily
> being used as part of the per-section architectural behavior testing[3]
> of execution-vs-expected-memory-permissions for quite a long while now.
>
> I'd rather not remove it (or do it dynamically) since the point is to
> test what has been generated by the toolchain/build process and stuffed
> into the .rodata section. i.e. making sure gadgets there can't be
> executed, that the boot-time section permission-setting works correctly,
> etc. Before the retbleed mitigation, this test worked for all
> architectures; I'd hate to regress it. :(
>
> -Kees
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3] e.g. https://linux.kernelci.org/test/plan/id/62d61ee8ef31e0f0faa39bff/
Josh posted this:
https://lkml.kernel.org/r/8ec0039712f252693049c70ed3891d39a2357112.1658155446.git.jpoimboe@kernel.org
which I picked up today; barring robot fail I'll push it to x86/urgent
tomorrow.
From: Peter Zijlstra
> Sent: 20 July 2022 22:13
...
> The prettiest option to obscure the immediate at the callsite I could
> conjure up is something like:
>
> kcfi_caller_linus:
> movl $0x12345600, %r10d
> movb $0x78, %r10b
> cmpl %r10d, -OFFSET(%r11)
> je 1f
> ud2
> 1: call __x86_thunk_indirect_r11
>
> Which comes to around 22 bytes (+5 over the original).
You'd be better doing:
movl $0x12345678-0xaa, %r10d
addl $0xaa, %r10d
so that the immediate is obscured even if the low bits are zero.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
From: David Laight
> Sent: 21 July 2022 09:22
>
> From: Peter Zijlstra
> > Sent: 20 July 2022 22:13
> ...
> > The prettiest option to obscure the immediate at the callsite I could
> > conjure up is something like:
> >
> > kcfi_caller_linus:
> > movl $0x12345600, %r10d
> > movb $0x78, %r10b
> > cmpl %r10d, -OFFSET(%r11)
> > je 1f
> > ud2
> > 1: call __x86_thunk_indirect_r11
> >
> > Which comes to around 22 bytes (+5 over the original).
>
> You'd be better doing:
> movl $0x12345678-0xaa, %r10d
> addl $0xaa, %r10d
> so that the immediate is obscured even if the low bits are zero.
Actually, can't you use %eax instead of %r10d?
IIRC it is only used for the number of FP registers in a varargs
call - and that isn't used in the kernel.
That removes the 3 'REG' prefixes and lets you use the
2-byte 04-xx instruction to add to %al.
Although I'm sure I remember something about a penalty for
accessing %al just after the full register.
So the 3-byte sign extending add may be better.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Wed, Jul 20, 2022 at 11:13:16PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 19, 2022 at 10:19:18AM -0700, Sami Tolvanen wrote:
>
> > Clang's current CFI implementation is somewhat similar to this. It
> > creates separate thunks for address-taken functions and changes
> > function addresses in C code to point to the thunks instead.
> >
> > While this works, it creates painful situations when interacting with
> > assembly (e.g. a function address taken in assembly cannot be used
> > for indirect calls in C as it doesn't point to the thunk) and needs
> > unpleasant hacks when we want take the actual function address in C
> > (i.e. scattering the code with function_nocfi() calls).
> >
> > I have to agree with Peter on this, I would rather avoid messing with
> > function pointers in KCFI to avoid these issues.
>
> It is either this; and I think I can avoid the worst of it (see below);
> or grow the indirect_callsites to obscure the immediate (as Linus
> suggested), there's around ~16k indirect callsites in a defconfig-ish
> kernel, so growing it isn't too horrible, but it isn't nice either.
>
> The prettiest option to obscure the immediate at the callsite I could
> conjure up is something like:
>
> kcfi_caller_linus:
> movl $0x12345600, %r10d
> movb $0x78, %r10b
> cmpl %r10d, -OFFSET(%r11)
> je 1f
> ud2
> 1: call __x86_thunk_indirect_r11
>
> Which comes to around 22 bytes (+5 over the original).
My very firstest LLVM patch; except it explodes at runtime and I'm not
sure where to start looking...
On top of sami-llvm/kcfi
If I comment out the orl and cmpl it compiles stuff, put either one back
and it explodes in some very unhelpful message.
The idea is the above callthunk that makes any offset work by not having
the full hash as an immediate and allow kCFI along with
-fpatchable-function-entry=N,M and include M in the offset.
Specifically, I meant to use -fpatchable-function-entry=16,16, but alas,
I never got that far.
Help ?
---
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 5e011d409ee8..ffdb95324da7 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -124,23 +124,12 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF,
OutStreamer->emitSymbolAttribute(FnSym, MCSA_ELF_TypeFunction);
OutStreamer->emitLabel(FnSym);
- // Emit int3 padding to allow runtime patching of the preamble.
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
-
// Embed the type hash in the X86::MOV32ri instruction to avoid special
// casing object file parsers.
EmitAndCountInstruction(MCInstBuilder(X86::MOV32ri)
.addReg(X86::EAX)
.addImm(Type->getZExtValue()));
- // The type hash is encoded in the last four bytes of the X86::MOV32ri
- // instruction. Emit additional X86::INT3 padding to ensure the hash is
- // at offset -6 from the function start to avoid potential call target
- // gadgets in checks emitted by X86AsmPrinter::LowerKCFI_CHECK.
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
-
if (MAI->hasDotTypeDotSizeDirective()) {
MCSymbol *EndSym = OutContext.createTempSymbol("cfi_func_end");
OutStreamer->emitLabel(EndSym);
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index 16c4d2e45970..d72e82f4f63a 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -1340,22 +1340,34 @@ void X86AsmPrinter::LowerKCFI_CHECK(const MachineInstr &MI) {
assert(std::next(MI.getIterator())->isCall() &&
"KCFI_CHECK not followed by a call instruction");
- // The type hash is encoded in the last four bytes of the X86::CMP32mi
- // instruction. If we decided to place the hash immediately before
- // indirect call targets (offset -4), the X86::JCC_1 instruction we'll
- // emit next would be a potential indirect call target as it's preceded
- // by a valid type hash.
- //
- // To avoid generating useful gadgets, X86AsmPrinter::emitKCFITypeId
- // emits the type hash prefix at offset -6, which makes X86::TRAP the
- // only possible target in this instruction sequence.
- EmitAndCountInstruction(MCInstBuilder(X86::CMP32mi)
+ const Function &F = MF->getFunction();
+ unsigned Imm = MI.getOperand(1).getImm();
+ unsigned Num;
+
+ if (F.hasFnAttribute("patchable-function-prefix")) {
+ if (F.getFnAttribute("patchable-function-prefix")
+ .getValueAsString()
+ .getAsInteger(10, Num))
+ Num = 0;
+ }
+
+ // movl $0x12345600, %r10d
+ EmitAndCountInstruction(MCInstBuilder(X86::MOV32ri)
+ .addReg(X86::R10)
+ .addImm(Imm & ~0xff));
+
+ // orl $0x78, %r10d
+ EmitAndCountInstruction(MCInstBuilder(X86::OR32ri8)
+ .addReg(X86::R10)
+ .addImm(Imm & 0xff));
+
+ // cmpl %r10, -off(%r11)
+ EmitAndCountInstruction(MCInstBuilder(X86::CMP32rm)
+ .addReg(X86::R10)
.addReg(MI.getOperand(0).getReg())
.addImm(1)
.addReg(X86::NoRegister)
- .addImm(-6)
- .addReg(X86::NoRegister)
- .addImm(MI.getOperand(1).getImm()));
+ .addImm(-(Num + 4)));
MCSymbol *Pass = OutContext.createTempSymbol();
EmitAndCountInstruction(
On Thu, Jul 21, 2022 at 10:56 AM Peter Zijlstra <[email protected]> wrote:
>
> this seems to work, let me go hack the kernel..
Am I missing something?
Isn't this generating
movl $~IMM,%r10d
negl %r10d
cmpl %r10d,-4(%calldest)
for the sequence?
That seems bogus for two reasons:
(a) 'neg' is not the opposite of '~'. Did you mean 'notl' or did you mean '-'?
Or am I missing something entirely?
(b) since you have that r10 use anyway, why can't you just generate the simpler
movl $-IMM,%r10d
addl -4(%calldest),%r10d
instead? You only need ZF anyway.
Maybe you need to add some "r10 is clobbered" thing, I don't know.
But again: I don't know llvm, so the above is basically me just doing
the "pattern matching monkey" thing.
Linus
On Thu, Jul 21, 2022 at 05:54:38PM +0200, Peter Zijlstra wrote:
> My very firstest LLVM patch; except it explodes at runtime and I'm not
> sure where to start looking...
>
> On top of sami-llvm/kcfi
Thanks Sami!
this seems to work, let me go hack the kernel..
---
clang/lib/Driver/SanitizerArgs.cpp | 12 ---------
llvm/lib/Target/X86/X86AsmPrinter.cpp | 11 --------
llvm/lib/Target/X86/X86MCInstLower.cpp | 47 ++++++++++++++++++++++------------
3 files changed, 31 insertions(+), 39 deletions(-)
diff --git a/clang/lib/Driver/SanitizerArgs.cpp b/clang/lib/Driver/SanitizerArgs.cpp
index 373a74399df0..b6ebc8ad1842 100644
--- a/clang/lib/Driver/SanitizerArgs.cpp
+++ b/clang/lib/Driver/SanitizerArgs.cpp
@@ -719,18 +719,6 @@ SanitizerArgs::SanitizerArgs(const ToolChain &TC,
D.Diag(diag::err_drv_argument_not_allowed_with)
<< "-fsanitize=kcfi"
<< lastArgumentForMask(D, Args, SanitizerKind::CFI);
-
- if (Arg *A = Args.getLastArg(options::OPT_fpatchable_function_entry_EQ)) {
- StringRef S = A->getValue();
- unsigned N, M;
- // With -fpatchable-function-entry=N,M, where M > 0,
- // llvm::AsmPrinter::emitFunctionHeader injects nops before the
- // KCFI type identifier, which is currently unsupported.
- if (!S.consumeInteger(10, N) && S.consume_front(",") &&
- !S.consumeInteger(10, M) && M > 0)
- D.Diag(diag::err_drv_argument_not_allowed_with)
- << "-fsanitize=kcfi" << A->getAsString(Args);
- }
}
Stats = Args.hasFlag(options::OPT_fsanitize_stats,
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 5e011d409ee8..ffdb95324da7 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -124,23 +124,12 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF,
OutStreamer->emitSymbolAttribute(FnSym, MCSA_ELF_TypeFunction);
OutStreamer->emitLabel(FnSym);
- // Emit int3 padding to allow runtime patching of the preamble.
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
-
// Embed the type hash in the X86::MOV32ri instruction to avoid special
// casing object file parsers.
EmitAndCountInstruction(MCInstBuilder(X86::MOV32ri)
.addReg(X86::EAX)
.addImm(Type->getZExtValue()));
- // The type hash is encoded in the last four bytes of the X86::MOV32ri
- // instruction. Emit additional X86::INT3 padding to ensure the hash is
- // at offset -6 from the function start to avoid potential call target
- // gadgets in checks emitted by X86AsmPrinter::LowerKCFI_CHECK.
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
- EmitAndCountInstruction(MCInstBuilder(X86::INT3));
-
if (MAI->hasDotTypeDotSizeDirective()) {
MCSymbol *EndSym = OutContext.createTempSymbol("cfi_func_end");
OutStreamer->emitLabel(EndSym);
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index 16c4d2e45970..4ed23348aa7c 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -1340,22 +1340,37 @@ void X86AsmPrinter::LowerKCFI_CHECK(const MachineInstr &MI) {
assert(std::next(MI.getIterator())->isCall() &&
"KCFI_CHECK not followed by a call instruction");
- // The type hash is encoded in the last four bytes of the X86::CMP32mi
- // instruction. If we decided to place the hash immediately before
- // indirect call targets (offset -4), the X86::JCC_1 instruction we'll
- // emit next would be a potential indirect call target as it's preceded
- // by a valid type hash.
- //
- // To avoid generating useful gadgets, X86AsmPrinter::emitKCFITypeId
- // emits the type hash prefix at offset -6, which makes X86::TRAP the
- // only possible target in this instruction sequence.
- EmitAndCountInstruction(MCInstBuilder(X86::CMP32mi)
- .addReg(MI.getOperand(0).getReg())
- .addImm(1)
- .addReg(X86::NoRegister)
- .addImm(-6)
- .addReg(X86::NoRegister)
- .addImm(MI.getOperand(1).getImm()));
+ const Function &F = MF->getFunction();
+ unsigned Imm = MI.getOperand(1).getImm();
+ unsigned Num = 0;
+
+ if (F.hasFnAttribute("patchable-function-prefix")) {
+ if (F.getFnAttribute("patchable-function-prefix")
+ .getValueAsString()
+ .getAsInteger(10, Num))
+ Num = 0;
+ }
+
+ // movl $(~0x12345678), %r10d
+ EmitAndCountInstruction(MCInstBuilder(X86::MOV32ri)
+ .addReg(X86::R10D) // dst
+ .addImm(~Imm));
+
+ // negl %r10d
+ EmitAndCountInstruction(MCInstBuilder(X86::NEG32r)
+ .addReg(X86::R10D) // dst
+ .addReg(X86::R10D) // src
+ );
+
+ // cmpl %r10d, -off(%r11)
+ EmitAndCountInstruction(MCInstBuilder(X86::CMP32mr)
+ .addReg(MI.getOperand(0).getReg()) // base
+ .addImm(0) // scale
+ .addReg(0) // index
+ .addImm(-(Num+4)) // offset
+ .addReg(0) // segment
+ .addReg(X86::R10D) // reg
+ );
MCSymbol *Pass = OutContext.createTempSymbol();
EmitAndCountInstruction(
On Thu, Jul 21, 2022 at 11:06:42AM -0700, Linus Torvalds wrote:
> On Thu, Jul 21, 2022 at 10:56 AM Peter Zijlstra <[email protected]> wrote:
> >
> > this seems to work, let me go hack the kernel..
>
> Am I missing something?
>
> Isn't this generating
>
> movl $~IMM,%r10d
> negl %r10d
> cmpl %r10d,-4(%calldest)
>
> for the sequence?
>
> That seems bogus for two reasons:
>
> (a) 'neg' is not the opposite of '~'. Did you mean 'notl' or did you mean '-'?
>
> Or am I missing something entirely?
No, you're right, I'm being daft again.
> (b) since you have that r10 use anyway, why can't you just generate the simpler
>
> movl $-IMM,%r10d
> addl -4(%calldest),%r10d
>
> instead? You only need ZF anyway.
Right, lemme see if I can wrangle llvm to generate that.
> Maybe you need to add some "r10 is clobbered" thing, I don't know.
R11,R11 are caller-saved, and since this is the actual call site, the
caller must already have saved them or marked them clobbered.
On Thu, Jul 21, 2022 at 11:27 AM Peter Zijlstra <[email protected]> wrote:
>
> R11,R11 are caller-saved, and since this is the actual call site, the
> caller must already have saved them or marked them clobbered.
Ok. I don't know the context, but I was thinking along the lines of
the same hash value perhaps being used multiple times because it has
the same function type. Then using the "addl" trick means that the
hash value in %r10 will be changing and cannot be re-used.
But I guess this is *much* too late for those kinds of optimizations,
as it literally just outputs the raw instruction sequence, and so the
(negated) hash value will always be re-generated anyway, no re-use
possible.
Linus
> Ok. I don't know the context, but I was thinking along the lines of
> the same hash value perhaps being used multiple times because it has
> the same function type. Then using the "addl" trick means that the
> hash value in %r10 will be changing and cannot be re-used.
Fwiiw, even if %r10 value was not being destroyed by the "addl", the
call right after the check implies that you cannot trust the contents of
%r10 anymore (it may have been messed up within the called function).
From: Linus Torvalds
> Sent: 21 July 2022 19:07
...
> (b) since you have that r10 use anyway, why can't you just generate the simpler
>
> movl $-IMM,%r10d
> addl -4(%calldest),%r10d
>
> instead? You only need ZF anyway.
>
> Maybe you need to add some "r10 is clobbered" thing, I don't know.
>
> But again: I don't know llvm, so the above is basically me just doing
> the "pattern matching monkey" thing.
>
> Linus
Since: "If the callee is a variadic function, then the number of floating
point arguments passed to the function in vector registers must be provided
by the caller in the AL register."
And that that never happens in the kernel you can use %eax instead
of %r10d.
Even in userspace %al can be set non-zero after the signature check.
If you are willing to cut the signature down to 26 bits and
then ensure that one of the bytes of -IMM (or ~IMM if you
use xor) is 0xcc and jump back to that on error the check
becomes:
movl $-IMM,%eax
1: addl -4(%calldest),%eax
jnz 1b-1 // or -2, -3, -4
add $num_fp_args,%eax // If needed non-zero
call %calldest
I think that adds 10 bytes to the call site.
Although with retpoline thunks (and no fp varargs calls)
all but the initial movl can go into the thunk.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, Jul 21, 2022 at 08:27:12PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 21, 2022 at 11:06:42AM -0700, Linus Torvalds wrote:
> > On Thu, Jul 21, 2022 at 10:56 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > this seems to work, let me go hack the kernel..
> >
> > Am I missing something?
> >
> > Isn't this generating
> >
> > movl $~IMM,%r10d
> > negl %r10d
> > cmpl %r10d,-4(%calldest)
> >
> > for the sequence?
> >
> > That seems bogus for two reasons:
> >
> > (a) 'neg' is not the opposite of '~'. Did you mean 'notl' or did you mean '-'?
> >
> > Or am I missing something entirely?
>
> No, you're right, I'm being daft again.
>
> > (b) since you have that r10 use anyway, why can't you just generate the simpler
> >
> > movl $-IMM,%r10d
> > addl -4(%calldest),%r10d
> >
> > instead? You only need ZF anyway.
>
> Right, lemme see if I can wrangle llvm to generate that.
That looks good to me. I updated my LLVM tree to generate this code
for the checks:
https://github.com/samitolvanen/llvm-project/commits/kcfi
Sami
On Thu, Jul 21, 2022 at 05:16:14PM -0700, Sami Tolvanen wrote:
> That looks good to me. I updated my LLVM tree to generate this code
> for the checks:
>
> https://github.com/samitolvanen/llvm-project/commits/kcfi
Thanks!
The alignment thing you added:
// Emit int3 padding before the type information to maintain alignment.
// The X86::MOV32ri instruction we emit is 5 bytes long.
uint64_t Padding = offsetToAlignment(5, MF.getAlignment());
while (Padding--)
EmitAndCountInstruction(MCInstBuilder(X86::INT3));
Doesn't seem to quite do what we want though.
When I use -fpatchable-function-entry=16,16 we effectively get a 32 byte
prefix on every function:
0000000000000000 <__cfi___traceiter_sched_kthread_stop>:
0: cc int3
1: cc int3
2: cc int3
3: cc int3
4: cc int3
5: cc int3
6: cc int3
7: cc int3
8: cc int3
9: cc int3
a: cc int3
b: b8 26 b1 df 98 mov $0x98dfb126,%eax
10: 90 nop
11: 90 nop
12: 90 nop
13: 90 nop
14: 90 nop
15: 90 nop
16: 90 nop
17: 90 nop
18: 90 nop
19: 90 nop
1a: 90 nop
1b: 90 nop
1c: 90 nop
1d: 90 nop
1e: 90 nop
1f: 90 nop
And given the parameters, that's indeed the only option. However, given
I can scribble the type thing just fine when moving to FineIBT and the
whole Skylake depth tracking only needs 10 bytes, I figured I'd try:
-fpatchable-function-entry=11,11 instead. But that resulted in
unalignment:
0000000000000000 <__cfi___traceiter_sched_kthread_stop>:
0: cc int3
1: cc int3
2: cc int3
3: cc int3
4: cc int3
5: cc int3
6: cc int3
7: cc int3
8: cc int3
9: cc int3
a: cc int3
b: b8 26 b1 df 98 mov $0x98dfb126,%eax
10: 90 nop
11: 90 nop
12: 90 nop
13: 90 nop
14: 90 nop
15: 90 nop
16: 90 nop
17: 90 nop
18: 90 nop
19: 90 nop
1a: 90 nop
000000000000001b <__traceiter_sched_kthread_stop>:
However, if I change clang like so:
llvm/lib/Target/X86/X86AsmPrinter.cpp | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/llvm/lib/Target/X86/X86AsmPrinter.cpp b/llvm/lib/Target/X86/X86AsmPrinter.cpp
index 789597f8ef1a..6c94313a197d 100644
--- a/llvm/lib/Target/X86/X86AsmPrinter.cpp
+++ b/llvm/lib/Target/X86/X86AsmPrinter.cpp
@@ -124,9 +124,15 @@ void X86AsmPrinter::emitKCFITypeId(const MachineFunction &MF,
OutStreamer->emitSymbolAttribute(FnSym, MCSA_ELF_TypeFunction);
OutStreamer->emitLabel(FnSym);
+ int64_t PrefixNops = 0;
+ (void)MF.getFunction()
+ .getFnAttribute("patchable-function-prefix")
+ .getValueAsString()
+ .getAsInteger(10, PrefixNops);
+
// Emit int3 padding before the type information to maintain alignment.
// The X86::MOV32ri instruction we emit is 5 bytes long.
- uint64_t Padding = offsetToAlignment(5, MF.getAlignment());
+ uint64_t Padding = offsetToAlignment(5+PrefixNops, MF.getAlignment());
while (Padding--)
EmitAndCountInstruction(MCInstBuilder(X86::INT3));
Then it becomes:
0000000000000000 <__cfi___traceiter_sched_kthread_stop>:
0: b8 26 b1 df 98 mov $0x98dfb126,%eax
5: 90 nop
6: 90 nop
7: 90 nop
8: 90 nop
9: 90 nop
a: 90 nop
b: 90 nop
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <__traceiter_sched_kthread_stop>:
and things are 'good' again, except for functions that don't get a kcfi
preamble, those are unaligned... I couldn't find where the
patchable-function-prefix nops are generated to fix this up :/
Also; could you perhaps add a switch to supress ENDBR for functions with
a kCFI preamble ?
On Thu, Jul 21, 2022 at 10:01:12PM +0000, David Laight wrote:
> Since: "If the callee is a variadic function, then the number of floating
> point arguments passed to the function in vector registers must be provided
> by the caller in the AL register."
>
> And that that never happens in the kernel you can use %eax instead
> of %r10d.
Except there's the AMD BTC thing and we should (compiler patch seems
MIA) have an unconditional: 'xor %eax,%eax' in front of every function
call.
(The official mitigation strategy was CALL; LFENCE IIRC, but that's so
horrible nobody is actually considering that)
Yes, the suggested sequence ends with rax being zero, but since we start
the speculation before that result is computed that's not good enough I
suspect.
From: Peter Zijlstra
> Sent: 22 July 2022 12:03
>
> On Thu, Jul 21, 2022 at 10:01:12PM +0000, David Laight wrote:
>
> > Since: "If the callee is a variadic function, then the number of floating
> > point arguments passed to the function in vector registers must be provided
> > by the caller in the AL register."
> >
> > And that that never happens in the kernel you can use %eax instead
> > of %r10d.
>
> Except there's the AMD BTC thing and we should (compiler patch seems
> MIA) have an unconditional: 'xor %eax,%eax' in front of every function
> call.
I've just read https://www.amd.com/system/files/documents/technical-guidance-for-mitigating-branch-type-confusion_v7_20220712.pdf
It doesn't seem to suggest clearing registers except as a vague
'might help' before a function return (to limit what the speculated
code can do.
The only advantage I can think of for 'xor ax,ax' is that it is done as
a register rename - and isn't dependant on older instructions.
So it might reduce some pipeline stalls.
I'm guessing that someone might find a 'gadget' that depends on %eax
and it may be possible to find somewhere that leaves an arbitrary
value in it.
It is also about the only register that isn't live!
> (The official mitigation strategy was CALL; LFENCE IIRC, but that's so
> horrible nobody is actually considering that)
>
> Yes, the suggested sequence ends with rax being zero, but since we start
> the speculation before that result is computed that's not good enough I
> suspect.
The speculated code can't use the 'wrong' %eax value.
The only problem is that reading from -4(%r11) is likely to be a
D$ miss giving plenty of time for the cpu to execute 'crap'.
But I'm not sure a later 'xor ax,ax' helps.
(OTOH this is all horrid and makes my brian hurt.)
AFAICT with BTC you 'just lose'.
I thought it was bad enough that some cpu used the BTB for predicted
conditional jumps - but using it to decide 'this must be a branch
instruction' seems especially broken.
Seems the best thing to do with those cpu is to run an embedded
system with a busybox+buildroot userspace where almost everything
runs as root :-)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, Jul 22, 2022 at 12:23:30PM +0200, Peter Zijlstra wrote:
> and things are 'good' again, except for functions that don't get a kcfi
> preamble, those are unaligned...
One way to fix this would be to just emit an empty KCFI preamble for
non-address-taken functions when patchable-function-prefix > 0, so all
the functions end up with the same alignment.
Note that Clang doesn't keep the function entry aligned with
-fpatchable-function-entry=N,M, where M>0. It generates .p2align 4,
0x90 before the nops, but if you want to maintain alignment for the
entry, you just have to tell it to generate the correct number of
prefix nops.
> I couldn't find where the patchable-function-prefix nops are generated
> to fix this up :/
It's all in AsmPrinter::emitFunctionHeader, look for emitNops.
> Also; could you perhaps add a switch to supress ENDBR for functions with
> a kCFI preamble ?
I'm planning to do that in a follow-up patch. I would rather not add
features that are not critical to the initial patch to avoid further
delays in getting the compiler changes accepted.
Sami
On Mon, 2022-07-18 at 22:44 +0200, Thomas Gleixner wrote:
> On Mon, Jul 18 2022 at 12:51, Linus Torvalds wrote:
> > On Mon, Jul 18, 2022 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
> > > Let the compiler add a 16 byte padding in front of each function entry
> > > point and put the call depth accounting there. That avoids calling out
> > > into the module area and reduces ITLB pressure.
> >
> > Ooh.
> >
> > I actually like this a lot better.
> >
> > Could we just say "use this instead if you have SKL and care about the issue?"
> >
> > I don't hate your module thunk trick, but this does seem *so* much
> > simpler, and if it performs better anyway, it really does seem like
> > the better approach.
>
> Yes, Peter and I came from avoiding a new compiler and the overhead for
> everyone when putting the padding into the code. We realized only when
> staring at the perf data that this padding in front of the function
> might be an acceptable solution. I did some more tests today on different
> machines with mitigations=off with kernels compiled with and without
> that padding. I couldn't find a single test case where the result was
> outside of the usual noise. But then my tests are definitely incomplete.
>
Here are some performance numbers for FIO running on a SKX server with
Intel Cold Stream SSD. Padding improves performance significantly.
Tested latest depth tracking code from Thomas:
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git/log/?h=depthtracking
(SHA1 714d29e3e7e3faac27142424ae2533163ddd3a46)
latest gcc patch from Thomas is included at the end.
Baseline Baseline
read (kIOPs) Mean stdev mitigations=off retbleed=off CPU util
================================================================================
mitigations=off 356.33 6.35 0.00% 7.11% 98.93%
retbleed=off 332.67 5.51 -6.64% 0.00% 99.16%
retbleed=ibrs 242.00 5.57 -32.09% -27.25% 99.41%
retbleed=stuff (nopad) 281.67 4.62 -20.95% -15.33% 99.35%
retbleed=stuff (pad) 310.67 0.58 -12.82% -6.61% 99.29%
read/write Baseline Baseline
70/30 (kIOPs) Mean stdev mitigations=off retbleed=off CPU util
================================================================================
mitigations=off 340.60 8.12 0.00% 4.01% 96.80%
retbleed=off 327.47 8.03 -3.86% 0.00% 97.06%
retbleed=ibrs 239.47 0.75 -29.69% -26.87% 98.23%
retbleed=stuff (nopad) 275.20 0.69 -19.20% -15.96% 97.86%
retbleed=stuff (pad) 296.60 2.03 -12.92% -9.43% 97.14%
Baseline Baseline
write (kIOPs) Mean stdev mitigations=off retbleed=off CPU util
================================================================================
mitigations=off 299.33 4.04 0.00% 7.16% 93.51%
retbleed=off 279.33 7.51 -6.68% 0.00% 94.30%
retbleed=ibrs 231.33 0.58 -22.72% -17.18% 95.84%
retbleed=stuff (nopad) 257.67 0.58 -13.92% -7.76% 94.96%
retbleed=stuff (pad) 274.67 1.53 -8.24% -1.67% 94.31%
Tim
gcc patch from Thomas:
---
gcc/config/i386/i386.cc | 13 +++++++++++++
gcc/config/i386/i386.h | 7 +++++++
gcc/config/i386/i386.opt | 4 ++++
gcc/doc/invoke.texi | 6 ++++++
4 files changed, 30 insertions(+)
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -6182,6 +6182,19 @@ ix86_code_end (void)
file_end_indicate_split_stack ();
}
+void
+x86_asm_output_function_prefix (FILE *asm_out_file,
+ const char *fnname ATTRIBUTE_UNUSED)
+{
+ if (force_function_padding)
+ {
+ fprintf (asm_out_file, "\t.align %d\n",
+ 1 << force_function_padding);
+ fprintf (asm_out_file, "\t.skip %d,0xcc\n",
+ 1 << force_function_padding);
+ }
+}
+
/* Emit code for the SET_GOT patterns. */
const char *
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2860,6 +2860,13 @@ extern enum attr_cpu ix86_schedule;
#define LIBGCC2_UNWIND_ATTRIBUTE __attribute__((target ("no-mmx,no-sse")))
#endif
+#include <stdio.h>
+extern void
+x86_asm_output_function_prefix (FILE *asm_out_file,
+ const char *fnname ATTRIBUTE_UNUSED);
+#undef ASM_OUTPUT_FUNCTION_PREFIX
+#define ASM_OUTPUT_FUNCTION_PREFIX x86_asm_output_function_prefix
+
/*
Local variables:
version-control: t
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1064,6 +1064,10 @@ mindirect-branch=
Target RejectNegative Joined Enum(indirect_branch) Var(ix86_indirect_branch) Init(indirect_branch_keep)
Convert indirect call and jump to call and return thunks.
+mforce-function-padding=
+Target Joined UInteger Var(force_function_padding) Init(0) IntegerRange(0, 6)
+Put a 2^$N byte padding area before each function
+
mfunction-return=
Target RejectNegative Joined Enum(indirect_branch) Var(ix86_function_return) Init(indirect_branch_keep)
Convert function return to call and return thunk.
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1451,6 +1451,7 @@ See RS/6000 and PowerPC Options.
-mindirect-branch=@var{choice} -mfunction-return=@var{choice} @gol
-mindirect-branch-register -mharden-sls=@var{choice} @gol
-mindirect-branch-cs-prefix -mneeded -mno-direct-extern-access}
+-mforce-function-padding @gol
@emph{x86 Windows Options}
@gccoptlist{-mconsole -mcygwin -mno-cygwin -mdll @gol
@@ -32849,6 +32850,11 @@ Force all calls to functions to be indir
when using Intel Processor Trace where it generates more precise timing
information for function calls.
+@item -mforce-function-padding
+@opindex -mforce-function-padding
+Force a 16 byte padding are before each function which allows run-time
+code patching to put a special prologue before the function entry.
+
@item -mmanual-endbr
@opindex mmanual-endbr
Insert ENDBR instruction at function entry only via the @code{cf_check}
On Fri, Jul 22, 2022 at 1:11 PM Tim Chen <[email protected]> wrote:
>
> Here are some performance numbers for FIO running on a SKX server with
> Intel Cold Stream SSD. Padding improves performance significantly.
That certainly looks oh-so-much better than those disgusting ibrs numbers.
One thing that I wonder about - gcc already knows about leaf functions
for other reasons (stack setup is often different anyway), and I
wonder it it might be worth looking at making leaf functions avoid the
whole depth counting, and just rely on a regular call/ret.
The whole call chain thing is already not entirely exact and is
counting to a smaller value than the real RSB size.
And leaf functions are generally the smallest and most often called,
so it might be noticeable on profiles and performance numbers to just
say "we already know this is a leaf, there's no reason to increment
the depth for this only to decrement it when it returns".
The main issue is obviously that the return instruction has to be a
non-decrementing one too for the leaf function case, so it's not just
"don't do the counting on entry", it's also a "don't do the usual
rethunk on exit".
So I just wanted to raise this, not because it's hugely important, but
just to make people think about it - I have these very clear memories
of the whole "don't make leaf functions create a stack frame" having
been a surprisingly big deal for some loads.
Of course, sometimes when I have clear memories, they turn out to be
just some drug-induced confusion in the end. But I know people
experimented with "-fno-omit-frame-pointer -momit-leaf-frame-pointer"
and that it made a big difference (but caused some issue with pvops
hidden in asm so that gcc incorrectly thought they were leaf functions
when they weren't).
Linus
On Thu, Jul 21 2022 at 17:54, Peter Zijlstra wrote:
> On Wed, Jul 20, 2022 at 11:13:16PM +0200, Peter Zijlstra wrote:
> The idea is the above callthunk that makes any offset work by not having
> the full hash as an immediate and allow kCFI along with
> -fpatchable-function-entry=N,M and include M in the offset.
>
> Specifically, I meant to use -fpatchable-function-entry=16,16, but alas,
> I never got that far.
That's embarrasing. I missed that option in the maze of gcc
options. That would have spared me to stare at gcc code :)