While running stress_code_patching test from selftests/powerpc/mm
against 5.17-rc1 booted on a POWER10 LPAR following ftrace warning
is seen:
WARNING: CPU: 1 PID: 2017392 at kernel/trace/ftrace.c:2068 ftrace_bug+0x274/0x2d8
Modules linked in: dm_mod bonding rfkill sunrpc pseries_rng xts vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables ext4 mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse
CPU: 1 PID: 2017392 Comm: stress_code_pat Not tainted 5.17.0-rc1-gdd81e1c7d5fb #1
NIP: c0000000002d561c LR: c0000000002d5618 CTR: 00000000005b4448
REGS: c0000000332fb760 TRAP: 0700 Not tainted (5.17.0-rc1-gdd81e1c7d5fb)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48228224 XER: 00000009
CFAR: c0000000001f6b00 IRQMASK: 0
GPR00: c0000000002d5618 c0000000332fba00 c000000002a20000 0000000000000022
GPR04: 00000000ffff7fff c0000000332fb720 c0000000332fb718 0000000000000027
GPR08: c00000167cca7e10 0000000000000001 0000000000000027 c0000000028d6d08
GPR12: 0000000000008000 c00000167fa30780 0000000040000000 00007fff9a089798
GPR16: 00007fff9a089724 00007fff9a026be8 00007fff99fbf4f0 00007fff9a08d568
GPR20: 00007fffce533ed0 0000000000000001 00007fff9a0399d8 00007fffd9eccf94
GPR24: 0000000000000001 0000000000000000 c0000000332fbc70 c000000000fb0d18
GPR28: c000000000ff5080 c000000000fadd38 c0000000020032ec c0000000070800a8
NIP [c0000000002d561c] ftrace_bug+0x274/0x2d8
LR [c0000000002d5618] ftrace_bug+0x270/0x2d8
Call Trace:
[c0000000332fba00] [c0000000002d5618] ftrace_bug+0x270/0x2d8 (unreliable)
[c0000000332fba90] [c0000000002ceaa8] ftrace_modify_all_code+0x108/0x1c0
[c0000000332fbac0] [c000000000081e58] arch_ftrace_update_code+0x18/0x110
[c0000000332fbae0] [c0000000002cec98] ftrace_run_update_code+0x58/0xe0
[c0000000332fbb10] [c0000000002d3f88] ftrace_startup+0xf8/0x1a0
[c0000000332fbb50] [c0000000002d407c] register_ftrace_function+0x4c/0xc0
[c0000000332fbb80] [c0000000002f7f88] function_trace_init+0x88/0x100
[c0000000332fbbb0] [c0000000002ee058] tracing_set_tracer+0x368/0x550
[c0000000332fbc50] [c0000000002ee358] tracing_set_trace_write+0x118/0x180
[c0000000332fbd10] [c00000000048e4c0] vfs_write+0xf0/0x340
[c0000000332fbd60] [c00000000048e8ec] ksys_write+0x7c/0x140
[c0000000332fbdb0] [c000000000033adc] system_call_exception+0x18c/0x390
[c0000000332fbe10] [c00000000000c64c] system_call_common+0xec/0x270
--- interrupt: c00 at 0x7fff99ccbd74
NIP: 00007fff99ccbd74 LR: 00007fff99c434c4 CTR: 0000000000000000
REGS: c0000000332fbe80 TRAP: 0c00 Not tainted (5.17.0-rc1-gdd81e1c7d5fb)
MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28222222 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007fffd9eccd70 00007fff99dc7100 0000000000000001
GPR04: 00007fffce531750 0000000000000009 0000000000000010 000000006e6f6974
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 00007fff99efaeb0 0000000040000000 00007fff9a089798
GPR16: 00007fff9a089724 00007fff9a026be8 00007fff99fbf4f0 00007fff9a08d568
GPR20: 00007fffce533ed0 0000000000000001 00007fff9a0399d8 00007fffd9eccf94
GPR24: 00007fffd9eccf90 00007fff9a08af94 0000000000000009 00007fffce531750
GPR28: 0000000000000009 00007fff99dc1848 00007fffce531750 0000000000000009
NIP [00007fff99ccbd74] 0x7fff99ccbd74
LR [00007fff99c434c4] 0x7fff99c434c4
--- interrupt: c00
Instruction dump:
48000014 3c62fe5d 38635230 4bf214c1 60000000 7fe3fb78 4bff8875 7c641b78
3c62fe5d 38635248 4bf214a5 60000000 <0fe00000> 38210090 39000001 3d22fd66
---[ end trace 0000000000000000 ]—
5.16 kernel was good. Git bisect points to following patch
commit 72b3942a173c387b27860ba1069636726e208777
scripts: ftrace - move the sort-processing in ftrace_init
Have attached .config
Thanks
-Sachin
[adding LKML so this is easier for others to find]
If anyone wants to follow the thread from the start, it's at:
https://lore.kernel.org/linuxppc-dev/[email protected]/
Ard, I was under the impression that the 32-bit arm kernel was (virtually)
relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you know
whether it currently does any boot-time dynamic relocation?
Kees, there's an x86_64 relocation question for you at the end.
On Wed, Jan 26, 2022 at 02:37:16PM +0000, Mark Rutland wrote:
> Hi,
>
> Steve pointed me at this thread over IRC -- I'm not subscribed to this list so
> grabbed a copy of the thread thus far via b4.
>
> On Tue, Jan 25, 2022 at 11:20:27AM +0800, Yinan Liu wrote:
> > > Yeah, I think it's time to opt in, instead of opting out.
>
> I agree this must be opt-in rather than opt-out.
>
> However, I think most architectures were broken (in at least some
> configurations) by commit:
>
> 72b3942a173c387b ("scripts: ftrace - move the sort-processing in ftrace_init")
>
> ... and so I don't think this fix is correct as-is, and we might want to revert
> that or at least mark is as BROKEN for now.
Steve asked for a bit more detail on IRC, so the below is an attempt to explain
what's actually going on here.
The short answer is that relocatable kernels (e.g. those with KASLR support)
need to handle the kernel being loaded at (somewhat) arbitrary virtual
addresses. Even where code can be position-independent, any pointers in the
kernel image (e.g. the contents of the mcount_loc table) need to be updated to
account for the specific VA the kernel was loaded at -- arch code does this
early at boot time by applying dynamic (ELF) relocations.
Walking through how we get there, considering arm64 specifically:
1) When an object is created with traceable functions:
The compiler records the addresses of the callsites into a section. Those
are absolute virtual addresses, but the final virtual addresses are not yet
known, so the compiler generates ELF relocations to tell the linker how to
fill these in later.
On arm64, since the compiler doesn't know the final value yet, it fills the
actual values with zero for now. Other architectures might do differently.
For example, per `objdump -r init/main.o`:
| RELOCATION RECORDS FOR [__patchable_function_entries]:
| OFFSET TYPE VALUE
| 0000000000000000 R_AARCH64_ABS64 .text+0x0000000000000028
| 0000000000000008 R_AARCH64_ABS64 .text+0x0000000000000088
| 0000000000000010 R_AARCH64_ABS64 .text+0x00000000000000e8
2) When vmlinux is linked:
The linker script accumulates the callsite pointers from all the object
files into the mcount_loc table. Since the kernel is relocatable, the
runtime absolute addresses are still not yet known, but the offset relative
to the kernel base is known, and so the linker consumes the absolute
relocations created by the compiler and generates new relocations relative
to the kernel's default load address so that these can be adjusted at boot
time.
On arm64, those are RELA and/or RELR relocations, which our vmlinux.lds.S
accumulates those into a location in the initdata section that the kernel
can find at boot time:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/vmlinux.lds.S?h=v5.17-rc1#n252
For example, per `objdump -s vmlinux -j .rela.dyn`:
| vmlinux: file format elf64-littleaarch64
|
| Contents of section .rela.dyn:
| ffff800009beb0d0 b01a0b09 0080ffff 03040000 00000000 ................
| ffff800009beb0e0 00600b09 0080ffff b81a0b09 0080ffff .`..............
| ffff800009beb0f0 03040000 00000000 80710b09 0080ffff .........q......
| ffff800009beb100 e8670b09 0080ffff 03040000 00000000 .g..............
| ffff800009beb110 48e60809 0080ffff f0670b09 0080ffff H........g......
| ffff800009beb120 03040000 00000000 ec190b09 0080ffff ................
Each of the relocations in .rela.dyn consists of 3 64-bit little-endian
values:
* The first (e.g. 0xffff8000090b1ab0) is the VA of the pointer to write to,
assuming the kernel's default load address (e.g. 0xffff800008000000). An
offset must be applied to this depending on where the kernel was actually
loaded relative to that default load address.
* The second (e.g. 0x0000000000000403) is the ELF relocation type (1027, AKA
R_AARCH64_RELATIVE).
* The third, (e.g. 0xffff8000090b6000) is the VA to write to the pointer,
assuming the kernel's default load address (e.g. 0xffff800008000000). An
offset must be applied to this depending on where the kernel was actually
loaded relative to that default load address.
The AArch64 ELF spec defines our relocations, e.g.
https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#5712dynamic-relocations
In general, relocations might depend on the default value that was
initialized (e.g. OR-ing in high bits assuming the low bits are already
correct). I'm not sure if any of the architectures we support uses such
relocations for relocatable kernels, but they could in theory.
3) When the vmlinux mcount_loc table is sorted:
The sorttable code sorts the data values within the mcount_loc table, but it
doesn't read the relocations to resolve the VA of each entry, nor does it
update the relocations when it swaps entries.
For arm64, where entries were all initialized to zero at compile time and
have not been updated since, the sort does nothing. When the relocations are
applied later, the result will be an unsorted table.
In general, where sorting *does* swap entries, it doesn't update the
corresponding relocations. Where entries A and B get swapped by the sort,
any relocation(s) for entry A will apply to the location of entry B, and
vice-versa. Depending on the specific relocation used, that may or may not
be a problem (e.g. the example of OR-ing in high bits would be broken).
4) When relocations are applied at boot time:
On arm64, to account for KASLR or any other virtual offset we might have to
account for, we apply the relocations early in boot in the
__relocate_kernel() assembly function:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/head.S?h=v5.17-rc1#n753
Since we didn't update the relocations during the build-time sort, and the
mcount_loc table was not sorted to begin with, applying all the relocations
for the mcount_loc table results in an unsorted table. Hence things go
wrong later for any code relying on this having been sorted already.
IIUC the s390 and powerpc cases are structurally similar, though the fine
detail might differ a bit.
I'm not sure how x86 works here; AFAICT the relocations are performed during
decompression, but it looks like there's some special build-time processing
associated with that, and the vmlinux doesn't contain standard ELF relocations.
Kees, IIUC you added the x86_64 support there, can you shed any light on if/how
this works on x86?
In practice, building v5.17-rc1 with CONFIG_FTRACE_SORT_STARTUP_TEST=y I see
the following splat:
| ------------[ cut here ]------------
| [14] unknown_bootoption+0x4/0x1c8 at ffffa1a861200738 is not sorted with trace_initcall_finish_cb+0x4/0x6c at ffffa1a85f8130b8
| WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:6403 ftrace_process_locs.isra.0+0x370/0x440
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc1 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : ftrace_process_locs.isra.0+0x370/0x440
| lr : ftrace_process_locs.isra.0+0x370/0x440
| sp : ffffa1a861b23d90
| x29: ffffa1a861b23d90 x28: 0000000000000000 x27: 0000000000000000
| x26: 0000000000000000 x25: 0000000000000000 x24: ffffa1a8612c0008
| x23: ffffa1a861302ea8 x22: ffffa1a861374320 x21: 0000000000000001
| x20: 000000000000e28f x19: 000000000000000e x18: ffffffffffffffff
| x17: 726f7320746f6e20 x16: 7369203833373030 x15: 3231363861316166
| x14: 6666662074612038 x13: 3862303331386635 x12: 3861316166666666
| x11: 2074612063367830 x10: ffffa1a861b4b1d0 x9 : ffffa1a861b4b1d0
| x8 : 00000000ffffefff x7 : ffffa1a861ba31d0 x6 : ffffa1a861ba31d0
| x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
| x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffa1a861b33e80
| Call trace:
| ftrace_process_locs.isra.0+0x370/0x440
| ftrace_init+0xb4/0x15c
| start_kernel+0x40c/0x6d4
| __primary_switched+0xc0/0xc8
| ---[ end trace 0000000000000000 ]---
Where those symbol adddresses are:
| [mark@lakrids:~/src/linux]% usekorg 11.1.0 aarch64-linux-objdump -t vmlinux | grep -w unknown_bootoption
| ffff800009a00734 l F .init.text 00000000000001c8 unknown_bootoption
| [mark@lakrids:~/src/linux]% usekorg 11.1.0 aarch64-linux-objdump -t vmlinux | grep -w trace_initcall_finish_cb
| ffff8000080130b4 l F .text 0000000000000064 trace_initcall_finish_cb
... and are obviously not sorted.
Further, the ftrace tests fail quite horribly, e.g.
| # ./ftracetest
| === Ftrace unit tests ===
| [1] Basic trace file check [PASS]
| [2] Basic test for tracers[ 38.979280] ------------[ ftrace bug ]------------
| [ 38.980225] ftrace faulted on modifying
| [ 38.980227] [<ffffa8ebbe6003fc>] set_reset_devices+0x8/0x24
| [ 38.982078] Setting ftrace call site to call ftrace function
| [ 38.983160] ftrace record flags: 80000001
| [ 38.983921] (1)
| [ 38.983921] expected tramp: ffffa8ebbcc2ba20
| [ 38.985132] ------------[ cut here ]------------
| [ 38.986013] WARNING: CPU: 3 PID: 265 at kernel/trace/ftrace.c:2068 ftrace_bug+0x284/0x2b4
| [ 38.987649] Modules linked in:
| [ 38.988275] CPU: 3 PID: 265 Comm: ftracetest Tainted: G W 5.17.0-rc1-dirty #4
| [ 38.989979] Hardware name: linux,dummy-virt (DT)
| [ 38.990916] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| [ 38.992318] pc : ftrace_bug+0x284/0x2b4
| [ 38.993111] lr : ftrace_bug+0x284/0x2b4
| [ 38.993893] sp : ffff80000888bb00
| [ 38.994566] x29: ffff80000888bb00 x28: 0000000000000020 x27: ffffa8ebbefcaa70
| [ 38.996002] x26: ffffa8ebbefcb7e0 x25: ffffa8ebbf36723c x24: ffffa8ebbf364000
| [ 38.997447] x23: 00000000fffffff2 x22: ffffa8ebbe38f6e8 x21: ffffa8ebbef29c78
| [ 38.998885] x20: ffff5ec380080030 x19: ffffa8ebbef29000 x18: ffffffffffffffff
| [ 39.000316] x17: 0000000000000000 x16: 0000000000000000 x15: ffff80008888b827
| [ 39.001758] x14: 0000000000000000 x13: 3032616232636362 x12: 6265386166666666
| [ 39.003202] x11: 203a706d61727420 x10: ffffa8ebbef4b1d0 x9 : ffffa8ebbcd0dff0
| [ 39.004645] x8 : 00000000ffffefff x7 : ffffa8ebbefa31d0 x6 : ffffa8ebbefa31d0
| [ 39.006092] x5 : ffff5ec4befa29d8 x4 : 0000000000000000 x3 : 0000000000000000
| [ 39.007538] x2 : 0000000000000000 x1 : ffff5ec380cf5580 x0 : 0000000000000022
| [ 39.008982] Call trace:
| [ 39.009495] ftrace_bug+0x284/0x2b4
| [ 39.010212] ftrace_replace_code+0x9c/0xa4
| [ 39.011054] ftrace_modify_all_code+0xe4/0x14c
| [ 39.011964] arch_ftrace_update_code+0x18/0x24
| [ 39.012874] ftrace_run_update_code+0x24/0x7c
| [ 39.013770] ftrace_startup+0xf8/0x1b0
| [ 39.014541] register_ftrace_graph+0x2dc/0x324
| [ 39.015449] graph_trace_init+0x6c/0x74
| [ 39.016232] tracing_set_tracer+0xec/0x17c
| [ 39.017071] tracing_set_trace_write+0xe8/0x150
| [ 39.017989] vfs_write+0xfc/0x2a0
| [ 39.018671] ksys_write+0x74/0x100
| [ 39.019370] __arm64_sys_write+0x28/0x3c
| [ 39.020172] invoke_syscall+0x50/0x120
| [ 39.020947] el0_svc_common.constprop.0+0xdc/0x100
| [ 39.021935] do_el0_svc+0x34/0xa0
| [ 39.022620] el0_svc+0x28/0x80
| [ 39.023253] el0t_64_sync_handler+0xa8/0x130
| [ 39.024121] el0t_64_sync+0x1a0/0x1a4
| [ 39.024874] ---[ end trace 0000000000000000 ]---
| [PASS]
| [3] Basic trace clock test [FAIL]
| [4] Basic event tracing check [FAIL]
| [5] Change the ringbuffer size [FAIL]
Thanks,
Mark.
>
> More on that below.
>
> > >
> > > Something like this:
> > >
> > > -- Steve
> > >
> > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > index c2724d986fa0..5256ebe57451 100644
> > > --- a/arch/arm/Kconfig
> > > +++ b/arch/arm/Kconfig
> > > @@ -82,6 +82,7 @@ config ARM
> > > select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
> > > select HAVE_CONTEXT_TRACKING
> > > select HAVE_C_RECORDMCOUNT
> > > + select HAVE_BUILDTIME_MCOUNT_SORT
> > > select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
> > > select HAVE_DMA_CONTIGUOUS if MMU
> > > select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
>
> IIUC the 32-bit arm kernel can be relocated at boot time, so I don't believe
> this is correct, and I believe any relocatable arm kernel has been broken since
> htat was introduced.
>
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index c4207cf9bb17..7996548b2b27 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -166,6 +166,7 @@ config ARM64
> > > select HAVE_ASM_MODVERSIONS
> > > select HAVE_EBPF_JIT
> > > select HAVE_C_RECORDMCOUNT
> > > + select HAVE_BUILDTIME_MCOUNT_SORT
> > > select HAVE_CMPXCHG_DOUBLE
> > > select HAVE_CMPXCHG_LOCAL
> > > select HAVE_CONTEXT_TRACKING
>
> The arm64 kernel is relocatable by default, and has been broken since the
> build-time sort was introduced -- I see ftrace test failures, and the
> CONFIG_FTRACE_SORT_STARTUP_TEST screams at boot time.
>
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index 7399327d1eff..46080dea5dba 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -186,6 +186,7 @@ config X86
> > > select HAVE_CONTEXT_TRACKING_OFFSTACK if HAVE_CONTEXT_TRACKING
> > > select HAVE_C_RECORDMCOUNT
> > > select HAVE_OBJTOOL_MCOUNT if STACK_VALIDATION
> > > + select HAVE_BUILDTIME_MCOUNT_SORT
>
> Isn't x86 relocatable in some configurations (e.g. for KASLR)?
>
> I can't see how the sort works for those cases, because the mcount_loc entries
> are absolute, and either:
>
> * The sorted entries will get overwritten by the unsorted relocation entries,
> and won't be sorted.
>
> * The sorted entries won't get overwritten, but then the absolute address will
> be wrong since they hadn't been relocated.
>
> How does that work?
>
> Thanks,
> Mark.
>
> > > select HAVE_DEBUG_KMEMLEAK
> > > select HAVE_DMA_CONTIGUOUS
> > > select HAVE_DYNAMIC_FTRACE
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 752ed89a293b..7e5b92090faa 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
> > > help
> > > C version of recordmcount available?
> > > +config HAVE_BUILDTIME_MCOUNT_SORT
> > > + bool
> > > + help
> > > + An architecture selects this if it sorts the mcount_loc section
> > > + at build time.
> > > +
> > > config BUILDTIME_MCOUNT_SORT
> > > bool
> > > default y
> > > - depends on BUILDTIME_TABLE_SORT && !S390
> > > + depends on HAVE_BUILDTIME_MCOUNT_SORT
> > > help
> > > Sort the mcount_loc section at build time.
> >
> > LGTM. This will no longer destroy ftrace on other architectures.
> > Those arches that we are not sure about can test and enable this function by
> > themselves.
> >
> >
> > Best regards
> > --yinan
> >
On Thu, 27 Jan 2022 at 12:47, Mark Rutland <[email protected]> wrote:
>
> [adding LKML so this is easier for others to find]
>
> If anyone wants to follow the thread from the start, it's at:
>
> https://lore.kernel.org/linuxppc-dev/[email protected]/
>
> Ard, I was under the impression that the 32-bit arm kernel was (virtually)
> relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you know
> whether it currently does any boot-time dynamic relocation?
>
No, it does not.
..
> > Steve pointed me at this thread over IRC -- I'm not subscribed to this list so
> > grabbed a copy of the thread thus far via b4.
> >
> > On Tue, Jan 25, 2022 at 11:20:27AM +0800, Yinan Liu wrote:
> > > > Yeah, I think it's time to opt in, instead of opting out.
> >
> > I agree this must be opt-in rather than opt-out.
> >
> > However, I think most architectures were broken (in at least some
> > configurations) by commit:
> >
> > 72b3942a173c387b ("scripts: ftrace - move the sort-processing in ftrace_init")
> >
> > ... and so I don't think this fix is correct as-is, and we might want to revert
> > that or at least mark is as BROKEN for now.
>
> Steve asked for a bit more detail on IRC, so the below is an attempt to explain
> what's actually going on here.
>
> The short answer is that relocatable kernels (e.g. those with KASLR support)
> need to handle the kernel being loaded at (somewhat) arbitrary virtual
> addresses. Even where code can be position-independent, any pointers in the
> kernel image (e.g. the contents of the mcount_loc table) need to be updated to
> account for the specific VA the kernel was loaded at -- arch code does this
> early at boot time by applying dynamic (ELF) relocations.
>
These architectures use place-relative extables for the same reason:
place relative references are resolved at build time rather than at
runtime during relocation, making a build time sort feasible.
arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
Note that the swap routine becomes something like the below, given
that the relative references need to be fixed up after the entry
changes place in the sorted list.
static void swap_ex(void *a, void *b, int size)
{
struct exception_table_entry *x = a, *y = b, tmp;
int delta = b - a;
tmp = *x;
x->insn = y->insn + delta;
y->insn = tmp.insn - delta;
...
}
As a bonus, the resulting footprint of the table in the image is
reduced by 8x, given that every 8 byte pointer has an accompanying 24
byte RELA record, so we go from 32 bytes to 4 bytes for every call to
__gnu_mcount_mc.
> Walking through how we get there, considering arm64 specifically:
>
> 1) When an object is created with traceable functions:
>
> The compiler records the addresses of the callsites into a section. Those
> are absolute virtual addresses, but the final virtual addresses are not yet
> known, so the compiler generates ELF relocations to tell the linker how to
> fill these in later.
>
> On arm64, since the compiler doesn't know the final value yet, it fills the
> actual values with zero for now. Other architectures might do differently.
>
> For example, per `objdump -r init/main.o`:
>
> | RELOCATION RECORDS FOR [__patchable_function_entries]:
> | OFFSET TYPE VALUE
> | 0000000000000000 R_AARCH64_ABS64 .text+0x0000000000000028
> | 0000000000000008 R_AARCH64_ABS64 .text+0x0000000000000088
> | 0000000000000010 R_AARCH64_ABS64 .text+0x00000000000000e8
>
> 2) When vmlinux is linked:
>
> The linker script accumulates the callsite pointers from all the object
> files into the mcount_loc table. Since the kernel is relocatable, the
> runtime absolute addresses are still not yet known, but the offset relative
> to the kernel base is known, and so the linker consumes the absolute
> relocations created by the compiler and generates new relocations relative
> to the kernel's default load address so that these can be adjusted at boot
> time.
>
> On arm64, those are RELA and/or RELR relocations, which our vmlinux.lds.S
> accumulates those into a location in the initdata section that the kernel
> can find at boot time:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/vmlinux.lds.S?h=v5.17-rc1#n252
>
> For example, per `objdump -s vmlinux -j .rela.dyn`:
>
> | vmlinux: file format elf64-littleaarch64
> |
> | Contents of section .rela.dyn:
> | ffff800009beb0d0 b01a0b09 0080ffff 03040000 00000000 ................
> | ffff800009beb0e0 00600b09 0080ffff b81a0b09 0080ffff .`..............
> | ffff800009beb0f0 03040000 00000000 80710b09 0080ffff .........q......
> | ffff800009beb100 e8670b09 0080ffff 03040000 00000000 .g..............
> | ffff800009beb110 48e60809 0080ffff f0670b09 0080ffff H........g......
> | ffff800009beb120 03040000 00000000 ec190b09 0080ffff ................
>
> Each of the relocations in .rela.dyn consists of 3 64-bit little-endian
> values:
>
> * The first (e.g. 0xffff8000090b1ab0) is the VA of the pointer to write to,
> assuming the kernel's default load address (e.g. 0xffff800008000000). An
> offset must be applied to this depending on where the kernel was actually
> loaded relative to that default load address.
>
> * The second (e.g. 0x0000000000000403) is the ELF relocation type (1027, AKA
> R_AARCH64_RELATIVE).
>
> * The third, (e.g. 0xffff8000090b6000) is the VA to write to the pointer,
> assuming the kernel's default load address (e.g. 0xffff800008000000). An
> offset must be applied to this depending on where the kernel was actually
> loaded relative to that default load address.
>
> The AArch64 ELF spec defines our relocations, e.g.
>
> https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#5712dynamic-relocations
>
> In general, relocations might depend on the default value that was
> initialized (e.g. OR-ing in high bits assuming the low bits are already
> correct). I'm not sure if any of the architectures we support uses such
> relocations for relocatable kernels, but they could in theory.
>
> 3) When the vmlinux mcount_loc table is sorted:
>
> The sorttable code sorts the data values within the mcount_loc table, but it
> doesn't read the relocations to resolve the VA of each entry, nor does it
> update the relocations when it swaps entries.
>
> For arm64, where entries were all initialized to zero at compile time and
> have not been updated since, the sort does nothing. When the relocations are
> applied later, the result will be an unsorted table.
>
> In general, where sorting *does* swap entries, it doesn't update the
> corresponding relocations. Where entries A and B get swapped by the sort,
> any relocation(s) for entry A will apply to the location of entry B, and
> vice-versa. Depending on the specific relocation used, that may or may not
> be a problem (e.g. the example of OR-ing in high bits would be broken).
>
> 4) When relocations are applied at boot time:
>
> On arm64, to account for KASLR or any other virtual offset we might have to
> account for, we apply the relocations early in boot in the
> __relocate_kernel() assembly function:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/head.S?h=v5.17-rc1#n753
>
> Since we didn't update the relocations during the build-time sort, and the
> mcount_loc table was not sorted to begin with, applying all the relocations
> for the mcount_loc table results in an unsorted table. Hence things go
> wrong later for any code relying on this having been sorted already.
>
> IIUC the s390 and powerpc cases are structurally similar, though the fine
> detail might differ a bit.
>
> I'm not sure how x86 works here; AFAICT the relocations are performed during
> decompression, but it looks like there's some special build-time processing
> associated with that, and the vmlinux doesn't contain standard ELF relocations.
>
> Kees, IIUC you added the x86_64 support there, can you shed any light on if/how
> this works on x86?
>
>
>
> In practice, building v5.17-rc1 with CONFIG_FTRACE_SORT_STARTUP_TEST=y I see
> the following splat:
>
> | ------------[ cut here ]------------
> | [14] unknown_bootoption+0x4/0x1c8 at ffffa1a861200738 is not sorted with trace_initcall_finish_cb+0x4/0x6c at ffffa1a85f8130b8
> | WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:6403 ftrace_process_locs.isra.0+0x370/0x440
> | Modules linked in:
> | CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc1 #2
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> | pc : ftrace_process_locs.isra.0+0x370/0x440
> | lr : ftrace_process_locs.isra.0+0x370/0x440
> | sp : ffffa1a861b23d90
> | x29: ffffa1a861b23d90 x28: 0000000000000000 x27: 0000000000000000
> | x26: 0000000000000000 x25: 0000000000000000 x24: ffffa1a8612c0008
> | x23: ffffa1a861302ea8 x22: ffffa1a861374320 x21: 0000000000000001
> | x20: 000000000000e28f x19: 000000000000000e x18: ffffffffffffffff
> | x17: 726f7320746f6e20 x16: 7369203833373030 x15: 3231363861316166
> | x14: 6666662074612038 x13: 3862303331386635 x12: 3861316166666666
> | x11: 2074612063367830 x10: ffffa1a861b4b1d0 x9 : ffffa1a861b4b1d0
> | x8 : 00000000ffffefff x7 : ffffa1a861ba31d0 x6 : ffffa1a861ba31d0
> | x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
> | x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffa1a861b33e80
> | Call trace:
> | ftrace_process_locs.isra.0+0x370/0x440
> | ftrace_init+0xb4/0x15c
> | start_kernel+0x40c/0x6d4
> | __primary_switched+0xc0/0xc8
> | ---[ end trace 0000000000000000 ]---
>
> Where those symbol adddresses are:
>
> | [mark@lakrids:~/src/linux]% usekorg 11.1.0 aarch64-linux-objdump -t vmlinux | grep -w unknown_bootoption
> | ffff800009a00734 l F .init.text 00000000000001c8 unknown_bootoption
> | [mark@lakrids:~/src/linux]% usekorg 11.1.0 aarch64-linux-objdump -t vmlinux | grep -w trace_initcall_finish_cb
> | ffff8000080130b4 l F .text 0000000000000064 trace_initcall_finish_cb
>
> ... and are obviously not sorted.
>
> Further, the ftrace tests fail quite horribly, e.g.
>
> | # ./ftracetest
> | === Ftrace unit tests ===
> | [1] Basic trace file check [PASS]
> | [2] Basic test for tracers[ 38.979280] ------------[ ftrace bug ]------------
> | [ 38.980225] ftrace faulted on modifying
> | [ 38.980227] [<ffffa8ebbe6003fc>] set_reset_devices+0x8/0x24
> | [ 38.982078] Setting ftrace call site to call ftrace function
> | [ 38.983160] ftrace record flags: 80000001
> | [ 38.983921] (1)
> | [ 38.983921] expected tramp: ffffa8ebbcc2ba20
> | [ 38.985132] ------------[ cut here ]------------
> | [ 38.986013] WARNING: CPU: 3 PID: 265 at kernel/trace/ftrace.c:2068 ftrace_bug+0x284/0x2b4
> | [ 38.987649] Modules linked in:
> | [ 38.988275] CPU: 3 PID: 265 Comm: ftracetest Tainted: G W 5.17.0-rc1-dirty #4
> | [ 38.989979] Hardware name: linux,dummy-virt (DT)
> | [ 38.990916] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> | [ 38.992318] pc : ftrace_bug+0x284/0x2b4
> | [ 38.993111] lr : ftrace_bug+0x284/0x2b4
> | [ 38.993893] sp : ffff80000888bb00
> | [ 38.994566] x29: ffff80000888bb00 x28: 0000000000000020 x27: ffffa8ebbefcaa70
> | [ 38.996002] x26: ffffa8ebbefcb7e0 x25: ffffa8ebbf36723c x24: ffffa8ebbf364000
> | [ 38.997447] x23: 00000000fffffff2 x22: ffffa8ebbe38f6e8 x21: ffffa8ebbef29c78
> | [ 38.998885] x20: ffff5ec380080030 x19: ffffa8ebbef29000 x18: ffffffffffffffff
> | [ 39.000316] x17: 0000000000000000 x16: 0000000000000000 x15: ffff80008888b827
> | [ 39.001758] x14: 0000000000000000 x13: 3032616232636362 x12: 6265386166666666
> | [ 39.003202] x11: 203a706d61727420 x10: ffffa8ebbef4b1d0 x9 : ffffa8ebbcd0dff0
> | [ 39.004645] x8 : 00000000ffffefff x7 : ffffa8ebbefa31d0 x6 : ffffa8ebbefa31d0
> | [ 39.006092] x5 : ffff5ec4befa29d8 x4 : 0000000000000000 x3 : 0000000000000000
> | [ 39.007538] x2 : 0000000000000000 x1 : ffff5ec380cf5580 x0 : 0000000000000022
> | [ 39.008982] Call trace:
> | [ 39.009495] ftrace_bug+0x284/0x2b4
> | [ 39.010212] ftrace_replace_code+0x9c/0xa4
> | [ 39.011054] ftrace_modify_all_code+0xe4/0x14c
> | [ 39.011964] arch_ftrace_update_code+0x18/0x24
> | [ 39.012874] ftrace_run_update_code+0x24/0x7c
> | [ 39.013770] ftrace_startup+0xf8/0x1b0
> | [ 39.014541] register_ftrace_graph+0x2dc/0x324
> | [ 39.015449] graph_trace_init+0x6c/0x74
> | [ 39.016232] tracing_set_tracer+0xec/0x17c
> | [ 39.017071] tracing_set_trace_write+0xe8/0x150
> | [ 39.017989] vfs_write+0xfc/0x2a0
> | [ 39.018671] ksys_write+0x74/0x100
> | [ 39.019370] __arm64_sys_write+0x28/0x3c
> | [ 39.020172] invoke_syscall+0x50/0x120
> | [ 39.020947] el0_svc_common.constprop.0+0xdc/0x100
> | [ 39.021935] do_el0_svc+0x34/0xa0
> | [ 39.022620] el0_svc+0x28/0x80
> | [ 39.023253] el0t_64_sync_handler+0xa8/0x130
> | [ 39.024121] el0t_64_sync+0x1a0/0x1a4
> | [ 39.024874] ---[ end trace 0000000000000000 ]---
> | [PASS]
> | [3] Basic trace clock test [FAIL]
> | [4] Basic event tracing check [FAIL]
> | [5] Change the ringbuffer size [FAIL]
>
> Thanks,
> Mark.
>
> >
> > More on that below.
> >
> > > >
> > > > Something like this:
> > > >
> > > > -- Steve
> > > >
> > > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > > index c2724d986fa0..5256ebe57451 100644
> > > > --- a/arch/arm/Kconfig
> > > > +++ b/arch/arm/Kconfig
> > > > @@ -82,6 +82,7 @@ config ARM
> > > > select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
> > > > select HAVE_CONTEXT_TRACKING
> > > > select HAVE_C_RECORDMCOUNT
> > > > + select HAVE_BUILDTIME_MCOUNT_SORT
> > > > select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
> > > > select HAVE_DMA_CONTIGUOUS if MMU
> > > > select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
> >
> > IIUC the 32-bit arm kernel can be relocated at boot time, so I don't believe
> > this is correct, and I believe any relocatable arm kernel has been broken since
> > htat was introduced.
> >
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index c4207cf9bb17..7996548b2b27 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -166,6 +166,7 @@ config ARM64
> > > > select HAVE_ASM_MODVERSIONS
> > > > select HAVE_EBPF_JIT
> > > > select HAVE_C_RECORDMCOUNT
> > > > + select HAVE_BUILDTIME_MCOUNT_SORT
> > > > select HAVE_CMPXCHG_DOUBLE
> > > > select HAVE_CMPXCHG_LOCAL
> > > > select HAVE_CONTEXT_TRACKING
> >
> > The arm64 kernel is relocatable by default, and has been broken since the
> > build-time sort was introduced -- I see ftrace test failures, and the
> > CONFIG_FTRACE_SORT_STARTUP_TEST screams at boot time.
> >
> > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > index 7399327d1eff..46080dea5dba 100644
> > > > --- a/arch/x86/Kconfig
> > > > +++ b/arch/x86/Kconfig
> > > > @@ -186,6 +186,7 @@ config X86
> > > > select HAVE_CONTEXT_TRACKING_OFFSTACK if HAVE_CONTEXT_TRACKING
> > > > select HAVE_C_RECORDMCOUNT
> > > > select HAVE_OBJTOOL_MCOUNT if STACK_VALIDATION
> > > > + select HAVE_BUILDTIME_MCOUNT_SORT
> >
> > Isn't x86 relocatable in some configurations (e.g. for KASLR)?
> >
> > I can't see how the sort works for those cases, because the mcount_loc entries
> > are absolute, and either:
> >
> > * The sorted entries will get overwritten by the unsorted relocation entries,
> > and won't be sorted.
> >
> > * The sorted entries won't get overwritten, but then the absolute address will
> > be wrong since they hadn't been relocated.
> >
> > How does that work?
> >
> > Thanks,
> > Mark.
> >
> > > > select HAVE_DEBUG_KMEMLEAK
> > > > select HAVE_DMA_CONTIGUOUS
> > > > select HAVE_DYNAMIC_FTRACE
> > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > > index 752ed89a293b..7e5b92090faa 100644
> > > > --- a/kernel/trace/Kconfig
> > > > +++ b/kernel/trace/Kconfig
> > > > @@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
> > > > help
> > > > C version of recordmcount available?
> > > > +config HAVE_BUILDTIME_MCOUNT_SORT
> > > > + bool
> > > > + help
> > > > + An architecture selects this if it sorts the mcount_loc section
> > > > + at build time.
> > > > +
> > > > config BUILDTIME_MCOUNT_SORT
> > > > bool
> > > > default y
> > > > - depends on BUILDTIME_TABLE_SORT && !S390
> > > > + depends on HAVE_BUILDTIME_MCOUNT_SORT
> > > > help
> > > > Sort the mcount_loc section at build time.
> > >
> > > LGTM. This will no longer destroy ftrace on other architectures.
> > > Those arches that we are not sure about can test and enable this function by
> > > themselves.
> > >
> > >
> > > Best regards
> > > --yinan
> > >
Mark Rutland <[email protected]> writes:
>> Isn't x86 relocatable in some configurations (e.g. for KASLR)?
>>
>> I can't see how the sort works for those cases, because the mcount_loc entries
>> are absolute, and either:
>>
>> * The sorted entries will get overwritten by the unsorted relocation entries,
>> and won't be sorted.
>>
>> * The sorted entries won't get overwritten, but then the absolute address will
>> be wrong since they hadn't been relocated.
>>
>> How does that work?
From what i've seen when looking into this ftrace sort problem x86 has a
a relocation tool, which is run before final linking: arch/x86/tools/relocs.c
This tools converts all the required relocations to three types:
- 32 bit relocations
- 64 bit relocations
- inverse 32 bit relocations
These are added to the end of the image.
The decompressor then iterates over that array, and just adds/subtracts
the KASLR offset - see arch/x86/boot/compressed/misc.c, handle_relocations()
So IMHO x86 never uses 'real' relocations during boot, and just
adds/subtracts. That's why the order stays the same, and the compile
time sort works.
/Sven
On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 12:47, Mark Rutland <[email protected]> wrote:
> >
> > [adding LKML so this is easier for others to find]
> >
> > If anyone wants to follow the thread from the start, it's at:
> >
> > https://lore.kernel.org/linuxppc-dev/[email protected]/
> >
> > Ard, I was under the impression that the 32-bit arm kernel was (virtually)
> > relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you know
> > whether it currently does any boot-time dynamic relocation?
>
> No, it does not.
Thanks for comfirming!
So 32-bit arm should be able to opt into the build-time sort as-is.
> > Steve asked for a bit more detail on IRC, so the below is an attempt to explain
> > what's actually going on here.
> >
> > The short answer is that relocatable kernels (e.g. those with KASLR support)
> > need to handle the kernel being loaded at (somewhat) arbitrary virtual
> > addresses. Even where code can be position-independent, any pointers in the
> > kernel image (e.g. the contents of the mcount_loc table) need to be updated to
> > account for the specific VA the kernel was loaded at -- arch code does this
> > early at boot time by applying dynamic (ELF) relocations.
>
> These architectures use place-relative extables for the same reason:
> place relative references are resolved at build time rather than at
> runtime during relocation, making a build time sort feasible.
>
> arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
>
> Note that the swap routine becomes something like the below, given
> that the relative references need to be fixed up after the entry
> changes place in the sorted list.
>
> static void swap_ex(void *a, void *b, int size)
> {
> struct exception_table_entry *x = a, *y = b, tmp;
> int delta = b - a;
>
> tmp = *x;
> x->insn = y->insn + delta;
> y->insn = tmp.insn - delta;
> ...
> }
>
> As a bonus, the resulting footprint of the table in the image is
> reduced by 8x, given that every 8 byte pointer has an accompanying 24
> byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> __gnu_mcount_mc.
Absolutely -- it'd be great if we could do that for the callsite locations; the
difficulty is that the entries are generated by the compiler itself, so we'd
either need some build/link time processing to convert each absolute 64-bit
value to a relative 32-bit offset, or new compiler options to generate those as
relative offsets from the outset.
Thanks,
Mark.
On Thu, 27 Jan 2022 at 13:20, Mark Rutland <[email protected]> wrote:
>
> On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> > On Thu, 27 Jan 2022 at 12:47, Mark Rutland <[email protected]> wrote:
> > >
> > > [adding LKML so this is easier for others to find]
> > >
> > > If anyone wants to follow the thread from the start, it's at:
> > >
> > > https://lore.kernel.org/linuxppc-dev/[email protected]/
> > >
> > > Ard, I was under the impression that the 32-bit arm kernel was (virtually)
> > > relocatable, but I couldn't spot where, and suspect I'm mistaken. Do you know
> > > whether it currently does any boot-time dynamic relocation?
> >
> > No, it does not.
>
> Thanks for comfirming!
>
> So 32-bit arm should be able to opt into the build-time sort as-is.
>
> > > Steve asked for a bit more detail on IRC, so the below is an attempt to explain
> > > what's actually going on here.
> > >
> > > The short answer is that relocatable kernels (e.g. those with KASLR support)
> > > need to handle the kernel being loaded at (somewhat) arbitrary virtual
> > > addresses. Even where code can be position-independent, any pointers in the
> > > kernel image (e.g. the contents of the mcount_loc table) need to be updated to
> > > account for the specific VA the kernel was loaded at -- arch code does this
> > > early at boot time by applying dynamic (ELF) relocations.
> >
> > These architectures use place-relative extables for the same reason:
> > place relative references are resolved at build time rather than at
> > runtime during relocation, making a build time sort feasible.
> >
> > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> >
> > Note that the swap routine becomes something like the below, given
> > that the relative references need to be fixed up after the entry
> > changes place in the sorted list.
> >
> > static void swap_ex(void *a, void *b, int size)
> > {
> > struct exception_table_entry *x = a, *y = b, tmp;
> > int delta = b - a;
> >
> > tmp = *x;
> > x->insn = y->insn + delta;
> > y->insn = tmp.insn - delta;
> > ...
> > }
> >
> > As a bonus, the resulting footprint of the table in the image is
> > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > __gnu_mcount_mc.
>
> Absolutely -- it'd be great if we could do that for the callsite locations; the
> difficulty is that the entries are generated by the compiler itself, so we'd
> either need some build/link time processing to convert each absolute 64-bit
> value to a relative 32-bit offset, or new compiler options to generate those as
> relative offsets from the outset.
>
Don't we use scripts/recordmcount.pl for that?
On Thu, Jan 27, 2022 at 01:04:41PM +0100, Sven Schnelle wrote:
> Mark Rutland <[email protected]> writes:
>
> >> Isn't x86 relocatable in some configurations (e.g. for KASLR)?
> >>
> >> I can't see how the sort works for those cases, because the mcount_loc entries
> >> are absolute, and either:
> >>
> >> * The sorted entries will get overwritten by the unsorted relocation entries,
> >> and won't be sorted.
> >>
> >> * The sorted entries won't get overwritten, but then the absolute address will
> >> be wrong since they hadn't been relocated.
> >>
> >> How does that work?
>
> From what i've seen when looking into this ftrace sort problem x86 has a
> a relocation tool, which is run before final linking: arch/x86/tools/relocs.c
> This tools converts all the required relocations to three types:
>
> - 32 bit relocations
> - 64 bit relocations
> - inverse 32 bit relocations
>
> These are added to the end of the image.
>
> The decompressor then iterates over that array, and just adds/subtracts
> the KASLR offset - see arch/x86/boot/compressed/misc.c, handle_relocations()
>
> So IMHO x86 never uses 'real' relocations during boot, and just
> adds/subtracts. That's why the order stays the same, and the compile
> time sort works.
Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
KASLR offset here", which is equivalent for all entries.
That makes sense, thanks!
Mark.
On Thu, 27 Jan 2022 12:27:04 +0000
Mark Rutland <[email protected]> wrote:
> Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
> KASLR offset here", which is equivalent for all entries.
>
> That makes sense, thanks!
And this is why we were having such a hard time understanding each other ;-)
I started a new project called "shelf", which is a shell interface to
read ELF files (Shelf on a ELF!).
It uses my ccli library:
https://github.com/rostedt/libccli
and can be found here:
https://github.com/rostedt/shelf
Build and install the latest libccli and then build this with just
"make".
$ shelf vmlinux
and then you can see what is stored in the mcount location:
shelf> dump symbol __start_mcount_loc - __stop_mcount_loc
I plan on adding more to include the REL and RELA sections and show how
they affect symbols and such.
Feel free to contribute too ;-)
-- Steve
On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 13:20, Mark Rutland <[email protected]> wrote:
> > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> >
> > > These architectures use place-relative extables for the same reason:
> > > place relative references are resolved at build time rather than at
> > > runtime during relocation, making a build time sort feasible.
> > >
> > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > >
> > > Note that the swap routine becomes something like the below, given
> > > that the relative references need to be fixed up after the entry
> > > changes place in the sorted list.
> > >
> > > static void swap_ex(void *a, void *b, int size)
> > > {
> > > struct exception_table_entry *x = a, *y = b, tmp;
> > > int delta = b - a;
> > >
> > > tmp = *x;
> > > x->insn = y->insn + delta;
> > > y->insn = tmp.insn - delta;
> > > ...
> > > }
> > >
> > > As a bonus, the resulting footprint of the table in the image is
> > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > __gnu_mcount_mc.
> >
> > Absolutely -- it'd be great if we could do that for the callsite locations; the
> > difficulty is that the entries are generated by the compiler itself, so we'd
> > either need some build/link time processing to convert each absolute 64-bit
> > value to a relative 32-bit offset, or new compiler options to generate those as
> > relative offsets from the outset.
>
> Don't we use scripts/recordmcount.pl for that?
Not quite -- we could adjust it to do so, but today it doesn't consider
existing mcount_loc entries, and just generates new ones where the compiler has
generated calls to mcount, which it finds by scanning the instructions in the
binary. Today it is not used:
* On arm64 when we default to using `-fpatchable-function-entry=N`. That makes
the compiler insert 2 NOPs in the function prologue, and log the location of
that NOP sled to a section called. `__patchable_function_entries`.
We need the compiler to do that since we can't reliably identify 2 NOPs in a
function prologue as being intended to be a patch site, as e.g. there could
be notrace functions where the compiler had to insert NOPs for alignment of a
subsequent brnach or similar.
* On architectures with `-nop-mcount`. On these, it's necessary to use
`-mrecord-mcount` to have the compiler log the patch-site, for the same
reason as with `-fpatchable-function-entry`.
* On architectures with `-mrecord-mcount` generally, since this avoids the
overhead of scanning each object.
* On x86 when objtool is used.
Thanks,
Mark.
On Thu, Jan 27, 2022 at 07:46:01AM -0500, Steven Rostedt wrote:
> On Thu, 27 Jan 2022 12:27:04 +0000
> Mark Rutland <[email protected]> wrote:
>
> > Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
> > KASLR offset here", which is equivalent for all entries.
> >
> > That makes sense, thanks!
>
> And this is why we were having such a hard time understanding each other ;-)
;)
With that in mind, I think that we understand that the build-time sort works
for:
* arch/x86, becuase the non-ELF relocations for mcount_loc happen to be
equivalent.
* arch/arm, because there's no dynamic relocaiton and the mcount_loc entries
have been finalized prior to sorting.
... but doesn't work for anyone else (including arm64) because the ELF
relocations are not equivalent, and need special care that is not yet
implemented.
So we should have arm and x86 opt-in, but for now everyone else (including
arm64, powerpc, s390) should be left with the prior behaviour with the runtime
sort only (in case the build-time sort breaks anything, as I mentioned in my
other mail).
Does that sound about right?
In future we might be able to do something much smarter (e.g. adding some
preprocessing to use relative entries).
I'll take a look at shelf. :)
Thanks,
Mark.
> I started a new project called "shelf", which is a shell interface to
> read ELF files (Shelf on a ELF!).
>
> It uses my ccli library:
>
> https://github.com/rostedt/libccli
>
> and can be found here:
>
> https://github.com/rostedt/shelf
>
> Build and install the latest libccli and then build this with just
> "make".
>
> $ shelf vmlinux
>
> and then you can see what is stored in the mcount location:
>
> shelf> dump symbol __start_mcount_loc - __stop_mcount_loc
>
> I plan on adding more to include the REL and RELA sections and show how
> they affect symbols and such.
>
> Feel free to contribute too ;-)
>
> -- Steve
Mark Rutland <[email protected]> writes:
> On Thu, Jan 27, 2022 at 07:46:01AM -0500, Steven Rostedt wrote:
>> On Thu, 27 Jan 2022 12:27:04 +0000
>> Mark Rutland <[email protected]> wrote:
>>
>> > Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
>> > KASLR offset here", which is equivalent for all entries.
>> >
>> > That makes sense, thanks!
>>
>> And this is why we were having such a hard time understanding each other ;-)
>
> ;)
>
> With that in mind, I think that we understand that the build-time sort works
> for:
>
> * arch/x86, becuase the non-ELF relocations for mcount_loc happen to be
> equivalent.
>
> * arch/arm, because there's no dynamic relocaiton and the mcount_loc entries
> have been finalized prior to sorting.
>
> ... but doesn't work for anyone else (including arm64) because the ELF
> relocations are not equivalent, and need special care that is not yet
> implemented.
For s390 my idea is to just skip the addresses between __start_mcount_loc
and __stop_mcount_loc, because for these addresses we know that they are
64 bits wide, so we just need to add the KASLR offset.
I'm thinking about something like this:
diff --git a/arch/s390/boot/compressed/decompressor.h b/arch/s390/boot/compressed/decompressor.h
index f75cc31a77dd..015d7e2e94ef 100644
--- a/arch/s390/boot/compressed/decompressor.h
+++ b/arch/s390/boot/compressed/decompressor.h
@@ -25,6 +25,8 @@ struct vmlinux_info {
unsigned long rela_dyn_start;
unsigned long rela_dyn_end;
unsigned long amode31_size;
+ unsigned long start_mcount_loc;
+ unsigned long stop_mcount_loc;
};
/* Symbols defined by linker scripts */
diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
index 1aa11a8f57dd..7bb0d88db5c6 100644
--- a/arch/s390/boot/startup.c
+++ b/arch/s390/boot/startup.c
@@ -88,6 +88,11 @@ static void handle_relocs(unsigned long offset)
dynsym = (Elf64_Sym *) vmlinux.dynsym_start;
for (rela = rela_start; rela < rela_end; rela++) {
loc = rela->r_offset + offset;
+ if ((loc >= vmlinux.start_mcount_loc) &&
+ (loc < vmlinux.stop_mcount_loc)) {
+ (*(unsigned long *)loc) += offset;
+ continue;
+ }
val = rela->r_addend;
r_sym = ELF64_R_SYM(rela->r_info);
if (r_sym) {
@@ -232,6 +237,8 @@ static void offset_vmlinux_info(unsigned long offset)
vmlinux.rela_dyn_start += offset;
vmlinux.rela_dyn_end += offset;
vmlinux.dynsym_start += offset;
+ vmlinux.start_mcount_loc += offset;
+ vmlinux.stop_mcount_loc += offset;
}
static unsigned long reserve_amode31(unsigned long safe_addr)
diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
index 42c43521878f..51c773405608 100644
--- a/arch/s390/kernel/vmlinux.lds.S
+++ b/arch/s390/kernel/vmlinux.lds.S
@@ -213,6 +213,8 @@ SECTIONS
QUAD(__rela_dyn_start) /* rela_dyn_start */
QUAD(__rela_dyn_end) /* rela_dyn_end */
QUAD(_eamode31 - _samode31) /* amode31_size */
+ QUAD(__start_mcount_loc)
+ QUAD(__stop_mcount_loc)
} :NONE
/* Debugging sections. */
Not sure whether that would also work on power, and also not sure
whether i missed something thinking about that. Maybe it doesn't even
work. ;-)
On Thu, 27 Jan 2022 at 13:59, Mark Rutland <[email protected]> wrote:
>
> On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> > On Thu, 27 Jan 2022 at 13:20, Mark Rutland <[email protected]> wrote:
> > > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> > >
> > > > These architectures use place-relative extables for the same reason:
> > > > place relative references are resolved at build time rather than at
> > > > runtime during relocation, making a build time sort feasible.
> > > >
> > > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > >
> > > > Note that the swap routine becomes something like the below, given
> > > > that the relative references need to be fixed up after the entry
> > > > changes place in the sorted list.
> > > >
> > > > static void swap_ex(void *a, void *b, int size)
> > > > {
> > > > struct exception_table_entry *x = a, *y = b, tmp;
> > > > int delta = b - a;
> > > >
> > > > tmp = *x;
> > > > x->insn = y->insn + delta;
> > > > y->insn = tmp.insn - delta;
> > > > ...
> > > > }
> > > >
> > > > As a bonus, the resulting footprint of the table in the image is
> > > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > > __gnu_mcount_mc.
> > >
> > > Absolutely -- it'd be great if we could do that for the callsite locations; the
> > > difficulty is that the entries are generated by the compiler itself, so we'd
> > > either need some build/link time processing to convert each absolute 64-bit
> > > value to a relative 32-bit offset, or new compiler options to generate those as
> > > relative offsets from the outset.
> >
> > Don't we use scripts/recordmcount.pl for that?
>
> Not quite -- we could adjust it to do so, but today it doesn't consider
> existing mcount_loc entries, and just generates new ones where the compiler has
> generated calls to mcount, which it finds by scanning the instructions in the
> binary. Today it is not used:
>
> * On arm64 when we default to using `-fpatchable-function-entry=N`. That makes
> the compiler insert 2 NOPs in the function prologue, and log the location of
> that NOP sled to a section called. `__patchable_function_entries`.
>
> We need the compiler to do that since we can't reliably identify 2 NOPs in a
> function prologue as being intended to be a patch site, as e.g. there could
> be notrace functions where the compiler had to insert NOPs for alignment of a
> subsequent brnach or similar.
>
> * On architectures with `-nop-mcount`. On these, it's necessary to use
> `-mrecord-mcount` to have the compiler log the patch-site, for the same
> reason as with `-fpatchable-function-entry`.
>
> * On architectures with `-mrecord-mcount` generally, since this avoids the
> overhead of scanning each object.
>
> * On x86 when objtool is used.
>
Right.
I suppose that on arm64, we can work around this by passing
--apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
targets are prepopulated with the link time value of the respective
addresses. It does cause some bloat, which is why we disable that
today, but we could make that dependent on ftrace being enabled.
I do wonder how much over head we accumulate, though, by having all
these relocations, but I suppose that is the situation today in any
case.
On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 13:59, Mark Rutland <[email protected]> wrote:
> >
> > On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> > > On Thu, 27 Jan 2022 at 13:20, Mark Rutland <[email protected]> wrote:
> > > > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> > > >
> > > > > These architectures use place-relative extables for the same reason:
> > > > > place relative references are resolved at build time rather than at
> > > > > runtime during relocation, making a build time sort feasible.
> > > > >
> > > > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > >
> > > > > Note that the swap routine becomes something like the below, given
> > > > > that the relative references need to be fixed up after the entry
> > > > > changes place in the sorted list.
> > > > >
> > > > > static void swap_ex(void *a, void *b, int size)
> > > > > {
> > > > > struct exception_table_entry *x = a, *y = b, tmp;
> > > > > int delta = b - a;
> > > > >
> > > > > tmp = *x;
> > > > > x->insn = y->insn + delta;
> > > > > y->insn = tmp.insn - delta;
> > > > > ...
> > > > > }
> > > > >
> > > > > As a bonus, the resulting footprint of the table in the image is
> > > > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > > > __gnu_mcount_mc.
> > > >
> > > > Absolutely -- it'd be great if we could do that for the callsite locations; the
> > > > difficulty is that the entries are generated by the compiler itself, so we'd
> > > > either need some build/link time processing to convert each absolute 64-bit
> > > > value to a relative 32-bit offset, or new compiler options to generate those as
> > > > relative offsets from the outset.
> > >
> > > Don't we use scripts/recordmcount.pl for that?
> >
> > Not quite -- we could adjust it to do so, but today it doesn't consider
> > existing mcount_loc entries, and just generates new ones where the compiler has
> > generated calls to mcount, which it finds by scanning the instructions in the
> > binary. Today it is not used:
> >
> > * On arm64 when we default to using `-fpatchable-function-entry=N`. That makes
> > the compiler insert 2 NOPs in the function prologue, and log the location of
> > that NOP sled to a section called. `__patchable_function_entries`.
> >
> > We need the compiler to do that since we can't reliably identify 2 NOPs in a
> > function prologue as being intended to be a patch site, as e.g. there could
> > be notrace functions where the compiler had to insert NOPs for alignment of a
> > subsequent brnach or similar.
> >
> > * On architectures with `-nop-mcount`. On these, it's necessary to use
> > `-mrecord-mcount` to have the compiler log the patch-site, for the same
> > reason as with `-fpatchable-function-entry`.
> >
> > * On architectures with `-mrecord-mcount` generally, since this avoids the
> > overhead of scanning each object.
> >
> > * On x86 when objtool is used.
> >
>
> Right.
>
> I suppose that on arm64, we can work around this by passing
> --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> targets are prepopulated with the link time value of the respective
> addresses. It does cause some bloat, which is why we disable that
> today, but we could make that dependent on ftrace being enabled.
We'd also need to teach the build-time sort to update the relocations, unless
you mean to also change the boot-time reloc code to RMW with the offset?
I think for right now the best thing is to disable the build-time sort for
arm64, but maybe something like that is the right thing to do longer term.
> I do wonder how much over head we accumulate, though, by having all
> these relocations, but I suppose that is the situation today in any
> case.
Yeah; I suspect if we want to do something about that we want to do it more
generally, and would probably need to do something like the x86 approach and
rewrite the relocs at build-time to something more compressed. If we applied
the dynamic relocs with the link-time address we'd only need 4 bytes to
identify each pointer to apply an offset to.
I'm not exactly sure how we could do that, nor what the trade-off look like in
practice.
THanks,
Mark.
On Thu, Jan 27, 2022 at 02:16:31PM +0100, Sven Schnelle wrote:
> Mark Rutland <[email protected]> writes:
>
> > On Thu, Jan 27, 2022 at 07:46:01AM -0500, Steven Rostedt wrote:
> >> On Thu, 27 Jan 2022 12:27:04 +0000
> >> Mark Rutland <[email protected]> wrote:
> >>
> >> > Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
> >> > KASLR offset here", which is equivalent for all entries.
> >> >
> >> > That makes sense, thanks!
> >>
> >> And this is why we were having such a hard time understanding each other ;-)
> >
> > ;)
> >
> > With that in mind, I think that we understand that the build-time sort works
> > for:
> >
> > * arch/x86, becuase the non-ELF relocations for mcount_loc happen to be
> > equivalent.
> >
> > * arch/arm, because there's no dynamic relocaiton and the mcount_loc entries
> > have been finalized prior to sorting.
> >
> > ... but doesn't work for anyone else (including arm64) because the ELF
> > relocations are not equivalent, and need special care that is not yet
> > implemented.
>
> For s390 my idea is to just skip the addresses between __start_mcount_loc
> and __stop_mcount_loc, because for these addresses we know that they are
> 64 bits wide, so we just need to add the KASLR offset.
>
> I'm thinking about something like this:
>
> diff --git a/arch/s390/boot/compressed/decompressor.h b/arch/s390/boot/compressed/decompressor.h
> index f75cc31a77dd..015d7e2e94ef 100644
> --- a/arch/s390/boot/compressed/decompressor.h
> +++ b/arch/s390/boot/compressed/decompressor.h
> @@ -25,6 +25,8 @@ struct vmlinux_info {
> unsigned long rela_dyn_start;
> unsigned long rela_dyn_end;
> unsigned long amode31_size;
> + unsigned long start_mcount_loc;
> + unsigned long stop_mcount_loc;
> };
>
> /* Symbols defined by linker scripts */
> diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
> index 1aa11a8f57dd..7bb0d88db5c6 100644
> --- a/arch/s390/boot/startup.c
> +++ b/arch/s390/boot/startup.c
> @@ -88,6 +88,11 @@ static void handle_relocs(unsigned long offset)
> dynsym = (Elf64_Sym *) vmlinux.dynsym_start;
> for (rela = rela_start; rela < rela_end; rela++) {
> loc = rela->r_offset + offset;
> + if ((loc >= vmlinux.start_mcount_loc) &&
> + (loc < vmlinux.stop_mcount_loc)) {
> + (*(unsigned long *)loc) += offset;
> + continue;
> + }
> val = rela->r_addend;
> r_sym = ELF64_R_SYM(rela->r_info);
> if (r_sym) {
> @@ -232,6 +237,8 @@ static void offset_vmlinux_info(unsigned long offset)
> vmlinux.rela_dyn_start += offset;
> vmlinux.rela_dyn_end += offset;
> vmlinux.dynsym_start += offset;
> + vmlinux.start_mcount_loc += offset;
> + vmlinux.stop_mcount_loc += offset;
> }
>
> static unsigned long reserve_amode31(unsigned long safe_addr)
> diff --git a/arch/s390/kernel/vmlinux.lds.S b/arch/s390/kernel/vmlinux.lds.S
> index 42c43521878f..51c773405608 100644
> --- a/arch/s390/kernel/vmlinux.lds.S
> +++ b/arch/s390/kernel/vmlinux.lds.S
> @@ -213,6 +213,8 @@ SECTIONS
> QUAD(__rela_dyn_start) /* rela_dyn_start */
> QUAD(__rela_dyn_end) /* rela_dyn_end */
> QUAD(_eamode31 - _samode31) /* amode31_size */
> + QUAD(__start_mcount_loc)
> + QUAD(__stop_mcount_loc)
> } :NONE
>
> /* Debugging sections. */
>
> Not sure whether that would also work on power, and also not sure
> whether i missed something thinking about that. Maybe it doesn't even
> work. ;-)
I don't know enough about s390 or powerpc relocs to say whether that works, but
I can say that approach isn't going to work for arm64 without other signficant
changes.
I want to get the regression fixed ASAP, so can we take a simple patch for -rc2
which disables the build-time sort where it's currently broken (by limiting the
opt-in to arm and x86), then follow-up per-architecture to re-enable it if
desired/safe?
Thanks,
Mark.
On Thu, 27 Jan 2022 13:33:02 +0000
Mark Rutland <[email protected]> wrote:
> I want to get the regression fixed ASAP, so can we take a simple patch for -rc2
> which disables the build-time sort where it's currently broken (by limiting the
> opt-in to arm and x86), then follow-up per-architecture to re-enable it if
> desired/safe?
I'm going to retest my patch that makes it an opt in for just x86 and arm
(32bit). I'll be pushing that hopefully later today. I have some other
patches to test as well.
-- Steve
On Thu, 27 Jan 2022 at 14:24, Mark Rutland <[email protected]> wrote:
>
> On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> > On Thu, 27 Jan 2022 at 13:59, Mark Rutland <[email protected]> wrote:
> > >
> > > On Thu, Jan 27, 2022 at 01:22:17PM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 27 Jan 2022 at 13:20, Mark Rutland <[email protected]> wrote:
> > > > > On Thu, Jan 27, 2022 at 01:03:34PM +0100, Ard Biesheuvel wrote:
> > > > >
> > > > > > These architectures use place-relative extables for the same reason:
> > > > > > place relative references are resolved at build time rather than at
> > > > > > runtime during relocation, making a build time sort feasible.
> > > > > >
> > > > > > arch/alpha/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/arm64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/ia64/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/parisc/include/asm/uaccess.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/powerpc/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/riscv/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/s390/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > > arch/x86/include/asm/extable.h:#define ARCH_HAS_RELATIVE_EXTABLE
> > > > > >
> > > > > > Note that the swap routine becomes something like the below, given
> > > > > > that the relative references need to be fixed up after the entry
> > > > > > changes place in the sorted list.
> > > > > >
> > > > > > static void swap_ex(void *a, void *b, int size)
> > > > > > {
> > > > > > struct exception_table_entry *x = a, *y = b, tmp;
> > > > > > int delta = b - a;
> > > > > >
> > > > > > tmp = *x;
> > > > > > x->insn = y->insn + delta;
> > > > > > y->insn = tmp.insn - delta;
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > As a bonus, the resulting footprint of the table in the image is
> > > > > > reduced by 8x, given that every 8 byte pointer has an accompanying 24
> > > > > > byte RELA record, so we go from 32 bytes to 4 bytes for every call to
> > > > > > __gnu_mcount_mc.
> > > > >
> > > > > Absolutely -- it'd be great if we could do that for the callsite locations; the
> > > > > difficulty is that the entries are generated by the compiler itself, so we'd
> > > > > either need some build/link time processing to convert each absolute 64-bit
> > > > > value to a relative 32-bit offset, or new compiler options to generate those as
> > > > > relative offsets from the outset.
> > > >
> > > > Don't we use scripts/recordmcount.pl for that?
> > >
> > > Not quite -- we could adjust it to do so, but today it doesn't consider
> > > existing mcount_loc entries, and just generates new ones where the compiler has
> > > generated calls to mcount, which it finds by scanning the instructions in the
> > > binary. Today it is not used:
> > >
> > > * On arm64 when we default to using `-fpatchable-function-entry=N`. That makes
> > > the compiler insert 2 NOPs in the function prologue, and log the location of
> > > that NOP sled to a section called. `__patchable_function_entries`.
> > >
> > > We need the compiler to do that since we can't reliably identify 2 NOPs in a
> > > function prologue as being intended to be a patch site, as e.g. there could
> > > be notrace functions where the compiler had to insert NOPs for alignment of a
> > > subsequent brnach or similar.
> > >
> > > * On architectures with `-nop-mcount`. On these, it's necessary to use
> > > `-mrecord-mcount` to have the compiler log the patch-site, for the same
> > > reason as with `-fpatchable-function-entry`.
> > >
> > > * On architectures with `-mrecord-mcount` generally, since this avoids the
> > > overhead of scanning each object.
> > >
> > > * On x86 when objtool is used.
> > >
> >
> > Right.
> >
> > I suppose that on arm64, we can work around this by passing
> > --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> > targets are prepopulated with the link time value of the respective
> > addresses. It does cause some bloat, which is why we disable that
> > today, but we could make that dependent on ftrace being enabled.
>
> We'd also need to teach the build-time sort to update the relocations, unless
> you mean to also change the boot-time reloc code to RMW with the offset?
>
Why would that be necessary? Every RELA entry has the same effect on
its target address, as it just adds a fixed offset.
> I think for right now the best thing is to disable the build-time sort for
> arm64, but maybe something like that is the right thing to do longer term.
>
Fair enough.
> > I do wonder how much over head we accumulate, though, by having all
> > these relocations, but I suppose that is the situation today in any
> > case.
>
> Yeah; I suspect if we want to do something about that we want to do it more
> generally, and would probably need to do something like the x86 approach and
> rewrite the relocs at build-time to something more compressed. If we applied
> the dynamic relocs with the link-time address we'd only need 4 bytes to
> identify each pointer to apply an offset to.
>
> I'm not exactly sure how we could do that, nor what the trade-off look like in
> practice.
>
It would make sense for -fpic codegen to use relative offsets in
__mcount_loc, but since we don't actually use -fpic on arm64, that
doesn't really help :-)
On Thu, Jan 27, 2022 at 02:59:31PM +0100, Ard Biesheuvel wrote:
> On Thu, 27 Jan 2022 at 14:24, Mark Rutland <[email protected]> wrote:
> >
> > On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> > > I suppose that on arm64, we can work around this by passing
> > > --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> > > targets are prepopulated with the link time value of the respective
> > > addresses. It does cause some bloat, which is why we disable that
> > > today, but we could make that dependent on ftrace being enabled.
> >
> > We'd also need to teach the build-time sort to update the relocations, unless
> > you mean to also change the boot-time reloc code to RMW with the offset?
>
> Why would that be necessary? Every RELA entry has the same effect on
> its target address, as it just adds a fixed offset.
Currently in relocate_kernel() we generate the absolute address from the
relocation alone, with the core of the relocation logic being as follows, with
x9 being the pointer to a RELA entry, and x23 being the offset relative to the
default load address:
ldp x12, x13, [x9], #24
ldr x14, [x9, #-8]
add x14, x14, x23 // relocate
str x14, [x12, x23]
... and (as per another reply), a sample RELA entry currently contains:
0xffff8000090b1ab0 // default load VA of pointer to update
0x0000000000000403 // R_AARCH64_RELATIVE
0xffff8000090b6000 // default load VA of addr to write
So either:
* That code stays as-is, and we must update the relocs to correspond to their
new sorted locations, or we'll blat the sorted values with the original
relocs as we do today.
* The code needs to change to RMW: read the existing value, add the offset
(ignoring the content of the RELA entry's addend field), and write it back.
This is what I meant when I said "change the boot-time reloc code to RMW with
the offset".
Does that make sense, or have I misunderstood?
Thanks,
Mark.
On Thu, Jan 27, 2022 at 08:55:43AM -0500, Steven Rostedt wrote:
> On Thu, 27 Jan 2022 13:33:02 +0000
> Mark Rutland <[email protected]> wrote:
>
> > I want to get the regression fixed ASAP, so can we take a simple patch for -rc2
> > which disables the build-time sort where it's currently broken (by limiting the
> > opt-in to arm and x86), then follow-up per-architecture to re-enable it if
> > desired/safe?
>
> I'm going to retest my patch that makes it an opt in for just x86 and arm
> (32bit). I'll be pushing that hopefully later today. I have some other
> patches to test as well.
Great; thanks!
Let me know if you'd like me to give that a spin on arm or arm64.
Thanks,
Mark.
On Thu, 27 Jan 2022 at 15:55, Mark Rutland <[email protected]> wrote:
>
> On Thu, Jan 27, 2022 at 02:59:31PM +0100, Ard Biesheuvel wrote:
> > On Thu, 27 Jan 2022 at 14:24, Mark Rutland <[email protected]> wrote:
> > >
> > > On Thu, Jan 27, 2022 at 02:07:03PM +0100, Ard Biesheuvel wrote:
> > > > I suppose that on arm64, we can work around this by passing
> > > > --apply-dynamic-relocs to the linker, so that all R_AARCH64_RELATIVE
> > > > targets are prepopulated with the link time value of the respective
> > > > addresses. It does cause some bloat, which is why we disable that
> > > > today, but we could make that dependent on ftrace being enabled.
> > >
> > > We'd also need to teach the build-time sort to update the relocations, unless
> > > you mean to also change the boot-time reloc code to RMW with the offset?
> >
> > Why would that be necessary? Every RELA entry has the same effect on
> > its target address, as it just adds a fixed offset.
>
> Currently in relocate_kernel() we generate the absolute address from the
> relocation alone, with the core of the relocation logic being as follows, with
> x9 being the pointer to a RELA entry, and x23 being the offset relative to the
> default load address:
>
> ldp x12, x13, [x9], #24
> ldr x14, [x9, #-8]
>
> add x14, x14, x23 // relocate
> str x14, [x12, x23]
>
> ... and (as per another reply), a sample RELA entry currently contains:
>
> 0xffff8000090b1ab0 // default load VA of pointer to update
> 0x0000000000000403 // R_AARCH64_RELATIVE
> 0xffff8000090b6000 // default load VA of addr to write
>
> So either:
>
> * That code stays as-is, and we must update the relocs to correspond to their
> new sorted locations, or we'll blat the sorted values with the original
> relocs as we do today.
>
> * The code needs to change to RMW: read the existing value, add the offset
> (ignoring the content of the RELA entry's addend field), and write it back.
> This is what I meant when I said "change the boot-time reloc code to RMW with
> the offset".
>
> Does that make sense, or have I misunderstood?
>
No you're right. We'd have to use different sequences here depending
on whether the relocation target is populated or not, which currently
we don't care about.
On Thu, Jan 27, 2022 at 11:46:49AM +0000, Mark Rutland wrote:
> I'm not sure how x86 works here; AFAICT the relocations are performed during
> decompression, but it looks like there's some special build-time processing
> associated with that, and the vmlinux doesn't contain standard ELF relocations.
>
> Kees, IIUC you added the x86_64 support there, can you shed any light on if/how
> this works on x86?
I think Sven beat me to it, and this was answered in
https://lore.kernel.org/lkml/[email protected]
but let me know if anything needs further info.
An additional note is that x86 is built with "-2G addressing"
(-mcmodel=kernel). There was some work done to make it actually
PIE, which would allow the KASLR base to move further:
https://github.com/KSPP/linux/issues/38
-Kees
--
Kees Cook