2023-04-28 09:53:02

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible

Purpose:

These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated
below the top 2G of the virtual address space. And this patchset
provides an example to allow kernel image to be relocated in top 512G of
the address space.

The ultimate purpose for PIE kernel is to increase the security of the
the kernel and also the fleixbility of the kernel image's virtual
address, which can be even in the low half of the address space. More
locations the kernel can fit in, this means an attacker could guess
harder.

The patchset is based on Thomas Garnier's X86 PIE patchset v6[1] and
v11[2]. However, some design changes are made and some bugs are fixed by
testing with different configurations and compilers.

Important changes:
- For fixmap area, move vsyscall page out of fixmap area and unify
__FIXADDR_TOP for x86. Then fixmap area could be relocated together
with kernel image.

- For compile-time base address of kernel image, keep it in top 2G of
address space. Introduce a new variable to store the run-time base
address and adapt for VA/PA transition during runtime.

- For percpu section, keep it as zero mapping for SMP. Because
compile-time base address of kernel image still resides in top 2G of
address space, then RIP-relative reference can still be used when
percpu section is zero mapping. However, when do relocation for
percpu variable references, percpu variable should be treated as
normal variable and absolute references should be relocated
accordingly. In addition, the relocation offset should be subtracted
from the GS base in order to ensure correct operation.

- For x86/boot/head64.c, don't build it as mcmodel=large. Instead, use
data relocation to acqiure global symbol's value and make
fixup_pointer() as a nop when running in identity mapping. This is
because not all global symbol references in the code use
fixup_pointer(), e.g. variables in macro related to 5-level paging,
which can be optimized by GCC as relative referencs. If build it as
mcmodel=large, there will be more fixup_pointer() calls, resulting
in uglier code. Actually, if build it as PIE even when
CONFIG_X86_PIE is disabled, then all fixup_pointer() could be
dropped. However stack protector would be broken if per-cpu stack
protector is not supported.

Limitations:
- Since I am not familiar with XEN, it has been disabled for now as it
is not adapted for PIE. This is due to the assignment of wrong
pointers (with low address values) to x86_ini_ops when running in
identity mapping. This issue can be resolved by building pagetable
eraly and jumping to high kernel address as soon as possible.

- It is not allowed to reference global variables in an alternative
section since RIP-relative addressing is not fixed in
apply_alternatives(). Fortunately, all disallowed relocations in the
alternative section can be captured by objtool. I believe that this
issue can also be fixed by using objtool.

- For module loading, only allow to load module without GOT for
simplicity. Only weak global variable referencs are using GOT.

Tests:
I only have tested booting with GCC 5.1.0 (min version), GCC 12.2.0
and CLANG 15.0.7. And I have also run the following tests for both
default configuration and Ubuntu configuration.

Performance/Size impact (GCC 12.2.0):

Size of vmlinux (Default configuration):
File size:
- PIE disabled: +0.012%
- PIE enabled: -2.219%
instructions:
- PIE disabled: same
- PIE enabled: +1.383%
.text section:
- PIE disabled: same
- PIE enabled: +0.589%

Size of vmlinux (Ubuntu configuration):
File size:
- PIE disabled: same
- PIE enabled: +2.391%
instructions:
- PIE disabled: +0.013%
- PIE enabled: +1.566%
.text section:
- PIE disabled: same
- PIE enabled: +0.055%

The .text section size increase is due to more instructions required for
PIE code. There are two reasons that have been mentioned in previous
mailist. Firstly, switch folding is disabled under PIE [3]. Secondly,
two instructions are needed for PIE to represent a single instruction
with sign extension, such as when accessing an array element. While only
one instruction is required when using mcmode=kernel, for PIE, it needs
to use lea to get the base of the array first.

Hackbench (50% and 1600% on thread/process for pipe/sockets):
- PIE disabled: no significant change (avg -/+ 0.5% on default config).
- PIE enabled: -2% to +2% in average (default config).

Kernbench (average of 10 Half and Optimal runs):
Elapsed Time:
- PIE disabled: no significant change (avg -0.2% on ubuntu config)
- PIE enabled: average -0.2% to +0.2%
System Time:
- PIE disabled: no significant change (avg -0.5% on ubuntu config)
- PIE enabled: average -0.5% to +0.5%

[1] https://lore.kernel.org/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]
[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82303

Brian Gerst (1):
x86-64: Use per-cpu stack canary if supported by compiler

Hou Wenlong (29):
x86/irq: Adapt assembly for PIE support
x86,rethook: Adapt assembly for PIE support
x86/paravirt: Use relative reference for original instruction
x86/Kconfig: Introduce new Kconfig for PIE kernel building
x86/PVH: Use fixed_percpu_data to set up GS base
x86/pie: Enable stack protector only if per-cpu stack canary is
supported
x86/percpu: Use PC-relative addressing for percpu variable references
x86/tools: Explicitly include autoconf.h for hostprogs
x86/percpu: Adapt percpu references relocation for PIE support
x86/ftrace: Adapt assembly for PIE support
x86/pie: Force hidden visibility for all symbol references
x86/boot/compressed: Adapt sed command to generate voffset.h when PIE
is enabled
x86/pie: Add .data.rel.* sections into link script
KVM: x86: Adapt assembly for PIE support
x86/PVH: Adapt PVH booting for PIE support
x86/bpf: Adapt BPF_CALL JIT codegen for PIE support
x86/modules: Adapt module loading for PIE support
x86/boot/64: Use data relocation to get absloute address when PIE is
enabled
objtool: Add validation for x86 PIE support
objtool: Adapt indirect call of __fentry__() for PIE support
x86/pie: Build the kernel as PIE
x86/vsyscall: Don't use set_fixmap() to map vsyscall page
x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap
area
x86/fixmap: Move vsyscall page out of fixmap area
x86/fixmap: Unify FIXADDR_TOP
x86/boot: Fill kernel image puds dynamically
x86/mm: Sort address_markers array when X86 PIE is enabled
x86/pie: Allow kernel image to be relocated in top 512G
x86/boot: Extend relocate range for PIE kernel image

Thomas Garnier (13):
x86/crypto: Adapt assembly for PIE support
x86: Add macro to get symbol address for PIE support
x86: relocate_kernel - Adapt assembly for PIE support
x86/entry/64: Adapt assembly for PIE support
x86: pm-trace: Adapt assembly for PIE support
x86/CPU: Adapt assembly for PIE support
x86/acpi: Adapt assembly for PIE support
x86/boot/64: Adapt assembly for PIE support
x86/power/64: Adapt assembly for PIE support
x86/alternatives: Adapt assembly for PIE support
x86/ftrace: Adapt ftrace nop patching for PIE support
x86/mm: Make the x86 GOT read-only
x86/relocs: Handle PIE relocations

Documentation/x86/x86_64/mm.rst | 4 +
arch/x86/Kconfig | 36 +++++-
arch/x86/Makefile | 33 +++--
arch/x86/boot/compressed/Makefile | 2 +-
arch/x86/boot/compressed/kaslr.c | 55 +++++++++
arch/x86/boot/compressed/misc.c | 4 +-
arch/x86/boot/compressed/misc.h | 9 ++
arch/x86/crypto/aegis128-aesni-asm.S | 6 +-
arch/x86/crypto/aesni-intel_asm.S | 2 +-
arch/x86/crypto/aesni-intel_avx-x86_64.S | 3 +-
arch/x86/crypto/aria-aesni-avx-asm_64.S | 30 ++---
arch/x86/crypto/camellia-aesni-avx-asm_64.S | 30 ++---
arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 30 ++---
arch/x86/crypto/camellia-x86_64-asm_64.S | 8 +-
arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 50 ++++----
arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 44 ++++---
arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 3 +-
arch/x86/crypto/des3_ede-asm_64.S | 96 ++++++++++-----
arch/x86/crypto/ghash-clmulni-intel_asm.S | 4 +-
arch/x86/crypto/sha256-avx2-asm.S | 18 ++-
arch/x86/entry/calling.h | 17 ++-
arch/x86/entry/entry_64.S | 22 +++-
arch/x86/entry/vdso/Makefile | 2 +-
arch/x86/entry/vsyscall/vsyscall_64.c | 7 +-
arch/x86/include/asm/alternative.h | 6 +-
arch/x86/include/asm/asm.h | 1 +
arch/x86/include/asm/fixmap.h | 28 +----
arch/x86/include/asm/irq_stack.h | 2 +-
arch/x86/include/asm/kmsan.h | 6 +-
arch/x86/include/asm/nospec-branch.h | 10 +-
arch/x86/include/asm/page_64.h | 8 +-
arch/x86/include/asm/page_64_types.h | 8 ++
arch/x86/include/asm/paravirt.h | 17 ++-
arch/x86/include/asm/paravirt_types.h | 12 +-
arch/x86/include/asm/percpu.h | 29 ++++-
arch/x86/include/asm/pgtable_64_types.h | 10 +-
arch/x86/include/asm/pm-trace.h | 2 +-
arch/x86/include/asm/processor.h | 17 ++-
arch/x86/include/asm/sections.h | 5 +
arch/x86/include/asm/stackprotector.h | 16 ++-
arch/x86/include/asm/sync_core.h | 6 +-
arch/x86/include/asm/vsyscall.h | 13 ++
arch/x86/kernel/acpi/wakeup_64.S | 31 ++---
arch/x86/kernel/alternative.c | 8 +-
arch/x86/kernel/asm-offsets_64.c | 2 +-
arch/x86/kernel/callthunks.c | 2 +-
arch/x86/kernel/cpu/common.c | 15 ++-
arch/x86/kernel/ftrace.c | 46 ++++++-
arch/x86/kernel/ftrace_64.S | 9 +-
arch/x86/kernel/head64.c | 77 +++++++++---
arch/x86/kernel/head_64.S | 68 ++++++++---
arch/x86/kernel/kvm.c | 21 +++-
arch/x86/kernel/module.c | 27 +++++
arch/x86/kernel/paravirt.c | 4 +
arch/x86/kernel/relocate_kernel_64.S | 2 +-
arch/x86/kernel/rethook.c | 8 ++
arch/x86/kernel/setup.c | 6 +
arch/x86/kernel/vmlinux.lds.S | 10 +-
arch/x86/kvm/svm/vmenter.S | 10 +-
arch/x86/kvm/vmx/vmenter.S | 2 +-
arch/x86/lib/cmpxchg16b_emu.S | 8 +-
arch/x86/mm/dump_pagetables.c | 36 +++++-
arch/x86/mm/fault.c | 1 -
arch/x86/mm/init_64.c | 10 +-
arch/x86/mm/ioremap.c | 5 +-
arch/x86/mm/kasan_init_64.c | 4 +-
arch/x86/mm/pat/set_memory.c | 2 +-
arch/x86/mm/pgtable.c | 13 ++
arch/x86/mm/pgtable_32.c | 3 -
arch/x86/mm/physaddr.c | 14 +--
arch/x86/net/bpf_jit_comp.c | 17 ++-
arch/x86/platform/efi/efi_thunk_64.S | 4 +
arch/x86/platform/pvh/head.S | 29 ++++-
arch/x86/power/hibernate_asm_64.S | 4 +-
arch/x86/tools/Makefile | 4 +-
arch/x86/tools/relocs.c | 113 ++++++++++++++++-
arch/x86/xen/mmu_pv.c | 32 +++--
arch/x86/xen/xen-asm.S | 10 +-
arch/x86/xen/xen-head.S | 14 ++-
include/asm-generic/vmlinux.lds.h | 12 ++
scripts/Makefile.lib | 1 +
scripts/recordmcount.c | 81 ++++++++-----
tools/objtool/arch/x86/decode.c | 10 +-
tools/objtool/builtin-check.c | 4 +-
tools/objtool/check.c | 121 +++++++++++++++++++
tools/objtool/include/objtool/builtin.h | 1 +
86 files changed, 1202 insertions(+), 410 deletions(-)


Patchset is based on tip/master.
base-commit: 01cbd032298654fe4c85e153dd9a224e5bc10194
--
2.31.1


2023-04-28 09:53:13

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 02/43] x86: Add macro to get symbol address for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Add a new _ASM_MOVABS macro to fetch a symbol address. Replace
"_ASM_MOV $<symbol>, %dst" code construct that are not compatible with
PIE.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/asm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index fbcfec4dc4cc..05974cc060c6 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -35,6 +35,7 @@
#define _ASM_ALIGN __ASM_SEL(.balign 4, .balign 8)

#define _ASM_MOV __ASM_SIZE(mov)
+#define _ASM_MOVABS __ASM_SEL(movl, movabsq)
#define _ASM_INC __ASM_SIZE(inc)
#define _ASM_DEC __ASM_SIZE(dec)
#define _ASM_ADD __ASM_SIZE(add)
--
2.31.1

2023-04-28 09:53:15

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 03/43] x86: relocate_kernel - Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly code to use only absolute references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 56cab1bb25f5..05d916e9df47 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -223,7 +223,7 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
movq %rax, %cr3
lea PAGE_SIZE(%r8), %rsp
call swap_pages
- movq $virtual_mapped, %rax
+ movabsq $virtual_mapped, %rax
pushq %rax
ANNOTATE_UNRET_SAFE
ret
--
2.31.1

2023-04-28 09:53:18

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 04/43] x86/entry/64: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/entry/entry_64.S | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 21dca946955e..6f2297ebb15f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1089,7 +1089,8 @@ SYM_CODE_START(error_entry)
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
je .Lbstep_iret
- cmpq $.Lgs_change, RIP+8(%rsp)
+ leaq .Lgs_change(%rip), %rcx
+ cmpq %rcx, RIP+8(%rsp)
jne .Lerror_entry_done_lfence

/*
@@ -1302,10 +1303,10 @@ SYM_CODE_START(asm_exc_nmi)
* resume the outer NMI.
*/

- movq $repeat_nmi, %rdx
+ leaq repeat_nmi(%rip), %rdx
cmpq 8(%rsp), %rdx
ja 1f
- movq $end_repeat_nmi, %rdx
+ leaq end_repeat_nmi(%rip), %rdx
cmpq 8(%rsp), %rdx
ja nested_nmi_out
1:
@@ -1359,7 +1360,8 @@ nested_nmi:
pushq %rdx
pushfq
pushq $__KERNEL_CS
- pushq $repeat_nmi
+ leaq repeat_nmi(%rip), %rdx
+ pushq %rdx

/* Put stack back */
addq $(6*8), %rsp
@@ -1398,7 +1400,11 @@ first_nmi:
addq $8, (%rsp) /* Fix up RSP */
pushfq /* RFLAGS */
pushq $__KERNEL_CS /* CS */
- pushq $1f /* RIP */
+ pushq $0 /* Space for RIP */
+ pushq %rdx /* Save RDX */
+ leaq 1f(%rip), %rdx /* Put the address of 1f label into RDX */
+ movq %rdx, 8(%rsp) /* Store it in RIP field */
+ popq %rdx /* Restore RDX */
iretq /* continues at repeat_nmi below */
UNWIND_HINT_IRET_REGS
1:
--
2.31.1

2023-04-28 09:53:28

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 06/43] x86/CPU: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/sync_core.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..fa5b1fe1a692 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -31,10 +31,12 @@ static inline void iret_to_self(void)
"pushfq\n\t"
"mov %%cs, %0\n\t"
"pushq %q0\n\t"
- "pushq $1f\n\t"
+ "leaq 1f(%%rip), %q0\n\t"
+ "pushq %q0\n\t"
"iretq\n\t"
"1:"
- : "=&r" (tmp), ASM_CALL_CONSTRAINT : : "cc", "memory");
+ : "=&r" (tmp), ASM_CALL_CONSTRAINT
+ : : "cc", "memory");
}
#endif /* CONFIG_X86_32 */

--
2.31.1

2023-04-28 09:53:42

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 09/43] x86/power/64: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/power/hibernate_asm_64.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/hibernate_asm_64.S b/arch/x86/power/hibernate_asm_64.S
index 0a0539e1cc81..1d96a119d29d 100644
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -39,7 +39,7 @@ SYM_FUNC_START(restore_registers)
movq %rax, %cr4; # turn PGE back on

/* We don't restore %rax, it must be 0 anyway */
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq pt_regs_sp(%rax), %rsp
movq pt_regs_bp(%rax), %rbp
movq pt_regs_si(%rax), %rsi
@@ -70,7 +70,7 @@ SYM_FUNC_START(restore_registers)
SYM_FUNC_END(restore_registers)

SYM_FUNC_START(swsusp_arch_suspend)
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq %rsp, pt_regs_sp(%rax)
movq %rbp, pt_regs_bp(%rax)
movq %rsi, pt_regs_si(%rax)
--
2.31.1

2023-04-28 09:53:50

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 07/43] x86/acpi: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/acpi/wakeup_64.S | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
index d5d8a352eafa..fe688bd87d72 100644
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -17,7 +17,7 @@
* Hooray, we are in Long 64-bit mode (but still running in low memory)
*/
SYM_FUNC_START(wakeup_long64)
- movq saved_magic, %rax
+ movq saved_magic(%rip), %rax
movq $0x123456789abcdef0, %rdx
cmpq %rdx, %rax
je 2f
@@ -33,14 +33,14 @@ SYM_FUNC_START(wakeup_long64)
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
- movq saved_rsp, %rsp
+ movq saved_rsp(%rip), %rsp

- movq saved_rbx, %rbx
- movq saved_rdi, %rdi
- movq saved_rsi, %rsi
- movq saved_rbp, %rbp
+ movq saved_rbx(%rip), %rbx
+ movq saved_rdi(%rip), %rdi
+ movq saved_rsi(%rip), %rsi
+ movq saved_rbp(%rip), %rbp

- movq saved_rip, %rax
+ movq saved_rip(%rip), %rax
ANNOTATE_RETPOLINE_SAFE
jmp *%rax
SYM_FUNC_END(wakeup_long64)
@@ -51,7 +51,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
xorl %eax, %eax
call save_processor_state

- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq %rsp, pt_regs_sp(%rax)
movq %rbp, pt_regs_bp(%rax)
movq %rsi, pt_regs_si(%rax)
@@ -70,13 +70,14 @@ SYM_FUNC_START(do_suspend_lowlevel)
pushfq
popq pt_regs_flags(%rax)

- movq $.Lresume_point, saved_rip(%rip)
+ leaq .Lresume_point(%rip), %rax
+ movq %rax, saved_rip(%rip)

- movq %rsp, saved_rsp
- movq %rbp, saved_rbp
- movq %rbx, saved_rbx
- movq %rdi, saved_rdi
- movq %rsi, saved_rsi
+ movq %rsp, saved_rsp(%rip)
+ movq %rbp, saved_rbp(%rip)
+ movq %rbx, saved_rbx(%rip)
+ movq %rdi, saved_rdi(%rip)
+ movq %rsi, saved_rsi(%rip)

addq $8, %rsp
movl $3, %edi
@@ -88,7 +89,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
.align 4
.Lresume_point:
/* We don't restore %rax, it must be 0 anyway */
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq saved_context_cr4(%rax), %rbx
movq %rbx, %cr4
movq saved_context_cr3(%rax), %rbx
--
2.31.1

2023-04-28 09:53:59

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 10/43] x86/alternatives: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the assembly options to work with pointers instead of integers.
The generated code is the same PIE just ensures input is a pointer.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/alternative.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index d7da28fada87..cbf7c93087c8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -307,7 +307,7 @@ static inline int alternatives_text_reserved(void *start, void *end)
/* Like alternative_io, but for replacing a direct call with another one. */
#define alternative_call(oldfunc, newfunc, ft_flags, output, input...) \
asm_inline volatile (ALTERNATIVE("call %P[old]", "call %P[new]", ft_flags) \
- : output : [old] "i" (oldfunc), [new] "i" (newfunc), ## input)
+ : output : [old] "X" (oldfunc), [new] "X" (newfunc), ## input)

/*
* Like alternative_call, but there are two features and respective functions.
@@ -320,8 +320,8 @@ static inline int alternatives_text_reserved(void *start, void *end)
asm_inline volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", ft_flags1,\
"call %P[new2]", ft_flags2) \
: output, ASM_CALL_CONSTRAINT \
- : [old] "i" (oldfunc), [new1] "i" (newfunc1), \
- [new2] "i" (newfunc2), ## input)
+ : [old] "X" (oldfunc), [new1] "X" (newfunc1), \
+ [new2] "X" (newfunc2), ## input)

/*
* use this macro(s) if you need more than one output parameter
--
2.31.1

2023-04-28 09:54:06

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 11/43] x86/irq: Adapt assembly for PIE support

Change the assembly options to work with pointers instead of integers.
The generated code is the same PIE just ensures input is a pointer.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/irq_stack.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/irq_stack.h b/arch/x86/include/asm/irq_stack.h
index 798183867d78..caba5d1d0800 100644
--- a/arch/x86/include/asm/irq_stack.h
+++ b/arch/x86/include/asm/irq_stack.h
@@ -93,7 +93,7 @@
"popq %%rsp \n" \
\
: "+r" (tos), ASM_CALL_CONSTRAINT \
- : [__func] "i" (func), [tos] "r" (tos) argconstr \
+ : [__func] "X" (func), [tos] "r" (tos) argconstr \
: "cc", "rax", "rcx", "rdx", "rsi", "rdi", "r8", "r9", "r10", \
"memory" \
); \
--
2.31.1

2023-04-28 09:54:11

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 05/43] x86: pm-trace: Adapt assembly for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change assembly to use the new _ASM_MOVABS macro instead of _ASM_MOV for
the assembly to be PIE compatible.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/pm-trace.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pm-trace.h b/arch/x86/include/asm/pm-trace.h
index bfa32aa428e5..972070806ce9 100644
--- a/arch/x86/include/asm/pm-trace.h
+++ b/arch/x86/include/asm/pm-trace.h
@@ -8,7 +8,7 @@
do { \
if (pm_trace_enabled) { \
const void *tracedata; \
- asm volatile(_ASM_MOV " $1f,%0\n" \
+ asm volatile(_ASM_MOVABS " $1f,%0\n" \
".section .tracedata,\"a\"\n" \
"1:\t.word %c1\n\t" \
_ASM_PTR " %c2\n" \
--
2.31.1

2023-04-28 09:54:19

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 12/43] x86,rethook: Adapt assembly for PIE support

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/rethook.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
index 8a1c0111ae79..ff3733b765e0 100644
--- a/arch/x86/kernel/rethook.c
+++ b/arch/x86/kernel/rethook.c
@@ -27,7 +27,15 @@ asm(
#ifdef CONFIG_X86_64
ANNOTATE_NOENDBR /* This is only jumped from ret instruction */
/* Push a fake return address to tell the unwinder it's a rethook. */
+#ifdef CONFIG_X86_PIE
+ " pushq $0\n"
+ " pushq %rdi\n"
+ " leaq arch_rethook_trampoline(%rip), %rdi\n"
+ " movq %rdi, 8(%rsp)\n"
+ " popq %rdi\n"
+#else
" pushq $arch_rethook_trampoline\n"
+#endif
UNWIND_HINT_FUNC
" pushq $" __stringify(__KERNEL_DS) "\n"
/* Save the 'sp - 16', this will be fixed later. */
--
2.31.1

2023-04-28 09:54:20

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

Similar to the alternative patching, use relative reference for original
instruction rather than absolute one, which saves 8 bytes for one entry
on x86_64. And it could generate R_X86_64_PC32 relocation instead of
R_X86_64_64 relocation, which also reduces relocation metadata on
relocatable builds. And the alignment could be hard coded to be 4 now.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/paravirt.h | 10 +++++-----
arch/x86/include/asm/paravirt_types.h | 8 ++++----
arch/x86/kernel/alternative.c | 8 +++++---
arch/x86/kernel/callthunks.c | 2 +-
4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b49778664d2b..2350ceb43db0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -742,16 +742,16 @@ extern void default_banner(void);

#else /* __ASSEMBLY__ */

-#define _PVSITE(ptype, ops, word, algn) \
+#define _PVSITE(ptype, ops) \
771:; \
ops; \
772:; \
.pushsection .parainstructions,"a"; \
- .align algn; \
- word 771b; \
+ .align 4; \
+ .long 771b-.; \
.byte ptype; \
.byte 772b-771b; \
- _ASM_ALIGN; \
+ .align 4; \
.popsection


@@ -759,7 +759,7 @@ extern void default_banner(void);
#ifdef CONFIG_PARAVIRT_XXL

#define PARA_PATCH(off) ((off) / 8)
-#define PARA_SITE(ptype, ops) _PVSITE(ptype, ops, .quad, 8)
+#define PARA_SITE(ptype, ops) _PVSITE(ptype, ops)
#define PARA_INDIRECT(addr) *addr(%rip)

#ifdef CONFIG_DEBUG_ENTRY
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 4acbcddddc29..982a234f5a06 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -5,7 +5,7 @@
#ifndef __ASSEMBLY__
/* These all sit in the .parainstructions section to tell us what to patch. */
struct paravirt_patch_site {
- u8 *instr; /* original instructions */
+ s32 instr_offset; /* original instructions */
u8 type; /* type of this instruction */
u8 len; /* length of original instruction */
};
@@ -270,11 +270,11 @@ extern struct paravirt_patch_template pv_ops;
#define _paravirt_alt(insn_string, type) \
"771:\n\t" insn_string "\n" "772:\n" \
".pushsection .parainstructions,\"a\"\n" \
- _ASM_ALIGN "\n" \
- _ASM_PTR " 771b\n" \
+ " .align 4\n" \
+ " .long 771b-.\n" \
" .byte " type "\n" \
" .byte 772b-771b\n" \
- _ASM_ALIGN "\n" \
+ " .align 4\n" \
".popsection\n"

/* Generate patchable code, with the default asm parameters. */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index f615e0cb6d93..25c59da6c53b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1230,20 +1230,22 @@ void __init_or_module apply_paravirt(struct paravirt_patch_site *start,
{
struct paravirt_patch_site *p;
char insn_buff[MAX_PATCH_LEN];
+ u8 *instr;

for (p = start; p < end; p++) {
unsigned int used;

+ instr = (u8 *)&p->instr_offset + p->instr_offset;
BUG_ON(p->len > MAX_PATCH_LEN);
/* prep the buffer with the original instructions */
- memcpy(insn_buff, p->instr, p->len);
- used = paravirt_patch(p->type, insn_buff, (unsigned long)p->instr, p->len);
+ memcpy(insn_buff, instr, p->len);
+ used = paravirt_patch(p->type, insn_buff, (unsigned long)instr, p->len);

BUG_ON(used > p->len);

/* Pad the rest with nops */
add_nops(insn_buff + used, p->len - used);
- text_poke_early(p->instr, insn_buff, p->len);
+ text_poke_early(instr, insn_buff, p->len);
}
}
extern struct paravirt_patch_site __start_parainstructions[],
diff --git a/arch/x86/kernel/callthunks.c b/arch/x86/kernel/callthunks.c
index ffea98f9064b..f15405acfd42 100644
--- a/arch/x86/kernel/callthunks.c
+++ b/arch/x86/kernel/callthunks.c
@@ -245,7 +245,7 @@ patch_paravirt_call_sites(struct paravirt_patch_site *start,
struct paravirt_patch_site *p;

for (p = start; p < end; p++)
- patch_call(p->instr, ct);
+ patch_call((void *)&p->instr_offset + p->instr_offset, ct);
}

static __init_or_module void
--
2.31.1

2023-04-28 09:54:39

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 14/43] x86/Kconfig: Introduce new Kconfig for PIE kernel building

Add a new Kconfig to control the behaviour of PIE building, and disable
it now.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c94297369448..68e5da464b96 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2208,6 +2208,10 @@ config RELOCATABLE
it has been loaded at and the compile time physical address
(CONFIG_PHYSICAL_START) is used as the minimum location.

+config X86_PIE
+ def_bool n
+ depends on X86_64
+
config RANDOMIZE_BASE
bool "Randomize the address of the kernel image (KASLR)"
depends on RELOCATABLE
--
2.31.1

2023-04-28 09:54:44

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 15/43] x86/PVH: Use fixed_percpu_data to set up GS base

startup_64() and startup_xen() both use fixed_percpu_data to set up GS
base. So for consitency, use it too in PVH entry.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/platform/pvh/head.S | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index c4365a05ab83..b093996b7e19 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -96,7 +96,7 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
1:
/* Set base address in stack canary descriptor. */
mov $MSR_GS_BASE,%ecx
- mov $_pa(canary), %eax
+ mov $_pa(INIT_PER_CPU_VAR(fixed_percpu_data)), %eax
xor %edx, %edx
wrmsr

@@ -156,8 +156,6 @@ SYM_DATA_START_LOCAL(gdt_start)
SYM_DATA_END_LABEL(gdt_start, SYM_L_LOCAL, gdt_end)

.balign 16
-SYM_DATA_LOCAL(canary, .fill 48, 1, 0)
-
SYM_DATA_START_LOCAL(early_stack)
.fill BOOT_STACK_SIZE, 1, 0
SYM_DATA_END_LABEL(early_stack, SYM_L_LOCAL, early_stack_end)
--
2.31.1

2023-04-28 09:55:14

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 17/43] x86/pie: Enable stack protector only if per-cpu stack canary is supported

Since -fPIE option is not incompatible with -mcmode=kernel option, PIE
kernel would drop -mcmodel=kernel option. However, GCC would use %fs as
segment register for stack protector when -mcmodel=kernel option is
dropped. So only enable stack protector for PIE kernel if per-cpu stack
canary is supported.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 55cce8cdf9bd..b26941ef50ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -403,6 +403,7 @@ config PGTABLE_LEVELS

config CC_HAS_SANE_STACKPROTECTOR
bool
+ default CC_HAS_CUSTOMIZED_STACKPROTECTOR if X86_PIE
default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh $(CC) $(CLANG_FLAGS)) if 64BIT
default $(success,$(srctree)/scripts/gcc-x86_32-has-stack-protector.sh $(CC) $(CLANG_FLAGS))
help
--
2.31.1

2023-04-28 09:55:29

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 19/43] x86/tools: Explicitly include autoconf.h for hostprogs

The relocs tool needs access to the CONFIG_* symbols found in
include/generated/autoconf.h, however, the header file is not included.
so the #if CONFIG_FW_LOADER code in arch/x86/tool/relocs.c is never
compiled.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/tools/Makefile | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/Makefile b/arch/x86/tools/Makefile
index 90e820ac9771..8af4aeeb72af 100644
--- a/arch/x86/tools/Makefile
+++ b/arch/x86/tools/Makefile
@@ -38,7 +38,9 @@ $(obj)/insn_decoder_test.o: $(srctree)/tools/arch/x86/lib/insn.c $(srctree)/tool

$(obj)/insn_sanity.o: $(srctree)/tools/arch/x86/lib/insn.c $(srctree)/tools/arch/x86/lib/inat.c $(srctree)/tools/arch/x86/include/asm/inat_types.h $(srctree)/tools/arch/x86/include/asm/inat.h $(srctree)/tools/arch/x86/include/asm/insn.h $(objtree)/arch/x86/lib/inat-tables.c

-HOST_EXTRACFLAGS += -I$(srctree)/tools/include
+HOST_EXTRACFLAGS += -I$(srctree)/tools/include \
+ -include include/generated/autoconf.h
+
hostprogs += relocs
relocs-objs := relocs_32.o relocs_64.o relocs_common.o
PHONY += relocs
--
2.31.1

2023-04-28 09:55:31

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/ftrace_64.S | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index eddb4fabc16f..411fa4148e18 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
SYM_FUNC_START(__fentry__)
CALL_DEPTH_ACCOUNT

+#ifdef CONFIG_X86_PIE
+ pushq %r8
+ leaq ftrace_stub(%rip), %r8
+ cmpq %r8, ftrace_trace_function(%rip)
+ popq %r8
+#else
cmpq $ftrace_stub, ftrace_trace_function
+#endif
jnz trace
RET

@@ -329,7 +336,7 @@ trace:
* ip and parent ip are used and the list function is called when
* function tracing is enabled.
*/
- movq ftrace_trace_function, %r8
+ movq ftrace_trace_function(%rip), %r8
CALL_NOSPEC r8
restore_mcount_regs

--
2.31.1

2023-04-28 09:55:35

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

Adapt module loading to support PIE relocations. No GOT is generared for
module, all the GOT entry of got references in module should exist in
kernel GOT. Currently, there is only one usable got reference for
__fentry__().

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/sections.h | 5 +++++
arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index a6e8373a5170..dc1c2b08ec48 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];

#if defined(CONFIG_X86_64)
extern char __end_rodata_hpage_align[];
+
+#ifdef CONFIG_X86_PIE
+extern char __start_got[], __end_got[];
+#endif
+
#endif

extern char __end_of_kernel_reserve[];
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index 84ad0e61ba6e..051f88e6884e 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
return 0;
}
#else /*X86_64*/
+#ifdef CONFIG_X86_PIE
+static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
+{
+ u64 *pos;
+
+ for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
+ if (*pos == sym->st_value)
+ return (u64)pos + rela->r_addend;
+ return 0;
+}
+#endif
+
static int __write_relocate_add(Elf64_Shdr *sechdrs,
const char *strtab,
unsigned int symindex,
@@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
case R_X86_64_64:
size = 8;
break;
+#ifndef CONFIG_X86_PIE
case R_X86_64_32:
if (val != *(u32 *)&val)
goto overflow;
@@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
goto overflow;
size = 4;
break;
+#else
+ case R_X86_64_GOTPCREL:
+ val = find_got_kernel_entry(sym, rel);
+ if (!val)
+ goto unexpected_got_reference;
+ fallthrough;
+#endif
case R_X86_64_PC32:
case R_X86_64_PLT32:
val -= (u64)loc;
@@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
}
return 0;

+#ifdef CONFIG_X86_PIE
+unexpected_got_reference:
+ pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
+ return -ENOEXEC;
+#else
overflow:
pr_err("overflow in relocation type %d val %Lx\n",
(int)ELF64_R_TYPE(rel[i].r_info), val);
pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
me->name);
+#endif
+
return -ENOEXEC;
}

--
2.31.1

2023-04-28 09:55:59

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 28/43] KVM: x86: Adapt assembly for PIE support

Change the assembly code to use only relative references of symbols for
the kernel to be PIE compatible.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kvm/svm/vmenter.S | 10 +++++-----
arch/x86/kvm/vmx/vmenter.S | 2 +-
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/vmenter.S b/arch/x86/kvm/svm/vmenter.S
index 8e8295e774f0..25be1a66c59d 100644
--- a/arch/x86/kvm/svm/vmenter.S
+++ b/arch/x86/kvm/svm/vmenter.S
@@ -270,16 +270,16 @@ SYM_FUNC_START(__svm_vcpu_run)
RESTORE_GUEST_SPEC_CTRL_BODY
RESTORE_HOST_SPEC_CTRL_BODY

-10: cmpb $0, kvm_rebooting
+10: cmpb $0, _ASM_RIP(kvm_rebooting)
jne 2b
ud2
-30: cmpb $0, kvm_rebooting
+30: cmpb $0, _ASM_RIP(kvm_rebooting)
jne 4b
ud2
-50: cmpb $0, kvm_rebooting
+50: cmpb $0, _ASM_RIP(kvm_rebooting)
jne 6b
ud2
-70: cmpb $0, kvm_rebooting
+70: cmpb $0, _ASM_RIP(kvm_rebooting)
jne 8b
ud2

@@ -381,7 +381,7 @@ SYM_FUNC_START(__svm_sev_es_vcpu_run)
RESTORE_GUEST_SPEC_CTRL_BODY
RESTORE_HOST_SPEC_CTRL_BODY

-3: cmpb $0, kvm_rebooting
+3: cmpb $0, _ASM_RIP(kvm_rebooting)
jne 2b
ud2

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 631fd7da2bc3..b7cc3c17736a 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -289,7 +289,7 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
RET

.Lfixup:
- cmpb $0, kvm_rebooting
+ cmpb $0, _ASM_RIP(kvm_rebooting)
jne .Lvmfail
ud2
.Lvmfail:
--
2.31.1

2023-04-28 09:56:27

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 24/43] x86/boot/compressed: Adapt sed command to generate voffset.h when PIE is enabled

When PIE is enabled, all symbols would be set as hidden to reduce GOT
references. According to generic ABI, a hidden symbol contained in a
relocatable object must be either removed or converted to STB_LOCAL
binding by the link-editor when the relocatable object is included in an
executable file or shared object. Both gold and ld.lld change the
binding of a STV_HIDDEND symbol to STB_LOCAL. But For GNU ld, it will
keep global hidden. However, sed command to generate voffset.h only
captures global symbol, then empty voffset.h would be generated when PIE
is enabled with lld. So capture local symbol too in sed command.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/boot/compressed/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 6b6cfe607bdb..678881496c44 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -79,7 +79,7 @@ LDFLAGS_vmlinux += -T
hostprogs := mkpiggy
HOST_EXTRACFLAGS += -I$(srctree)/tools/include

-sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVW] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'
+sed-voffset := -e 's/^\([0-9a-fA-F]*\) [ABCDGRSTVWabcdgrstvw] \(_text\|__bss_start\|_end\)$$/\#define VO_\2 _AC(0x\1,UL)/p'

quiet_cmd_voffset = VOFFSET $@
cmd_voffset = $(NM) $< | sed -n $(sed-voffset) > $@
--
2.31.1

2023-04-28 09:56:32

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 23/43] x86/pie: Force hidden visibility for all symbol references

Eliminate all GOT entries in the kernel, by forcing hidden visibility
for all symbol references, which informs the compiler that such
references will be resolved at link time without the need for allocating
GOT entries. However, there are still some GOT entries after this, one
for __fentry__() indirect call, and others are due to global weak symbol
references.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Makefile | 7 +++++++
arch/x86/entry/vdso/Makefile | 2 +-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 57e4dbbf501d..81500011396d 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -158,6 +158,11 @@ else
KBUILD_RUSTFLAGS += $(rustflags-y)

KBUILD_CFLAGS += -mno-red-zone
+
+ifdef CONFIG_X86_PIE
+ PIE_CFLAGS := -include $(srctree)/include/linux/hidden.h
+ KBUILD_CFLAGS += $(PIE_CFLAGS)
+endif
KBUILD_CFLAGS += -mcmodel=kernel
KBUILD_RUSTFLAGS += -Cno-redzone=y
KBUILD_RUSTFLAGS += -Ccode-model=kernel
@@ -176,6 +181,8 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
endif
endif

+export PIE_CFLAGS
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 6a1821bd7d5e..9437653a9de2 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -92,7 +92,7 @@ ifneq ($(RETPOLINE_VDSO_CFLAGS),)
endif
endif

-$(vobjs): KBUILD_CFLAGS := $(filter-out $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
+$(vobjs): KBUILD_CFLAGS := $(filter-out $(PIE_CFLAGS) $(PADDING_CFLAGS) $(CC_FLAGS_LTO) $(CC_FLAGS_CFI) $(RANDSTRUCT_CFLAGS) $(GCC_PLUGINS_CFLAGS) $(RETPOLINE_CFLAGS),$(KBUILD_CFLAGS)) $(CFL)
$(vobjs): KBUILD_AFLAGS += -DBUILD_VDSO

#
--
2.31.1

2023-04-28 09:56:46

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for PIE support

When using PIE with function tracing, the compiler generates a call
through the GOT (call *__fentry__@GOTPCREL). This instruction is an
indirect call (INSN_CALL_DYNAMIC) and wouldn't be collected by
add_call_destinations(). So collect those indirect calls of
__fentry__() individually for PIE support. And replace the 6th byte of
the GOT call by a 1-byte nop so ftrace can handle the previous 5-bytes
as before.

When RETPOLINE is enabled, __fentry__() is still an indirect call, which
generates warnings in objtool. For simplicity, select DYNAMIC_FTRACE to
patch it as NOPs. And regard it as INSN_CALL to omit warnings for jump
table and retpoline checks in ojbtool.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 1 +
tools/objtool/arch/x86/decode.c | 10 +++++++--
tools/objtool/check.c | 39 +++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b753a54e5ea7..5ac5f335855e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2225,6 +2225,7 @@ config X86_PIE
def_bool n
depends on X86_64
select OBJTOOL if HAVE_OBJTOOL
+ select DYNAMIC_FTRACE if FUNCTION_TRACER && RETPOLINE

config RANDOMIZE_BASE
bool "Randomize the address of the kernel image (KASLR)"
diff --git a/tools/objtool/arch/x86/decode.c b/tools/objtool/arch/x86/decode.c
index 9ef024fd648c..cd9a81002efe 100644
--- a/tools/objtool/arch/x86/decode.c
+++ b/tools/objtool/arch/x86/decode.c
@@ -747,15 +747,21 @@ void arch_initial_func_cfi_state(struct cfi_init_state *state)

const char *arch_nop_insn(int len)
{
- static const char nops[5][5] = {
+ static const char nops[6][6] = {
{ BYTES_NOP1 },
{ BYTES_NOP2 },
{ BYTES_NOP3 },
{ BYTES_NOP4 },
{ BYTES_NOP5 },
+ /*
+ * For PIE kernel, use a 5-byte nop
+ * and 1-byte nop to keep the frace
+ * hooking algorithm working correct.
+ */
+ { BYTES_NOP5, BYTES_NOP1 },
};

- if (len < 1 || len > 5) {
+ if (len < 1 || len > 6) {
WARN("invalid NOP size: %d\n", len);
return NULL;
}
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index d67b80251eec..2456ab931fe5 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -1785,6 +1785,38 @@ static int add_call_destinations(struct objtool_file *file)
return 0;
}

+static int add_indirect_mcount_calls(struct objtool_file *file)
+{
+ struct instruction *insn;
+ struct reloc *reloc;
+
+ for_each_insn(file, insn) {
+ if (insn->type != INSN_CALL_DYNAMIC)
+ continue;
+
+ reloc = insn_reloc(file, insn);
+ if (!reloc)
+ continue;
+ if (!reloc->sym->fentry)
+ continue;
+
+ /*
+ * __fentry__() is an indirect call even in RETPOLINE builiding
+ * when X86_PIE is enabled, so DYNAMIC_FTRACE is selected. Then
+ * all indirect calls of __fentry__() would be patched as NOP
+ * later, so regard it as retpoline safe as a hack here. Also
+ * regard it as a direct call, otherwise, it would be treat as
+ * a jump to jump table in insn_jump_table(), because
+ * _jump_table and _call_dest share the same memory.
+ */
+ insn->type = INSN_CALL;
+ insn->retpoline_safe = true;
+ add_call_dest(file, insn, reloc->sym, false);
+ }
+
+ return 0;
+}
+
/*
* The .alternatives section requires some extra special care over and above
* other special sections because alternatives are patched in place.
@@ -2668,6 +2700,13 @@ static int decode_sections(struct objtool_file *file)
if (ret)
return ret;

+ /*
+ * For X86 PIE kernel, __fentry__ call is an indirect call instead
+ * of direct call.
+ */
+ if (opts.pie)
+ add_indirect_mcount_calls(file);
+
/*
* Must be after add_call_destinations() such that it can override
* dead_end_function() marks.
--
2.31.1

2023-04-28 09:56:52

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 39/43] x86/fixmap: Unify FIXADDR_TOP

Now FIXADDR_TOP is nothing to do with vsyscall page, it can be declared
as variable too for x86_64, so unify it for x86.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/include/asm/fixmap.h | 13 -------------
arch/x86/kernel/head64.c | 1 -
arch/x86/mm/dump_pagetables.c | 3 ++-
arch/x86/mm/ioremap.c | 5 ++---
arch/x86/mm/pgtable.c | 13 +++++++++++++
arch/x86/mm/pgtable_32.c | 3 ---
6 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index eeb152ad9682..9433109e4853 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -40,21 +40,8 @@
#include <linux/threads.h>
#endif

-/*
- * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
- * uses fixmaps that relies on FIXADDR_TOP for proper address calculation.
- * Because of this, FIXADDR_TOP x86 integration was left as later work.
- */
-#ifdef CONFIG_X86_32
-/*
- * Leave one empty page between vmalloc'ed areas and
- * the start of the fixmap.
- */
extern unsigned long __FIXADDR_TOP;
#define FIXADDR_TOP ((unsigned long)__FIXADDR_TOP)
-#else
-#define FIXADDR_TOP (0xffffffffff600000UL - PAGE_SIZE)
-#endif

/*
* Here we define all the compile-time 'special' virtual
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index ef7ad96f2154..8295b547b64f 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -499,7 +499,6 @@ asmlinkage __visible void __init __noreturn x86_64_start_kernel(char * real_mode
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
MAYBE_BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) ==
(__START_KERNEL & PGDIR_MASK)));
- BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

cr4_init_shadow();

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e1b599ecbbc2..df1a708a038a 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -104,7 +104,7 @@ static struct addr_marker address_markers[] = {
[HIGH_KERNEL_NR] = { __START_KERNEL_map, "High Kernel Mapping" },
[MODULES_VADDR_NR] = { MODULES_VADDR, "Modules" },
[MODULES_END_NR] = { MODULES_END, "End Modules" },
- [FIXADDR_START_NR] = { FIXADDR_START, "Fixmap Area" },
+ [FIXADDR_START_NR] = { 0UL, "Fixmap Area" },
[END_OF_SPACE_NR] = { -1, NULL }
};

@@ -453,6 +453,7 @@ static int __init pt_dump_init(void)
address_markers[KASAN_SHADOW_START_NR].start_address = KASAN_SHADOW_START;
address_markers[KASAN_SHADOW_END_NR].start_address = KASAN_SHADOW_END;
#endif
+ address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
#endif
#ifdef CONFIG_X86_32
address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index aa7d279321ea..44f9c6781c15 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -879,10 +879,9 @@ void __init early_ioremap_init(void)
pmd_t *pmd;

#ifdef CONFIG_X86_64
- BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
-#else
- WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
+ BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
#endif
+ WARN_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));

early_ioremap_setup();

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index afab0bc7862b..726c0c369676 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -627,6 +627,19 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
}
#endif

+#ifdef CONFIG_X86_32
+/*
+ * Leave one empty page between vmalloc'ed areas and
+ * the start of the fixmap.
+ */
+#define __FIXADDR_TOP_BASE 0xfffff000
+#else
+#define __FIXADDR_TOP_BASE (0xffffffffff600000UL - PAGE_SIZE)
+#endif
+
+unsigned long __FIXADDR_TOP = __FIXADDR_TOP_BASE;
+EXPORT_SYMBOL(__FIXADDR_TOP);
+
/**
* reserve_top_address - reserves a hole in the top of kernel address space
* @reserve - size of hole to reserve
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
index c234634e26ba..2b9a00976fee 100644
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -65,9 +65,6 @@ void set_pte_vaddr(unsigned long vaddr, pte_t pteval)
flush_tlb_one_kernel(vaddr);
}

-unsigned long __FIXADDR_TOP = 0xfffff000;
-EXPORT_SYMBOL(__FIXADDR_TOP);
-
/*
* vmalloc=size forces the vmalloc area to be exactly 'size'
* bytes. This can be used to increase (or decrease) the
--
2.31.1

2023-04-28 09:56:53

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 36/43] x86/vsyscall: Don't use set_fixmap() to map vsyscall page

In order to unify FIXADDR_TOP for x86 and allow fixmap area to be
moveable, vsyscall page should be mapped individually. However, for
XENPV guest, vsyscall page needs to be mapped into user pagetable too.
So introduce a new PVMMU op to help to map vsyscall page.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/entry/vsyscall/vsyscall_64.c | 3 +--
arch/x86/include/asm/paravirt.h | 7 +++++++
arch/x86/include/asm/paravirt_types.h | 4 ++++
arch/x86/include/asm/vsyscall.h | 13 +++++++++++++
arch/x86/kernel/paravirt.c | 4 ++++
arch/x86/xen/mmu_pv.c | 20 ++++++++++++++------
6 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/vsyscall/vsyscall_64.c b/arch/x86/entry/vsyscall/vsyscall_64.c
index e0ca8120aea8..4373460ebbde 100644
--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -385,8 +385,7 @@ void __init map_vsyscall(void)
* page.
*/
if (vsyscall_mode == EMULATE) {
- __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
- PAGE_KERNEL_VVAR);
+ __set_vsyscall_page(physaddr_vsyscall, PAGE_KERNEL_VVAR);
set_vsyscall_pgtable_user_bits(swapper_pg_dir);
}

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2350ceb43db0..dcc0706287ee 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -576,6 +576,13 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx,
{
pv_ops.mmu.set_fixmap(idx, phys, flags);
}
+
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+static inline void __set_vsyscall_page(phys_addr_t phys, pgprot_t flags)
+{
+ pv_ops.mmu.set_vsyscall_page(phys, flags);
+}
+#endif
#endif

#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 982a234f5a06..e79f38232849 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -224,6 +224,10 @@ struct pv_mmu_ops {
an mfn. We can tell which is which from the index. */
void (*set_fixmap)(unsigned /* enum fixed_addresses */ idx,
phys_addr_t phys, pgprot_t flags);
+
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+ void (*set_vsyscall_page)(phys_addr_t phys, pgprot_t flags);
+#endif
#endif
} __no_randomize_layout;

diff --git a/arch/x86/include/asm/vsyscall.h b/arch/x86/include/asm/vsyscall.h
index ab60a71a8dcb..73691fc60924 100644
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -2,6 +2,7 @@
#ifndef _ASM_X86_VSYSCALL_H
#define _ASM_X86_VSYSCALL_H

+#include <asm/pgtable.h>
#include <linux/seqlock.h>
#include <uapi/asm/vsyscall.h>

@@ -15,6 +16,18 @@ extern void set_vsyscall_pgtable_user_bits(pgd_t *root);
*/
extern bool emulate_vsyscall(unsigned long error_code,
struct pt_regs *regs, unsigned long address);
+static inline void native_set_vsyscall_page(phys_addr_t phys, pgprot_t flags)
+{
+ pgprot_val(flags) &= __default_kernel_pte_mask;
+ set_pte_vaddr(VSYSCALL_ADDR, pfn_pte(phys >> PAGE_SHIFT, flags));
+}
+
+#ifndef CONFIG_PARAVIRT_XXL
+#define __set_vsyscall_page native_set_vsyscall_page
+#else
+#include <asm/paravirt.h>
+#endif
+
#else
static inline void map_vsyscall(void) {}
static inline bool emulate_vsyscall(unsigned long error_code,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ac10b46c5832..13c81402f377 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -33,6 +33,7 @@
#include <asm/tlb.h>
#include <asm/io_bitmap.h>
#include <asm/gsseg.h>
+#include <asm/vsyscall.h>

/*
* nop stub, which must not clobber anything *including the stack* to
@@ -357,6 +358,9 @@ struct paravirt_patch_template pv_ops = {
},

.mmu.set_fixmap = native_set_fixmap,
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+ .mmu.set_vsyscall_page = native_set_vsyscall_page,
+#endif
#endif /* CONFIG_PARAVIRT_XXL */

#if defined(CONFIG_PARAVIRT_SPINLOCKS)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index fdc91deece7e..a59bc013ee5b 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -59,6 +59,7 @@

#include <asm/tlbflush.h>
#include <asm/fixmap.h>
+#include <asm/vsyscall.h>
#include <asm/mmu_context.h>
#include <asm/setup.h>
#include <asm/paravirt.h>
@@ -2020,9 +2021,6 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)

switch (idx) {
case FIX_BTMAP_END ... FIX_BTMAP_BEGIN:
-#ifdef CONFIG_X86_VSYSCALL_EMULATION
- case VSYSCALL_PAGE:
-#endif
/* All local page mappings */
pte = pfn_pte(phys, prot);
break;
@@ -2058,14 +2056,21 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
vaddr = __fix_to_virt(idx);
if (HYPERVISOR_update_va_mapping(vaddr, pte, UVMF_INVLPG))
BUG();
+}

#ifdef CONFIG_X86_VSYSCALL_EMULATION
+static void xen_set_vsyscall_page(phys_addr_t phys, pgprot_t prot)
+{
+ pte_t pte = pfn_pte(phys >> PAGE_SHIFT, prot);
+
+ if (HYPERVISOR_update_va_mapping(VSYSCALL_ADDR, pte, UVMF_INVLPG))
+ BUG();
+
/* Replicate changes to map the vsyscall page into the user
pagetable vsyscall mapping. */
- if (idx == VSYSCALL_PAGE)
- set_pte_vaddr_pud(level3_user_vsyscall, vaddr, pte);
-#endif
+ set_pte_vaddr_pud(level3_user_vsyscall, VSYSCALL_ADDR, pte);
}
+#endif

static void __init xen_post_allocator_init(void)
{
@@ -2156,6 +2161,9 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
},

.set_fixmap = xen_set_fixmap,
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+ .set_vsyscall_page = xen_set_vsyscall_page,
+#endif
},
};

--
2.31.1

2023-04-28 09:56:55

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 32/43] x86/boot/64: Use data relocation to get absloute address when PIE is enabled

When PIE is enabled, all symbol references are RIP-relative, so there is
no need to fixup global symbol references when in low address. However,
in order to acquire absloute virtual address of symbol, introduce a
macro to use data relocation to get it.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/head64.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 49f7629b17f7..ef7ad96f2154 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -86,10 +86,22 @@ static struct desc_ptr startup_gdt_descr = {

#define __head __section(".head.text")

+#ifdef CONFIG_X86_PIE
+#define SYM_ABS_VAL(sym) \
+ ({ static unsigned long __initdata __##sym = (unsigned long)sym; __##sym; })
+
+static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
+{
+ return ptr;
+}
+#else
+#define SYM_ABS_VAL(sym) ((unsigned long)sym)
+
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
return ptr - (void *)_text + (void *)physaddr;
}
+#endif /* CONFIG_X86_PIE */

static unsigned long __head *fixup_long(void *ptr, unsigned long physaddr)
{
@@ -142,8 +154,8 @@ static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
* attribute.
*/
if (sme_get_me_mask()) {
- vaddr = (unsigned long)__start_bss_decrypted;
- vaddr_end = (unsigned long)__end_bss_decrypted;
+ vaddr = SYM_ABS_VAL(__start_bss_decrypted);
+ vaddr_end = SYM_ABS_VAL(__end_bss_decrypted);

for (; vaddr < vaddr_end; vaddr += PMD_SIZE) {
/*
@@ -189,6 +201,8 @@ unsigned long __head __startup_64(unsigned long physaddr,
bool la57;
int i;
unsigned int *next_pgt_ptr;
+ unsigned long text_base = SYM_ABS_VAL(_text);
+ unsigned long end_base = SYM_ABS_VAL(_end);

la57 = check_la57_support(physaddr);

@@ -200,7 +214,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
* Compute the delta between the address I am compiled to run at
* and the address I am actually running at.
*/
- load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ load_delta = physaddr - (text_base - __START_KERNEL_map);

/* Is the address not 2M aligned? */
if (load_delta & ~PMD_MASK)
@@ -214,9 +228,9 @@ unsigned long __head __startup_64(unsigned long physaddr,
pgd = fixup_pointer(&early_top_pgt, physaddr);
p = pgd + pgd_index(__START_KERNEL_map);
if (la57)
- *p = (unsigned long)level4_kernel_pgt;
+ *p = SYM_ABS_VAL(level4_kernel_pgt);
else
- *p = (unsigned long)level3_kernel_pgt;
+ *p = SYM_ABS_VAL(level3_kernel_pgt);
*p += _PAGE_TABLE_NOENC - __START_KERNEL_map + load_delta;

if (la57) {
@@ -273,7 +287,7 @@ unsigned long __head __startup_64(unsigned long physaddr,
pmd_entry += sme_get_me_mask();
pmd_entry += physaddr;

- for (i = 0; i < DIV_ROUND_UP(_end - _text, PMD_SIZE); i++) {
+ for (i = 0; i < DIV_ROUND_UP(end_base - text_base, PMD_SIZE); i++) {
int idx = i + (physaddr >> PMD_SHIFT);

pmd[idx % PTRS_PER_PMD] = pmd_entry + i * PMD_SIZE;
@@ -298,11 +312,11 @@ unsigned long __head __startup_64(unsigned long physaddr,
pmd = fixup_pointer(level2_kernel_pgt, physaddr);

/* invalidate pages before the kernel image */
- for (i = 0; i < pmd_index((unsigned long)_text); i++)
+ for (i = 0; i < pmd_index(text_base); i++)
pmd[i] &= ~_PAGE_PRESENT;

/* fixup pages that are part of the kernel image */
- for (; i <= pmd_index((unsigned long)_end); i++)
+ for (; i <= pmd_index(end_base); i++)
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;

--
2.31.1

2023-04-28 09:57:01

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 29/43] x86/PVH: Adapt PVH booting for PIE support

If PIE is enabled, all symbol references would be RIP-relative. However,
PVH booting runs in low address space, which could cause wrong x86_init
callbacks assignment. Since init_top_pgt has building high kernel
address mapping, let PVH booting runs in high address space to make all
things right.

PVH booting assumes that no relocation happened. Since the kernel
compile address is still in top 2G, so it is allowed to use R_X86_64_32S
for symbol references in pvh_start_xen().

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/platform/pvh/head.S | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index 5842fe0e4f96..09518d4de042 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -94,6 +94,13 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
/* 64-bit entry point. */
.code64
1:
+#ifdef CONFIG_X86_PIE
+ movabs $2f, %rax
+ ANNOTATE_RETPOLINE_SAFE
+ jmp *%rax
+2:
+ ANNOTATE_NOENDBR // above
+#endif
/* Set base address in stack canary descriptor. */
mov $MSR_GS_BASE,%ecx
#if defined(CONFIG_STACKPROTECTOR_FIXED)
@@ -149,9 +156,15 @@ SYM_CODE_END(pvh_start_xen)
.section ".init.data","aw"
.balign 8
SYM_DATA_START_LOCAL(gdt)
+ /*
+ * Use an ASM_PTR (quad on x64) for _pa(gdt_start) because PIE requires
+ * a pointer size storage value before applying the relocation. On
+ * 32-bit _ASM_PTR will be a long which is aligned the space needed for
+ * relocation.
+ */
.word gdt_end - gdt_start
- .long _pa(gdt_start)
- .word 0
+ _ASM_PTR _pa(gdt_start)
+ .balign 8
SYM_DATA_END(gdt)
SYM_DATA_START_LOCAL(gdt_start)
.quad 0x0000000000000000 /* NULL descriptor */
--
2.31.1

2023-04-28 09:58:15

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support

For x86 PIE binary, only RIP-relative addressing is allowed, however,
there are still a little absolute references of R_X86_64_64 relocation
type for data section and a little absolute references of R_X86_64_32S
relocation type in pvh_start_xen() function.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 1 +
scripts/Makefile.lib | 1 +
tools/objtool/builtin-check.c | 4 +-
tools/objtool/check.c | 82 +++++++++++++++++++++++++
tools/objtool/include/objtool/builtin.h | 1 +
5 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 715f0734d065..b753a54e5ea7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2224,6 +2224,7 @@ config RELOCATABLE
config X86_PIE
def_bool n
depends on X86_64
+ select OBJTOOL if HAVE_OBJTOOL

config RANDOMIZE_BASE
bool "Randomize the address of the kernel image (KASLR)"
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 100a386fcd71..e3c804fbc421 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE) += --static-call
objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess
objtool-args-$(CONFIG_GCOV_KERNEL) += --no-unreachable
objtool-args-$(CONFIG_PREFIX_SYMBOLS) += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
+objtool-args-$(CONFIG_X86_PIE) += --pie

objtool-args = $(objtool-args-y) \
$(if $(delay-objtool), --link) \
diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
index 7c175198d09f..1cf1d00464e0 100644
--- a/tools/objtool/builtin-check.c
+++ b/tools/objtool/builtin-check.c
@@ -81,6 +81,7 @@ static const struct option check_options[] = {
OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
OPT_BOOLEAN(0 , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
+ OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),

OPT_GROUP("Options:"),
@@ -137,7 +138,8 @@ static bool opts_valid(void)
opts.sls ||
opts.stackval ||
opts.static_call ||
- opts.uaccess) {
+ opts.uaccess ||
+ opts.pie) {
if (opts.dump_orc) {
ERROR("--dump can't be combined with other options");
return false;
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index 5b600bbf2389..d67b80251eec 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
for (insn = next_insn_same_sec(file, insn); insn; \
insn = next_insn_same_sec(file, insn))

+static struct instruction *find_insn_containing(struct objtool_file *file,
+ struct section *sec,
+ unsigned long offset)
+{
+ struct instruction *insn;
+
+ insn = find_insn(file, sec, 0);
+ if (!insn)
+ return NULL;
+
+ sec_for_each_insn_from(file, insn) {
+ if (insn->offset > offset)
+ return NULL;
+ if (insn->offset <= offset && (insn->offset + insn->len) > offset)
+ return insn;
+ }
+
+ return NULL;
+}
+
+
static inline struct symbol *insn_call_dest(struct instruction *insn)
{
if (insn->type == INSN_JUMP_DYNAMIC ||
@@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
return 0;
}

+static int is_in_pvh_code(struct instruction *insn)
+{
+ struct symbol *sym = insn->sym;
+
+ return sym && !strcmp(sym->name, "pvh_start_xen");
+}
+
+static int validate_pie(struct objtool_file *file)
+{
+ struct section *sec;
+ struct reloc *reloc;
+ struct instruction *insn;
+ int warnings = 0;
+
+ for_each_sec(file, sec) {
+ if (!sec->reloc)
+ continue;
+ if (!(sec->sh.sh_flags & SHF_ALLOC))
+ continue;
+
+ list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
+ switch (reloc->type) {
+ case R_X86_64_NONE:
+ case R_X86_64_PC32:
+ case R_X86_64_PLT32:
+ case R_X86_64_64:
+ case R_X86_64_PC64:
+ case R_X86_64_GOTPCREL:
+ break;
+ case R_X86_64_32:
+ case R_X86_64_32S:
+ insn = find_insn_containing(file, sec, reloc->offset);
+ if (!insn) {
+ WARN("can't find relocate insn near %s+0x%lx",
+ sec->name, reloc->offset);
+ } else {
+ if (is_in_pvh_code(insn))
+ break;
+ WARN("insn at %s+0x%lx is not compatible with PIE",
+ sec->name, insn->offset);
+ }
+ warnings++;
+ break;
+ default:
+ WARN("unexpected relocation type %d at %s+0x%lx",
+ reloc->type, sec->name, reloc->offset);
+ warnings++;
+ break;
+ }
+ }
+ }
+
+ return warnings;
+}
+
int check(struct objtool_file *file)
{
int ret, warnings = 0;
@@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
warnings += ret;
}

+ if (opts.pie) {
+ ret = validate_pie(file);
+ if (ret < 0)
+ return ret;
+ warnings += ret;
+ }

if (opts.stats) {
printf("nr_insns_visited: %ld\n", nr_insns_visited);
diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
index 2a108e648b7a..1151211a5cea 100644
--- a/tools/objtool/include/objtool/builtin.h
+++ b/tools/objtool/include/objtool/builtin.h
@@ -26,6 +26,7 @@ struct opts {
bool uaccess;
int prefix;
bool cfi;
+ bool pie;

/* options: */
bool backtrace;
--
2.31.1

2023-04-28 10:06:20

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 27/43] x86/relocs: Handle PIE relocations

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

Change the relocation tool to correctly handle relocations generated by
-fPIE option:

- Add relocation for each entry of the .got section given the linker
does not generate R_X86_64_GLOB_DAT on a simple link.
- Ignore R_X86_64_GOTPCREL.

Signed-off-by: Thomas Garnier <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/tools/relocs.c | 96 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 038e9c12fad3..97ac96195232 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -42,6 +42,7 @@ struct section {
Elf32_Word *xsymtab;
Elf_Rel *reltab;
char *strtab;
+ Elf_Addr *got;
};
static struct section *secs;

@@ -308,6 +309,36 @@ static Elf_Sym *sym_lookup(const char *symname)
return 0;
}

+static Elf_Sym *sym_lookup_addr(Elf_Addr addr, const char **name)
+{
+ int i;
+
+ for (i = 0; i < ehdr.e_shnum; i++) {
+ struct section *sec = &secs[i];
+ long nsyms;
+ Elf_Sym *symtab;
+ Elf_Sym *sym;
+
+ if (sec->shdr.sh_type != SHT_SYMTAB)
+ continue;
+
+ nsyms = sec->shdr.sh_size/sizeof(Elf_Sym);
+ symtab = sec->symtab;
+
+ for (sym = symtab; --nsyms >= 0; sym++) {
+ if (sym->st_value == addr) {
+ if (name) {
+ *name = sym_name(sec->link->strtab,
+ sym);
+ }
+ return sym;
+ }
+ }
+ }
+ return 0;
+}
+
+
#if BYTE_ORDER == LITTLE_ENDIAN
#define le16_to_cpu(val) (val)
#define le32_to_cpu(val) (val)
@@ -588,6 +619,35 @@ static void read_relocs(FILE *fp)
}
}

+static void read_got(FILE *fp)
+{
+ int i;
+
+ for (i = 0; i < ehdr.e_shnum; i++) {
+ struct section *sec = &secs[i];
+
+ sec->got = NULL;
+ if (sec->shdr.sh_type != SHT_PROGBITS ||
+ strcmp(sec_name(i), ".got")) {
+ continue;
+ }
+ sec->got = malloc(sec->shdr.sh_size);
+ if (!sec->got) {
+ die("malloc of %" FMT " bytes for got failed\n",
+ sec->shdr.sh_size);
+ }
+ if (fseek(fp, sec->shdr.sh_offset, SEEK_SET) < 0) {
+ die("Seek to %" FMT " failed: %s\n",
+ sec->shdr.sh_offset, strerror(errno));
+ }
+ if (fread(sec->got, 1, sec->shdr.sh_size, fp)
+ != sec->shdr.sh_size) {
+ die("Cannot read got: %s\n",
+ strerror(errno));
+ }
+ }
+}
+

static void print_absolute_symbols(void)
{
@@ -718,6 +778,32 @@ static void add_reloc(struct relocs *r, uint32_t offset)
r->offset[r->count++] = offset;
}

+/*
+ * The linker does not generate relocations for the GOT for the kernel.
+ * If a GOT is found, simulate the relocations that should have been included.
+ */
+static void walk_got_table(int (*process)(struct section *sec, Elf_Rel *rel,
+ Elf_Sym *sym, const char *symname),
+ struct section *sec)
+{
+ int i;
+ Elf_Addr entry;
+ Elf_Sym *sym;
+ const char *symname;
+ Elf_Rel rel;
+
+ for (i = 0; i < sec->shdr.sh_size/sizeof(Elf_Addr); i++) {
+ entry = sec->got[i];
+ sym = sym_lookup_addr(entry, &symname);
+ if (!sym)
+ die("Could not found got symbol for entry %d\n", i);
+ rel.r_offset = sec->shdr.sh_addr + i * sizeof(Elf_Addr);
+ rel.r_info = ELF_BITS == 64 ? R_X86_64_GLOB_DAT
+ : R_386_GLOB_DAT;
+ process(sec, &rel, sym, symname);
+ }
+}
+
static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
Elf_Sym *sym, const char *symname))
{
@@ -731,6 +817,8 @@ static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
struct section *sec = &secs[i];

if (sec->shdr.sh_type != SHT_REL_TYPE) {
+ if (sec->got)
+ walk_got_table(process, sec);
continue;
}
sec_symtab = sec->link;
@@ -842,6 +930,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
offset += per_cpu_load_addr;

switch (r_type) {
+ case R_X86_64_GOTPCREL:
case R_X86_64_NONE:
/* NONE can be ignored. */
break;
@@ -905,7 +994,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
* the relocations are processed.
* Make sure that the offset will fit.
*/
- if ((int32_t)offset != (int64_t)offset)
+ if (r_type != R_X86_64_64 && (int32_t)offset != (int64_t)offset)
die("Relocation offset doesn't fit in 32 bits\n");

if (r_type == R_X86_64_64)
@@ -914,6 +1003,10 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
add_reloc(&relocs32, offset);
break;

+ case R_X86_64_GLOB_DAT:
+ add_reloc(&relocs64, offset);
+ break;
+
default:
die("Unsupported relocation type: %s (%d)\n",
rel_type(r_type), r_type);
@@ -1188,6 +1281,7 @@ void process(FILE *fp, int use_real_mode, int as_text,
read_strtabs(fp);
read_symtabs(fp);
read_relocs(fp);
+ read_got(fp);
if (ELF_BITS == 64)
percpu_init();
if (show_absolute_syms) {
--
2.31.1

2023-04-28 10:06:35

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 37/43] x86/xen: Pin up to VSYSCALL_ADDR when vsyscall page is out of fixmap area

If vsyscall page is moved out of fixmap area, then FIXADDR_TOP would be
below vsyscall page. So it should pin up to VSYSCALL_ADDR if vsyscall is
enabled.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/xen/mmu_pv.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index a59bc013ee5b..28392f3478a0 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -587,6 +587,12 @@ static void xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
xen_pud_walk(mm, pud, func, last, limit);
}

+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+#define __KERNEL_MAP_TOP (VSYSCALL_ADDR + PAGE_SIZE)
+#else
+#define __KERNEL_MAP_TOP FIXADDR_TOP
+#endif
+
/*
* (Yet another) pagetable walker. This one is intended for pinning a
* pagetable. This means that it walks a pagetable and calls the
@@ -594,7 +600,7 @@ static void xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
* at every level. It walks the entire pagetable, but it only bothers
* pinning pte pages which are below limit. In the normal case this
* will be STACK_TOP_MAX, but at boot we need to pin up to
- * FIXADDR_TOP.
+ * __KERNEL_MAP_TOP.
*
* We must skip the Xen hole in the middle of the address space, just after
* the big x86-64 virtual hole.
@@ -609,7 +615,7 @@ static void __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,

/* The limit is the last byte to be touched */
limit--;
- BUG_ON(limit >= FIXADDR_TOP);
+ BUG_ON(limit >= __KERNEL_MAP_TOP);

/*
* 64-bit has a great big hole in the middle of the address
@@ -797,7 +803,7 @@ static void __init xen_after_bootmem(void)
#ifdef CONFIG_X86_VSYSCALL_EMULATION
SetPagePinned(virt_to_page(level3_user_vsyscall));
#endif
- xen_pgd_walk(&init_mm, xen_mark_pinned, FIXADDR_TOP);
+ xen_pgd_walk(&init_mm, xen_mark_pinned, __KERNEL_MAP_TOP);
}

static void xen_unpin_page(struct mm_struct *mm, struct page *page,
--
2.31.1

2023-04-28 10:08:52

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 41/43] x86/mm: Sort address_markers array when X86 PIE is enabled

When X86 PIE is enabled, kernel image is allowed to relocated in top
512G, then kernel image address could be below EFI range address. So
sort address_markers array to make the order right.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/mm/dump_pagetables.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index df1a708a038a..81aa1c0b39cc 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -17,6 +17,7 @@
#include <linux/highmem.h>
#include <linux/pci.h>
#include <linux/ptdump.h>
+#include <linux/sort.h>

#include <asm/e820/types.h>

@@ -436,6 +437,27 @@ void ptdump_walk_pgd_level_checkwx(void)
ptdump_walk_pgd_level_core(NULL, &init_mm, INIT_PGD, true, false);
}

+#ifdef CONFIG_X86_PIE
+static int __init address_markers_sort_cmp(const void *pa, const void *pb)
+{
+ struct addr_marker *a = (struct addr_marker *)pa;
+ struct addr_marker *b = (struct addr_marker *)pb;
+
+ return (a->start_address > b->start_address) -
+ (a->start_address < b->start_address);
+}
+
+static void __init address_markers_sort(void)
+{
+ sort(&address_markers[0], ARRAY_SIZE(address_markers), sizeof(address_markers[0]),
+ address_markers_sort_cmp, NULL);
+}
+#else
+static void __init address_markers_sort(void)
+{
+}
+#endif
+
static int __init pt_dump_init(void)
{
/*
@@ -467,6 +489,8 @@ static int __init pt_dump_init(void)
address_markers[LDT_NR].start_address = LDT_BASE_ADDR;
# endif
#endif
+ address_markers_sort();
+
return 0;
}
__initcall(pt_dump_init);
--
2.31.1

2023-04-28 10:17:06

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 40/43] x86/boot: Fill kernel image puds dynamically

For PIE kernel, it could be randomized in any address. Later, kernel
image would be moved down the top 2G, so fille kernel image puds
dynamically.

Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>a
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/head64.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 8295b547b64f..c5cd61aab8ae 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -239,8 +239,18 @@ unsigned long __head __startup_64(unsigned long physaddr,
}

pud = fixup_pointer(&level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ if (IS_ENABLED(CONFIG_X86_PIE)) {
+ pud[510] = 0;
+ pud[511] = 0;
+
+ i = pud_index(text_base);
+ pgtable_flags = _KERNPG_TABLE_NOENC - __START_KERNEL_map + load_delta;
+ pud[i] = pgtable_flags + SYM_ABS_VAL(level2_kernel_pgt);
+ pud[i + 1] = pgtable_flags + SYM_ABS_VAL(level2_fixmap_pgt);
+ } else {
+ pud[510] += load_delta;
+ pud[511] += load_delta;
+ }

pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
for (i = FIXMAP_PMD_TOP; i > FIXMAP_PMD_TOP - FIXMAP_PMD_NUM; i--)
--
2.31.1

2023-04-28 10:17:38

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support

From: Thomas Garnier <[email protected]>

From: Thomas Garnier <[email protected]>

When using PIE with function tracing, the compiler generates a
call through the GOT (call *__fentry__@GOTPCREL). This instruction
takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
option is not implemented for -fPIE now.

If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
so ftrace can handle the previous 5-bytes as before.

[Hou Wenlong: Adapt code change and fix wrong offset calculation in
make_nop_x86()]

Signed-off-by: Thomas Garnier <[email protected]>
Co-developed-by: Hou Wenlong <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/kernel/ftrace.c | 46 ++++++++++++++++++++++-
scripts/recordmcount.c | 81 ++++++++++++++++++++++++++--------------
2 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 5e7ead52cfdb..b795f9dde561 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -124,6 +124,50 @@ ftrace_modify_code_direct(unsigned long ip, const char *old_code,
return 0;
}

+/* Bytes before call GOT offset */
+static const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
+
+static int __ref
+ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
+ unsigned const char *new_code)
+{
+ unsigned char replaced[MCOUNT_INSN_SIZE + 1];
+
+ /*
+ * If PIE is not enabled default to the original approach to code
+ * modification.
+ */
+ if (!IS_ENABLED(CONFIG_X86_PIE))
+ return ftrace_modify_code_direct(ip, old_code, new_code);
+
+ ftrace_expected = old_code;
+
+ /* Ensure the instructions point to a call to the GOT */
+ if (copy_from_kernel_nofault(replaced, (void *)ip, sizeof(replaced))) {
+ WARN_ONCE(1, "invalid function");
+ return -EFAULT;
+ }
+
+ if (memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn))) {
+ WARN_ONCE(1, "invalid function call");
+ return -EINVAL;
+ }
+
+ /*
+ * Build a nop slide with a 5-byte nop and 1-byte nop to keep the ftrace
+ * hooking algorithm working with the expected 5 bytes instruction.
+ */
+ memset(replaced, x86_nops[1][0], sizeof(replaced));
+ memcpy(replaced, new_code, MCOUNT_INSN_SIZE);
+
+ /* replace the text with the new text */
+ if (ftrace_poke_late)
+ text_poke_queue((void *)ip, replaced, MCOUNT_INSN_SIZE + 1, NULL);
+ else
+ text_poke_early((void *)ip, replaced, MCOUNT_INSN_SIZE + 1);
+ return 0;
+}
+
int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long addr)
{
unsigned long ip = rec->ip;
@@ -141,7 +185,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long ad
* just modify the code directly.
*/
if (addr == MCOUNT_ADDR)
- return ftrace_modify_code_direct(ip, old, new);
+ return ftrace_modify_initial_code(ip, old, new);

/*
* x86 overrides ftrace_replace_code -- this function will never be used
diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c
index e30216525325..02783a29d428 100644
--- a/scripts/recordmcount.c
+++ b/scripts/recordmcount.c
@@ -218,36 +218,10 @@ static void *mmap_file(char const *fname)
return file_map;
}

-
-static unsigned char ideal_nop5_x86_64[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
-static unsigned char ideal_nop5_x86_32[5] = { 0x3e, 0x8d, 0x74, 0x26, 0x00 };
-static unsigned char *ideal_nop;
-
static char rel_type_nop;
-
static int (*make_nop)(void *map, size_t const offset);

-static int make_nop_x86(void *map, size_t const offset)
-{
- uint32_t *ptr;
- unsigned char *op;
-
- /* Confirm we have 0xe8 0x0 0x0 0x0 0x0 */
- ptr = map + offset;
- if (*ptr != 0)
- return -1;
-
- op = map + offset - 1;
- if (*op != 0xe8)
- return -1;
-
- /* convert to nop */
- if (ulseek(offset - 1, SEEK_SET) < 0)
- return -1;
- if (uwrite(ideal_nop, 5) < 0)
- return -1;
- return 0;
-}
+static unsigned char *ideal_nop;

static unsigned char ideal_nop4_arm_le[4] = { 0x00, 0x00, 0xa0, 0xe1 }; /* mov r0, r0 */
static unsigned char ideal_nop4_arm_be[4] = { 0xe1, 0xa0, 0x00, 0x00 }; /* mov r0, r0 */
@@ -504,6 +478,50 @@ static void MIPS64_r_info(Elf64_Rel *const rp, unsigned sym, unsigned type)
}).r_info;
}

+static unsigned char ideal_nop5_x86_64[5] = { 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char ideal_nop6_x86_64[6] = { 0x66, 0x0f, 0x1f, 0x44, 0x00, 0x00 };
+static unsigned char ideal_nop5_x86_32[5] = { 0x3e, 0x8d, 0x74, 0x26, 0x00 };
+static size_t ideal_nop_x86_size;
+
+static unsigned char stub_default_x86[2] = { 0xe8, 0x00 }; /* call relative */
+static unsigned char stub_got_x86[3] = { 0xff, 0x15, 0x00 }; /* call .got */
+static unsigned char *stub_x86;
+static size_t stub_x86_size;
+
+static int make_nop_x86(void *map, size_t const offset)
+{
+ uint32_t *ptr;
+ size_t stub_offset = offset + 1 - stub_x86_size;
+
+ /* Confirm we have the expected stub */
+ ptr = map + stub_offset;
+ if (memcmp(ptr, stub_x86, stub_x86_size))
+ return -1;
+
+ /* convert to nop */
+ if (ulseek(stub_offset, SEEK_SET) < 0)
+ return -1;
+ if (uwrite(ideal_nop, ideal_nop_x86_size) < 0)
+ return -1;
+ return 0;
+}
+
+/* Swap the stub and nop for a got call if the binary is built with PIE */
+static int is_fake_mcount_x86_x64(Elf64_Rel const *rp)
+{
+ if (ELF64_R_TYPE(rp->r_info) == R_X86_64_GOTPCREL) {
+ ideal_nop = ideal_nop6_x86_64;
+ ideal_nop_x86_size = sizeof(ideal_nop6_x86_64);
+ stub_x86 = stub_got_x86;
+ stub_x86_size = sizeof(stub_got_x86);
+ mcount_adjust_64 = 1 - stub_x86_size;
+ }
+
+ /* Once the relocation was checked, rollback to default */
+ is_fake_mcount64 = fn_is_fake_mcount64;
+ return is_fake_mcount64(rp);
+}
+
static int do_file(char const *const fname)
{
unsigned int reltype = 0;
@@ -568,6 +586,9 @@ static int do_file(char const *const fname)
rel_type_nop = R_386_NONE;
make_nop = make_nop_x86;
ideal_nop = ideal_nop5_x86_32;
+ ideal_nop_x86_size = sizeof(ideal_nop5_x86_32);
+ stub_x86 = stub_default_x86;
+ stub_x86_size = sizeof(stub_default_x86);
mcount_adjust_32 = -1;
gpfx = 0;
break;
@@ -597,9 +618,13 @@ static int do_file(char const *const fname)
case EM_X86_64:
make_nop = make_nop_x86;
ideal_nop = ideal_nop5_x86_64;
+ ideal_nop_x86_size = sizeof(ideal_nop5_x86_64);
+ stub_x86 = stub_default_x86;
+ stub_x86_size = sizeof(stub_default_x86);
reltype = R_X86_64_64;
rel_type_nop = R_X86_64_NONE;
- mcount_adjust_64 = -1;
+ is_fake_mcount64 = is_fake_mcount_x86_x64;
+ mcount_adjust_64 = 1 - stub_x86_size;
gpfx = 0;
break;
} /* end switch */
--
2.31.1

2023-04-28 10:18:33

by Hou Wenlong

[permalink] [raw]
Subject: [PATCH RFC 35/43] x86/pie: Build the kernel as PIE

The kernel is currently build with mcmode=kernel option which forces it
to stay on the top 2G of the virtual address space. For PIE, use -fPIE
option to build the kernel as a Position Independent Executable (PIE),
which uses RIP-relative addressing and could be able to move below the
top 2G.

The --emit-relocs linker option was kept instead of using -pie to limit
the impact on mapped sections. Any incompatible relocation will be catch
by the objtool at compile time.

Suggested-by: Lai Jiangshan <[email protected]>
Signed-off-by: Hou Wenlong <[email protected]>
Cc: Thomas Garnier <[email protected]>
Cc: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 8 ++++++--
arch/x86/Makefile | 9 +++++++--
2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5ac5f335855e..9f8020991184 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2222,10 +2222,14 @@ config RELOCATABLE
(CONFIG_PHYSICAL_START) is used as the minimum location.

config X86_PIE
- def_bool n
- depends on X86_64
+ bool "Build a PIE kernel"
+ default n
+ depends on X86_64 && !XEN
select OBJTOOL if HAVE_OBJTOOL
select DYNAMIC_FTRACE if FUNCTION_TRACER && RETPOLINE
+ help
+ This builds a PIE kernel image that could be put at any
+ virtual address.

config RANDOMIZE_BASE
bool "Randomize the address of the kernel image (KASLR)"
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 81500011396d..6631974e2003 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -160,10 +160,15 @@ else
KBUILD_CFLAGS += -mno-red-zone

ifdef CONFIG_X86_PIE
- PIE_CFLAGS := -include $(srctree)/include/linux/hidden.h
+ PIE_CFLAGS := -fPIE -include $(srctree)/include/linux/hidden.h
KBUILD_CFLAGS += $(PIE_CFLAGS)
-endif
+ # Relax relocation in both CFLAGS and LDFLAGS to support older compilers
+ KBUILD_CFLAGS += $(call cc-option,-Wa$(comma)-mrelax-relocations=no)
+ LDFLAGS_vmlinux += $(call ld-option,--no-relax)
+ KBUILD_LDFLAGS_MODULE += $(call ld-option,--no-relax)
+else
KBUILD_CFLAGS += -mcmodel=kernel
+endif
KBUILD_RUSTFLAGS += -Cno-redzone=y
KBUILD_RUSTFLAGS += -Ccode-model=kernel

--
2.31.1

2023-04-28 10:38:23

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support



Le 28/04/2023 à 11:51, Hou Wenlong a écrit :
> [Vous ne recevez pas souvent de courriers de [email protected]. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
>
> For x86 PIE binary, only RIP-relative addressing is allowed, however,
> there are still a little absolute references of R_X86_64_64 relocation
> type for data section and a little absolute references of R_X86_64_32S
> relocation type in pvh_start_xen() function.
>
> Suggested-by: Lai Jiangshan <[email protected]>
> Signed-off-by: Hou Wenlong <[email protected]>
> Cc: Thomas Garnier <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> scripts/Makefile.lib | 1 +
> tools/objtool/builtin-check.c | 4 +-
> tools/objtool/check.c | 82 +++++++++++++++++++++++++
> tools/objtool/include/objtool/builtin.h | 1 +
> 5 files changed, 88 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 715f0734d065..b753a54e5ea7 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -2224,6 +2224,7 @@ config RELOCATABLE
> config X86_PIE
> def_bool n
> depends on X86_64
> + select OBJTOOL if HAVE_OBJTOOL
>
> config RANDOMIZE_BASE
> bool "Randomize the address of the kernel image (KASLR)"
> diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> index 100a386fcd71..e3c804fbc421 100644
> --- a/scripts/Makefile.lib
> +++ b/scripts/Makefile.lib
> @@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE) += --static-call
> objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess
> objtool-args-$(CONFIG_GCOV_KERNEL) += --no-unreachable
> objtool-args-$(CONFIG_PREFIX_SYMBOLS) += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
> +objtool-args-$(CONFIG_X86_PIE) += --pie
>
> objtool-args = $(objtool-args-y) \
> $(if $(delay-objtool), --link) \
> diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
> index 7c175198d09f..1cf1d00464e0 100644
> --- a/tools/objtool/builtin-check.c
> +++ b/tools/objtool/builtin-check.c
> @@ -81,6 +81,7 @@ static const struct option check_options[] = {
> OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
> OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
> OPT_BOOLEAN(0 , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
> + OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
> OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),
>
> OPT_GROUP("Options:"),
> @@ -137,7 +138,8 @@ static bool opts_valid(void)
> opts.sls ||
> opts.stackval ||
> opts.static_call ||
> - opts.uaccess) {
> + opts.uaccess ||
> + opts.pie) {
> if (opts.dump_orc) {
> ERROR("--dump can't be combined with other options");
> return false;
> diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> index 5b600bbf2389..d67b80251eec 100644
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> for (insn = next_insn_same_sec(file, insn); insn; \
> insn = next_insn_same_sec(file, insn))
>
> +static struct instruction *find_insn_containing(struct objtool_file *file,
> + struct section *sec,
> + unsigned long offset)
> +{
> + struct instruction *insn;
> +
> + insn = find_insn(file, sec, 0);
> + if (!insn)
> + return NULL;
> +
> + sec_for_each_insn_from(file, insn) {
> + if (insn->offset > offset)
> + return NULL;
> + if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> + return insn;
> + }
> +
> + return NULL;
> +}
> +
> +
> static inline struct symbol *insn_call_dest(struct instruction *insn)
> {
> if (insn->type == INSN_JUMP_DYNAMIC ||
> @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> return 0;
> }
>
> +static int is_in_pvh_code(struct instruction *insn)
> +{
> + struct symbol *sym = insn->sym;
> +
> + return sym && !strcmp(sym->name, "pvh_start_xen");
> +}
> +
> +static int validate_pie(struct objtool_file *file)
> +{
> + struct section *sec;
> + struct reloc *reloc;
> + struct instruction *insn;
> + int warnings = 0;
> +
> + for_each_sec(file, sec) {
> + if (!sec->reloc)
> + continue;
> + if (!(sec->sh.sh_flags & SHF_ALLOC))
> + continue;
> +
> + list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> + switch (reloc->type) {
> + case R_X86_64_NONE:
> + case R_X86_64_PC32:
> + case R_X86_64_PLT32:
> + case R_X86_64_64:
> + case R_X86_64_PC64:
> + case R_X86_64_GOTPCREL:
> + break;
> + case R_X86_64_32:
> + case R_X86_64_32S:

That looks very specific to X86, should it go at another place ?

If it can work for any architecture, can you add generic macros, just
like commit c1449735211d ("objtool: Use macros to define arch specific
reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount
specific implementation") ?

> + insn = find_insn_containing(file, sec, reloc->offset);
> + if (!insn) {
> + WARN("can't find relocate insn near %s+0x%lx",
> + sec->name, reloc->offset);
> + } else {
> + if (is_in_pvh_code(insn))
> + break;
> + WARN("insn at %s+0x%lx is not compatible with PIE",
> + sec->name, insn->offset);
> + }
> + warnings++;
> + break;
> + default:
> + WARN("unexpected relocation type %d at %s+0x%lx",
> + reloc->type, sec->name, reloc->offset);
> + warnings++;
> + break;
> + }
> + }
> + }
> +
> + return warnings;
> +}
> +
> int check(struct objtool_file *file)
> {
> int ret, warnings = 0;
> @@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
> warnings += ret;
> }
>
> + if (opts.pie) {
> + ret = validate_pie(file);
> + if (ret < 0)
> + return ret;
> + warnings += ret;
> + }
>
> if (opts.stats) {
> printf("nr_insns_visited: %ld\n", nr_insns_visited);
> diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
> index 2a108e648b7a..1151211a5cea 100644
> --- a/tools/objtool/include/objtool/builtin.h
> +++ b/tools/objtool/include/objtool/builtin.h
> @@ -26,6 +26,7 @@ struct opts {
> bool uaccess;
> int prefix;
> bool cfi;
> + bool pie;
>
> /* options: */
> bool backtrace;
> --
> 2.31.1
>

2023-04-28 11:38:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH RFC 07/43] x86/acpi: Adapt assembly for PIE support

On Fri, Apr 28, 2023 at 11:52 AM Hou Wenlong
<[email protected]> wrote:
>
> From: Thomas Garnier <[email protected]>
>
> From: Thomas Garnier <[email protected]>
>
> Change the assembly code to use only relative references of symbols for the
> kernel to be PIE compatible.
>
> Signed-off-by: Thomas Garnier <[email protected]>
> Signed-off-by: Hou Wenlong <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Kees Cook <[email protected]>

Acked-by: Rafael J. Wysocki <[email protected]>

> ---
> arch/x86/kernel/acpi/wakeup_64.S | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
> index d5d8a352eafa..fe688bd87d72 100644
> --- a/arch/x86/kernel/acpi/wakeup_64.S
> +++ b/arch/x86/kernel/acpi/wakeup_64.S
> @@ -17,7 +17,7 @@
> * Hooray, we are in Long 64-bit mode (but still running in low memory)
> */
> SYM_FUNC_START(wakeup_long64)
> - movq saved_magic, %rax
> + movq saved_magic(%rip), %rax
> movq $0x123456789abcdef0, %rdx
> cmpq %rdx, %rax
> je 2f
> @@ -33,14 +33,14 @@ SYM_FUNC_START(wakeup_long64)
> movw %ax, %es
> movw %ax, %fs
> movw %ax, %gs
> - movq saved_rsp, %rsp
> + movq saved_rsp(%rip), %rsp
>
> - movq saved_rbx, %rbx
> - movq saved_rdi, %rdi
> - movq saved_rsi, %rsi
> - movq saved_rbp, %rbp
> + movq saved_rbx(%rip), %rbx
> + movq saved_rdi(%rip), %rdi
> + movq saved_rsi(%rip), %rsi
> + movq saved_rbp(%rip), %rbp
>
> - movq saved_rip, %rax
> + movq saved_rip(%rip), %rax
> ANNOTATE_RETPOLINE_SAFE
> jmp *%rax
> SYM_FUNC_END(wakeup_long64)
> @@ -51,7 +51,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
> xorl %eax, %eax
> call save_processor_state
>
> - movq $saved_context, %rax
> + leaq saved_context(%rip), %rax
> movq %rsp, pt_regs_sp(%rax)
> movq %rbp, pt_regs_bp(%rax)
> movq %rsi, pt_regs_si(%rax)
> @@ -70,13 +70,14 @@ SYM_FUNC_START(do_suspend_lowlevel)
> pushfq
> popq pt_regs_flags(%rax)
>
> - movq $.Lresume_point, saved_rip(%rip)
> + leaq .Lresume_point(%rip), %rax
> + movq %rax, saved_rip(%rip)
>
> - movq %rsp, saved_rsp
> - movq %rbp, saved_rbp
> - movq %rbx, saved_rbx
> - movq %rdi, saved_rdi
> - movq %rsi, saved_rsi
> + movq %rsp, saved_rsp(%rip)
> + movq %rbp, saved_rbp(%rip)
> + movq %rbx, saved_rbx(%rip)
> + movq %rdi, saved_rdi(%rip)
> + movq %rsi, saved_rsi(%rip)
>
> addq $8, %rsp
> movl $3, %edi
> @@ -88,7 +89,7 @@ SYM_FUNC_START(do_suspend_lowlevel)
> .align 4
> .Lresume_point:
> /* We don't restore %rax, it must be 0 anyway */
> - movq $saved_context, %rax
> + leaq saved_context(%rip), %rax
> movq saved_context_cr4(%rax), %rbx
> movq %rbx, %cr4
> movq saved_context_cr3(%rax), %rbx
> --
> 2.31.1
>

2023-04-28 11:50:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support

On Fri, Apr 28, 2023 at 10:28:19AM +0000, Christophe Leroy wrote:


> > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > index 5b600bbf2389..d67b80251eec 100644
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> > for (insn = next_insn_same_sec(file, insn); insn; \
> > insn = next_insn_same_sec(file, insn))
> >
> > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > + struct section *sec,
> > + unsigned long offset)
> > +{
> > + struct instruction *insn;
> > +
> > + insn = find_insn(file, sec, 0);
> > + if (!insn)
> > + return NULL;
> > +
> > + sec_for_each_insn_from(file, insn) {
> > + if (insn->offset > offset)
> > + return NULL;
> > + if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > + return insn;
> > + }
> > +
> > + return NULL;
> > +}

Urgh, this is horrendous crap. Yes you're only using it in case of a
warning, but adding a function like this makes it appear like it's
actually sane to use.

A far better implementation -- but still not stellar -- would be
something like:

sym = find_symbol_containing(sec, offset);
if (!sym)
// fail
sym_for_each_insn(file, sym, insn) {
...
}

But given insn_hash uses sec_offset_hash() you can do something similar
to find_reloc_by_dest_range()

start = offset - (INSN_MAX_SIZE - 1);
for_offset_range(o, start, start + INSN_MAX_SIZE) {
hash_for_each_possible(file->insn_hash, insn, hash, sec_offset_hash(sec, o)) {
if (insn->sec != sec)
continue;

if (insn->offset <= offset &&
insn->offset + inns->len > offset)
return insn;
}
}
return NULL;

> > +
> > +
> > static inline struct symbol *insn_call_dest(struct instruction *insn)
> > {
> > if (insn->type == INSN_JUMP_DYNAMIC ||
> > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> > return 0;
> > }
> >
> > +static int is_in_pvh_code(struct instruction *insn)
> > +{
> > + struct symbol *sym = insn->sym;
> > +
> > + return sym && !strcmp(sym->name, "pvh_start_xen");
> > +}
> > +
> > +static int validate_pie(struct objtool_file *file)
> > +{
> > + struct section *sec;
> > + struct reloc *reloc;
> > + struct instruction *insn;
> > + int warnings = 0;
> > +
> > + for_each_sec(file, sec) {
> > + if (!sec->reloc)
> > + continue;
> > + if (!(sec->sh.sh_flags & SHF_ALLOC))
> > + continue;
> > +
> > + list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > + switch (reloc->type) {
> > + case R_X86_64_NONE:
> > + case R_X86_64_PC32:
> > + case R_X86_64_PLT32:
> > + case R_X86_64_64:
> > + case R_X86_64_PC64:
> > + case R_X86_64_GOTPCREL:
> > + break;
> > + case R_X86_64_32:
> > + case R_X86_64_32S:
>
> That looks very specific to X86, should it go at another place ?
>
> If it can work for any architecture, can you add generic macros, just
> like commit c1449735211d ("objtool: Use macros to define arch specific
> reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount
> specific implementation") ?

Yes, this should be something like arch_PIE_reloc() or so. Similar to
arch_pc_relative_reloc().

2023-04-28 13:49:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support

On Fri, 28 Apr 2023 17:51:02 +0800
"Hou Wenlong" <[email protected]> wrote:

> From: Thomas Garnier <[email protected]>
>
> From: Thomas Garnier <[email protected]>
>
> When using PIE with function tracing, the compiler generates a
> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
> option is not implemented for -fPIE now.
>
> If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> so ftrace can handle the previous 5-bytes as before.

Wait! This won't work!

You can't just append another nop to fill in the blanks here. We must
either have a single 6 byte nop, or we need to refactor the entire logic to
something that other archs have.

The two nops means that the CPU can take it as two separate commands.
There's nothing stopping the computer from preempting a task between the
two. If that happens, and you modify the 1byte nop and 5byte nop with a
single 6 byte command, when the task get's rescheduled, it will execute the
last 5 bytes of that 6 byte command and take a general protection fault, and
likely crash the machine.

NACK on this. It needs a better solution.

-- Steve


>
> [Hou Wenlong: Adapt code change and fix wrong offset calculation in
> make_nop_x86()]
>

2023-04-28 13:52:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support

On Fri, 28 Apr 2023 17:51:01 +0800
"Hou Wenlong" <[email protected]> wrote:

> Change the assembly code to use only relative references of symbols for
> the kernel to be PIE compatible.
>
> Signed-off-by: Hou Wenlong <[email protected]>
> Cc: Thomas Garnier <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> arch/x86/kernel/ftrace_64.S | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
> index eddb4fabc16f..411fa4148e18 100644
> --- a/arch/x86/kernel/ftrace_64.S
> +++ b/arch/x86/kernel/ftrace_64.S
> @@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
> SYM_FUNC_START(__fentry__)
> CALL_DEPTH_ACCOUNT
>
> +#ifdef CONFIG_X86_PIE
> + pushq %r8
> + leaq ftrace_stub(%rip), %r8
> + cmpq %r8, ftrace_trace_function(%rip)
> + popq %r8
> +#else
> cmpq $ftrace_stub, ftrace_trace_function
> +#endif
> jnz trace
> RET
>
> @@ -329,7 +336,7 @@ trace:
> * ip and parent ip are used and the list function is called when
> * function tracing is enabled.
> */
> - movq ftrace_trace_function, %r8
> + movq ftrace_trace_function(%rip), %r8
> CALL_NOSPEC r8
> restore_mcount_regs
>

I really don't want to add more updates to !DYNAMIC_FTRACE. This code only
exists to make sure I don't break it for other architectures.

How about

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 442eccc00960..ee4d0713139d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -37,7 +37,7 @@ config X86_64

config FORCE_DYNAMIC_FTRACE
def_bool y
- depends on X86_32
+ depends on X86_32 || X86_PIE
depends on FUNCTION_TRACER
select DYNAMIC_FTRACE
help


?

-- Steve

2023-04-28 15:22:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 34/43] objtool: Adapt indirect call of __fentry__() for PIE support

On Fri, Apr 28, 2023 at 05:51:14PM +0800, Hou Wenlong wrote:

> --- a/tools/objtool/arch/x86/decode.c
> +++ b/tools/objtool/arch/x86/decode.c
> @@ -747,15 +747,21 @@ void arch_initial_func_cfi_state(struct cfi_init_state *state)
>
> const char *arch_nop_insn(int len)
> {
> - static const char nops[5][5] = {
> + static const char nops[6][6] = {
> { BYTES_NOP1 },
> { BYTES_NOP2 },
> { BYTES_NOP3 },
> { BYTES_NOP4 },
> { BYTES_NOP5 },
> + /*
> + * For PIE kernel, use a 5-byte nop
> + * and 1-byte nop to keep the frace
> + * hooking algorithm working correct.
> + */
> + { BYTES_NOP5, BYTES_NOP1 },
> };
> - if (len < 1 || len > 5) {
> + if (len < 1 || len > 6) {
> WARN("invalid NOP size: %d\n", len);
> return NULL;
> }

Like Steve already said, this is broken, we hard rely on these things
being single instructions, this must absolutely be BYTES_NOP6.

And yes, then you get to fix a whole lot more.

2023-04-28 15:24:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible


For some raison I didn't get 0/n but did get all of the others. Please
keep your Cc list consistent.

On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:

> - It is not allowed to reference global variables in an alternative
> section since RIP-relative addressing is not fixed in
> apply_alternatives(). Fortunately, all disallowed relocations in the
> alternative section can be captured by objtool. I believe that this
> issue can also be fixed by using objtool.

https://lkml.kernel.org/r/[email protected]

2023-04-28 19:47:56

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
>
> Adapt module loading to support PIE relocations. No GOT is generared for
> module, all the GOT entry of got references in module should exist in
> kernel GOT. Currently, there is only one usable got reference for
> __fentry__().
>

I don't think this is the right approach. We should permit GOTPCREL
relocations properly, which means making them point to a location in
memory that carries the absolute address of the symbol. There are
several ways to go about that, but perhaps the simplest way is to make
the symbol address in ksymtab a 64-bit absolute value (but retain the
PC32 references for the symbol name and the symbol namespace name).
That way, you can always resolve such GOTPCREL relocations by pointing
it to the ksymtab entry. Another option would be to take inspiration
from the PLT code we have on ARM and arm64 (and other architectures,
surely) and to count the GOT based relocations, allocate some extra
r/o module space for each, and allocate slots and populate them with
the right value as you fix up the relocations.

Then, many such relocations can be relaxed at module load time if the
symbol is in range. IIUC, the module and kernel will still be inside
the same 2G window even after widening the KASLR range to 512G, so
most GOT loads can be converted into RIP relative LEA instructions.

Note that this will also permit you to do things like

#define PV_VCPU_PREEMPTED_ASM \
"leaq __per_cpu_offset(%rip), %rax \n\t" \
"movq (%rax,%rdi,8), %rax \n\t" \
"addq steal_time@GOTPCREL(%rip), %rax \n\t" \
"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
"setne %al\n\t"

or

+#ifdef CONFIG_X86_PIE
+ " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
+#else
" pushq $arch_rethook_trampoline\n"
+#endif

instead of having these kludgy push/pop sequences to free up temp registers.

(FYI I have looked into this PIE linking just a few weeks ago [0] so
this is all rather fresh in my memory)




[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie


> Signed-off-by: Hou Wenlong <[email protected]>
> Cc: Thomas Garnier <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Kees Cook <[email protected]>
> ---
> arch/x86/include/asm/sections.h | 5 +++++
> arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
> 2 files changed, 32 insertions(+)
>
> diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> index a6e8373a5170..dc1c2b08ec48 100644
> --- a/arch/x86/include/asm/sections.h
> +++ b/arch/x86/include/asm/sections.h
> @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
>
> #if defined(CONFIG_X86_64)
> extern char __end_rodata_hpage_align[];
> +
> +#ifdef CONFIG_X86_PIE
> +extern char __start_got[], __end_got[];
> +#endif
> +
> #endif
>
> extern char __end_of_kernel_reserve[];
> diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> index 84ad0e61ba6e..051f88e6884e 100644
> --- a/arch/x86/kernel/module.c
> +++ b/arch/x86/kernel/module.c
> @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> return 0;
> }
> #else /*X86_64*/
> +#ifdef CONFIG_X86_PIE
> +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> +{
> + u64 *pos;
> +
> + for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> + if (*pos == sym->st_value)
> + return (u64)pos + rela->r_addend;
> + return 0;
> +}
> +#endif
> +
> static int __write_relocate_add(Elf64_Shdr *sechdrs,
> const char *strtab,
> unsigned int symindex,
> @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> case R_X86_64_64:
> size = 8;
> break;
> +#ifndef CONFIG_X86_PIE
> case R_X86_64_32:
> if (val != *(u32 *)&val)
> goto overflow;
> @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> goto overflow;
> size = 4;
> break;
> +#else
> + case R_X86_64_GOTPCREL:
> + val = find_got_kernel_entry(sym, rel);
> + if (!val)
> + goto unexpected_got_reference;
> + fallthrough;
> +#endif
> case R_X86_64_PC32:
> case R_X86_64_PLT32:
> val -= (u64)loc;
> @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> }
> return 0;
>
> +#ifdef CONFIG_X86_PIE
> +unexpected_got_reference:
> + pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> + return -ENOEXEC;
> +#else
> overflow:
> pr_err("overflow in relocation type %d val %Lx\n",
> (int)ELF64_R_TYPE(rel[i].r_info), val);
> pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> me->name);
> +#endif
> +
> return -ENOEXEC;
> }
>
> --
> 2.31.1
>

2023-04-29 03:53:27

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 22/43] x86/ftrace: Adapt ftrace nop patching for PIE support

On Fri, Apr 28, 2023 at 09:44:54PM +0800, Steven Rostedt wrote:
> On Fri, 28 Apr 2023 17:51:02 +0800
> "Hou Wenlong" <[email protected]> wrote:
>
> > From: Thomas Garnier <[email protected]>
> >
> > From: Thomas Garnier <[email protected]>
> >
> > When using PIE with function tracing, the compiler generates a
> > call through the GOT (call *__fentry__@GOTPCREL). This instruction
> > takes 6-bytes instead of 5-bytes with a relative call. And -mnop-mcount
> > option is not implemented for -fPIE now.
> >
> > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> > so ftrace can handle the previous 5-bytes as before.
>
> Wait! This won't work!
>
> You can't just append another nop to fill in the blanks here. We must
> either have a single 6 byte nop, or we need to refactor the entire logic to
> something that other archs have.
>
> The two nops means that the CPU can take it as two separate commands.
> There's nothing stopping the computer from preempting a task between the
> two. If that happens, and you modify the 1byte nop and 5byte nop with a
> single 6 byte command, when the task get's rescheduled, it will execute the
> last 5 bytes of that 6 byte command and take a general protection fault, and
> likely crash the machine.
>
> NACK on this. It needs a better solution.
>
> -- Steve
>
>
Hi Steve,

Sorry for not providing the original patch link:
https://lore.kernel.org/all/[email protected]/

I drop the Reviewed-by tag due to the change described in commit
message.

This nop patching is only used for the first time (addr = MCOUNT) before
SMP or executing code in module. And ftrace_make_call() is not modified,
then we would use 5 byte direct call to replace the first 5 byte nop
when tracepoint is enabled like before, it's still one instruction. So,
the logic is same like before, patch the first 5 byte when tracepoint is
enabled or disabled during running.

> >
> > [Hou Wenlong: Adapt code change and fix wrong offset calculation in
> > make_nop_x86()]
> >

2023-04-29 04:12:08

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support

On Fri, Apr 28, 2023 at 06:28:19PM +0800, Christophe Leroy wrote:
>
>
> Le 28/04/2023 ? 11:51, Hou Wenlong a ?crit?:
> > [Vous ne recevez pas souvent de courriers de [email protected]. D?couvrez pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> >
> > For x86 PIE binary, only RIP-relative addressing is allowed, however,
> > there are still a little absolute references of R_X86_64_64 relocation
> > type for data section and a little absolute references of R_X86_64_32S
> > relocation type in pvh_start_xen() function.
> >
> > Suggested-by: Lai Jiangshan <[email protected]>
> > Signed-off-by: Hou Wenlong <[email protected]>
> > Cc: Thomas Garnier <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > arch/x86/Kconfig | 1 +
> > scripts/Makefile.lib | 1 +
> > tools/objtool/builtin-check.c | 4 +-
> > tools/objtool/check.c | 82 +++++++++++++++++++++++++
> > tools/objtool/include/objtool/builtin.h | 1 +
> > 5 files changed, 88 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 715f0734d065..b753a54e5ea7 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -2224,6 +2224,7 @@ config RELOCATABLE
> > config X86_PIE
> > def_bool n
> > depends on X86_64
> > + select OBJTOOL if HAVE_OBJTOOL
> >
> > config RANDOMIZE_BASE
> > bool "Randomize the address of the kernel image (KASLR)"
> > diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
> > index 100a386fcd71..e3c804fbc421 100644
> > --- a/scripts/Makefile.lib
> > +++ b/scripts/Makefile.lib
> > @@ -270,6 +270,7 @@ objtool-args-$(CONFIG_HAVE_STATIC_CALL_INLINE) += --static-call
> > objtool-args-$(CONFIG_HAVE_UACCESS_VALIDATION) += --uaccess
> > objtool-args-$(CONFIG_GCOV_KERNEL) += --no-unreachable
> > objtool-args-$(CONFIG_PREFIX_SYMBOLS) += --prefix=$(CONFIG_FUNCTION_PADDING_BYTES)
> > +objtool-args-$(CONFIG_X86_PIE) += --pie
> >
> > objtool-args = $(objtool-args-y) \
> > $(if $(delay-objtool), --link) \
> > diff --git a/tools/objtool/builtin-check.c b/tools/objtool/builtin-check.c
> > index 7c175198d09f..1cf1d00464e0 100644
> > --- a/tools/objtool/builtin-check.c
> > +++ b/tools/objtool/builtin-check.c
> > @@ -81,6 +81,7 @@ static const struct option check_options[] = {
> > OPT_BOOLEAN('t', "static-call", &opts.static_call, "annotate static calls"),
> > OPT_BOOLEAN('u', "uaccess", &opts.uaccess, "validate uaccess rules for SMAP"),
> > OPT_BOOLEAN(0 , "cfi", &opts.cfi, "annotate kernel control flow integrity (kCFI) function preambles"),
> > + OPT_BOOLEAN(0, "pie", &opts.pie, "validate addressing rules for PIE"),
> > OPT_CALLBACK_OPTARG(0, "dump", NULL, NULL, "orc", "dump metadata", parse_dump),
> >
> > OPT_GROUP("Options:"),
> > @@ -137,7 +138,8 @@ static bool opts_valid(void)
> > opts.sls ||
> > opts.stackval ||
> > opts.static_call ||
> > - opts.uaccess) {
> > + opts.uaccess ||
> > + opts.pie) {
> > if (opts.dump_orc) {
> > ERROR("--dump can't be combined with other options");
> > return false;
> > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > index 5b600bbf2389..d67b80251eec 100644
> > --- a/tools/objtool/check.c
> > +++ b/tools/objtool/check.c
> > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> > for (insn = next_insn_same_sec(file, insn); insn; \
> > insn = next_insn_same_sec(file, insn))
> >
> > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > + struct section *sec,
> > + unsigned long offset)
> > +{
> > + struct instruction *insn;
> > +
> > + insn = find_insn(file, sec, 0);
> > + if (!insn)
> > + return NULL;
> > +
> > + sec_for_each_insn_from(file, insn) {
> > + if (insn->offset > offset)
> > + return NULL;
> > + if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > + return insn;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +
> > static inline struct symbol *insn_call_dest(struct instruction *insn)
> > {
> > if (insn->type == INSN_JUMP_DYNAMIC ||
> > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> > return 0;
> > }
> >
> > +static int is_in_pvh_code(struct instruction *insn)
> > +{
> > + struct symbol *sym = insn->sym;
> > +
> > + return sym && !strcmp(sym->name, "pvh_start_xen");
> > +}
> > +
> > +static int validate_pie(struct objtool_file *file)
> > +{
> > + struct section *sec;
> > + struct reloc *reloc;
> > + struct instruction *insn;
> > + int warnings = 0;
> > +
> > + for_each_sec(file, sec) {
> > + if (!sec->reloc)
> > + continue;
> > + if (!(sec->sh.sh_flags & SHF_ALLOC))
> > + continue;
> > +
> > + list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > + switch (reloc->type) {
> > + case R_X86_64_NONE:
> > + case R_X86_64_PC32:
> > + case R_X86_64_PLT32:
> > + case R_X86_64_64:
> > + case R_X86_64_PC64:
> > + case R_X86_64_GOTPCREL:
> > + break;
> > + case R_X86_64_32:
> > + case R_X86_64_32S:
>
> That looks very specific to X86, should it go at another place ?
>
> If it can work for any architecture, can you add generic macros, just
> like commit c1449735211d ("objtool: Use macros to define arch specific
> reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount
> specific implementation") ?
>
Get it, I'll refactor it and move code into X86 directory.

Thanks.
> > + insn = find_insn_containing(file, sec, reloc->offset);
> > + if (!insn) {
> > + WARN("can't find relocate insn near %s+0x%lx",
> > + sec->name, reloc->offset);
> > + } else {
> > + if (is_in_pvh_code(insn))
> > + break;
> > + WARN("insn at %s+0x%lx is not compatible with PIE",
> > + sec->name, insn->offset);
> > + }
> > + warnings++;
> > + break;
> > + default:
> > + WARN("unexpected relocation type %d at %s+0x%lx",
> > + reloc->type, sec->name, reloc->offset);
> > + warnings++;
> > + break;
> > + }
> > + }
> > + }
> > +
> > + return warnings;
> > +}
> > +
> > int check(struct objtool_file *file)
> > {
> > int ret, warnings = 0;
> > @@ -4673,6 +4749,12 @@ int check(struct objtool_file *file)
> > warnings += ret;
> > }
> >
> > + if (opts.pie) {
> > + ret = validate_pie(file);
> > + if (ret < 0)
> > + return ret;
> > + warnings += ret;
> > + }
> >
> > if (opts.stats) {
> > printf("nr_insns_visited: %ld\n", nr_insns_visited);
> > diff --git a/tools/objtool/include/objtool/builtin.h b/tools/objtool/include/objtool/builtin.h
> > index 2a108e648b7a..1151211a5cea 100644
> > --- a/tools/objtool/include/objtool/builtin.h
> > +++ b/tools/objtool/include/objtool/builtin.h
> > @@ -26,6 +26,7 @@ struct opts {
> > bool uaccess;
> > int prefix;
> > bool cfi;
> > + bool pie;
> >
> > /* options: */
> > bool backtrace;
> > --
> > 2.31.1
> >

2023-04-29 04:25:11

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 33/43] objtool: Add validation for x86 PIE support

On Fri, Apr 28, 2023 at 07:43:38PM +0800, Peter Zijlstra wrote:
> On Fri, Apr 28, 2023 at 10:28:19AM +0000, Christophe Leroy wrote:
>
>
> > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c
> > > index 5b600bbf2389..d67b80251eec 100644
> > > --- a/tools/objtool/check.c
> > > +++ b/tools/objtool/check.c
> > > @@ -131,6 +131,27 @@ static struct instruction *prev_insn_same_sym(struct objtool_file *file,
> > > for (insn = next_insn_same_sec(file, insn); insn; \
> > > insn = next_insn_same_sec(file, insn))
> > >
> > > +static struct instruction *find_insn_containing(struct objtool_file *file,
> > > + struct section *sec,
> > > + unsigned long offset)
> > > +{
> > > + struct instruction *insn;
> > > +
> > > + insn = find_insn(file, sec, 0);
> > > + if (!insn)
> > > + return NULL;
> > > +
> > > + sec_for_each_insn_from(file, insn) {
> > > + if (insn->offset > offset)
> > > + return NULL;
> > > + if (insn->offset <= offset && (insn->offset + insn->len) > offset)
> > > + return insn;
> > > + }
> > > +
> > > + return NULL;
> > > +}
>
> Urgh, this is horrendous crap. Yes you're only using it in case of a
> warning, but adding a function like this makes it appear like it's
> actually sane to use.
>
> A far better implementation -- but still not stellar -- would be
> something like:
>
> sym = find_symbol_containing(sec, offset);
> if (!sym)
> // fail
> sym_for_each_insn(file, sym, insn) {
> ...
> }
>
> But given insn_hash uses sec_offset_hash() you can do something similar
> to find_reloc_by_dest_range()
>
> start = offset - (INSN_MAX_SIZE - 1);
> for_offset_range(o, start, start + INSN_MAX_SIZE) {
> hash_for_each_possible(file->insn_hash, insn, hash, sec_offset_hash(sec, o)) {
> if (insn->sec != sec)
> continue;
>
> if (insn->offset <= offset &&
> insn->offset + inns->len > offset)
> return insn;
> }
> }
> return NULL;
>
Thanks for your suggestion, I'll pick it in the next version.

> > > +
> > > +
> > > static inline struct symbol *insn_call_dest(struct instruction *insn)
> > > {
> > > if (insn->type == INSN_JUMP_DYNAMIC ||
> > > @@ -4529,6 +4550,61 @@ static int validate_reachable_instructions(struct objtool_file *file)
> > > return 0;
> > > }
> > >
> > > +static int is_in_pvh_code(struct instruction *insn)
> > > +{
> > > + struct symbol *sym = insn->sym;
> > > +
> > > + return sym && !strcmp(sym->name, "pvh_start_xen");
> > > +}
> > > +
> > > +static int validate_pie(struct objtool_file *file)
> > > +{
> > > + struct section *sec;
> > > + struct reloc *reloc;
> > > + struct instruction *insn;
> > > + int warnings = 0;
> > > +
> > > + for_each_sec(file, sec) {
> > > + if (!sec->reloc)
> > > + continue;
> > > + if (!(sec->sh.sh_flags & SHF_ALLOC))
> > > + continue;
> > > +
> > > + list_for_each_entry(reloc, &sec->reloc->reloc_list, list) {
> > > + switch (reloc->type) {
> > > + case R_X86_64_NONE:
> > > + case R_X86_64_PC32:
> > > + case R_X86_64_PLT32:
> > > + case R_X86_64_64:
> > > + case R_X86_64_PC64:
> > > + case R_X86_64_GOTPCREL:
> > > + break;
> > > + case R_X86_64_32:
> > > + case R_X86_64_32S:
> >
> > That looks very specific to X86, should it go at another place ?
> >
> > If it can work for any architecture, can you add generic macros, just
> > like commit c1449735211d ("objtool: Use macros to define arch specific
> > reloc types") then commit c984aef8c832 ("objtool/powerpc: Add --mcount
> > specific implementation") ?
>
> Yes, this should be something like arch_PIE_reloc() or so. Similar to
> arch_pc_relative_reloc().

2023-04-29 04:47:39

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 21/43] x86/ftrace: Adapt assembly for PIE support

On Fri, Apr 28, 2023 at 09:37:19PM +0800, Steven Rostedt wrote:
> On Fri, 28 Apr 2023 17:51:01 +0800
> "Hou Wenlong" <[email protected]> wrote:
>
> > Change the assembly code to use only relative references of symbols for
> > the kernel to be PIE compatible.
> >
> > Signed-off-by: Hou Wenlong <[email protected]>
> > Cc: Thomas Garnier <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > arch/x86/kernel/ftrace_64.S | 9 ++++++++-
> > 1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
> > index eddb4fabc16f..411fa4148e18 100644
> > --- a/arch/x86/kernel/ftrace_64.S
> > +++ b/arch/x86/kernel/ftrace_64.S
> > @@ -315,7 +315,14 @@ STACK_FRAME_NON_STANDARD_FP(ftrace_regs_caller)
> > SYM_FUNC_START(__fentry__)
> > CALL_DEPTH_ACCOUNT
> >
> > +#ifdef CONFIG_X86_PIE
> > + pushq %r8
> > + leaq ftrace_stub(%rip), %r8
> > + cmpq %r8, ftrace_trace_function(%rip)
> > + popq %r8
> > +#else
> > cmpq $ftrace_stub, ftrace_trace_function
> > +#endif
> > jnz trace
> > RET
> >
> > @@ -329,7 +336,7 @@ trace:
> > * ip and parent ip are used and the list function is called when
> > * function tracing is enabled.
> > */
> > - movq ftrace_trace_function, %r8
> > + movq ftrace_trace_function(%rip), %r8
> > CALL_NOSPEC r8
> > restore_mcount_regs
> >
>
> I really don't want to add more updates to !DYNAMIC_FTRACE. This code only
> exists to make sure I don't break it for other architectures.
>
> How about
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 442eccc00960..ee4d0713139d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -37,7 +37,7 @@ config X86_64
>
> config FORCE_DYNAMIC_FTRACE
> def_bool y
> - depends on X86_32
> + depends on X86_32 || X86_PIE
> depends on FUNCTION_TRACER
> select DYNAMIC_FTRACE
> help
>
>
> ?
>
OK, I'll drop it. Actually, I select DYNAMIC_FTRACE when
CONFIG_RETPOLINE is enabled for PIE due to the indirect call for
__fentry__() in patch 34.

Thanks.
> -- Steve

2023-05-06 07:32:22

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 00/43] x86/pie: Make kernel image's virtual address flexible

On Fri, Apr 28, 2023 at 11:22:06PM +0800, Peter Zijlstra wrote:
>
> For some raison I didn't get 0/n but did get all of the others. Please
> keep your Cc list consistent.
>
Sorry, I'll pay attention next time.

> On Fri, Apr 28, 2023 at 05:50:40PM +0800, Hou Wenlong wrote:
>
> > - It is not allowed to reference global variables in an alternative
> > section since RIP-relative addressing is not fixed in
> > apply_alternatives(). Fortunately, all disallowed relocations in the
> > alternative section can be captured by objtool. I believe that this
> > issue can also be fixed by using objtool.
>
> https://lkml.kernel.org/r/[email protected]

Thank you for your patch. However, it's more complicated for call depth
tracking case. Although, the per-cpu variable in the alternative section
is relocated, but the content of the "skl_call_thunk_template" is copied
into another place, so the offset is still incorrect. I'm not sure if
this case is common or not. Since the destination is clear, I could do
relocation here as well, but it would make the code more complicated.

Thanks!

2023-05-08 08:41:53

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> >
> > Adapt module loading to support PIE relocations. No GOT is generared for
> > module, all the GOT entry of got references in module should exist in
> > kernel GOT. Currently, there is only one usable got reference for
> > __fentry__().
> >
>
> I don't think this is the right approach. We should permit GOTPCREL
> relocations properly, which means making them point to a location in
> memory that carries the absolute address of the symbol. There are
> several ways to go about that, but perhaps the simplest way is to make
> the symbol address in ksymtab a 64-bit absolute value (but retain the
> PC32 references for the symbol name and the symbol namespace name).
> That way, you can always resolve such GOTPCREL relocations by pointing
> it to the ksymtab entry. Another option would be to take inspiration
> from the PLT code we have on ARM and arm64 (and other architectures,
> surely) and to count the GOT based relocations, allocate some extra
> r/o module space for each, and allocate slots and populate them with
> the right value as you fix up the relocations.
>
> Then, many such relocations can be relaxed at module load time if the
> symbol is in range. IIUC, the module and kernel will still be inside
> the same 2G window even after widening the KASLR range to 512G, so
> most GOT loads can be converted into RIP relative LEA instructions.
>
> Note that this will also permit you to do things like
>
> #define PV_VCPU_PREEMPTED_ASM \
> "leaq __per_cpu_offset(%rip), %rax \n\t" \
> "movq (%rax,%rdi,8), %rax \n\t" \
> "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> "setne %al\n\t"
>
> or
>
> +#ifdef CONFIG_X86_PIE
> + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> +#else
> " pushq $arch_rethook_trampoline\n"
> +#endif
>
> instead of having these kludgy push/pop sequences to free up temp registers.
>
> (FYI I have looked into this PIE linking just a few weeks ago [0] so
> this is all rather fresh in my memory)
>
>
>
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
>
>
Hi Ard,
Thanks for providing the link, it has been very helpful for me as I am
new to the topic of compilers. One key difference I noticed is that you
linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
switched to "--emit-reloc" in order to reduce dynamic relocation space
on mapped memory.

The another issue is that it requires the addition of the
"-mrelax-relocations=no" option to support older compilers and linkers.
R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
in binutils 2.26 and later, but the mini version required for the kernel
is 2.25. This option disables relocation relaxation, which makes GOT not
empty. I also noticed this option in arch/x86/boot/compressed/Makefile
with the reason given in [2]. Without relocation relaxation, GOT
references would increase the size of GOT. Therefore, I do not want to
use GOT reference in assembly directly. However, I realized that the
compiler could still generate GOT references in some cases such as
"fentry" calls and stack canary references.

Regarding module loading, I agree that we should support GOT reference
for the module itself. I will refactor it according to your suggestion.

Thanks.

[0] https://yhbt.net/lore/all/[email protected]
[1] https://yhbt.net/lore/all/[email protected]
[2] https://lore.kernel.org/all/[email protected]/

> > Signed-off-by: Hou Wenlong <[email protected]>
> > Cc: Thomas Garnier <[email protected]>
> > Cc: Lai Jiangshan <[email protected]>
> > Cc: Kees Cook <[email protected]>
> > ---
> > arch/x86/include/asm/sections.h | 5 +++++
> > arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
> > 2 files changed, 32 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > index a6e8373a5170..dc1c2b08ec48 100644
> > --- a/arch/x86/include/asm/sections.h
> > +++ b/arch/x86/include/asm/sections.h
> > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> >
> > #if defined(CONFIG_X86_64)
> > extern char __end_rodata_hpage_align[];
> > +
> > +#ifdef CONFIG_X86_PIE
> > +extern char __start_got[], __end_got[];
> > +#endif
> > +
> > #endif
> >
> > extern char __end_of_kernel_reserve[];
> > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > index 84ad0e61ba6e..051f88e6884e 100644
> > --- a/arch/x86/kernel/module.c
> > +++ b/arch/x86/kernel/module.c
> > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > return 0;
> > }
> > #else /*X86_64*/
> > +#ifdef CONFIG_X86_PIE
> > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > +{
> > + u64 *pos;
> > +
> > + for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > + if (*pos == sym->st_value)
> > + return (u64)pos + rela->r_addend;
> > + return 0;
> > +}
> > +#endif
> > +
> > static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > const char *strtab,
> > unsigned int symindex,
> > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > case R_X86_64_64:
> > size = 8;
> > break;
> > +#ifndef CONFIG_X86_PIE
> > case R_X86_64_32:
> > if (val != *(u32 *)&val)
> > goto overflow;
> > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > goto overflow;
> > size = 4;
> > break;
> > +#else
> > + case R_X86_64_GOTPCREL:
> > + val = find_got_kernel_entry(sym, rel);
> > + if (!val)
> > + goto unexpected_got_reference;
> > + fallthrough;
> > +#endif
> > case R_X86_64_PC32:
> > case R_X86_64_PLT32:
> > val -= (u64)loc;
> > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > }
> > return 0;
> >
> > +#ifdef CONFIG_X86_PIE
> > +unexpected_got_reference:
> > + pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > + return -ENOEXEC;
> > +#else
> > overflow:
> > pr_err("overflow in relocation type %d val %Lx\n",
> > (int)ELF64_R_TYPE(rel[i].r_info), val);
> > pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > me->name);
> > +#endif
> > +
> > return -ENOEXEC;
> > }
> >
> > --
> > 2.31.1
> >

2023-05-08 09:31:38

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
>
> On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > >
> > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > module, all the GOT entry of got references in module should exist in
> > > kernel GOT. Currently, there is only one usable got reference for
> > > __fentry__().
> > >
> >
> > I don't think this is the right approach. We should permit GOTPCREL
> > relocations properly, which means making them point to a location in
> > memory that carries the absolute address of the symbol. There are
> > several ways to go about that, but perhaps the simplest way is to make
> > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > PC32 references for the symbol name and the symbol namespace name).
> > That way, you can always resolve such GOTPCREL relocations by pointing
> > it to the ksymtab entry. Another option would be to take inspiration
> > from the PLT code we have on ARM and arm64 (and other architectures,
> > surely) and to count the GOT based relocations, allocate some extra
> > r/o module space for each, and allocate slots and populate them with
> > the right value as you fix up the relocations.
> >
> > Then, many such relocations can be relaxed at module load time if the
> > symbol is in range. IIUC, the module and kernel will still be inside
> > the same 2G window even after widening the KASLR range to 512G, so
> > most GOT loads can be converted into RIP relative LEA instructions.
> >
> > Note that this will also permit you to do things like
> >
> > #define PV_VCPU_PREEMPTED_ASM \
> > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > "movq (%rax,%rdi,8), %rax \n\t" \
> > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > "setne %al\n\t"
> >
> > or
> >
> > +#ifdef CONFIG_X86_PIE
> > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > +#else
> > " pushq $arch_rethook_trampoline\n"
> > +#endif
> >
> > instead of having these kludgy push/pop sequences to free up temp registers.
> >
> > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > this is all rather fresh in my memory)
> >
> >
> >
> >
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> >
> >
> Hi Ard,
> Thanks for providing the link, it has been very helpful for me as I am
> new to the topic of compilers.

Happy to hear that.

> One key difference I noticed is that you
> linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> switched to "--emit-reloc" in order to reduce dynamic relocation space
> on mapped memory.
>

The problem with --emit-relocs is that the relocations emitted into
the binary may get out of sync with the actual code after the linker
has applied relocations.

$ cat /tmp/a.s
foo:movq foo@GOTPCREL(%rip), %rax

$ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o

/tmp/a.o: file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
3: R_X86_64_REX_GOTPCRELX foo-0x4

$ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.o
0000000000000000 <foo>:
0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
3: R_X86_64_REX_GOTPCRELX foo-0x4

$ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
-Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
0000000000401000 <foo>:
401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
401003: R_X86_64_32S foo

$ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
-Wl,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
0000000000001000 <foo>:
1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
1003: R_X86_64_PC32 foo-0x4

This all looks as expected. However, when using Clang, we end up with

$ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
-fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
$ x86_64-linux-gnu-objdump -dr /tmp/a.elf
00000000000012c0 <foo>:
12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
12c3: R_X86_64_REX_GOTPCRELX foo-0x4

So in this case, what --emit-relocs gives us is not what is actually
in the binary. We cannot just ignore these either, given that they are
treated differently depending on whether the symbol is a per-CPU
symbol or not - in the former case, we need to perform a fixup if the
relaxed reference is RIP relative, and in the latter case, if the
relaxed reference is absolute.

On top of that, --emit-relocs does not cover the GOT, so we'd still
need to process that from the code explicitly.

In general, relying on --emit-relocs is kind of dodgy, and I think
combining PIE linking with --emit-relocs is a bad idea.

> The another issue is that it requires the addition of the
> "-mrelax-relocations=no" option to support older compilers and linkers.

Why? The decompressor is now linked in PIE mode so we should be able
to drop that. Or do you need to add is somewhere else?

> R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> in binutils 2.26 and later, but the mini version required for the kernel
> is 2.25. This option disables relocation relaxation, which makes GOT not
> empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> with the reason given in [2]. Without relocation relaxation, GOT
> references would increase the size of GOT. Therefore, I do not want to
> use GOT reference in assembly directly. However, I realized that the
> compiler could still generate GOT references in some cases such as
> "fentry" calls and stack canary references.
>

The stack canary references are under discussion here [3]. I have also
sent a patch for kallsyms symbol references [4]. Beyond that, there
should be very few cases where GOT entries are emitted, so I don't
think this is fundamentally a problem.

I haven't run into the __fentry__ issue myself: do you think we should
fix this in the compiler?

> Regarding module loading, I agree that we should support GOT reference
> for the module itself. I will refactor it according to your suggestion.
>

Excellent, good luck with that.

However, you will still need to make a convincing case for why this is
all worth the trouble. Especially given that you disable the depth
tracking code, which I don't think should be mutually exclusive.

I am aware that this a rather tricky, and involves rewriting
RIP-relative per-CPU variable accesses, but it would be good to get a
discussion started on that topic, and figure out whether there is a
way forward there. Ignoring it is not going to help.


>
> [0] https://yhbt.net/lore/all/[email protected]
> [1] https://yhbt.net/lore/all/[email protected]
> [2] https://lore.kernel.org/all/[email protected]/
>

[3] https://github.com/llvm/llvm-project/issues/60116
[4] [email protected]

> > > Signed-off-by: Hou Wenlong <[email protected]>
> > > Cc: Thomas Garnier <[email protected]>
> > > Cc: Lai Jiangshan <[email protected]>
> > > Cc: Kees Cook <[email protected]>
> > > ---
> > > arch/x86/include/asm/sections.h | 5 +++++
> > > arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
> > > 2 files changed, 32 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > index a6e8373a5170..dc1c2b08ec48 100644
> > > --- a/arch/x86/include/asm/sections.h
> > > +++ b/arch/x86/include/asm/sections.h
> > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > >
> > > #if defined(CONFIG_X86_64)
> > > extern char __end_rodata_hpage_align[];
> > > +
> > > +#ifdef CONFIG_X86_PIE
> > > +extern char __start_got[], __end_got[];
> > > +#endif
> > > +
> > > #endif
> > >
> > > extern char __end_of_kernel_reserve[];
> > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > index 84ad0e61ba6e..051f88e6884e 100644
> > > --- a/arch/x86/kernel/module.c
> > > +++ b/arch/x86/kernel/module.c
> > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > > return 0;
> > > }
> > > #else /*X86_64*/
> > > +#ifdef CONFIG_X86_PIE
> > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > +{
> > > + u64 *pos;
> > > +
> > > + for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > + if (*pos == sym->st_value)
> > > + return (u64)pos + rela->r_addend;
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > const char *strtab,
> > > unsigned int symindex,
> > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > case R_X86_64_64:
> > > size = 8;
> > > break;
> > > +#ifndef CONFIG_X86_PIE
> > > case R_X86_64_32:
> > > if (val != *(u32 *)&val)
> > > goto overflow;
> > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > goto overflow;
> > > size = 4;
> > > break;
> > > +#else
> > > + case R_X86_64_GOTPCREL:
> > > + val = find_got_kernel_entry(sym, rel);
> > > + if (!val)
> > > + goto unexpected_got_reference;
> > > + fallthrough;
> > > +#endif
> > > case R_X86_64_PC32:
> > > case R_X86_64_PLT32:
> > > val -= (u64)loc;
> > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > }
> > > return 0;
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > +unexpected_got_reference:
> > > + pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > + return -ENOEXEC;
> > > +#else
> > > overflow:
> > > pr_err("overflow in relocation type %d val %Lx\n",
> > > (int)ELF64_R_TYPE(rel[i].r_info), val);
> > > pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > > me->name);
> > > +#endif
> > > +
> > > return -ENOEXEC;
> > > }
> > >
> > > --
> > > 2.31.1
> > >

2023-05-08 11:56:20

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> >
> > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > > >
> > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > module, all the GOT entry of got references in module should exist in
> > > > kernel GOT. Currently, there is only one usable got reference for
> > > > __fentry__().
> > > >
> > >
> > > I don't think this is the right approach. We should permit GOTPCREL
> > > relocations properly, which means making them point to a location in
> > > memory that carries the absolute address of the symbol. There are
> > > several ways to go about that, but perhaps the simplest way is to make
> > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > PC32 references for the symbol name and the symbol namespace name).
> > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > it to the ksymtab entry. Another option would be to take inspiration
> > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > surely) and to count the GOT based relocations, allocate some extra
> > > r/o module space for each, and allocate slots and populate them with
> > > the right value as you fix up the relocations.
> > >
> > > Then, many such relocations can be relaxed at module load time if the
> > > symbol is in range. IIUC, the module and kernel will still be inside
> > > the same 2G window even after widening the KASLR range to 512G, so
> > > most GOT loads can be converted into RIP relative LEA instructions.
> > >
> > > Note that this will also permit you to do things like
> > >
> > > #define PV_VCPU_PREEMPTED_ASM \
> > > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > "movq (%rax,%rdi,8), %rax \n\t" \
> > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > "setne %al\n\t"
> > >
> > > or
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > +#else
> > > " pushq $arch_rethook_trampoline\n"
> > > +#endif
> > >
> > > instead of having these kludgy push/pop sequences to free up temp registers.
> > >
> > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > this is all rather fresh in my memory)
> > >
> > >
> > >
> > >
> > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > >
> > >
> > Hi Ard,
> > Thanks for providing the link, it has been very helpful for me as I am
> > new to the topic of compilers.
>
> Happy to hear that.
>

Thanks for your prompt reply.

> > One key difference I noticed is that you
> > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > on mapped memory.
> >
>
> The problem with --emit-relocs is that the relocations emitted into
> the binary may get out of sync with the actual code after the linker
> has applied relocations.
>
> $ cat /tmp/a.s
> foo:movq foo@GOTPCREL(%rip), %rax
>
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
>
> /tmp/a.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <foo>:
> 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 0000000000000000 <foo>:
> 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000401000 <foo>:
> 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
> 401003: R_X86_64_32S foo
>
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000001000 <foo>:
> 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
> 1003: R_X86_64_PC32 foo-0x4
>
> This all looks as expected. However, when using Clang, we end up with
>
> $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 00000000000012c0 <foo>:
> 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
> 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> So in this case, what --emit-relocs gives us is not what is actually
> in the binary. We cannot just ignore these either, given that they are
> treated differently depending on whether the symbol is a per-CPU
> symbol or not - in the former case, we need to perform a fixup if the
> relaxed reference is RIP relative, and in the latter case, if the
> relaxed reference is absolute.
>
With symbols hidden and the compile-time address of the kernel image
kept in the top 2G, is it possible for the relaxed reference to be
absolute, even if I keep the percpu section zero-mapping for SMP? I
didn't see absoulte relaxed reference after dropping
"-mrelax-relocations=no" option.

> On top of that, --emit-relocs does not cover the GOT, so we'd still
> need to process that from the code explicitly.
>
Yes, so the relocs tool would process GOT, and generate
R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com

> In general, relying on --emit-relocs is kind of dodgy, and I think
> combining PIE linking with --emit-relocs is a bad idea.
>
> > The another issue is that it requires the addition of the
> > "-mrelax-relocations=no" option to support older compilers and linkers.
>
> Why? The decompressor is now linked in PIE mode so we should be able
> to drop that. Or do you need to add is somewhere else?
>
I tried to use binutils 2.25 (mini version), it couldn't recognize
R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.

> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > in binutils 2.26 and later, but the mini version required for the kernel
> > is 2.25. This option disables relocation relaxation, which makes GOT not
> > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > with the reason given in [2]. Without relocation relaxation, GOT
> > references would increase the size of GOT. Therefore, I do not want to
> > use GOT reference in assembly directly. However, I realized that the
> > compiler could still generate GOT references in some cases such as
> > "fentry" calls and stack canary references.
> >
>
> The stack canary references are under discussion here [3]. I have also
> sent a patch for kallsyms symbol references [4]. Beyond that, there
> should be very few cases where GOT entries are emitted, so I don't
> think this is fundamentally a problem.
>
> I haven't run into the __fentry__ issue myself: do you think we should
> fix this in the compiler?
>
The issue about __fentry__ is that the compiler would generate 6-bytes
indirect call through GOT with "-fPIE" option. However, the original
ftrace nop patching assumes it is a 5-bytes direct call. And
"-mnop-mcount" option is not compatiable with "-fPIE" option, so the
complier woudn't patch it as nop.

So we should patch it with one 5-bytes nop followed by one 1-byte nop,
This way, ftrace can handle the previous 5-bytes as before. Also I have
built PIE kernel with relocation relaxation on GCC, and the linker would
relax it as following:
ffffffff810018f0 <do_one_initcall>:
ffffffff810018f0: f3 0f 1e fa endbr64
ffffffff810018f4: 67 e8 a6 d6 05 00 addr32 call ffffffff8105efa0 <__fentry__>
ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4

It still requires a different nop patching for ftrace. I notice
"Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
that the GOT indirect call can be relaxed as "call fentry nop" or "nop
call fentry", it appears that the latter is chosen. If the linker could
generate the former, then no fixup would be necessary for ftrace with
PIE.

> > Regarding module loading, I agree that we should support GOT reference
> > for the module itself. I will refactor it according to your suggestion.
> >
>
> Excellent, good luck with that.
>
> However, you will still need to make a convincing case for why this is
> all worth the trouble. Especially given that you disable the depth
> tracking code, which I don't think should be mutually exclusive.
>
Actually, I could do relocation for it when apply patching for the
depth tracking code. I'm not sure such case is common or not.

> I am aware that this a rather tricky, and involves rewriting
> RIP-relative per-CPU variable accesses, but it would be good to get a
> discussion started on that topic, and figure out whether there is a
> way forward there. Ignoring it is not going to help.
>
>
I see that your PIE linking chose to put the per-cpu section in high
kernel image address, I still keep it as zero-mapping. However, both are
in the RIP-relative addressing range.

> >
> > [0] https://yhbt.net/lore/all/[email protected]
> > [1] https://yhbt.net/lore/all/[email protected]
> > [2] https://lore.kernel.org/all/[email protected]/
> >
>
> [3] https://github.com/llvm/llvm-project/issues/60116
> [4] [email protected]
>
> > > > Signed-off-by: Hou Wenlong <[email protected]>
> > > > Cc: Thomas Garnier <[email protected]>
> > > > Cc: Lai Jiangshan <[email protected]>
> > > > Cc: Kees Cook <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/sections.h | 5 +++++
> > > > arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
> > > > 2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > > index a6e8373a5170..dc1c2b08ec48 100644
> > > > --- a/arch/x86/include/asm/sections.h
> > > > +++ b/arch/x86/include/asm/sections.h
> > > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > > >
> > > > #if defined(CONFIG_X86_64)
> > > > extern char __end_rodata_hpage_align[];
> > > > +
> > > > +#ifdef CONFIG_X86_PIE
> > > > +extern char __start_got[], __end_got[];
> > > > +#endif
> > > > +
> > > > #endif
> > > >
> > > > extern char __end_of_kernel_reserve[];
> > > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > > index 84ad0e61ba6e..051f88e6884e 100644
> > > > --- a/arch/x86/kernel/module.c
> > > > +++ b/arch/x86/kernel/module.c
> > > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > > > return 0;
> > > > }
> > > > #else /*X86_64*/
> > > > +#ifdef CONFIG_X86_PIE
> > > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > > +{
> > > > + u64 *pos;
> > > > +
> > > > + for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > > + if (*pos == sym->st_value)
> > > > + return (u64)pos + rela->r_addend;
> > > > + return 0;
> > > > +}
> > > > +#endif
> > > > +
> > > > static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > const char *strtab,
> > > > unsigned int symindex,
> > > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > case R_X86_64_64:
> > > > size = 8;
> > > > break;
> > > > +#ifndef CONFIG_X86_PIE
> > > > case R_X86_64_32:
> > > > if (val != *(u32 *)&val)
> > > > goto overflow;
> > > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > goto overflow;
> > > > size = 4;
> > > > break;
> > > > +#else
> > > > + case R_X86_64_GOTPCREL:
> > > > + val = find_got_kernel_entry(sym, rel);
> > > > + if (!val)
> > > > + goto unexpected_got_reference;
> > > > + fallthrough;
> > > > +#endif
> > > > case R_X86_64_PC32:
> > > > case R_X86_64_PLT32:
> > > > val -= (u64)loc;
> > > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > }
> > > > return 0;
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > +unexpected_got_reference:
> > > > + pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > > + return -ENOEXEC;
> > > > +#else
> > > > overflow:
> > > > pr_err("overflow in relocation type %d val %Lx\n",
> > > > (int)ELF64_R_TYPE(rel[i].r_info), val);
> > > > pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > > > me->name);
> > > > +#endif
> > > > +
> > > > return -ENOEXEC;
> > > > }
> > > >
> > > > --
> > > > 2.31.1
> > > >

2023-05-08 18:19:42

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Mon, 8 May 2023 at 13:45, Hou Wenlong <[email protected]> wrote:
>
> On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> > >
> > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > > > >
> > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > module, all the GOT entry of got references in module should exist in
> > > > > kernel GOT. Currently, there is only one usable got reference for
> > > > > __fentry__().
> > > > >
> > > >
> > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > relocations properly, which means making them point to a location in
> > > > memory that carries the absolute address of the symbol. There are
> > > > several ways to go about that, but perhaps the simplest way is to make
> > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > PC32 references for the symbol name and the symbol namespace name).
> > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > surely) and to count the GOT based relocations, allocate some extra
> > > > r/o module space for each, and allocate slots and populate them with
> > > > the right value as you fix up the relocations.
> > > >
> > > > Then, many such relocations can be relaxed at module load time if the
> > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > >
> > > > Note that this will also permit you to do things like
> > > >
> > > > #define PV_VCPU_PREEMPTED_ASM \
> > > > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > > "movq (%rax,%rdi,8), %rax \n\t" \
> > > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > > "setne %al\n\t"
> > > >
> > > > or
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > +#else
> > > > " pushq $arch_rethook_trampoline\n"
> > > > +#endif
> > > >
> > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > >
> > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > this is all rather fresh in my memory)
> > > >
> > > >
> > > >
> > > >
> > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > >
> > > >
> > > Hi Ard,
> > > Thanks for providing the link, it has been very helpful for me as I am
> > > new to the topic of compilers.
> >
> > Happy to hear that.
> >
>
> Thanks for your prompt reply.
>
> > > One key difference I noticed is that you
> > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > on mapped memory.
> > >
> >
> > The problem with --emit-relocs is that the relocations emitted into
> > the binary may get out of sync with the actual code after the linker
> > has applied relocations.
> >
> > $ cat /tmp/a.s
> > foo:movq foo@GOTPCREL(%rip), %rax
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> >
> > /tmp/a.o: file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > 0000000000000000 <foo>:
> > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000401000 <foo>:
> > 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
> > 401003: R_X86_64_32S foo
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000001000 <foo>:
> > 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
> > 1003: R_X86_64_PC32 foo-0x4
> >
> > This all looks as expected. However, when using Clang, we end up with
> >
> > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 00000000000012c0 <foo>:
> > 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
> > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > So in this case, what --emit-relocs gives us is not what is actually
> > in the binary. We cannot just ignore these either, given that they are
> > treated differently depending on whether the symbol is a per-CPU
> > symbol or not - in the former case, we need to perform a fixup if the
> > relaxed reference is RIP relative, and in the latter case, if the
> > relaxed reference is absolute.
> >
> With symbols hidden and the compile-time address of the kernel image
> kept in the top 2G, is it possible for the relaxed reference to be
> absolute, even if I keep the percpu section zero-mapping for SMP? I
> didn't see absoulte relaxed reference after dropping
> "-mrelax-relocations=no" option.
>

If you link in PIE mode, you should never see absolute references
after relaxation.

> > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > need to process that from the code explicitly.
> >
> Yes, so the relocs tool would process GOT, and generate
> R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
> https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com
>

Yes, something like that is needed. I'd lean towards generating the
reloc data directly instead of creating an artifiical RELA section
with GLOB_DAT relocations, but that is a minor detail.

> > In general, relying on --emit-relocs is kind of dodgy, and I think
> > combining PIE linking with --emit-relocs is a bad idea.
> >
> > > The another issue is that it requires the addition of the
> > > "-mrelax-relocations=no" option to support older compilers and linkers.
> >
> > Why? The decompressor is now linked in PIE mode so we should be able
> > to drop that. Or do you need to add is somewhere else?
> >
> I tried to use binutils 2.25 (mini version), it couldn't recognize
> R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.
>

I'm not sure that matters. If the assembler accepts @GOTPCREL
notation, it should generate the relocations that the linker can
understand. If the toolchain is not internally consistent in this
regard, I don't think it is our problem.

This might mean that we end up with more residual GOT entries than
with a more recent toolchain, but I don't think that is a big deal.

> > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > in binutils 2.26 and later, but the mini version required for the kernel
> > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > with the reason given in [2]. Without relocation relaxation, GOT
> > > references would increase the size of GOT. Therefore, I do not want to
> > > use GOT reference in assembly directly. However, I realized that the
> > > compiler could still generate GOT references in some cases such as
> > > "fentry" calls and stack canary references.
> > >
> >
> > The stack canary references are under discussion here [3]. I have also
> > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > should be very few cases where GOT entries are emitted, so I don't
> > think this is fundamentally a problem.
> >
> > I haven't run into the __fentry__ issue myself: do you think we should
> > fix this in the compiler?
> >
> The issue about __fentry__ is that the compiler would generate 6-bytes
> indirect call through GOT with "-fPIE" option. However, the original
> ftrace nop patching assumes it is a 5-bytes direct call. And
> "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> complier woudn't patch it as nop.
>
> So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> This way, ftrace can handle the previous 5-bytes as before. Also I have
> built PIE kernel with relocation relaxation on GCC, and the linker would
> relax it as following:
> ffffffff810018f0 <do_one_initcall>:
> ffffffff810018f0: f3 0f 1e fa endbr64
> ffffffff810018f4: 67 e8 a6 d6 05 00 addr32 call ffffffff8105efa0 <__fentry__>
> ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
>

But if fentry is a function symbol, I would not expect the codegen to
be different at all. Are you using -fno-plt?

> It still requires a different nop patching for ftrace. I notice
> "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> call fentry", it appears that the latter is chosen. If the linker could
> generate the former, then no fixup would be necessary for ftrace with
> PIE.
>

Right. I think this may be a result of __fentry__ not being subject to
the same rules wrt visibility etc, similar to __stack_chk_guard. These
are arguably compiler issues that could qualify as bugs, given that
these symbol references don't behave like ordinary symbol references.

> > > Regarding module loading, I agree that we should support GOT reference
> > > for the module itself. I will refactor it according to your suggestion.
> > >
> >
> > Excellent, good luck with that.
> >
> > However, you will still need to make a convincing case for why this is
> > all worth the trouble. Especially given that you disable the depth
> > tracking code, which I don't think should be mutually exclusive.
> >
> Actually, I could do relocation for it when apply patching for the
> depth tracking code. I'm not sure such case is common or not.
>

I think that alternatives patching in general would need to support
RIP relative references in the alternatives. The depth tracking
template is a bit different in this regard, and could be fixed more
easily, I think.

> > I am aware that this a rather tricky, and involves rewriting
> > RIP-relative per-CPU variable accesses, but it would be good to get a
> > discussion started on that topic, and figure out whether there is a
> > way forward there. Ignoring it is not going to help.
> >
> >
> I see that your PIE linking chose to put the per-cpu section in high
> kernel image address, I still keep it as zero-mapping. However, both are
> in the RIP-relative addressing range.
>

Pure PIE linking cannot support the zero mapping - it can only work
with --emit-relocs, which I was trying to avoid.

2023-05-09 09:56:53

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Tue, 9 May 2023 at 11:42, Hou Wenlong <[email protected]> wrote:
>
> On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 13:45, Hou Wenlong <[email protected]> wrote:
> > >
> > > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> > > > >
> > > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
...
> > > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > > references would increase the size of GOT. Therefore, I do not want to
> > > > > use GOT reference in assembly directly. However, I realized that the
> > > > > compiler could still generate GOT references in some cases such as
> > > > > "fentry" calls and stack canary references.
> > > > >
> > > >
> > > > The stack canary references are under discussion here [3]. I have also
> > > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > > should be very few cases where GOT entries are emitted, so I don't
> > > > think this is fundamentally a problem.
> > > >
> > > > I haven't run into the __fentry__ issue myself: do you think we should
> > > > fix this in the compiler?
> > > >
> > > The issue about __fentry__ is that the compiler would generate 6-bytes
> > > indirect call through GOT with "-fPIE" option. However, the original
> > > ftrace nop patching assumes it is a 5-bytes direct call. And
> > > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > > complier woudn't patch it as nop.
> > >
> > > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > > built PIE kernel with relocation relaxation on GCC, and the linker would
> > > relax it as following:
> > > ffffffff810018f0 <do_one_initcall>:
> > > ffffffff810018f0: f3 0f 1e fa endbr64
> > > ffffffff810018f4: 67 e8 a6 d6 05 00 addr32 call ffffffff8105efa0 <__fentry__>
> > > ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> > >
> >
> > But if fentry is a function symbol, I would not expect the codegen to
> > be different at all. Are you using -fno-plt?
> >
> No, even with -fno-plt added, the compiler still generates a GOT
> reference for fentry. Therefore, the problem may be visibility, as you
> said.
>

Yeah, I spotted this issue in GCC - I just sent them a patch this morning.

> > > It still requires a different nop patching for ftrace. I notice
> > > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > > call fentry", it appears that the latter is chosen. If the linker could
> > > generate the former, then no fixup would be necessary for ftrace with
> > > PIE.
> > >
> >
> > Right. I think this may be a result of __fentry__ not being subject to
> > the same rules wrt visibility etc, similar to __stack_chk_guard. These
> > are arguably compiler issues that could qualify as bugs, given that
> > these symbol references don't behave like ordinary symbol references.
> >
> > > > > Regarding module loading, I agree that we should support GOT reference
> > > > > for the module itself. I will refactor it according to your suggestion.
> > > > >
> > > >
> > > > Excellent, good luck with that.
> > > >
> > > > However, you will still need to make a convincing case for why this is
> > > > all worth the trouble. Especially given that you disable the depth
> > > > tracking code, which I don't think should be mutually exclusive.
> > > >
> > > Actually, I could do relocation for it when apply patching for the
> > > depth tracking code. I'm not sure such case is common or not.
> > >
> >
> > I think that alternatives patching in general would need to support
> > RIP relative references in the alternatives. The depth tracking
> > template is a bit different in this regard, and could be fixed more
> > easily, I think.
> >
> > > > I am aware that this a rather tricky, and involves rewriting
> > > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > > discussion started on that topic, and figure out whether there is a
> > > > way forward there. Ignoring it is not going to help.
> > > >
> > > >
> > > I see that your PIE linking chose to put the per-cpu section in high
> > > kernel image address, I still keep it as zero-mapping. However, both are
> > > in the RIP-relative addressing range.
> > >
> >
> > Pure PIE linking cannot support the zero mapping - it can only work
> > with --emit-relocs, which I was trying to avoid.
> Sorry, why doesn't PIE linking support zero mapping? I noticed in the
> commit message for your PIE linking that it stated, "if we randomize the
> kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
> reference needs to be decreased by the same amount in order for the
> produced offset to remain correct." As a result, I decided to decrease
> the GS base and not relocate the RIP-relative per-CPU reference in the
> relocs. Consequently, all RIP-relative references, regardless of whether
> they are per-CPU variables or not, do not require relocation.
>

Interesting. Does that work as expected with dynamically allocated
per-CPU variables?

> Furthermore, all symbols are hidden, which implies that all per-CPU
> references will not generate a GOT reference and will be relaxed as
> absolute reference due to zero mapping. However, the __stack_chk_guard
> on CLANG always generates a GOT reference, but I didn't see it being
> relaxed as absolute reference on LLVM.
>

Yeah, we should fix that.

2023-05-09 10:03:39

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 13:45, Hou Wenlong <[email protected]> wrote:
> >
> > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> > > >
> > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > > > > >
> > > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > > module, all the GOT entry of got references in module should exist in
> > > > > > kernel GOT. Currently, there is only one usable got reference for
> > > > > > __fentry__().
> > > > > >
> > > > >
> > > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > > relocations properly, which means making them point to a location in
> > > > > memory that carries the absolute address of the symbol. There are
> > > > > several ways to go about that, but perhaps the simplest way is to make
> > > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > > PC32 references for the symbol name and the symbol namespace name).
> > > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > > surely) and to count the GOT based relocations, allocate some extra
> > > > > r/o module space for each, and allocate slots and populate them with
> > > > > the right value as you fix up the relocations.
> > > > >
> > > > > Then, many such relocations can be relaxed at module load time if the
> > > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > > >
> > > > > Note that this will also permit you to do things like
> > > > >
> > > > > #define PV_VCPU_PREEMPTED_ASM \
> > > > > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > > > "movq (%rax,%rdi,8), %rax \n\t" \
> > > > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > > > "setne %al\n\t"
> > > > >
> > > > > or
> > > > >
> > > > > +#ifdef CONFIG_X86_PIE
> > > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > > +#else
> > > > > " pushq $arch_rethook_trampoline\n"
> > > > > +#endif
> > > > >
> > > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > > >
> > > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > > this is all rather fresh in my memory)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > > >
> > > > >
> > > > Hi Ard,
> > > > Thanks for providing the link, it has been very helpful for me as I am
> > > > new to the topic of compilers.
> > >
> > > Happy to hear that.
> > >
> >
> > Thanks for your prompt reply.
> >
> > > > One key difference I noticed is that you
> > > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > > on mapped memory.
> > > >
> > >
> > > The problem with --emit-relocs is that the relocations emitted into
> > > the binary may get out of sync with the actual code after the linker
> > > has applied relocations.
> > >
> > > $ cat /tmp/a.s
> > > foo:movq foo@GOTPCREL(%rip), %rax
> > >
> > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > >
> > > /tmp/a.o: file format elf64-x86-64
> > >
> > >
> > > Disassembly of section .text:
> > >
> > > 0000000000000000 <foo>:
> > > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > > 0000000000000000 <foo>:
> > > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 0000000000401000 <foo>:
> > > 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
> > > 401003: R_X86_64_32S foo
> > >
> > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 0000000000001000 <foo>:
> > > 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
> > > 1003: R_X86_64_PC32 foo-0x4
> > >
> > > This all looks as expected. However, when using Clang, we end up with
> > >
> > > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > > 00000000000012c0 <foo>:
> > > 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
> > > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> > >
> > > So in this case, what --emit-relocs gives us is not what is actually
> > > in the binary. We cannot just ignore these either, given that they are
> > > treated differently depending on whether the symbol is a per-CPU
> > > symbol or not - in the former case, we need to perform a fixup if the
> > > relaxed reference is RIP relative, and in the latter case, if the
> > > relaxed reference is absolute.
> > >
> > With symbols hidden and the compile-time address of the kernel image
> > kept in the top 2G, is it possible for the relaxed reference to be
> > absolute, even if I keep the percpu section zero-mapping for SMP? I
> > didn't see absoulte relaxed reference after dropping
> > "-mrelax-relocations=no" option.
> >
>
> If you link in PIE mode, you should never see absolute references
> after relaxation.
>
> > > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > > need to process that from the code explicitly.
> > >
> > Yes, so the relocs tool would process GOT, and generate
> > R_X86_64_GLOB_DAT relocation for GOT entries in patch 27:
> > https://lore.kernel.org/lkml/d25c7644249355785365914398bdba1ed2c52468.1682673543.git.houwenlong.hwl@antgroup.com
> >
>
> Yes, something like that is needed. I'd lean towards generating the
> reloc data directly instead of creating an artifiical RELA section
> with GLOB_DAT relocations, but that is a minor detail.
>
> > > In general, relying on --emit-relocs is kind of dodgy, and I think
> > > combining PIE linking with --emit-relocs is a bad idea.
> > >
> > > > The another issue is that it requires the addition of the
> > > > "-mrelax-relocations=no" option to support older compilers and linkers.
> > >
> > > Why? The decompressor is now linked in PIE mode so we should be able
> > > to drop that. Or do you need to add is somewhere else?
> > >
> > I tried to use binutils 2.25 (mini version), it couldn't recognize
> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX.
> >
>
> I'm not sure that matters. If the assembler accepts @GOTPCREL
> notation, it should generate the relocations that the linker can
> understand. If the toolchain is not internally consistent in this
> regard, I don't think it is our problem.
>
I get it. Thanks.

> This might mean that we end up with more residual GOT entries than
> with a more recent toolchain, but I don't think that is a big deal.
>
> > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > references would increase the size of GOT. Therefore, I do not want to
> > > > use GOT reference in assembly directly. However, I realized that the
> > > > compiler could still generate GOT references in some cases such as
> > > > "fentry" calls and stack canary references.
> > > >
> > >
> > > The stack canary references are under discussion here [3]. I have also
> > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > should be very few cases where GOT entries are emitted, so I don't
> > > think this is fundamentally a problem.
> > >
> > > I haven't run into the __fentry__ issue myself: do you think we should
> > > fix this in the compiler?
> > >
> > The issue about __fentry__ is that the compiler would generate 6-bytes
> > indirect call through GOT with "-fPIE" option. However, the original
> > ftrace nop patching assumes it is a 5-bytes direct call. And
> > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > complier woudn't patch it as nop.
> >
> > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > built PIE kernel with relocation relaxation on GCC, and the linker would
> > relax it as following:
> > ffffffff810018f0 <do_one_initcall>:
> > ffffffff810018f0: f3 0f 1e fa endbr64
> > ffffffff810018f4: 67 e8 a6 d6 05 00 addr32 call ffffffff8105efa0 <__fentry__>
> > ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> >
>
> But if fentry is a function symbol, I would not expect the codegen to
> be different at all. Are you using -fno-plt?
>
No, even with -fno-plt added, the compiler still generates a GOT
reference for fentry. Therefore, the problem may be visibility, as you
said.

> > It still requires a different nop patching for ftrace. I notice
> > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > call fentry", it appears that the latter is chosen. If the linker could
> > generate the former, then no fixup would be necessary for ftrace with
> > PIE.
> >
>
> Right. I think this may be a result of __fentry__ not being subject to
> the same rules wrt visibility etc, similar to __stack_chk_guard. These
> are arguably compiler issues that could qualify as bugs, given that
> these symbol references don't behave like ordinary symbol references.
>
> > > > Regarding module loading, I agree that we should support GOT reference
> > > > for the module itself. I will refactor it according to your suggestion.
> > > >
> > >
> > > Excellent, good luck with that.
> > >
> > > However, you will still need to make a convincing case for why this is
> > > all worth the trouble. Especially given that you disable the depth
> > > tracking code, which I don't think should be mutually exclusive.
> > >
> > Actually, I could do relocation for it when apply patching for the
> > depth tracking code. I'm not sure such case is common or not.
> >
>
> I think that alternatives patching in general would need to support
> RIP relative references in the alternatives. The depth tracking
> template is a bit different in this regard, and could be fixed more
> easily, I think.
>
> > > I am aware that this a rather tricky, and involves rewriting
> > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > discussion started on that topic, and figure out whether there is a
> > > way forward there. Ignoring it is not going to help.
> > >
> > >
> > I see that your PIE linking chose to put the per-cpu section in high
> > kernel image address, I still keep it as zero-mapping. However, both are
> > in the RIP-relative addressing range.
> >
>
> Pure PIE linking cannot support the zero mapping - it can only work
> with --emit-relocs, which I was trying to avoid.
Sorry, why doesn't PIE linking support zero mapping? I noticed in the
commit message for your PIE linking that it stated, "if we randomize the
kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
reference needs to be decreased by the same amount in order for the
produced offset to remain correct." As a result, I decided to decrease
the GS base and not relocate the RIP-relative per-CPU reference in the
relocs. Consequently, all RIP-relative references, regardless of whether
they are per-CPU variables or not, do not require relocation.

Furthermore, all symbols are hidden, which implies that all per-CPU
references will not generate a GOT reference and will be relaxed as
absolute reference due to zero mapping. However, the __stack_chk_guard
on CLANG always generates a GOT reference, but I didn't see it being
relaxed as absolute reference on LLVM.

Thanks!

2023-05-09 12:43:56

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Tue, May 09, 2023 at 05:52:31PM +0800, Ard Biesheuvel wrote:
> On Tue, 9 May 2023 at 11:42, Hou Wenlong <[email protected]> wrote:
> >
> > On Tue, May 09, 2023 at 01:47:53AM +0800, Ard Biesheuvel wrote:
> > > On Mon, 8 May 2023 at 13:45, Hou Wenlong <[email protected]> wrote:
> > > >
> > > > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > > > > On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> ...
> > > > > > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > > > > > in binutils 2.26 and later, but the mini version required for the kernel
> > > > > > is 2.25. This option disables relocation relaxation, which makes GOT not
> > > > > > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > > > > > with the reason given in [2]. Without relocation relaxation, GOT
> > > > > > references would increase the size of GOT. Therefore, I do not want to
> > > > > > use GOT reference in assembly directly. However, I realized that the
> > > > > > compiler could still generate GOT references in some cases such as
> > > > > > "fentry" calls and stack canary references.
> > > > > >
> > > > >
> > > > > The stack canary references are under discussion here [3]. I have also
> > > > > sent a patch for kallsyms symbol references [4]. Beyond that, there
> > > > > should be very few cases where GOT entries are emitted, so I don't
> > > > > think this is fundamentally a problem.
> > > > >
> > > > > I haven't run into the __fentry__ issue myself: do you think we should
> > > > > fix this in the compiler?
> > > > >
> > > > The issue about __fentry__ is that the compiler would generate 6-bytes
> > > > indirect call through GOT with "-fPIE" option. However, the original
> > > > ftrace nop patching assumes it is a 5-bytes direct call. And
> > > > "-mnop-mcount" option is not compatiable with "-fPIE" option, so the
> > > > complier woudn't patch it as nop.
> > > >
> > > > So we should patch it with one 5-bytes nop followed by one 1-byte nop,
> > > > This way, ftrace can handle the previous 5-bytes as before. Also I have
> > > > built PIE kernel with relocation relaxation on GCC, and the linker would
> > > > relax it as following:
> > > > ffffffff810018f0 <do_one_initcall>:
> > > > ffffffff810018f0: f3 0f 1e fa endbr64
> > > > ffffffff810018f4: 67 e8 a6 d6 05 00 addr32 call ffffffff8105efa0 <__fentry__>
> > > > ffffffff810018f6: R_X86_64_PC32 __fentry__-0x4
> > > >
> > >
> > > But if fentry is a function symbol, I would not expect the codegen to
> > > be different at all. Are you using -fno-plt?
> > >
> > No, even with -fno-plt added, the compiler still generates a GOT
> > reference for fentry. Therefore, the problem may be visibility, as you
> > said.
> >
>
> Yeah, I spotted this issue in GCC - I just sent them a patch this morning.
>
> > > > It still requires a different nop patching for ftrace. I notice
> > > > "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, which suggests
> > > > that the GOT indirect call can be relaxed as "call fentry nop" or "nop
> > > > call fentry", it appears that the latter is chosen. If the linker could
> > > > generate the former, then no fixup would be necessary for ftrace with
> > > > PIE.
> > > >
> > >
> > > Right. I think this may be a result of __fentry__ not being subject to
> > > the same rules wrt visibility etc, similar to __stack_chk_guard. These
> > > are arguably compiler issues that could qualify as bugs, given that
> > > these symbol references don't behave like ordinary symbol references.
> > >
> > > > > > Regarding module loading, I agree that we should support GOT reference
> > > > > > for the module itself. I will refactor it according to your suggestion.
> > > > > >
> > > > >
> > > > > Excellent, good luck with that.
> > > > >
> > > > > However, you will still need to make a convincing case for why this is
> > > > > all worth the trouble. Especially given that you disable the depth
> > > > > tracking code, which I don't think should be mutually exclusive.
> > > > >
> > > > Actually, I could do relocation for it when apply patching for the
> > > > depth tracking code. I'm not sure such case is common or not.
> > > >
> > >
> > > I think that alternatives patching in general would need to support
> > > RIP relative references in the alternatives. The depth tracking
> > > template is a bit different in this regard, and could be fixed more
> > > easily, I think.
> > >
> > > > > I am aware that this a rather tricky, and involves rewriting
> > > > > RIP-relative per-CPU variable accesses, but it would be good to get a
> > > > > discussion started on that topic, and figure out whether there is a
> > > > > way forward there. Ignoring it is not going to help.
> > > > >
> > > > >
> > > > I see that your PIE linking chose to put the per-cpu section in high
> > > > kernel image address, I still keep it as zero-mapping. However, both are
> > > > in the RIP-relative addressing range.
> > > >
> > >
> > > Pure PIE linking cannot support the zero mapping - it can only work
> > > with --emit-relocs, which I was trying to avoid.
> > Sorry, why doesn't PIE linking support zero mapping? I noticed in the
> > commit message for your PIE linking that it stated, "if we randomize the
> > kernel's VA by increasing it by X bytes, every RIP-relative per-CPU
> > reference needs to be decreased by the same amount in order for the
> > produced offset to remain correct." As a result, I decided to decrease
> > the GS base and not relocate the RIP-relative per-CPU reference in the
> > relocs. Consequently, all RIP-relative references, regardless of whether
> > they are per-CPU variables or not, do not require relocation.
> >
>
> Interesting. Does that work as expected with dynamically allocated
> per-CPU variables?
>
I didn't encounter any issues with the dynamically allocated per-CPU
variables. Since the per_cpu_ptr macro uses the __per_cpu_offset array
directly, it should work. In any case, I have tested loading the kvm
module, which uses dynamically allocated per-CPU variables, and
successfully booted a guest.

The related patch is:
https://lore.kernel.org/lkml/62d7e9e73467b711351a84ebce99372d3dccaa73.1682673543.git.houwenlong.hwl@antgroup.com

Thanks!
> > Furthermore, all symbols are hidden, which implies that all per-CPU
> > references will not generate a GOT reference and will be relaxed as
> > absolute reference due to zero mapping. However, the __stack_chk_guard
> > on CLANG always generates a GOT reference, but I didn't see it being
> > relaxed as absolute reference on LLVM.
> >
>
> Yeah, we should fix that.

2023-05-10 07:30:04

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> >
> > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > > >
> > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > module, all the GOT entry of got references in module should exist in
> > > > kernel GOT. Currently, there is only one usable got reference for
> > > > __fentry__().
> > > >
> > >
> > > I don't think this is the right approach. We should permit GOTPCREL
> > > relocations properly, which means making them point to a location in
> > > memory that carries the absolute address of the symbol. There are
> > > several ways to go about that, but perhaps the simplest way is to make
> > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > PC32 references for the symbol name and the symbol namespace name).
> > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > it to the ksymtab entry. Another option would be to take inspiration
> > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > surely) and to count the GOT based relocations, allocate some extra
> > > r/o module space for each, and allocate slots and populate them with
> > > the right value as you fix up the relocations.
> > >
> > > Then, many such relocations can be relaxed at module load time if the
> > > symbol is in range. IIUC, the module and kernel will still be inside
> > > the same 2G window even after widening the KASLR range to 512G, so
> > > most GOT loads can be converted into RIP relative LEA instructions.
> > >
> > > Note that this will also permit you to do things like
> > >
> > > #define PV_VCPU_PREEMPTED_ASM \
> > > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > "movq (%rax,%rdi,8), %rax \n\t" \
> > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > "setne %al\n\t"
> > >
> > > or
> > >
> > > +#ifdef CONFIG_X86_PIE
> > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > +#else
> > > " pushq $arch_rethook_trampoline\n"
> > > +#endif
> > >
> > > instead of having these kludgy push/pop sequences to free up temp registers.
> > >
> > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > this is all rather fresh in my memory)
> > >
> > >
> > >
> > >
> > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > >
> > >
> > Hi Ard,
> > Thanks for providing the link, it has been very helpful for me as I am
> > new to the topic of compilers.
>
> Happy to hear that.
>
> > One key difference I noticed is that you
> > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > on mapped memory.
> >
>
> The problem with --emit-relocs is that the relocations emitted into
> the binary may get out of sync with the actual code after the linker
> has applied relocations.
>
> $ cat /tmp/a.s
> foo:movq foo@GOTPCREL(%rip), %rax
>
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
>
> /tmp/a.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <foo>:
> 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> 0000000000000000 <foo>:
> 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> 3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000401000 <foo>:
> 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
> 401003: R_X86_64_32S foo
>
> $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 0000000000001000 <foo>:
> 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
> 1003: R_X86_64_PC32 foo-0x4
>
> This all looks as expected. However, when using Clang, we end up with
>
> $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> 00000000000012c0 <foo>:
> 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
> 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
>
> So in this case, what --emit-relocs gives us is not what is actually
> in the binary. We cannot just ignore these either, given that they are
> treated differently depending on whether the symbol is a per-CPU
> symbol or not - in the former case, we need to perform a fixup if the
> relaxed reference is RIP relative, and in the latter case, if the
> relaxed reference is absolute.
>
> On top of that, --emit-relocs does not cover the GOT, so we'd still
> need to process that from the code explicitly.
>
> In general, relying on --emit-relocs is kind of dodgy, and I think
> combining PIE linking with --emit-relocs is a bad idea.
>
> > The another issue is that it requires the addition of the
> > "-mrelax-relocations=no" option to support older compilers and linkers.
>
> Why? The decompressor is now linked in PIE mode so we should be able
> to drop that. Or do you need to add is somewhere else?
>
Hi Ard,

After removing the "-mrelax-relocations=no" option, I noticed that the
linker was relaxing GOT references as absolute references for mov
instructions, even if the symbol was in a high address, as long as I
kept the compile-time base address of the kernel image in the top 2G. I
consulted the "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI,
which stated that "When position-independent code is disabled and foo is
defined locally in the lower 32-bit address space, memory operand in mov
can be converted into immediate operand". However, it seemed that if the
symbol was in the higher 32-bit address space, the memory operand in mov
would also be converted into an immediate operand. If I decreased the
compile-time base address of the kernel image, it would be relaxed as
lea. Therefore, I believe that using "-mrelax-relocations=no" without
"-pie" option is necessary. Is there a way to force the linker to relax
it as lea without using the "-pie" option when linking?

Since all GOT references cannot be omitted, perhaps I should try linking
the kernel with the "-pie" option.

Thanks!

> > R_X86_64_GOTPCRELX and R_X86_64_REX_GOTPCRELX relocations are supported
> > in binutils 2.26 and later, but the mini version required for the kernel
> > is 2.25. This option disables relocation relaxation, which makes GOT not
> > empty. I also noticed this option in arch/x86/boot/compressed/Makefile
> > with the reason given in [2]. Without relocation relaxation, GOT
> > references would increase the size of GOT. Therefore, I do not want to
> > use GOT reference in assembly directly. However, I realized that the
> > compiler could still generate GOT references in some cases such as
> > "fentry" calls and stack canary references.
> >
>
> The stack canary references are under discussion here [3]. I have also
> sent a patch for kallsyms symbol references [4]. Beyond that, there
> should be very few cases where GOT entries are emitted, so I don't
> think this is fundamentally a problem.
>
> I haven't run into the __fentry__ issue myself: do you think we should
> fix this in the compiler?
>
> > Regarding module loading, I agree that we should support GOT reference
> > for the module itself. I will refactor it according to your suggestion.
> >
>
> Excellent, good luck with that.
>
> However, you will still need to make a convincing case for why this is
> all worth the trouble. Especially given that you disable the depth
> tracking code, which I don't think should be mutually exclusive.
>
> I am aware that this a rather tricky, and involves rewriting
> RIP-relative per-CPU variable accesses, but it would be good to get a
> discussion started on that topic, and figure out whether there is a
> way forward there. Ignoring it is not going to help.
>
>
> >
> > [0] https://yhbt.net/lore/all/[email protected]
> > [1] https://yhbt.net/lore/all/[email protected]
> > [2] https://lore.kernel.org/all/[email protected]/
> >
>
> [3] https://github.com/llvm/llvm-project/issues/60116
> [4] [email protected]
>
> > > > Signed-off-by: Hou Wenlong <[email protected]>
> > > > Cc: Thomas Garnier <[email protected]>
> > > > Cc: Lai Jiangshan <[email protected]>
> > > > Cc: Kees Cook <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/sections.h | 5 +++++
> > > > arch/x86/kernel/module.c | 27 +++++++++++++++++++++++++++
> > > > 2 files changed, 32 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
> > > > index a6e8373a5170..dc1c2b08ec48 100644
> > > > --- a/arch/x86/include/asm/sections.h
> > > > +++ b/arch/x86/include/asm/sections.h
> > > > @@ -12,6 +12,11 @@ extern char __end_rodata_aligned[];
> > > >
> > > > #if defined(CONFIG_X86_64)
> > > > extern char __end_rodata_hpage_align[];
> > > > +
> > > > +#ifdef CONFIG_X86_PIE
> > > > +extern char __start_got[], __end_got[];
> > > > +#endif
> > > > +
> > > > #endif
> > > >
> > > > extern char __end_of_kernel_reserve[];
> > > > diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
> > > > index 84ad0e61ba6e..051f88e6884e 100644
> > > > --- a/arch/x86/kernel/module.c
> > > > +++ b/arch/x86/kernel/module.c
> > > > @@ -129,6 +129,18 @@ int apply_relocate(Elf32_Shdr *sechdrs,
> > > > return 0;
> > > > }
> > > > #else /*X86_64*/
> > > > +#ifdef CONFIG_X86_PIE
> > > > +static u64 find_got_kernel_entry(Elf64_Sym *sym, const Elf64_Rela *rela)
> > > > +{
> > > > + u64 *pos;
> > > > +
> > > > + for (pos = (u64 *)__start_got; pos < (u64 *)__end_got; pos++)
> > > > + if (*pos == sym->st_value)
> > > > + return (u64)pos + rela->r_addend;
> > > > + return 0;
> > > > +}
> > > > +#endif
> > > > +
> > > > static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > const char *strtab,
> > > > unsigned int symindex,
> > > > @@ -171,6 +183,7 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > case R_X86_64_64:
> > > > size = 8;
> > > > break;
> > > > +#ifndef CONFIG_X86_PIE
> > > > case R_X86_64_32:
> > > > if (val != *(u32 *)&val)
> > > > goto overflow;
> > > > @@ -181,6 +194,13 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > goto overflow;
> > > > size = 4;
> > > > break;
> > > > +#else
> > > > + case R_X86_64_GOTPCREL:
> > > > + val = find_got_kernel_entry(sym, rel);
> > > > + if (!val)
> > > > + goto unexpected_got_reference;
> > > > + fallthrough;
> > > > +#endif
> > > > case R_X86_64_PC32:
> > > > case R_X86_64_PLT32:
> > > > val -= (u64)loc;
> > > > @@ -214,11 +234,18 @@ static int __write_relocate_add(Elf64_Shdr *sechdrs,
> > > > }
> > > > return 0;
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > +unexpected_got_reference:
> > > > + pr_err("Target got entry doesn't exist in kernel got, loc %p\n", loc);
> > > > + return -ENOEXEC;
> > > > +#else
> > > > overflow:
> > > > pr_err("overflow in relocation type %d val %Lx\n",
> > > > (int)ELF64_R_TYPE(rel[i].r_info), val);
> > > > pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
> > > > me->name);
> > > > +#endif
> > > > +
> > > > return -ENOEXEC;
> > > > }
> > > >
> > > > --
> > > > 2.31.1
> > > >

2023-05-10 08:25:38

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support

On Wed, 10 May 2023 at 09:15, Hou Wenlong <[email protected]> wrote:
>
> On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote:
> > On Mon, 8 May 2023 at 10:38, Hou Wenlong <[email protected]> wrote:
> > >
> > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote:
> > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong <[email protected]> wrote:
> > > > >
> > > > > Adapt module loading to support PIE relocations. No GOT is generared for
> > > > > module, all the GOT entry of got references in module should exist in
> > > > > kernel GOT. Currently, there is only one usable got reference for
> > > > > __fentry__().
> > > > >
> > > >
> > > > I don't think this is the right approach. We should permit GOTPCREL
> > > > relocations properly, which means making them point to a location in
> > > > memory that carries the absolute address of the symbol. There are
> > > > several ways to go about that, but perhaps the simplest way is to make
> > > > the symbol address in ksymtab a 64-bit absolute value (but retain the
> > > > PC32 references for the symbol name and the symbol namespace name).
> > > > That way, you can always resolve such GOTPCREL relocations by pointing
> > > > it to the ksymtab entry. Another option would be to take inspiration
> > > > from the PLT code we have on ARM and arm64 (and other architectures,
> > > > surely) and to count the GOT based relocations, allocate some extra
> > > > r/o module space for each, and allocate slots and populate them with
> > > > the right value as you fix up the relocations.
> > > >
> > > > Then, many such relocations can be relaxed at module load time if the
> > > > symbol is in range. IIUC, the module and kernel will still be inside
> > > > the same 2G window even after widening the KASLR range to 512G, so
> > > > most GOT loads can be converted into RIP relative LEA instructions.
> > > >
> > > > Note that this will also permit you to do things like
> > > >
> > > > #define PV_VCPU_PREEMPTED_ASM \
> > > > "leaq __per_cpu_offset(%rip), %rax \n\t" \
> > > > "movq (%rax,%rdi,8), %rax \n\t" \
> > > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \
> > > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \
> > > > "setne %al\n\t"
> > > >
> > > > or
> > > >
> > > > +#ifdef CONFIG_X86_PIE
> > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n"
> > > > +#else
> > > > " pushq $arch_rethook_trampoline\n"
> > > > +#endif
> > > >
> > > > instead of having these kludgy push/pop sequences to free up temp registers.
> > > >
> > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so
> > > > this is all rather fresh in my memory)
> > > >
> > > >
> > > >
> > > >
> > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie
> > > >
> > > >
> > > Hi Ard,
> > > Thanks for providing the link, it has been very helpful for me as I am
> > > new to the topic of compilers.
> >
> > Happy to hear that.
> >
> > > One key difference I noticed is that you
> > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed
> > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it
> > > switched to "--emit-reloc" in order to reduce dynamic relocation space
> > > on mapped memory.
> > >
> >
> > The problem with --emit-relocs is that the relocations emitted into
> > the binary may get out of sync with the actual code after the linker
> > has applied relocations.
> >
> > $ cat /tmp/a.s
> > foo:movq foo@GOTPCREL(%rip), %rax
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o
> >
> > /tmp/a.o: file format elf64-x86-64
> >
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.o
> > 0000000000000000 <foo>:
> > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <foo+0x7>
> > 3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000401000 <foo>:
> > 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax
> > 401003: R_X86_64_32S foo
> >
> > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles
> > -Wl,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 0000000000001000 <foo>:
> > 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 <foo>
> > 1003: R_X86_64_PC32 foo-0x4
> >
> > This all looks as expected. However, when using Clang, we end up with
> >
> > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles
> > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s
> > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf
> > 00000000000012c0 <foo>:
> > 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 <foo>
> > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4
> >
> > So in this case, what --emit-relocs gives us is not what is actually
> > in the binary. We cannot just ignore these either, given that they are
> > treated differently depending on whether the symbol is a per-CPU
> > symbol or not - in the former case, we need to perform a fixup if the
> > relaxed reference is RIP relative, and in the latter case, if the
> > relaxed reference is absolute.
> >
> > On top of that, --emit-relocs does not cover the GOT, so we'd still
> > need to process that from the code explicitly.
> >
> > In general, relying on --emit-relocs is kind of dodgy, and I think
> > combining PIE linking with --emit-relocs is a bad idea.
> >
> > > The another issue is that it requires the addition of the
> > > "-mrelax-relocations=no" option to support older compilers and linkers.
> >
> > Why? The decompressor is now linked in PIE mode so we should be able
> > to drop that. Or do you need to add is somewhere else?
> >
> Hi Ard,
>
> After removing the "-mrelax-relocations=no" option, I noticed that the
> linker was relaxing GOT references as absolute references for mov
> instructions, even if the symbol was in a high address, as long as I
> kept the compile-time base address of the kernel image in the top 2G. I
> consulted the "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI,
> which stated that "When position-independent code is disabled and foo is
> defined locally in the lower 32-bit address space, memory operand in mov
> can be converted into immediate operand". However, it seemed that if the
> symbol was in the higher 32-bit address space, the memory operand in mov
> would also be converted into an immediate operand. If I decreased the
> compile-time base address of the kernel image, it would be relaxed as
> lea. Therefore, I believe that using "-mrelax-relocations=no" without
> "-pie" option is necessary.

Indeed. As you noted, the linker assumes that non-PIE linked binaries
will always appear at their link time address, and relaxations will
try to take advantage of that.

Currently, we use -pie linking only for the decompressor, and we
should be able to drop -mrelax-relocations=no from its LDFLAGS. But
position dependent linking should not use relaxations at all.

> Is there a way to force the linker to relax
> it as lea without using the "-pie" option when linking?
>

Not that I am aware of.

> Since all GOT references cannot be omitted, perhaps I should try linking
> the kernel with the "-pie" option.
>

That way, we will end up with two sets of relocations, the static ones
from --emit-relocs and the dynamic ones from -pie. This should be
manageable, given that the difference between those sets should
exactly cover the GOT.

However, relying on --emit-relocs and -pie at the same time seems
clumsy to me. I'd prefer to only depend on -pie at /some/ point.

--
Ard.

2023-06-01 09:41:08

by Juergen Gross

[permalink] [raw]
Subject: Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

On 28.04.23 11:50, Hou Wenlong wrote:
> Similar to the alternative patching, use relative reference for original
> instruction rather than absolute one, which saves 8 bytes for one entry
> on x86_64. And it could generate R_X86_64_PC32 relocation instead of
> R_X86_64_64 relocation, which also reduces relocation metadata on
> relocatable builds. And the alignment could be hard coded to be 4 now.
>
> Signed-off-by: Hou Wenlong <[email protected]>
> Cc: Thomas Garnier <[email protected]>
> Cc: Lai Jiangshan <[email protected]>
> Cc: Kees Cook <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>

I think this patch should be taken even without the series.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.08 kB)
OpenPGP public key
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2023-06-05 06:50:28

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction



> On Jun 1, 2023, at 2:29 AM, Juergen Gross <[email protected]> wrote:
>
> On 28.04.23 11:50, Hou Wenlong wrote:
>> Similar to the alternative patching, use relative reference for original
>> instruction rather than absolute one, which saves 8 bytes for one entry
>> on x86_64. And it could generate R_X86_64_PC32 relocation instead of
>> R_X86_64_64 relocation, which also reduces relocation metadata on
>> relocatable builds. And the alignment could be hard coded to be 4 now.
>> Signed-off-by: Hou Wenlong <[email protected]>
>> Cc: Thomas Garnier <[email protected]>
>> Cc: Lai Jiangshan <[email protected]>
>> Cc: Kees Cook <[email protected]>
>
> Reviewed-by: Juergen Gross <[email protected]>
>
> I think this patch should be taken even without the series.

It looks good to me, I am just not sure what the alignment is needed
at all.

Why not to make the struct __packed (like struct alt_instr) and get rid
of all the .align directives? Am I missing something?


2023-06-06 12:05:03

by Hou Wenlong

[permalink] [raw]
Subject: Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

On Mon, Jun 05, 2023 at 02:40:54PM +0800, Nadav Amit wrote:
>
>
> > On Jun 1, 2023, at 2:29 AM, Juergen Gross <[email protected]> wrote:
> >
> > On 28.04.23 11:50, Hou Wenlong wrote:
> >> Similar to the alternative patching, use relative reference for original
> >> instruction rather than absolute one, which saves 8 bytes for one entry
> >> on x86_64. And it could generate R_X86_64_PC32 relocation instead of
> >> R_X86_64_64 relocation, which also reduces relocation metadata on
> >> relocatable builds. And the alignment could be hard coded to be 4 now.
> >> Signed-off-by: Hou Wenlong <[email protected]>
> >> Cc: Thomas Garnier <[email protected]>
> >> Cc: Lai Jiangshan <[email protected]>
> >> Cc: Kees Cook <[email protected]>
> >
> > Reviewed-by: Juergen Gross <[email protected]>
> >
> > I think this patch should be taken even without the series.
>
> It looks good to me, I am just not sure what the alignment is needed
> at all.
>
> Why not to make the struct __packed (like struct alt_instr) and get rid
> of all the .align directives? Am I missing something?

Yes, making the struct __packed can save more space. If I understand
correctly, it could be done even without this patch but it may lead to
misaligned memory access. However, it seems to not matter as I didn't
find any related log for packing struct alt_instr. I can do such things
if needed.

Thanks.