These patches make the changes necessary to build the kernel as Position
Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below
the top 2G of the virtual address space. It allows to optionally extend the
KASLR randomization range from 1G to 3G.
Thanks a lot to Ard Biesheuvel & Kees Cook on their feedback on compiler
changes, PIE support and KASLR in general.
The patches:
- 1-3, 5-15: Change in assembly code to be PIE compliant.
- 4: Add a new _ASM_GET_PTR macro to fetch a symbol address generically.
- 16: Adapt percpu design to work correctly when PIE is enabled.
- 17: Provide an option to default visibility to hidden except for key symbols.
It removes errors between compilation units.
- 18: Adapt relocation tool to handle PIE binary correctly.
- 19: Add the CONFIG_X86_PIE option (off by default)
- 20: Adapt relocation tool to generate a 64-bit relocation table.
- 21: Add options to build modules as mcmodel=large and dynamically create a
PLT for relative references out of range (adapted from arm64).
- 22: Add the CONFIG_RANDOMIZE_BASE_LARGE option to increase relocation range
from 1G to 3G (off by default).
Performance/Size impact:
Hackbench (50% and 1600% loads):
- PIE disabled: no significant change (-0.50% / +0.50%)
- PIE enabled: 7% to 8% on half load, 10% on heavy load.
These results are aligned with the different research on user-mode PIE
impact on cpu intensive benchmarks (around 10% on x86_64).
slab_test (average of 10 runs):
- PIE disabled: no significant change (-1% / +1%)
- PIE enabled: 3% to 4%
Kernbench (average of 10 Half and Optimal runs):
Elapsed Time:
- PIE disabled: no significant change (-0.22% / +0.06%)
- PIE enabled: around 0.50%
System Time:
- PIE disabled: no significant change (-0.99% / -1.28%)
- PIE enabled: 5% to 6%
Size of vmlinux (Ubuntu configuration):
File size:
- PIE disabled: 472928672 bytes (-0.000169% from baseline)
- PIE enabled: 216878461 bytes (-54.14% from baseline)
.text sections:
- PIE disabled: 9373572 bytes (+0.04% from baseline)
- PIE enabled: 9499138 bytes (+1.38% from baseline)
The big decrease in vmlinux file size is due to the lower number of
relocations appended to the file.
diffstat:
arch/x86/Kconfig | 37 +++++
arch/x86/Makefile | 17 ++
arch/x86/boot/boot.h | 2
arch/x86/boot/compressed/Makefile | 5
arch/x86/boot/compressed/misc.c | 10 +
arch/x86/crypto/aes-x86_64-asm_64.S | 45 +++---
arch/x86/crypto/aesni-intel_asm.S | 14 +
arch/x86/crypto/aesni-intel_avx-x86_64.S | 6
arch/x86/crypto/camellia-aesni-avx-asm_64.S | 42 ++---
arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 44 +++---
arch/x86/crypto/camellia-x86_64-asm_64.S | 8 -
arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 50 +++---
arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 44 +++---
arch/x86/crypto/des3_ede-asm_64.S | 96 ++++++++-----
arch/x86/crypto/ghash-clmulni-intel_asm.S | 4
arch/x86/crypto/glue_helper-asm-avx.S | 4
arch/x86/crypto/glue_helper-asm-avx2.S | 6
arch/x86/entry/entry_64.S | 26 ++-
arch/x86/include/asm/asm.h | 13 +
arch/x86/include/asm/bug.h | 2
arch/x86/include/asm/jump_label.h | 8 -
arch/x86/include/asm/kvm_host.h | 6
arch/x86/include/asm/module.h | 16 ++
arch/x86/include/asm/page_64_types.h | 9 +
arch/x86/include/asm/paravirt_types.h | 12 +
arch/x86/include/asm/percpu.h | 25 ++-
arch/x86/include/asm/pm-trace.h | 2
arch/x86/include/asm/processor.h | 8 -
arch/x86/include/asm/setup.h | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/acpi/wakeup_64.S | 31 ++--
arch/x86/kernel/cpu/common.c | 4
arch/x86/kernel/head64.c | 28 +++
arch/x86/kernel/head_64.S | 47 +++++-
arch/x86/kernel/kvm.c | 6
arch/x86/kernel/module-plts.c | 198 +++++++++++++++++++++++++++
arch/x86/kernel/module.c | 18 +-
arch/x86/kernel/module.lds | 4
arch/x86/kernel/relocate_kernel_64.S | 2
arch/x86/kernel/setup_percpu.c | 2
arch/x86/kernel/vmlinux.lds.S | 13 +
arch/x86/kvm/svm.c | 4
arch/x86/lib/cmpxchg16b_emu.S | 8 -
arch/x86/power/hibernate_asm_64.S | 4
arch/x86/tools/relocs.c | 134 +++++++++++++++---
arch/x86/tools/relocs.h | 4
arch/x86/tools/relocs_common.c | 15 +-
arch/x86/xen/xen-asm.S | 12 -
arch/x86/xen/xen-asm.h | 3
arch/x86/xen/xen-head.S | 9 -
include/asm-generic/sections.h | 6
include/linux/compiler.h | 8 +
init/Kconfig | 9 +
kernel/kallsyms.c | 16 +-
54 files changed, 868 insertions(+), 282 deletions(-)
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/crypto/aes-x86_64-asm_64.S | 45 ++++++++-----
arch/x86/crypto/aesni-intel_asm.S | 14 ++--
arch/x86/crypto/aesni-intel_avx-x86_64.S | 6 +-
arch/x86/crypto/camellia-aesni-avx-asm_64.S | 42 ++++++------
arch/x86/crypto/camellia-aesni-avx2-asm_64.S | 44 ++++++-------
arch/x86/crypto/camellia-x86_64-asm_64.S | 8 ++-
arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 50 ++++++++-------
arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 44 +++++++------
arch/x86/crypto/des3_ede-asm_64.S | 96 ++++++++++++++++++----------
arch/x86/crypto/ghash-clmulni-intel_asm.S | 4 +-
arch/x86/crypto/glue_helper-asm-avx.S | 4 +-
arch/x86/crypto/glue_helper-asm-avx2.S | 6 +-
12 files changed, 211 insertions(+), 152 deletions(-)
diff --git a/arch/x86/crypto/aes-x86_64-asm_64.S b/arch/x86/crypto/aes-x86_64-asm_64.S
index 8739cf7795de..86fa068e5e81 100644
--- a/arch/x86/crypto/aes-x86_64-asm_64.S
+++ b/arch/x86/crypto/aes-x86_64-asm_64.S
@@ -48,8 +48,12 @@
#define R10 %r10
#define R11 %r11
+/* Hold global for PIE suport */
+#define RBASE %r12
+
#define prologue(FUNC,KEY,B128,B192,r1,r2,r5,r6,r7,r8,r9,r10,r11) \
ENTRY(FUNC); \
+ pushq RBASE; \
movq r1,r2; \
leaq KEY+48(r8),r9; \
movq r10,r11; \
@@ -74,54 +78,63 @@
movl r6 ## E,4(r9); \
movl r7 ## E,8(r9); \
movl r8 ## E,12(r9); \
+ popq RBASE; \
ret; \
ENDPROC(FUNC);
+#define round_mov(tab_off, reg_i, reg_o) \
+ leaq tab_off(%rip), RBASE; \
+ movl (RBASE,reg_i,4), reg_o;
+
+#define round_xor(tab_off, reg_i, reg_o) \
+ leaq tab_off(%rip), RBASE; \
+ xorl (RBASE,reg_i,4), reg_o;
+
#define round(TAB,OFFSET,r1,r2,r3,r4,r5,r6,r7,r8,ra,rb,rc,rd) \
movzbl r2 ## H,r5 ## E; \
movzbl r2 ## L,r6 ## E; \
- movl TAB+1024(,r5,4),r5 ## E;\
+ round_mov(TAB+1024, r5, r5 ## E)\
movw r4 ## X,r2 ## X; \
- movl TAB(,r6,4),r6 ## E; \
+ round_mov(TAB, r6, r6 ## E) \
roll $16,r2 ## E; \
shrl $16,r4 ## E; \
movzbl r4 ## L,r7 ## E; \
movzbl r4 ## H,r4 ## E; \
xorl OFFSET(r8),ra ## E; \
xorl OFFSET+4(r8),rb ## E; \
- xorl TAB+3072(,r4,4),r5 ## E;\
- xorl TAB+2048(,r7,4),r6 ## E;\
+ round_xor(TAB+3072, r4, r5 ## E)\
+ round_xor(TAB+2048, r7, r6 ## E)\
movzbl r1 ## L,r7 ## E; \
movzbl r1 ## H,r4 ## E; \
- movl TAB+1024(,r4,4),r4 ## E;\
+ round_mov(TAB+1024, r4, r4 ## E)\
movw r3 ## X,r1 ## X; \
roll $16,r1 ## E; \
shrl $16,r3 ## E; \
- xorl TAB(,r7,4),r5 ## E; \
+ round_xor(TAB, r7, r5 ## E) \
movzbl r3 ## L,r7 ## E; \
movzbl r3 ## H,r3 ## E; \
- xorl TAB+3072(,r3,4),r4 ## E;\
- xorl TAB+2048(,r7,4),r5 ## E;\
+ round_xor(TAB+3072, r3, r4 ## E)\
+ round_xor(TAB+2048, r7, r5 ## E)\
movzbl r1 ## L,r7 ## E; \
movzbl r1 ## H,r3 ## E; \
shrl $16,r1 ## E; \
- xorl TAB+3072(,r3,4),r6 ## E;\
- movl TAB+2048(,r7,4),r3 ## E;\
+ round_xor(TAB+3072, r3, r6 ## E)\
+ round_mov(TAB+2048, r7, r3 ## E)\
movzbl r1 ## L,r7 ## E; \
movzbl r1 ## H,r1 ## E; \
- xorl TAB+1024(,r1,4),r6 ## E;\
- xorl TAB(,r7,4),r3 ## E; \
+ round_xor(TAB+1024, r1, r6 ## E)\
+ round_xor(TAB, r7, r3 ## E) \
movzbl r2 ## H,r1 ## E; \
movzbl r2 ## L,r7 ## E; \
shrl $16,r2 ## E; \
- xorl TAB+3072(,r1,4),r3 ## E;\
- xorl TAB+2048(,r7,4),r4 ## E;\
+ round_xor(TAB+3072, r1, r3 ## E)\
+ round_xor(TAB+2048, r7, r4 ## E)\
movzbl r2 ## H,r1 ## E; \
movzbl r2 ## L,r2 ## E; \
xorl OFFSET+8(r8),rc ## E; \
xorl OFFSET+12(r8),rd ## E; \
- xorl TAB+1024(,r1,4),r3 ## E;\
- xorl TAB(,r2,4),r4 ## E;
+ round_xor(TAB+1024, r1, r3 ## E)\
+ round_xor(TAB, r2, r4 ## E)
#define move_regs(r1,r2,r3,r4) \
movl r3 ## E,r1 ## E; \
diff --git a/arch/x86/crypto/aesni-intel_asm.S b/arch/x86/crypto/aesni-intel_asm.S
index 16627fec80b2..5f73201dff32 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -325,7 +325,8 @@ _get_AAD_rest0\num_initial_blocks\operation:
vpshufb and an array of shuffle masks */
movq %r12, %r11
salq $4, %r11
- movdqu aad_shift_arr(%r11), \TMP1
+ leaq aad_shift_arr(%rip), %rax
+ movdqu (%rax,%r11,), \TMP1
PSHUFB_XMM \TMP1, %xmm\i
_get_AAD_rest_final\num_initial_blocks\operation:
PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
@@ -584,7 +585,8 @@ _get_AAD_rest0\num_initial_blocks\operation:
vpshufb and an array of shuffle masks */
movq %r12, %r11
salq $4, %r11
- movdqu aad_shift_arr(%r11), \TMP1
+ leaq aad_shift_arr(%rip), %rax
+ movdqu (%rax,%r11,), \TMP1
PSHUFB_XMM \TMP1, %xmm\i
_get_AAD_rest_final\num_initial_blocks\operation:
PSHUFB_XMM %xmm14, %xmm\i # byte-reflect the AAD data
@@ -2722,7 +2724,7 @@ ENDPROC(aesni_cbc_dec)
*/
.align 4
_aesni_inc_init:
- movaps .Lbswap_mask, BSWAP_MASK
+ movaps .Lbswap_mask(%rip), BSWAP_MASK
movaps IV, CTR
PSHUFB_XMM BSWAP_MASK CTR
mov $1, TCTR_LOW
@@ -2850,12 +2852,12 @@ ENTRY(aesni_xts_crypt8)
cmpb $0, %cl
movl $0, %ecx
movl $240, %r10d
- leaq _aesni_enc4, %r11
- leaq _aesni_dec4, %rax
+ leaq _aesni_enc4(%rip), %r11
+ leaq _aesni_dec4(%rip), %rax
cmovel %r10d, %ecx
cmoveq %rax, %r11
- movdqa .Lgf128mul_x_ble_mask, GF128MUL_MASK
+ movdqa .Lgf128mul_x_ble_mask(%rip), GF128MUL_MASK
movups (IVP), IV
mov 480(KEYP), KLEN
diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index faecb1518bf8..488605b19fe8 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -454,7 +454,8 @@ _get_AAD_rest0\@:
vpshufb and an array of shuffle masks */
movq %r12, %r11
salq $4, %r11
- movdqu aad_shift_arr(%r11), \T1
+ leaq aad_shift_arr(%rip), %rax
+ movdqu (%rax,%r11,), \T1
vpshufb \T1, reg_i, reg_i
_get_AAD_rest_final\@:
vpshufb SHUF_MASK(%rip), reg_i, reg_i
@@ -1761,7 +1762,8 @@ _get_AAD_rest0\@:
vpshufb and an array of shuffle masks */
movq %r12, %r11
salq $4, %r11
- movdqu aad_shift_arr(%r11), \T1
+ leaq aad_shift_arr(%rip), %rax
+ movdqu (%rax,%r11,), \T1
vpshufb \T1, reg_i, reg_i
_get_AAD_rest_final\@:
vpshufb SHUF_MASK(%rip), reg_i, reg_i
diff --git a/arch/x86/crypto/camellia-aesni-avx-asm_64.S b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
index f7c495e2863c..46feaea52632 100644
--- a/arch/x86/crypto/camellia-aesni-avx-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx-asm_64.S
@@ -52,10 +52,10 @@
/* \
* S-function with AES subbytes \
*/ \
- vmovdqa .Linv_shift_row, t4; \
- vbroadcastss .L0f0f0f0f, t7; \
- vmovdqa .Lpre_tf_lo_s1, t0; \
- vmovdqa .Lpre_tf_hi_s1, t1; \
+ vmovdqa .Linv_shift_row(%rip), t4; \
+ vbroadcastss .L0f0f0f0f(%rip), t7; \
+ vmovdqa .Lpre_tf_lo_s1(%rip), t0; \
+ vmovdqa .Lpre_tf_hi_s1(%rip), t1; \
\
/* AES inverse shift rows */ \
vpshufb t4, x0, x0; \
@@ -68,8 +68,8 @@
vpshufb t4, x6, x6; \
\
/* prefilter sboxes 1, 2 and 3 */ \
- vmovdqa .Lpre_tf_lo_s4, t2; \
- vmovdqa .Lpre_tf_hi_s4, t3; \
+ vmovdqa .Lpre_tf_lo_s4(%rip), t2; \
+ vmovdqa .Lpre_tf_hi_s4(%rip), t3; \
filter_8bit(x0, t0, t1, t7, t6); \
filter_8bit(x7, t0, t1, t7, t6); \
filter_8bit(x1, t0, t1, t7, t6); \
@@ -83,8 +83,8 @@
filter_8bit(x6, t2, t3, t7, t6); \
\
/* AES subbytes + AES shift rows */ \
- vmovdqa .Lpost_tf_lo_s1, t0; \
- vmovdqa .Lpost_tf_hi_s1, t1; \
+ vmovdqa .Lpost_tf_lo_s1(%rip), t0; \
+ vmovdqa .Lpost_tf_hi_s1(%rip), t1; \
vaesenclast t4, x0, x0; \
vaesenclast t4, x7, x7; \
vaesenclast t4, x1, x1; \
@@ -95,16 +95,16 @@
vaesenclast t4, x6, x6; \
\
/* postfilter sboxes 1 and 4 */ \
- vmovdqa .Lpost_tf_lo_s3, t2; \
- vmovdqa .Lpost_tf_hi_s3, t3; \
+ vmovdqa .Lpost_tf_lo_s3(%rip), t2; \
+ vmovdqa .Lpost_tf_hi_s3(%rip), t3; \
filter_8bit(x0, t0, t1, t7, t6); \
filter_8bit(x7, t0, t1, t7, t6); \
filter_8bit(x3, t0, t1, t7, t6); \
filter_8bit(x6, t0, t1, t7, t6); \
\
/* postfilter sbox 3 */ \
- vmovdqa .Lpost_tf_lo_s2, t4; \
- vmovdqa .Lpost_tf_hi_s2, t5; \
+ vmovdqa .Lpost_tf_lo_s2(%rip), t4; \
+ vmovdqa .Lpost_tf_hi_s2(%rip), t5; \
filter_8bit(x2, t2, t3, t7, t6); \
filter_8bit(x5, t2, t3, t7, t6); \
\
@@ -443,7 +443,7 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
transpose_4x4(c0, c1, c2, c3, a0, a1); \
transpose_4x4(d0, d1, d2, d3, a0, a1); \
\
- vmovdqu .Lshufb_16x16b, a0; \
+ vmovdqu .Lshufb_16x16b(%rip), a0; \
vmovdqu st1, a1; \
vpshufb a0, a2, a2; \
vpshufb a0, a3, a3; \
@@ -482,7 +482,7 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
#define inpack16_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \
y6, y7, rio, key) \
vmovq key, x0; \
- vpshufb .Lpack_bswap, x0, x0; \
+ vpshufb .Lpack_bswap(%rip), x0, x0; \
\
vpxor 0 * 16(rio), x0, y7; \
vpxor 1 * 16(rio), x0, y6; \
@@ -533,7 +533,7 @@ ENDPROC(roundsm16_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
vmovdqu x0, stack_tmp0; \
\
vmovq key, x0; \
- vpshufb .Lpack_bswap, x0, x0; \
+ vpshufb .Lpack_bswap(%rip), x0, x0; \
\
vpxor x0, y7, y7; \
vpxor x0, y6, y6; \
@@ -1016,7 +1016,7 @@ ENTRY(camellia_ctr_16way)
subq $(16 * 16), %rsp;
movq %rsp, %rax;
- vmovdqa .Lbswap128_mask, %xmm14;
+ vmovdqa .Lbswap128_mask(%rip), %xmm14;
/* load IV and byteswap */
vmovdqu (%rcx), %xmm0;
@@ -1065,7 +1065,7 @@ ENTRY(camellia_ctr_16way)
/* inpack16_pre: */
vmovq (key_table)(CTX), %xmm15;
- vpshufb .Lpack_bswap, %xmm15, %xmm15;
+ vpshufb .Lpack_bswap(%rip), %xmm15, %xmm15;
vpxor %xmm0, %xmm15, %xmm0;
vpxor %xmm1, %xmm15, %xmm1;
vpxor %xmm2, %xmm15, %xmm2;
@@ -1133,7 +1133,7 @@ camellia_xts_crypt_16way:
subq $(16 * 16), %rsp;
movq %rsp, %rax;
- vmovdqa .Lxts_gf128mul_and_shl1_mask, %xmm14;
+ vmovdqa .Lxts_gf128mul_and_shl1_mask(%rip), %xmm14;
/* load IV */
vmovdqu (%rcx), %xmm0;
@@ -1209,7 +1209,7 @@ camellia_xts_crypt_16way:
/* inpack16_pre: */
vmovq (key_table)(CTX, %r8, 8), %xmm15;
- vpshufb .Lpack_bswap, %xmm15, %xmm15;
+ vpshufb .Lpack_bswap(%rip), %xmm15, %xmm15;
vpxor 0 * 16(%rax), %xmm15, %xmm0;
vpxor %xmm1, %xmm15, %xmm1;
vpxor %xmm2, %xmm15, %xmm2;
@@ -1264,7 +1264,7 @@ ENTRY(camellia_xts_enc_16way)
*/
xorl %r8d, %r8d; /* input whitening key, 0 for enc */
- leaq __camellia_enc_blk16, %r9;
+ leaq __camellia_enc_blk16(%rip), %r9;
jmp camellia_xts_crypt_16way;
ENDPROC(camellia_xts_enc_16way)
@@ -1282,7 +1282,7 @@ ENTRY(camellia_xts_dec_16way)
movl $24, %eax;
cmovel %eax, %r8d; /* input whitening key, last for dec */
- leaq __camellia_dec_blk16, %r9;
+ leaq __camellia_dec_blk16(%rip), %r9;
jmp camellia_xts_crypt_16way;
ENDPROC(camellia_xts_dec_16way)
diff --git a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
index eee5b3982cfd..93da327fec83 100644
--- a/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
+++ b/arch/x86/crypto/camellia-aesni-avx2-asm_64.S
@@ -69,12 +69,12 @@
/* \
* S-function with AES subbytes \
*/ \
- vbroadcasti128 .Linv_shift_row, t4; \
- vpbroadcastd .L0f0f0f0f, t7; \
- vbroadcasti128 .Lpre_tf_lo_s1, t5; \
- vbroadcasti128 .Lpre_tf_hi_s1, t6; \
- vbroadcasti128 .Lpre_tf_lo_s4, t2; \
- vbroadcasti128 .Lpre_tf_hi_s4, t3; \
+ vbroadcasti128 .Linv_shift_row(%rip), t4; \
+ vpbroadcastd .L0f0f0f0f(%rip), t7; \
+ vbroadcasti128 .Lpre_tf_lo_s1(%rip), t5; \
+ vbroadcasti128 .Lpre_tf_hi_s1(%rip), t6; \
+ vbroadcasti128 .Lpre_tf_lo_s4(%rip), t2; \
+ vbroadcasti128 .Lpre_tf_hi_s4(%rip), t3; \
\
/* AES inverse shift rows */ \
vpshufb t4, x0, x0; \
@@ -120,8 +120,8 @@
vinserti128 $1, t2##_x, x6, x6; \
vextracti128 $1, x1, t3##_x; \
vextracti128 $1, x4, t2##_x; \
- vbroadcasti128 .Lpost_tf_lo_s1, t0; \
- vbroadcasti128 .Lpost_tf_hi_s1, t1; \
+ vbroadcasti128 .Lpost_tf_lo_s1(%rip), t0; \
+ vbroadcasti128 .Lpost_tf_hi_s1(%rip), t1; \
vaesenclast t4##_x, x2##_x, x2##_x; \
vaesenclast t4##_x, t6##_x, t6##_x; \
vinserti128 $1, t6##_x, x2, x2; \
@@ -136,16 +136,16 @@
vinserti128 $1, t2##_x, x4, x4; \
\
/* postfilter sboxes 1 and 4 */ \
- vbroadcasti128 .Lpost_tf_lo_s3, t2; \
- vbroadcasti128 .Lpost_tf_hi_s3, t3; \
+ vbroadcasti128 .Lpost_tf_lo_s3(%rip), t2; \
+ vbroadcasti128 .Lpost_tf_hi_s3(%rip), t3; \
filter_8bit(x0, t0, t1, t7, t6); \
filter_8bit(x7, t0, t1, t7, t6); \
filter_8bit(x3, t0, t1, t7, t6); \
filter_8bit(x6, t0, t1, t7, t6); \
\
/* postfilter sbox 3 */ \
- vbroadcasti128 .Lpost_tf_lo_s2, t4; \
- vbroadcasti128 .Lpost_tf_hi_s2, t5; \
+ vbroadcasti128 .Lpost_tf_lo_s2(%rip), t4; \
+ vbroadcasti128 .Lpost_tf_hi_s2(%rip), t5; \
filter_8bit(x2, t2, t3, t7, t6); \
filter_8bit(x5, t2, t3, t7, t6); \
\
@@ -482,7 +482,7 @@ ENDPROC(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
transpose_4x4(c0, c1, c2, c3, a0, a1); \
transpose_4x4(d0, d1, d2, d3, a0, a1); \
\
- vbroadcasti128 .Lshufb_16x16b, a0; \
+ vbroadcasti128 .Lshufb_16x16b(%rip), a0; \
vmovdqu st1, a1; \
vpshufb a0, a2, a2; \
vpshufb a0, a3, a3; \
@@ -521,7 +521,7 @@ ENDPROC(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
#define inpack32_pre(x0, x1, x2, x3, x4, x5, x6, x7, y0, y1, y2, y3, y4, y5, \
y6, y7, rio, key) \
vpbroadcastq key, x0; \
- vpshufb .Lpack_bswap, x0, x0; \
+ vpshufb .Lpack_bswap(%rip), x0, x0; \
\
vpxor 0 * 32(rio), x0, y7; \
vpxor 1 * 32(rio), x0, y6; \
@@ -572,7 +572,7 @@ ENDPROC(roundsm32_x4_x5_x6_x7_x0_x1_x2_x3_y4_y5_y6_y7_y0_y1_y2_y3_ab)
vmovdqu x0, stack_tmp0; \
\
vpbroadcastq key, x0; \
- vpshufb .Lpack_bswap, x0, x0; \
+ vpshufb .Lpack_bswap(%rip), x0, x0; \
\
vpxor x0, y7, y7; \
vpxor x0, y6, y6; \
@@ -1112,7 +1112,7 @@ ENTRY(camellia_ctr_32way)
vmovdqu (%rcx), %xmm0;
vmovdqa %xmm0, %xmm1;
inc_le128(%xmm0, %xmm15, %xmm14);
- vbroadcasti128 .Lbswap128_mask, %ymm14;
+ vbroadcasti128 .Lbswap128_mask(%rip), %ymm14;
vinserti128 $1, %xmm0, %ymm1, %ymm0;
vpshufb %ymm14, %ymm0, %ymm13;
vmovdqu %ymm13, 15 * 32(%rax);
@@ -1158,7 +1158,7 @@ ENTRY(camellia_ctr_32way)
/* inpack32_pre: */
vpbroadcastq (key_table)(CTX), %ymm15;
- vpshufb .Lpack_bswap, %ymm15, %ymm15;
+ vpshufb .Lpack_bswap(%rip), %ymm15, %ymm15;
vpxor %ymm0, %ymm15, %ymm0;
vpxor %ymm1, %ymm15, %ymm1;
vpxor %ymm2, %ymm15, %ymm2;
@@ -1242,13 +1242,13 @@ camellia_xts_crypt_32way:
subq $(16 * 32), %rsp;
movq %rsp, %rax;
- vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_0, %ymm12;
+ vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_0(%rip), %ymm12;
/* load IV and construct second IV */
vmovdqu (%rcx), %xmm0;
vmovdqa %xmm0, %xmm15;
gf128mul_x_ble(%xmm0, %xmm12, %xmm13);
- vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_1, %ymm13;
+ vbroadcasti128 .Lxts_gf128mul_and_shl1_mask_1(%rip), %ymm13;
vinserti128 $1, %xmm0, %ymm15, %ymm0;
vpxor 0 * 32(%rdx), %ymm0, %ymm15;
vmovdqu %ymm15, 15 * 32(%rax);
@@ -1325,7 +1325,7 @@ camellia_xts_crypt_32way:
/* inpack32_pre: */
vpbroadcastq (key_table)(CTX, %r8, 8), %ymm15;
- vpshufb .Lpack_bswap, %ymm15, %ymm15;
+ vpshufb .Lpack_bswap(%rip), %ymm15, %ymm15;
vpxor 0 * 32(%rax), %ymm15, %ymm0;
vpxor %ymm1, %ymm15, %ymm1;
vpxor %ymm2, %ymm15, %ymm2;
@@ -1383,7 +1383,7 @@ ENTRY(camellia_xts_enc_32way)
xorl %r8d, %r8d; /* input whitening key, 0 for enc */
- leaq __camellia_enc_blk32, %r9;
+ leaq __camellia_enc_blk32(%rip), %r9;
jmp camellia_xts_crypt_32way;
ENDPROC(camellia_xts_enc_32way)
@@ -1401,7 +1401,7 @@ ENTRY(camellia_xts_dec_32way)
movl $24, %eax;
cmovel %eax, %r8d; /* input whitening key, last for dec */
- leaq __camellia_dec_blk32, %r9;
+ leaq __camellia_dec_blk32(%rip), %r9;
jmp camellia_xts_crypt_32way;
ENDPROC(camellia_xts_dec_32way)
diff --git a/arch/x86/crypto/camellia-x86_64-asm_64.S b/arch/x86/crypto/camellia-x86_64-asm_64.S
index 310319c601ed..b8c81e2f9973 100644
--- a/arch/x86/crypto/camellia-x86_64-asm_64.S
+++ b/arch/x86/crypto/camellia-x86_64-asm_64.S
@@ -92,11 +92,13 @@
#define RXORbl %r9b
#define xor2ror16(T0, T1, tmp1, tmp2, ab, dst) \
+ leaq T0(%rip), tmp1; \
movzbl ab ## bl, tmp2 ## d; \
+ xorq (tmp1, tmp2, 8), dst; \
+ leaq T1(%rip), tmp2; \
movzbl ab ## bh, tmp1 ## d; \
- rorq $16, ab; \
- xorq T0(, tmp2, 8), dst; \
- xorq T1(, tmp1, 8), dst;
+ xorq (tmp2, tmp1, 8), dst; \
+ rorq $16, ab;
/**********************************************************************
1-way camellia
diff --git a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
index b4a8806234ea..ae2976b56b27 100644
--- a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
@@ -98,16 +98,20 @@
#define lookup_32bit(src, dst, op1, op2, op3, interleave_op, il_reg) \
- movzbl src ## bh, RID1d; \
- movzbl src ## bl, RID2d; \
- shrq $16, src; \
- movl s1(, RID1, 4), dst ## d; \
- op1 s2(, RID2, 4), dst ## d; \
- movzbl src ## bh, RID1d; \
- movzbl src ## bl, RID2d; \
- interleave_op(il_reg); \
- op2 s3(, RID1, 4), dst ## d; \
- op3 s4(, RID2, 4), dst ## d;
+ movzbl src ## bh, RID1d; \
+ leaq s1(%rip), RID2; \
+ movl (RID2, RID1, 4), dst ## d; \
+ movzbl src ## bl, RID2d; \
+ leaq s2(%rip), RID1; \
+ op1 (RID1, RID2, 4), dst ## d; \
+ shrq $16, src; \
+ movzbl src ## bh, RID1d; \
+ leaq s3(%rip), RID2; \
+ op2 (RID2, RID1, 4), dst ## d; \
+ movzbl src ## bl, RID2d; \
+ leaq s4(%rip), RID1; \
+ op3 (RID1, RID2, 4), dst ## d; \
+ interleave_op(il_reg);
#define dummy(d) /* do nothing */
@@ -166,15 +170,15 @@
subround(l ## 3, r ## 3, l ## 4, r ## 4, f);
#define enc_preload_rkr() \
- vbroadcastss .L16_mask, RKR; \
+ vbroadcastss .L16_mask(%rip), RKR; \
/* add 16-bit rotation to key rotations (mod 32) */ \
vpxor kr(CTX), RKR, RKR;
#define dec_preload_rkr() \
- vbroadcastss .L16_mask, RKR; \
+ vbroadcastss .L16_mask(%rip), RKR; \
/* add 16-bit rotation to key rotations (mod 32) */ \
vpxor kr(CTX), RKR, RKR; \
- vpshufb .Lbswap128_mask, RKR, RKR;
+ vpshufb .Lbswap128_mask(%rip), RKR, RKR;
#define transpose_2x4(x0, x1, t0, t1) \
vpunpckldq x1, x0, t0; \
@@ -249,9 +253,9 @@ __cast5_enc_blk16:
pushq %rbp;
pushq %rbx;
- vmovdqa .Lbswap_mask, RKM;
- vmovd .Lfirst_mask, R1ST;
- vmovd .L32_mask, R32;
+ vmovdqa .Lbswap_mask(%rip), RKM;
+ vmovd .Lfirst_mask(%rip), R1ST;
+ vmovd .L32_mask(%rip), R32;
enc_preload_rkr();
inpack_blocks(RL1, RR1, RTMP, RX, RKM);
@@ -285,7 +289,7 @@ __cast5_enc_blk16:
popq %rbx;
popq %rbp;
- vmovdqa .Lbswap_mask, RKM;
+ vmovdqa .Lbswap_mask(%rip), RKM;
outunpack_blocks(RR1, RL1, RTMP, RX, RKM);
outunpack_blocks(RR2, RL2, RTMP, RX, RKM);
@@ -321,9 +325,9 @@ __cast5_dec_blk16:
pushq %rbp;
pushq %rbx;
- vmovdqa .Lbswap_mask, RKM;
- vmovd .Lfirst_mask, R1ST;
- vmovd .L32_mask, R32;
+ vmovdqa .Lbswap_mask(%rip), RKM;
+ vmovd .Lfirst_mask(%rip), R1ST;
+ vmovd .L32_mask(%rip), R32;
dec_preload_rkr();
inpack_blocks(RL1, RR1, RTMP, RX, RKM);
@@ -354,7 +358,7 @@ __cast5_dec_blk16:
round(RL, RR, 1, 2);
round(RR, RL, 0, 1);
- vmovdqa .Lbswap_mask, RKM;
+ vmovdqa .Lbswap_mask(%rip), RKM;
popq %rbx;
popq %rbp;
@@ -508,8 +512,8 @@ ENTRY(cast5_ctr_16way)
vpcmpeqd RKR, RKR, RKR;
vpaddq RKR, RKR, RKR; /* low: -2, high: -2 */
- vmovdqa .Lbswap_iv_mask, R1ST;
- vmovdqa .Lbswap128_mask, RKM;
+ vmovdqa .Lbswap_iv_mask(%rip), R1ST;
+ vmovdqa .Lbswap128_mask(%rip), RKM;
/* load IV and byteswap */
vmovq (%rcx), RX;
diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
index 952d3156a933..6bd52210a3c1 100644
--- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
@@ -98,16 +98,20 @@
#define lookup_32bit(src, dst, op1, op2, op3, interleave_op, il_reg) \
- movzbl src ## bh, RID1d; \
- movzbl src ## bl, RID2d; \
- shrq $16, src; \
- movl s1(, RID1, 4), dst ## d; \
- op1 s2(, RID2, 4), dst ## d; \
- movzbl src ## bh, RID1d; \
- movzbl src ## bl, RID2d; \
- interleave_op(il_reg); \
- op2 s3(, RID1, 4), dst ## d; \
- op3 s4(, RID2, 4), dst ## d;
+ movzbl src ## bh, RID1d; \
+ leaq s1(%rip), RID2; \
+ movl (RID2, RID1, 4), dst ## d; \
+ movzbl src ## bl, RID2d; \
+ leaq s2(%rip), RID1; \
+ op1 (RID1, RID2, 4), dst ## d; \
+ shrq $16, src; \
+ movzbl src ## bh, RID1d; \
+ leaq s3(%rip), RID2; \
+ op2 (RID2, RID1, 4), dst ## d; \
+ movzbl src ## bl, RID2d; \
+ leaq s4(%rip), RID1; \
+ op3 (RID1, RID2, 4), dst ## d; \
+ interleave_op(il_reg);
#define dummy(d) /* do nothing */
@@ -190,10 +194,10 @@
qop(RD, RC, 1);
#define shuffle(mask) \
- vpshufb mask, RKR, RKR;
+ vpshufb mask(%rip), RKR, RKR;
#define preload_rkr(n, do_mask, mask) \
- vbroadcastss .L16_mask, RKR; \
+ vbroadcastss .L16_mask(%rip), RKR; \
/* add 16-bit rotation to key rotations (mod 32) */ \
vpxor (kr+n*16)(CTX), RKR, RKR; \
do_mask(mask);
@@ -273,9 +277,9 @@ __cast6_enc_blk8:
pushq %rbp;
pushq %rbx;
- vmovdqa .Lbswap_mask, RKM;
- vmovd .Lfirst_mask, R1ST;
- vmovd .L32_mask, R32;
+ vmovdqa .Lbswap_mask(%rip), RKM;
+ vmovd .Lfirst_mask(%rip), R1ST;
+ vmovd .L32_mask(%rip), R32;
inpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
inpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -299,7 +303,7 @@ __cast6_enc_blk8:
popq %rbx;
popq %rbp;
- vmovdqa .Lbswap_mask, RKM;
+ vmovdqa .Lbswap_mask(%rip), RKM;
outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
outunpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -319,9 +323,9 @@ __cast6_dec_blk8:
pushq %rbp;
pushq %rbx;
- vmovdqa .Lbswap_mask, RKM;
- vmovd .Lfirst_mask, R1ST;
- vmovd .L32_mask, R32;
+ vmovdqa .Lbswap_mask(%rip), RKM;
+ vmovd .Lfirst_mask(%rip), R1ST;
+ vmovd .L32_mask(%rip), R32;
inpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
inpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
@@ -345,7 +349,7 @@ __cast6_dec_blk8:
popq %rbx;
popq %rbp;
- vmovdqa .Lbswap_mask, RKM;
+ vmovdqa .Lbswap_mask(%rip), RKM;
outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
outunpack_blocks(RA2, RB2, RC2, RD2, RTMP, RX, RKRF, RKM);
diff --git a/arch/x86/crypto/des3_ede-asm_64.S b/arch/x86/crypto/des3_ede-asm_64.S
index f3e91647ca27..d532ff94b70a 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -138,21 +138,29 @@
movzbl RW0bl, RT2d; \
movzbl RW0bh, RT3d; \
shrq $16, RW0; \
- movq s8(, RT0, 8), RT0; \
- xorq s6(, RT1, 8), to; \
+ leaq s8(%rip), RW1; \
+ movq (RW1, RT0, 8), RT0; \
+ leaq s6(%rip), RW1; \
+ xorq (RW1, RT1, 8), to; \
movzbl RW0bl, RL1d; \
movzbl RW0bh, RT1d; \
shrl $16, RW0d; \
- xorq s4(, RT2, 8), RT0; \
- xorq s2(, RT3, 8), to; \
+ leaq s4(%rip), RW1; \
+ xorq (RW1, RT2, 8), RT0; \
+ leaq s2(%rip), RW1; \
+ xorq (RW1, RT3, 8), to; \
movzbl RW0bl, RT2d; \
movzbl RW0bh, RT3d; \
- xorq s7(, RL1, 8), RT0; \
- xorq s5(, RT1, 8), to; \
- xorq s3(, RT2, 8), RT0; \
+ leaq s7(%rip), RW1; \
+ xorq (RW1, RL1, 8), RT0; \
+ leaq s5(%rip), RW1; \
+ xorq (RW1, RT1, 8), to; \
+ leaq s3(%rip), RW1; \
+ xorq (RW1, RT2, 8), RT0; \
load_next_key(n, RW0); \
xorq RT0, to; \
- xorq s1(, RT3, 8), to; \
+ leaq s1(%rip), RW1; \
+ xorq (RW1, RT3, 8), to; \
#define load_next_key(n, RWx) \
movq (((n) + 1) * 8)(CTX), RWx;
@@ -362,65 +370,89 @@ ENDPROC(des3_ede_x86_64_crypt_blk)
movzbl RW0bl, RT3d; \
movzbl RW0bh, RT1d; \
shrq $16, RW0; \
- xorq s8(, RT3, 8), to##0; \
- xorq s6(, RT1, 8), to##0; \
+ leaq s8(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##0; \
+ leaq s6(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##0; \
movzbl RW0bl, RT3d; \
movzbl RW0bh, RT1d; \
shrq $16, RW0; \
- xorq s4(, RT3, 8), to##0; \
- xorq s2(, RT1, 8), to##0; \
+ leaq s4(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##0; \
+ leaq s2(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##0; \
movzbl RW0bl, RT3d; \
movzbl RW0bh, RT1d; \
shrl $16, RW0d; \
- xorq s7(, RT3, 8), to##0; \
- xorq s5(, RT1, 8), to##0; \
+ leaq s7(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##0; \
+ leaq s5(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##0; \
movzbl RW0bl, RT3d; \
movzbl RW0bh, RT1d; \
load_next_key(n, RW0); \
- xorq s3(, RT3, 8), to##0; \
- xorq s1(, RT1, 8), to##0; \
+ leaq s3(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##0; \
+ leaq s1(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##0; \
xorq from##1, RW1; \
movzbl RW1bl, RT3d; \
movzbl RW1bh, RT1d; \
shrq $16, RW1; \
- xorq s8(, RT3, 8), to##1; \
- xorq s6(, RT1, 8), to##1; \
+ leaq s8(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##1; \
+ leaq s6(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##1; \
movzbl RW1bl, RT3d; \
movzbl RW1bh, RT1d; \
shrq $16, RW1; \
- xorq s4(, RT3, 8), to##1; \
- xorq s2(, RT1, 8), to##1; \
+ leaq s4(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##1; \
+ leaq s2(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##1; \
movzbl RW1bl, RT3d; \
movzbl RW1bh, RT1d; \
shrl $16, RW1d; \
- xorq s7(, RT3, 8), to##1; \
- xorq s5(, RT1, 8), to##1; \
+ leaq s7(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##1; \
+ leaq s5(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##1; \
movzbl RW1bl, RT3d; \
movzbl RW1bh, RT1d; \
do_movq(RW0, RW1); \
- xorq s3(, RT3, 8), to##1; \
- xorq s1(, RT1, 8), to##1; \
+ leaq s3(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##1; \
+ leaq s1(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##1; \
xorq from##2, RW2; \
movzbl RW2bl, RT3d; \
movzbl RW2bh, RT1d; \
shrq $16, RW2; \
- xorq s8(, RT3, 8), to##2; \
- xorq s6(, RT1, 8), to##2; \
+ leaq s8(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##2; \
+ leaq s6(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##2; \
movzbl RW2bl, RT3d; \
movzbl RW2bh, RT1d; \
shrq $16, RW2; \
- xorq s4(, RT3, 8), to##2; \
- xorq s2(, RT1, 8), to##2; \
+ leaq s4(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##2; \
+ leaq s2(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##2; \
movzbl RW2bl, RT3d; \
movzbl RW2bh, RT1d; \
shrl $16, RW2d; \
- xorq s7(, RT3, 8), to##2; \
- xorq s5(, RT1, 8), to##2; \
+ leaq s7(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##2; \
+ leaq s5(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##2; \
movzbl RW2bl, RT3d; \
movzbl RW2bh, RT1d; \
do_movq(RW0, RW2); \
- xorq s3(, RT3, 8), to##2; \
- xorq s1(, RT1, 8), to##2;
+ leaq s3(%rip), RT2; \
+ xorq (RT2, RT3, 8), to##2; \
+ leaq s1(%rip), RT2; \
+ xorq (RT2, RT1, 8), to##2;
#define __movq(src, dst) \
movq src, dst;
diff --git a/arch/x86/crypto/ghash-clmulni-intel_asm.S b/arch/x86/crypto/ghash-clmulni-intel_asm.S
index f94375a8dcd1..d56a281221fb 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_asm.S
+++ b/arch/x86/crypto/ghash-clmulni-intel_asm.S
@@ -97,7 +97,7 @@ ENTRY(clmul_ghash_mul)
FRAME_BEGIN
movups (%rdi), DATA
movups (%rsi), SHASH
- movaps .Lbswap_mask, BSWAP
+ movaps .Lbswap_mask(%rip), BSWAP
PSHUFB_XMM BSWAP DATA
call __clmul_gf128mul_ble
PSHUFB_XMM BSWAP DATA
@@ -114,7 +114,7 @@ ENTRY(clmul_ghash_update)
FRAME_BEGIN
cmp $16, %rdx
jb .Lupdate_just_ret # check length
- movaps .Lbswap_mask, BSWAP
+ movaps .Lbswap_mask(%rip), BSWAP
movups (%rdi), DATA
movups (%rcx), SHASH
PSHUFB_XMM BSWAP DATA
diff --git a/arch/x86/crypto/glue_helper-asm-avx.S b/arch/x86/crypto/glue_helper-asm-avx.S
index 02ee2308fb38..8a49ab1699ef 100644
--- a/arch/x86/crypto/glue_helper-asm-avx.S
+++ b/arch/x86/crypto/glue_helper-asm-avx.S
@@ -54,7 +54,7 @@
#define load_ctr_8way(iv, bswap, x0, x1, x2, x3, x4, x5, x6, x7, t0, t1, t2) \
vpcmpeqd t0, t0, t0; \
vpsrldq $8, t0, t0; /* low: -1, high: 0 */ \
- vmovdqa bswap, t1; \
+ vmovdqa bswap(%rip), t1; \
\
/* load IV and byteswap */ \
vmovdqu (iv), x7; \
@@ -99,7 +99,7 @@
#define load_xts_8way(iv, src, dst, x0, x1, x2, x3, x4, x5, x6, x7, tiv, t0, \
t1, xts_gf128mul_and_shl1_mask) \
- vmovdqa xts_gf128mul_and_shl1_mask, t0; \
+ vmovdqa xts_gf128mul_and_shl1_mask(%rip), t0; \
\
/* load IV */ \
vmovdqu (iv), tiv; \
diff --git a/arch/x86/crypto/glue_helper-asm-avx2.S b/arch/x86/crypto/glue_helper-asm-avx2.S
index a53ac11dd385..e04c80467bd2 100644
--- a/arch/x86/crypto/glue_helper-asm-avx2.S
+++ b/arch/x86/crypto/glue_helper-asm-avx2.S
@@ -67,7 +67,7 @@
vmovdqu (iv), t2x; \
vmovdqa t2x, t3x; \
inc_le128(t2x, t0x, t1x); \
- vbroadcasti128 bswap, t1; \
+ vbroadcasti128 bswap(%rip), t1; \
vinserti128 $1, t2x, t3, t2; /* ab: le0 ; cd: le1 */ \
vpshufb t1, t2, x0; \
\
@@ -124,13 +124,13 @@
tivx, t0, t0x, t1, t1x, t2, t2x, t3, \
xts_gf128mul_and_shl1_mask_0, \
xts_gf128mul_and_shl1_mask_1) \
- vbroadcasti128 xts_gf128mul_and_shl1_mask_0, t1; \
+ vbroadcasti128 xts_gf128mul_and_shl1_mask_0(%rip), t1; \
\
/* load IV and construct second IV */ \
vmovdqu (iv), tivx; \
vmovdqa tivx, t0x; \
gf128mul_x_ble(tivx, t1x, t2x); \
- vbroadcasti128 xts_gf128mul_and_shl1_mask_1, t2; \
+ vbroadcasti128 xts_gf128mul_and_shl1_mask_1(%rip), t2; \
vinserti128 $1, tivx, t0, tiv; \
vpxor (0*32)(src), tiv, x0; \
vmovdqu tiv, (0*32)(dst); \
--
2.13.2.932.g7449e964c-goog
Add a new _ASM_GET_PTR macro to fetch a symbol address. It will be used
to replace "_ASM_MOV $<symbol>, %dst" code construct that are not compatible
with PIE.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/asm.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 7a9df3beb89b..bf2842cfb583 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -55,6 +55,19 @@
# define CC_OUT(c) [_cc_ ## c] "=qm"
#endif
+/* Macros to get a global variable address with PIE support on 64-bit */
+#ifdef CONFIG_X86_32
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(movl $##_src)
+#else
+#ifdef __ASSEMBLY__
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%rip))
+#else
+#define __ASM_GET_PTR_PRE(_src) __ASM_FORM_COMMA(leaq (_src)(%%rip))
+#endif
+#endif
+#define _ASM_GET_PTR(_src, _dst) \
+ __ASM_GET_PTR_PRE(_src) __ASM_FORM(_dst)
+
/* Exception table entry */
#ifdef __ASSEMBLY__
# define _ASM_EXTABLE_HANDLE(from, to, handler) \
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/bug.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/bug.h b/arch/x86/include/asm/bug.h
index 39e702d90cdb..2307e2aceb00 100644
--- a/arch/x86/include/asm/bug.h
+++ b/arch/x86/include/asm/bug.h
@@ -37,7 +37,7 @@ do { \
asm volatile("1:\t" ins "\n" \
".pushsection __bug_table,\"a\"\n" \
"2:\t" __BUG_REL(1b) "\t# bug_entry::bug_addr\n" \
- "\t" __BUG_REL(%c0) "\t# bug_entry::file\n" \
+ "\t" __BUG_REL(%P0) "\t# bug_entry::file\n" \
"\t.word %c1" "\t# bug_entry::line\n" \
"\t.word %c2" "\t# bug_entry::flags\n" \
"\t.org 2b+%c3\n" \
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use the new _ASM_GET_PTR macro which get a
symbol reference while being PIE compatible. Modify the RELOC macro that
was using an assignment generating a non-relative reference.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/xen/xen-asm.h | 3 ++-
arch/x86/xen/xen-head.S | 9 +++++----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/xen/xen-asm.h b/arch/x86/xen/xen-asm.h
index 465276467a47..3b1c8a2e77d8 100644
--- a/arch/x86/xen/xen-asm.h
+++ b/arch/x86/xen/xen-asm.h
@@ -2,8 +2,9 @@
#define _XEN_XEN_ASM_H
#include <linux/linkage.h>
+#include <asm/asm.h>
-#define RELOC(x, v) .globl x##_reloc; x##_reloc=v
+#define RELOC(x, v) .globl x##_reloc; x##_reloc: _ASM_PTR v
#define ENDPATCH(x) .globl x##_end; x##_end=.
/* Pseudo-flag used for virtual NMI, which we don't implement yet */
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 72a8e6adebe6..ab2462396bd8 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -23,14 +23,15 @@ ENTRY(startup_xen)
/* Clear .bss */
xor %eax,%eax
- mov $__bss_start, %_ASM_DI
- mov $__bss_stop, %_ASM_CX
+ _ASM_GET_PTR(__bss_start, %_ASM_DI)
+ _ASM_GET_PTR(__bss_stop, %_ASM_CX)
sub %_ASM_DI, %_ASM_CX
shr $__ASM_SEL(2, 3), %_ASM_CX
rep __ASM_SIZE(stos)
- mov %_ASM_SI, xen_start_info
- mov $init_thread_union+THREAD_SIZE, %_ASM_SP
+ _ASM_GET_PTR(xen_start_info, %_ASM_AX)
+ mov %_ASM_SI, (%_ASM_AX)
+ _ASM_GET_PTR(init_thread_union+THREAD_SIZE, %_ASM_SP)
jmp xen_start_kernel
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Replace the %c constraint with %P. The %c is incompatible with PIE
because it implies an immediate value whereas %P reference a symbol.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/jump_label.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index adc54c12cbd1..6e558e4524dc 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -36,9 +36,9 @@ static __always_inline bool arch_static_branch(struct static_key *key, bool bran
".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
- _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+ _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
- : : "i" (key), "i" (branch) : : l_yes);
+ : : "X" (&((char *)key)[branch]) : : l_yes);
return false;
l_yes:
@@ -52,9 +52,9 @@ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
"2:\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
- _ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
+ _ASM_PTR "1b, %l[l_yes], %P0 \n\t"
".popsection \n\t"
- : : "i" (key), "i" (branch) : : l_yes);
+ : : "X" (&((char *)key)[branch]) : : l_yes);
return false;
l_yes:
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. The new __ASM_GET_PTR_PRE macro is used to
get the address of a symbol on both 32 and 64-bit with PIE support.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++--
arch/x86/kernel/kvm.c | 6 ++++--
arch/x86/kvm/svm.c | 4 ++--
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 87ac4fba6d8e..3041201a3aeb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1352,9 +1352,11 @@ asmlinkage void kvm_spurious_fault(void);
".pushsection .fixup, \"ax\" \n" \
"667: \n\t" \
cleanup_insn "\n\t" \
- "cmpb $0, kvm_rebooting \n\t" \
+ "cmpb $0, kvm_rebooting" __ASM_SEL(,(%%rip)) " \n\t" \
"jne 668b \n\t" \
- __ASM_SIZE(push) " $666b \n\t" \
+ __ASM_SIZE(push) "%%" _ASM_AX " \n\t" \
+ __ASM_GET_PTR_PRE(666b) "%%" _ASM_AX "\n\t" \
+ "xchg %%" _ASM_AX ", (%%" _ASM_SP ") \n\t" \
"call kvm_spurious_fault \n\t" \
".popsection \n\t" \
_ASM_EXTABLE(666b, 667b)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 71c17a5be983..53b8ad162589 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -618,8 +618,10 @@ asm(
".global __raw_callee_save___kvm_vcpu_is_preempted;"
".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
"__raw_callee_save___kvm_vcpu_is_preempted:"
-"movq __per_cpu_offset(,%rdi,8), %rax;"
-"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+"leaq __per_cpu_offset(%rip), %rax;"
+"movq (%rax,%rdi,8), %rax;"
+"addq " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rip), %rax;"
+"cmpb $0, (%rax);"
"setne %al;"
"ret;"
".popsection");
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4d8141e533c3..8b718c6d6729 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -554,12 +554,12 @@ static u32 svm_msrpm_offset(u32 msr)
static inline void clgi(void)
{
- asm volatile (__ex(SVM_CLGI));
+ asm volatile (__ex(SVM_CLGI) : :);
}
static inline void stgi(void)
{
- asm volatile (__ex(SVM_STGI));
+ asm volatile (__ex(SVM_STGI) : :);
}
static inline void invlpga(unsigned long addr, u32 asid)
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/entry/entry_64.S | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index a9a8027a6c0e..691c4755269b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -195,12 +195,15 @@ entry_SYSCALL_64_fastpath:
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx
+ /* Ensures the call is position independent */
+ leaq sys_call_table(%rip), %r11
+
/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
- call *sys_call_table(, %rax, 8)
+ call *(%r11, %rax, 8)
.Lentry_SYSCALL_64_after_fastpath_call:
movq %rax, RAX(%rsp)
@@ -333,7 +336,8 @@ ENTRY(stub_ptregs_64)
* RAX stores a pointer to the C function implementing the syscall.
* IRQs are on.
*/
- cmpq $.Lentry_SYSCALL_64_after_fastpath_call, (%rsp)
+ leaq .Lentry_SYSCALL_64_after_fastpath_call(%rip), %r11
+ cmpq %r11, (%rsp)
jne 1f
/*
@@ -1109,7 +1113,8 @@ ENTRY(error_entry)
movl %ecx, %eax /* zero extend */
cmpq %rax, RIP+8(%rsp)
je .Lbstep_iret
- cmpq $.Lgs_change, RIP+8(%rsp)
+ leaq .Lgs_change(%rip), %rcx
+ cmpq %rcx, RIP+8(%rsp)
jne .Lerror_entry_done
/*
@@ -1324,10 +1329,10 @@ ENTRY(nmi)
* resume the outer NMI.
*/
- movq $repeat_nmi, %rdx
+ leaq repeat_nmi(%rip), %rdx
cmpq 8(%rsp), %rdx
ja 1f
- movq $end_repeat_nmi, %rdx
+ leaq end_repeat_nmi(%rip), %rdx
cmpq 8(%rsp), %rdx
ja nested_nmi_out
1:
@@ -1381,7 +1386,8 @@ nested_nmi:
pushq %rdx
pushfq
pushq $__KERNEL_CS
- pushq $repeat_nmi
+ leaq repeat_nmi(%rip), %rdx
+ pushq %rdx
/* Put stack back */
addq $(6*8), %rsp
@@ -1419,7 +1425,9 @@ first_nmi:
addq $8, (%rsp) /* Fix up RSP */
pushfq /* RFLAGS */
pushq $__KERNEL_CS /* CS */
- pushq $1f /* RIP */
+ pushq %rax /* Support Position Independent Code */
+ leaq 1f(%rip), %rax /* RIP */
+ xchgq %rax, (%rsp) /* Restore RAX, put 1f */
INTERRUPT_RETURN /* continues at repeat_nmi below */
1:
#endif
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 98111b38ebfd..da817d1628ac 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -186,7 +186,7 @@ identity_mapped:
movq %rax, %cr3
lea PAGE_SIZE(%r8), %rsp
call swap_pages
- movq $virtual_mapped, %rax
+ leaq virtual_mapped(%rip), %rax
pushq %rax
ret
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Early at boot, the kernel is mapped at a temporary address while preparing
the page table. To know the changes needed for the page table with KASLR,
the boot code calculate the difference between the expected address of the
kernel and the one chosen by KASLR. It does not work with PIE because all
symbols in code are relatives. Instead of getting the future relocated
virtual address, you will get the current temporary mapping. The solution
is using global variables that will be relocated as expected.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/kernel/head_64.S | 32 ++++++++++++++++++++++++--------
1 file changed, 24 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 6225550883df..7e4f7a83a15a 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -78,8 +78,23 @@ startup_64:
call __startup_64
popq %rsi
- movq $(early_top_pgt - __START_KERNEL_map), %rax
+ movq _early_top_pgt_offset(%rip), %rax
jmp 1f
+
+ /*
+ * Position Independent Code takes only relative references in code
+ * meaning a global variable address is relative to RIP and not its
+ * future virtual address. Global variables can be used instead as they
+ * are still relocated on the expected kernel mapping address.
+ */
+ .align 8
+_early_top_pgt_offset:
+ .quad early_top_pgt - __START_KERNEL_map
+_init_top_offset:
+ .quad init_top_pgt - __START_KERNEL_map
+_va_jump:
+ .quad 2f
+
ENTRY(secondary_startup_64)
/*
* At this point the CPU runs in 64bit mode CS.L = 1 CS.D = 0,
@@ -98,7 +113,8 @@ ENTRY(secondary_startup_64)
/* Sanitize CPU configuration */
call verify_cpu
- movq $(init_top_pgt - __START_KERNEL_map), %rax
+ movq _init_top_offset(%rip), %rax
+
1:
/* Enable PAE mode, PGE and LA57 */
@@ -113,9 +129,8 @@ ENTRY(secondary_startup_64)
movq %rax, %cr3
/* Ensure I am executing from virtual addresses */
- movq $1f, %rax
- jmp *%rax
-1:
+ jmp *_va_jump(%rip)
+2:
/* Check if nx is implemented */
movl $0x80000001, %eax
@@ -211,11 +226,12 @@ ENTRY(secondary_startup_64)
* REX.W + FF /5 JMP m16:64 Jump far, absolute indirect,
* address given in m16:64.
*/
- pushq $.Lafter_lret # put return address on stack for unwinder
+ leaq .Lafter_lret(%rip), %rax
+ pushq %rax # put return address on stack for unwinder
xorq %rbp, %rbp # clear frame pointer
- movq initial_code(%rip), %rax
+ leaq initial_code(%rip), %rax
pushq $__KERNEL_CS # set correct cs
- pushq %rax # target address in negative space
+ pushq (%rax) # target address in negative space
lretq
.Lafter_lret:
ENDPROC(secondary_startup_64)
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/power/hibernate_asm_64.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/power/hibernate_asm_64.S b/arch/x86/power/hibernate_asm_64.S
index ce8da3a0412c..6fdd7bbc3c33 100644
--- a/arch/x86/power/hibernate_asm_64.S
+++ b/arch/x86/power/hibernate_asm_64.S
@@ -24,7 +24,7 @@
#include <asm/frame.h>
ENTRY(swsusp_arch_suspend)
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq %rsp, pt_regs_sp(%rax)
movq %rbp, pt_regs_bp(%rax)
movq %rsi, pt_regs_si(%rax)
@@ -115,7 +115,7 @@ ENTRY(restore_registers)
movq %rax, %cr4; # turn PGE back on
/* We don't restore %rax, it must be 0 anyway */
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq pt_regs_sp(%rax), %rsp
movq pt_regs_bp(%rax), %rbp
movq pt_regs_si(%rax), %rsi
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change assembly to use the new _ASM_GET_PTR macro instead of _ASM_MOV for
the assembly to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/pm-trace.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/pm-trace.h b/arch/x86/include/asm/pm-trace.h
index 7b7ac42c3661..a3801261f0dd 100644
--- a/arch/x86/include/asm/pm-trace.h
+++ b/arch/x86/include/asm/pm-trace.h
@@ -7,7 +7,7 @@
do { \
if (pm_trace_enabled) { \
const void *tracedata; \
- asm volatile(_ASM_MOV " $1f,%0\n" \
+ asm volatile(_ASM_GET_PTR(1f, %0) "\n" \
".section .tracedata,\"a\"\n" \
"1:\t.word %c1\n\t" \
_ASM_PTR " %c2\n" \
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/kernel/acpi/wakeup_64.S | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/acpi/wakeup_64.S b/arch/x86/kernel/acpi/wakeup_64.S
index 50b8ed0317a3..472659c0f811 100644
--- a/arch/x86/kernel/acpi/wakeup_64.S
+++ b/arch/x86/kernel/acpi/wakeup_64.S
@@ -14,7 +14,7 @@
* Hooray, we are in Long 64-bit mode (but still running in low memory)
*/
ENTRY(wakeup_long64)
- movq saved_magic, %rax
+ movq saved_magic(%rip), %rax
movq $0x123456789abcdef0, %rdx
cmpq %rdx, %rax
jne bogus_64_magic
@@ -25,14 +25,14 @@ ENTRY(wakeup_long64)
movw %ax, %es
movw %ax, %fs
movw %ax, %gs
- movq saved_rsp, %rsp
+ movq saved_rsp(%rip), %rsp
- movq saved_rbx, %rbx
- movq saved_rdi, %rdi
- movq saved_rsi, %rsi
- movq saved_rbp, %rbp
+ movq saved_rbx(%rip), %rbx
+ movq saved_rdi(%rip), %rdi
+ movq saved_rsi(%rip), %rsi
+ movq saved_rbp(%rip), %rbp
- movq saved_rip, %rax
+ movq saved_rip(%rip), %rax
jmp *%rax
ENDPROC(wakeup_long64)
@@ -45,7 +45,7 @@ ENTRY(do_suspend_lowlevel)
xorl %eax, %eax
call save_processor_state
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq %rsp, pt_regs_sp(%rax)
movq %rbp, pt_regs_bp(%rax)
movq %rsi, pt_regs_si(%rax)
@@ -64,13 +64,14 @@ ENTRY(do_suspend_lowlevel)
pushfq
popq pt_regs_flags(%rax)
- movq $.Lresume_point, saved_rip(%rip)
+ leaq .Lresume_point(%rip), %rax
+ movq %rax, saved_rip(%rip)
- movq %rsp, saved_rsp
- movq %rbp, saved_rbp
- movq %rbx, saved_rbx
- movq %rdi, saved_rdi
- movq %rsi, saved_rsi
+ movq %rsp, saved_rsp(%rip)
+ movq %rbp, saved_rbp(%rip)
+ movq %rbx, saved_rbx(%rip)
+ movq %rdi, saved_rdi(%rip)
+ movq %rsi, saved_rsi(%rip)
addq $8, %rsp
movl $3, %edi
@@ -82,7 +83,7 @@ ENTRY(do_suspend_lowlevel)
.align 4
.Lresume_point:
/* We don't restore %rax, it must be 0 anyway */
- movq $saved_context, %rax
+ leaq saved_context(%rip), %rax
movq saved_context_cr4(%rax), %rbx
movq %rbx, %cr4
movq saved_context_cr3(%rax), %rbx
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the assembly code to use only relative references of symbols for the
kernel to be PIE compatible. Use the new _ASM_GET_PTR macro instead of
the 'mov $symbol, %dst' construct to not have an absolute reference.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/processor.h | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 028245e1c42b..1adea8c4436e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -47,7 +47,7 @@ static inline void *current_text_addr(void)
{
void *pc;
- asm volatile("mov $1f, %0; 1:":"=r" (pc));
+ asm volatile(_ASM_GET_PTR(1f, %0) "; 1:":"=r" (pc));
return pc;
}
@@ -682,6 +682,7 @@ static inline void sync_core(void)
: "+r" (__sp) : : "memory");
#else
unsigned int tmp;
+ unsigned long tmp2;
asm volatile (
"mov %%ss, %0\n\t"
@@ -691,10 +692,11 @@ static inline void sync_core(void)
"pushfq\n\t"
"mov %%cs, %0\n\t"
"pushq %q0\n\t"
- "pushq $1f\n\t"
+ "leaq 1f(%%rip), %1\n\t"
+ "pushq %1\n\t"
"iretq\n\t"
"1:"
- : "=&r" (tmp), "+r" (__sp) : : "cc", "memory");
+ : "=&r" (tmp), "=&r" (tmp2), "+r" (__sp) : : "cc", "memory");
#endif
}
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
By default PIE generated code create only relative references so _text
points to the temporary virtual address. Instead use a global variable
so the relocation is done as expected.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/kernel/head64.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 46c3c73e7f43..4103e90ff128 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -45,7 +45,13 @@ static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
return ptr - (void *)_text + (void *)physaddr;
}
-void __head __startup_64(unsigned long physaddr)
+/*
+ * Use a global variable to properly calculate _text delta on PIE. By default
+ * a PIE binary do a RIP relative difference instead of the relocated address.
+ */
+unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
+
+void __head notrace __startup_64(unsigned long physaddr)
{
unsigned long load_delta, *p;
pgdval_t *pgd;
@@ -62,7 +68,7 @@ void __head __startup_64(unsigned long physaddr)
* Compute the delta between the address I am compiled to run at
* and the address I am actually running at.
*/
- load_delta = physaddr - (unsigned long)(_text - __START_KERNEL_map);
+ load_delta = physaddr - _text_offset;
/* Is the address not 2M aligned? */
if (load_delta & ~PMD_PAGE_MASK)
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
if PIE is enabled, switch the paravirt assembly constraints to be
compatible. The %c/i constrains generate smaller code so is kept by
default.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 9ffc36bfe4cd..6f67c10672ec 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -347,9 +347,17 @@ extern struct pv_lock_ops pv_lock_ops;
#define PARAVIRT_PATCH(x) \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
+#ifdef CONFIG_X86_PIE
+#define paravirt_opptr_call "a"
+#define paravirt_opptr_type "p"
+#else
+#define paravirt_opptr_call "c"
+#define paravirt_opptr_type "i"
+#endif
+
#define paravirt_type(op) \
[paravirt_typenum] "i" (PARAVIRT_PATCH(op)), \
- [paravirt_opptr] "i" (&(op))
+ [paravirt_opptr] paravirt_opptr_type (&(op))
#define paravirt_clobber(clobber) \
[paravirt_clobber] "i" (clobber)
@@ -403,7 +411,7 @@ int paravirt_disable_iospace(void);
* offset into the paravirt_patch_template structure, and can therefore be
* freely converted back into a structure offset.
*/
-#define PARAVIRT_CALL "call *%c[paravirt_opptr];"
+#define PARAVIRT_CALL "call *%" paravirt_opptr_call "[paravirt_opptr];"
/*
* These macros are intended to wrap calls through one of the paravirt
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Add the CONFIG_X86_PIE option which builds the kernel as a Position
Independent Executable (PIE). The kernel is currently build with the
mcmodel=kernel option which forces it to stay on the top 2G of the
virtual address space. With PIE, the kernel will be able to move below
the -2G limit increasing the KASLR range from 1GB to 3GB.
The modules do not support PIE due to how they are linked. Disable PIE
for them and default to mcmodel=kernel for now.
The PIE configuration is not yet compatible with XEN_PVH. Xen PVH
generates 32-bit assembly and uses a long jump to transition to 64-bit.
A long jump require an absolute reference that is not compatible with
PIE.
Performance/Size impact:
Hackbench (50% and 1600% loads):
- PIE disabled: no significant change (-0.50% / +0.50%)
- PIE enabled: 7% to 8% on half load, 10% on heavy load.
These results are aligned with the different research on user-mode PIE
impact on cpu intensive benchmarks (around 10% on x86_64).
slab_test (average of 10 runs):
- PIE disabled: no significant change (-1% / +1%)
- PIE enabled: 3% to 4%
Kernbench (average of 10 Half and Optimal runs):
Elapsed Time:
- PIE disabled: no significant change (-0.22% / +0.06%)
- PIE enabled: around 0.50%
System Time:
- PIE disabled: no significant change (-0.99% / -1.28%)
- PIE enabled: 5% to 6%
Size of vmlinux (Ubuntu configuration):
File size:
- PIE disabled: 472928672 bytes (-0.000169% from baseline)
- PIE enabled: 216878461 bytes (-54.14% from baseline)
.text sections:
- PIE disabled: 9373572 bytes (+0.04% from baseline)
- PIE enabled: 9499138 bytes (+1.38% from baseline)
The big decrease in vmlinux file size is due to the lower number of
relocations appended to the file.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/Kconfig | 6 ++++++
arch/x86/Makefile | 9 +++++++++
2 files changed, 15 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 781521b7cf9e..b26ee6751021 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2080,6 +2080,12 @@ config RANDOMIZE_MEMORY_PHYSICAL_PADDING
If unsure, leave at the default value.
+config X86_PIE
+ bool
+ depends on X86_64 && !XEN_PVH
+ select DEFAULT_HIDDEN
+ select MODULE_REL_CRCS if MODVERSIONS
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 1e902f926be3..452a9621af8f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -45,8 +45,12 @@ export REALMODE_CFLAGS
export BITS
ifdef CONFIG_X86_NEED_RELOCS
+ifdef CONFIG_X86_PIE
+ LDFLAGS_vmlinux := -pie -shared -Bsymbolic
+else
LDFLAGS_vmlinux := --emit-relocs
endif
+endif
#
# Prevent GCC from generating any FP code by mistake.
@@ -132,7 +136,12 @@ else
KBUILD_CFLAGS += $(cflags-y)
KBUILD_CFLAGS += -mno-red-zone
+ifdef CONFIG_X86_PIE
+ KBUILD_CFLAGS += -fPIC
+ KBUILD_CFLAGS_MODULE += -fno-PIC -mcmodel=kernel
+else
KBUILD_CFLAGS += -mcmodel=kernel
+endif
# -funit-at-a-time shrinks the kernel .text considerably
# unfortunately it makes reading oopses harder.
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Provide an option to default visibility to hidden except for key
symbols. This option is disabled by default and will be used by x86_64
PIE support to remove errors between compilation units.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/boot/boot.h | 2 +-
arch/x86/include/asm/setup.h | 2 +-
include/asm-generic/sections.h | 6 ++++++
include/linux/compiler.h | 8 ++++++++
init/Kconfig | 7 +++++++
kernel/kallsyms.c | 16 ++++++++--------
6 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index ef5a9cc66fb8..d726c35bdd96 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -193,7 +193,7 @@ static inline bool memcmp_gs(const void *s1, addr_t s2, size_t len)
}
/* Heap -- available for dynamic lists. */
-extern char _end[];
+extern char _end[] __default_visibility;
extern char *HEAP;
extern char *heap_end;
#define RESET_HEAP() ((void *)( HEAP = _end ))
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index e4585a393965..f3ffad82bdc0 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -66,7 +66,7 @@ static inline void x86_ce4100_early_setup(void) { }
* This is set up by the setup-routine at boot-time
*/
extern struct boot_params boot_params;
-extern char _text[];
+extern char _text[] __default_visibility;
static inline bool kaslr_enabled(void)
{
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index 532372c6cf15..27c12f6dd6e2 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -28,6 +28,9 @@
* __entry_text_start, __entry_text_end
* __ctors_start, __ctors_end
*/
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(default)
+#endif
extern char _text[], _stext[], _etext[];
extern char _data[], _sdata[], _edata[];
extern char __bss_start[], __bss_stop[];
@@ -42,6 +45,9 @@ extern char __start_rodata[], __end_rodata[];
/* Start and end of .ctors section - used for constructor calls. */
extern char __ctors_start[], __ctors_end[];
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility pop
+#endif
extern __visible const void __nosave_begin, __nosave_end;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index eca8ad75e28b..876b827fe4a7 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -78,6 +78,14 @@ extern void __chk_io_ptr(const volatile void __iomem *);
#include <linux/compiler-clang.h>
#endif
+/* Useful for Position Independent Code to reduce global references */
+#ifdef CONFIG_DEFAULT_HIDDEN
+#pragma GCC visibility push(hidden)
+#define __default_visibility __attribute__((visibility ("default")))
+#else
+#define __default_visibility
+#endif
+
/*
* Generic compiler-dependent macros required for kernel
* build go below this comment. Actual compiler/compiler version
diff --git a/init/Kconfig b/init/Kconfig
index 4fb5d6fc2c4f..a93626d40355 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1635,6 +1635,13 @@ config PROFILING
config TRACEPOINTS
bool
+#
+# Default to hidden visibility for all symbols.
+# Useful for Position Independent Code to reduce global references.
+#
+config DEFAULT_HIDDEN
+ bool
+
source "arch/Kconfig"
endmenu # General setup
diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index 127e7cfafa55..252019c8c3a9 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -32,24 +32,24 @@
* These will be re-linked against their real values
* during the second link stage.
*/
-extern const unsigned long kallsyms_addresses[] __weak;
-extern const int kallsyms_offsets[] __weak;
-extern const u8 kallsyms_names[] __weak;
+extern const unsigned long kallsyms_addresses[] __weak __default_visibility;
+extern const int kallsyms_offsets[] __weak __default_visibility;
+extern const u8 kallsyms_names[] __weak __default_visibility;
/*
* Tell the compiler that the count isn't in the small data section if the arch
* has one (eg: FRV).
*/
extern const unsigned long kallsyms_num_syms
-__attribute__((weak, section(".rodata")));
+__attribute__((weak, section(".rodata"))) __default_visibility;
extern const unsigned long kallsyms_relative_base
-__attribute__((weak, section(".rodata")));
+__attribute__((weak, section(".rodata"))) __default_visibility;
-extern const u8 kallsyms_token_table[] __weak;
-extern const u16 kallsyms_token_index[] __weak;
+extern const u8 kallsyms_token_table[] __weak __default_visibility;
+extern const u16 kallsyms_token_index[] __weak __default_visibility;
-extern const unsigned long kallsyms_markers[] __weak;
+extern const unsigned long kallsyms_markers[] __weak __default_visibility;
static inline int is_kernel_inittext(unsigned long addr)
{
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Add a new CONFIG_RANDOMIZE_BASE_LARGE option to benefit from PIE
support. It increases the KASLR range from 1GB to 3GB. The new range
stars at 0xffffffff00000000 just above the EFI memory region. This
option is off by default.
The boot code is adapted to create the appropriate page table spanning
three PUD pages.
The relocation table uses 64-bit integers generated with the updated
relocation tool with the large-reloc option.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/Kconfig | 21 +++++++++++++++++++++
arch/x86/boot/compressed/Makefile | 5 +++++
arch/x86/boot/compressed/misc.c | 10 +++++++++-
arch/x86/include/asm/page_64_types.h | 9 +++++++++
arch/x86/kernel/head64.c | 18 ++++++++++++++----
arch/x86/kernel/head_64.S | 11 ++++++++++-
6 files changed, 68 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 60d161391d5a..8054eef76dfc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2096,6 +2096,27 @@ config X86_MODULE_PLTS
select X86_MODULE_MODEL_LARGE
select HAVE_MOD_ARCH_SPECIFIC
+config RANDOMIZE_BASE_LARGE
+ bool "Increase the randomization range of the kernel image"
+ depends on X86_64 && RANDOMIZE_BASE
+ select X86_PIE
+ select X86_MODULE_PLTS if MODULES
+ default n
+ ---help---
+ Build the kernel as a Position Independent Executable (PIE) and
+ increase the available randomization range from 1GB to 3GB.
+
+ This option impacts performance on kernel CPU intensive workloads up
+ to 10% due to PIE generated code. Impact on user-mode processes and
+ typical usage would be significantly less (0.50% when you build the
+ kernel).
+
+ The kernel and modules will generate slightly more assembly (1 to 2%
+ increase on the .text sections). The vmlinux binary will be
+ significantly smaller due to less relocations.
+
+ If unsure say N
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 2c860ad4fe06..8f4317864e98 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -111,7 +111,12 @@ $(obj)/vmlinux.bin: vmlinux FORCE
targets += $(patsubst $(obj)/%,%,$(vmlinux-objs-y)) vmlinux.bin.all vmlinux.relocs
+# Large randomization require bigger relocation table
+ifeq ($(CONFIG_RANDOMIZE_BASE_LARGE),y)
+CMD_RELOCS = arch/x86/tools/relocs --large-reloc
+else
CMD_RELOCS = arch/x86/tools/relocs
+endif
quiet_cmd_relocs = RELOCS $@
cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
$(obj)/vmlinux.relocs: vmlinux FORCE
diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index a0838ab929f2..0a0c80ab1842 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -170,10 +170,18 @@ void __puthex(unsigned long value)
}
#if CONFIG_X86_NEED_RELOCS
+
+/* Large randomization go lower than -2G and use large relocation table */
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+typedef long rel_t;
+#else
+typedef int rel_t;
+#endif
+
static void handle_relocations(void *output, unsigned long output_len,
unsigned long virt_addr)
{
- int *reloc;
+ rel_t *reloc;
unsigned long delta, map, ptr;
unsigned long min_addr = (unsigned long)output;
unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 3f5f08b010d0..6b65f846dd64 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -48,7 +48,11 @@
#define __PAGE_OFFSET __PAGE_OFFSET_BASE
#endif /* CONFIG_RANDOMIZE_MEMORY */
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define __START_KERNEL_map _AC(0xffffffff00000000, UL)
+#else
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)
+#endif /* CONFIG_RANDOMIZE_BASE_LARGE */
/* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
#ifdef CONFIG_X86_5LEVEL
@@ -65,9 +69,14 @@
* 512MiB by default, leaving 1.5GiB for modules once the page tables
* are fully set up. If kernel ASLR is configured, it can extend the
* kernel page table mapping, reducing the size of the modules area.
+ * On PIE, we relocate the binary 2G lower so add this extra space.
*/
#if defined(CONFIG_RANDOMIZE_BASE)
+#ifdef CONFIG_RANDOMIZE_BASE_LARGE
+#define KERNEL_IMAGE_SIZE (_AC(3, UL) * 1024 * 1024 * 1024)
+#else
#define KERNEL_IMAGE_SIZE (1024 * 1024 * 1024)
+#endif
#else
#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)
#endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4103e90ff128..235c3f7b46c7 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -39,6 +39,7 @@ static unsigned int __initdata next_early_pgt;
pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
#define __head __section(.head.text)
+#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
{
@@ -54,6 +55,8 @@ unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
void __head notrace __startup_64(unsigned long physaddr)
{
unsigned long load_delta, *p;
+ unsigned long level3_kernel_start, level3_kernel_count;
+ unsigned long level3_fixmap_start;
pgdval_t *pgd;
p4dval_t *p4d;
pudval_t *pud;
@@ -74,6 +77,11 @@ void __head notrace __startup_64(unsigned long physaddr)
if (load_delta & ~PMD_PAGE_MASK)
for (;;);
+ /* Look at the randomization spread to adapt page table used */
+ level3_kernel_start = pud_index(__START_KERNEL_map);
+ level3_kernel_count = pud_count(KERNEL_IMAGE_SIZE);
+ level3_fixmap_start = level3_kernel_start + level3_kernel_count;
+
/* Fixup the physical addresses in the page table */
pgd = fixup_pointer(&early_top_pgt, physaddr);
@@ -85,8 +93,9 @@ void __head notrace __startup_64(unsigned long physaddr)
}
pud = fixup_pointer(&level3_kernel_pgt, physaddr);
- pud[510] += load_delta;
- pud[511] += load_delta;
+ for (i = 0; i < level3_kernel_count; i++)
+ pud[level3_kernel_start + i] += load_delta;
+ pud[level3_fixmap_start] += load_delta;
pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
pmd[506] += load_delta;
@@ -137,7 +146,7 @@ void __head notrace __startup_64(unsigned long physaddr)
*/
pmd = fixup_pointer(level2_kernel_pgt, physaddr);
- for (i = 0; i < PTRS_PER_PMD; i++) {
+ for (i = 0; i < PTRS_PER_PMD * level3_kernel_count; i++) {
if (pmd[i] & _PAGE_PRESENT)
pmd[i] += load_delta;
}
@@ -268,7 +277,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
*/
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
- BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
+ BUILD_BUG_ON(!IS_ENABLED(CONFIG_RANDOMIZE_BASE_LARGE) &&
+ MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 4d0a7e68bfe8..e8b2d6706eca 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -39,11 +39,15 @@
#define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
L3_START_KERNEL = pud_index(__START_KERNEL_map)
+/* Adapt page table L3 space based on range of randomization */
+L3_KERNEL_ENTRY_COUNT = pud_count(KERNEL_IMAGE_SIZE)
+
.text
__HEAD
.code64
@@ -396,7 +400,12 @@ NEXT_PAGE(level4_kernel_pgt)
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
- .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
+ i = 0
+ .rept L3_KERNEL_ENTRY_COUNT
+ .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE \
+ + PAGE_SIZE*i
+ i = i + 1
+ .endr
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Change the relocation tool to correctly handle DYN/PIE kernel where
the relocation table does not reference symbols and percpu support is
not needed. Also add support for R_X86_64_RELATIVE relocations that can
be handled like a 64-bit relocation due to the usage of -Bsymbolic.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/tools/relocs.c | 74 +++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 65 insertions(+), 9 deletions(-)
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 73eb7fd4aec4..70f523dd68ff 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -642,6 +642,13 @@ static void add_reloc(struct relocs *r, uint32_t offset)
r->offset[r->count++] = offset;
}
+/* Relocation found in a DYN binary, support only for 64-bit PIE */
+static int is_dyn_reloc(struct section *sec)
+{
+ return ELF_BITS == 64 && ehdr.e_type == ET_DYN &&
+ sec->shdr.sh_info == SHT_NULL;
+}
+
static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
Elf_Sym *sym, const char *symname))
{
@@ -652,6 +659,7 @@ static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
Elf_Sym *sh_symtab;
struct section *sec_applies, *sec_symtab;
int j;
+ int dyn_reloc = 0;
struct section *sec = &secs[i];
if (sec->shdr.sh_type != SHT_REL_TYPE) {
@@ -660,14 +668,20 @@ static void walk_relocs(int (*process)(struct section *sec, Elf_Rel *rel,
sec_symtab = sec->link;
sec_applies = &secs[sec->shdr.sh_info];
if (!(sec_applies->shdr.sh_flags & SHF_ALLOC)) {
- continue;
+ if (!is_dyn_reloc(sec_applies))
+ continue;
+ dyn_reloc = 1;
}
sh_symtab = sec_symtab->symtab;
sym_strtab = sec_symtab->link->strtab;
for (j = 0; j < sec->shdr.sh_size/sizeof(Elf_Rel); j++) {
Elf_Rel *rel = &sec->reltab[j];
- Elf_Sym *sym = &sh_symtab[ELF_R_SYM(rel->r_info)];
- const char *symname = sym_name(sym_strtab, sym);
+ Elf_Sym *sym = NULL;
+ const char *symname = NULL;
+ if (!dyn_reloc) {
+ sym = &sh_symtab[ELF_R_SYM(rel->r_info)];
+ symname = sym_name(sym_strtab, sym);
+ }
process(sec, rel, sym, symname);
}
@@ -746,16 +760,21 @@ static int is_percpu_sym(ElfW(Sym) *sym, const char *symname)
strncmp(symname, "init_per_cpu_", 13);
}
-
static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
const char *symname)
{
unsigned r_type = ELF64_R_TYPE(rel->r_info);
ElfW(Addr) offset = rel->r_offset;
- int shn_abs = (sym->st_shndx == SHN_ABS) && !is_reloc(S_REL, symname);
+ int shn_abs = 0;
+ int dyn_reloc = is_dyn_reloc(sec);
- if (sym->st_shndx == SHN_UNDEF)
- return 0;
+ if (!dyn_reloc) {
+ shn_abs = (sym->st_shndx == SHN_ABS) &&
+ !is_reloc(S_REL, symname);
+
+ if (sym->st_shndx == SHN_UNDEF)
+ return 0;
+ }
/*
* Adjust the offset if this reloc applies to the percpu section.
@@ -769,6 +788,9 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
break;
case R_X86_64_PC32:
+ if (dyn_reloc)
+ die("PC32 reloc in PIE DYN binary");
+
/*
* PC relative relocations don't need to be adjusted unless
* referencing a percpu symbol.
@@ -783,7 +805,7 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
/*
* References to the percpu area don't need to be adjusted.
*/
- if (is_percpu_sym(sym, symname))
+ if (!dyn_reloc && is_percpu_sym(sym, symname))
break;
if (shn_abs) {
@@ -814,6 +836,14 @@ static int do_reloc64(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
add_reloc(&relocs32, offset);
break;
+ case R_X86_64_RELATIVE:
+ /*
+ * -Bsymbolic means we don't need the addend and we can reuse
+ * the original relocs64.
+ */
+ add_reloc(&relocs64, offset);
+ break;
+
default:
die("Unsupported relocation type: %s (%d)\n",
rel_type(r_type), r_type);
@@ -1044,6 +1074,21 @@ static void emit_relocs(int as_text, int use_real_mode)
}
}
+/* Print a different header based on the type of relocation */
+static void print_reloc_header(struct section *sec) {
+ static int header_printed = 0;
+ int header_type = is_dyn_reloc(sec) ? 2 : 1;
+
+ if (header_printed == header_type)
+ return;
+ header_printed = header_type;
+
+ if (header_type == 2)
+ printf("reloc type\toffset\tvalue\n");
+ else
+ printf("reloc section\treloc type\tsymbol\tsymbol section\n");
+}
+
/*
* As an aid to debugging problems with different linkers
* print summary information about the relocs.
@@ -1053,6 +1098,18 @@ static void emit_relocs(int as_text, int use_real_mode)
static int do_reloc_info(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
const char *symname)
{
+
+ print_reloc_header(sec);
+
+#if ELF_BITS == 64
+ if (is_dyn_reloc(sec)) {
+ printf("%s\t0x%lx\t0x%lx\n",
+ rel_type(ELF_R_TYPE(rel->r_info)),
+ rel->r_offset,
+ rel->r_addend);
+ return 0;
+ }
+#endif
printf("%s\t%s\t%s\t%s\n",
sec_name(sec->shdr.sh_info),
rel_type(ELF_R_TYPE(rel->r_info)),
@@ -1063,7 +1120,6 @@ static int do_reloc_info(struct section *sec, Elf_Rel *rel, ElfW(Sym) *sym,
static void print_reloc_info(void)
{
- printf("reloc section\treloc type\tsymbol\tsymbol section\n");
walk_relocs(do_reloc_info);
}
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
Perpcu uses a clever design where the .percu ELF section has a virtual
address of zero and the relocation code avoid relocating specific
symbols. It makes the code simple and easily adaptable with or without
SMP support.
This design is incompatible with PIE because generated code always try to
access the zero virtual address relative to the default mapping address.
It becomes impossible when KASLR is configured to go below -2G. This
patch solves this problem by removing the zero mapping and adapting the GS
base to be relative to the expected address. These changes are done only
when PIE is enabled. The original implementation is kept as-is
by default.
The assembly and PER_CPU macros are changed to use relative references
when PIE is enabled.
The KALLSYMS_ABSOLUTE_PERCPU configuration is disabled with PIE given
percpu symbols are not absolute in this case.
Position Independent Executable (PIE) support will allow to extended the
KASLR randomization range below the -2G memory limit.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/entry/entry_64.S | 4 ++--
arch/x86/include/asm/percpu.h | 25 +++++++++++++++++++------
arch/x86/kernel/cpu/common.c | 4 +++-
arch/x86/kernel/head_64.S | 4 ++++
arch/x86/kernel/setup_percpu.c | 2 +-
arch/x86/kernel/vmlinux.lds.S | 13 +++++++++++--
arch/x86/lib/cmpxchg16b_emu.S | 8 ++++----
arch/x86/xen/xen-asm.S | 12 ++++++------
init/Kconfig | 2 +-
9 files changed, 51 insertions(+), 23 deletions(-)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 691c4755269b..be198c0a2a8c 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -388,7 +388,7 @@ ENTRY(__switch_to_asm)
#ifdef CONFIG_CC_STACKPROTECTOR
movq TASK_stack_canary(%rsi), %rbx
- movq %rbx, PER_CPU_VAR(irq_stack_union)+stack_canary_offset
+ movq %rbx, PER_CPU_VAR(irq_stack_union + stack_canary_offset)
#endif
/* restore callee-saved registers */
@@ -739,7 +739,7 @@ apicinterrupt IRQ_WORK_VECTOR irq_work_interrupt smp_irq_work_interrupt
/*
* Exception entry points.
*/
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss + (TSS_ist + ((x) - 1) * 8))
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 9fa03604b2b3..862eb771f0e5 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -4,9 +4,11 @@
#ifdef CONFIG_X86_64
#define __percpu_seg gs
#define __percpu_mov_op movq
+#define __percpu_rel (%rip)
#else
#define __percpu_seg fs
#define __percpu_mov_op movl
+#define __percpu_rel
#endif
#ifdef __ASSEMBLY__
@@ -27,10 +29,14 @@
#define PER_CPU(var, reg) \
__percpu_mov_op %__percpu_seg:this_cpu_off, reg; \
lea var(reg), reg
-#define PER_CPU_VAR(var) %__percpu_seg:var
+/* Compatible with Position Independent Code */
+#define PER_CPU_VAR(var) %__percpu_seg:(var)##__percpu_rel
+/* Rare absolute reference */
+#define PER_CPU_VAR_ABS(var) %__percpu_seg:var
#else /* ! SMP */
#define PER_CPU(var, reg) __percpu_mov_op $var, reg
-#define PER_CPU_VAR(var) var
+#define PER_CPU_VAR(var) (var)##__percpu_rel
+#define PER_CPU_VAR_ABS(var) var
#endif /* SMP */
#ifdef CONFIG_X86_64_SMP
@@ -208,27 +214,34 @@ do { \
pfo_ret__; \
})
+/* Position Independent code uses relative addresses only */
+#ifdef CONFIG_X86_PIE
+#define __percpu_stable_arg __percpu_arg(a1)
+#else
+#define __percpu_stable_arg __percpu_arg(P1)
+#endif
+
#define percpu_stable_op(op, var) \
({ \
typeof(var) pfo_ret__; \
switch (sizeof(var)) { \
case 1: \
- asm(op "b "__percpu_arg(P1)",%0" \
+ asm(op "b "__percpu_stable_arg ",%0" \
: "=q" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 2: \
- asm(op "w "__percpu_arg(P1)",%0" \
+ asm(op "w "__percpu_stable_arg ",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 4: \
- asm(op "l "__percpu_arg(P1)",%0" \
+ asm(op "l "__percpu_stable_arg ",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
case 8: \
- asm(op "q "__percpu_arg(P1)",%0" \
+ asm(op "q "__percpu_stable_arg ",%0" \
: "=r" (pfo_ret__) \
: "p" (&(var))); \
break; \
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b95cd94ca97b..31300767ec0f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -480,7 +480,9 @@ void load_percpu_segment(int cpu)
loadsegment(fs, __KERNEL_PERCPU);
#else
__loadsegment_simple(gs, 0);
- wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
+ wrmsrl(MSR_GS_BASE,
+ (unsigned long)per_cpu(irq_stack_union.gs_base, cpu) -
+ (unsigned long)__per_cpu_start);
#endif
load_stack_canary_segment();
}
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7e4f7a83a15a..4d0a7e68bfe8 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -256,7 +256,11 @@ ENDPROC(start_cpu0)
GLOBAL(initial_code)
.quad x86_64_start_kernel
GLOBAL(initial_gs)
+#ifdef CONFIG_X86_PIE
+ .quad 0
+#else
.quad INIT_PER_CPU_VAR(irq_stack_union)
+#endif
GLOBAL(initial_stack)
/*
* The SIZEOF_PTREGS gap is a convention which helps the in-kernel
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 10edd1e69a68..ce1c58a29def 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -25,7 +25,7 @@
DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
EXPORT_PER_CPU_SYMBOL(cpu_number);
-#ifdef CONFIG_X86_64
+#if defined(CONFIG_X86_64) && !defined(CONFIG_X86_PIE)
#define BOOT_PERCPU_OFFSET ((unsigned long)__per_cpu_load)
#else
#define BOOT_PERCPU_OFFSET 0
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index c8a3b61be0aa..77f1b0622539 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -183,9 +183,14 @@ SECTIONS
/*
* percpu offsets are zero-based on SMP. PERCPU_VADDR() changes the
* output PHDR, so the next output section - .init.text - should
- * start another segment - init.
+ * start another segment - init. For Position Independent Code, the
+ * per-cpu section cannot be zero-based because everything is relative.
*/
+#ifdef CONFIG_X86_PIE
+ PERCPU_SECTION(INTERNODE_CACHE_BYTES)
+#else
PERCPU_VADDR(INTERNODE_CACHE_BYTES, 0, :percpu)
+#endif
ASSERT(SIZEOF(.data..percpu) < CONFIG_PHYSICAL_START,
"per-CPU data too large - increase CONFIG_PHYSICAL_START")
#endif
@@ -361,7 +366,11 @@ SECTIONS
* Per-cpu symbols which need to be offset from __per_cpu_load
* for the boot processor.
*/
+#ifdef CONFIG_X86_PIE
+#define INIT_PER_CPU(x) init_per_cpu__##x = x
+#else
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
+#endif
INIT_PER_CPU(gdt_page);
INIT_PER_CPU(irq_stack_union);
@@ -371,7 +380,7 @@ INIT_PER_CPU(irq_stack_union);
. = ASSERT((_end - _text <= KERNEL_IMAGE_SIZE),
"kernel image bigger than KERNEL_IMAGE_SIZE");
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) && !defined(CONFIG_X86_PIE)
. = ASSERT((irq_stack_union == 0),
"irq_stack_union is not at start of per-cpu area");
#endif
diff --git a/arch/x86/lib/cmpxchg16b_emu.S b/arch/x86/lib/cmpxchg16b_emu.S
index 9b330242e740..254950604ae4 100644
--- a/arch/x86/lib/cmpxchg16b_emu.S
+++ b/arch/x86/lib/cmpxchg16b_emu.S
@@ -33,13 +33,13 @@ ENTRY(this_cpu_cmpxchg16b_emu)
pushfq
cli
- cmpq PER_CPU_VAR((%rsi)), %rax
+ cmpq PER_CPU_VAR_ABS((%rsi)), %rax
jne .Lnot_same
- cmpq PER_CPU_VAR(8(%rsi)), %rdx
+ cmpq PER_CPU_VAR_ABS(8(%rsi)), %rdx
jne .Lnot_same
- movq %rbx, PER_CPU_VAR((%rsi))
- movq %rcx, PER_CPU_VAR(8(%rsi))
+ movq %rbx, PER_CPU_VAR_ABS((%rsi))
+ movq %rcx, PER_CPU_VAR_ABS(8(%rsi))
popfq
mov $1, %al
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index eff224df813f..40410969fd3c 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -26,7 +26,7 @@
ENTRY(xen_irq_enable_direct)
FRAME_BEGIN
/* Unmask events */
- movb $0, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+ movb $0, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
/*
* Preempt here doesn't matter because that will deal with any
@@ -35,7 +35,7 @@ ENTRY(xen_irq_enable_direct)
*/
/* Test for pending */
- testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+ testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
jz 1f
2: call check_events
@@ -52,7 +52,7 @@ ENDPATCH(xen_irq_enable_direct)
* non-zero.
*/
ENTRY(xen_irq_disable_direct)
- movb $1, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+ movb $1, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
ENDPATCH(xen_irq_disable_direct)
ret
ENDPROC(xen_irq_disable_direct)
@@ -68,7 +68,7 @@ ENDPATCH(xen_irq_disable_direct)
* x86 use opposite senses (mask vs enable).
*/
ENTRY(xen_save_fl_direct)
- testb $0xff, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+ testb $0xff, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
setz %ah
addb %ah, %ah
ENDPATCH(xen_save_fl_direct)
@@ -91,7 +91,7 @@ ENTRY(xen_restore_fl_direct)
#else
testb $X86_EFLAGS_IF>>8, %ah
#endif
- setz PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_mask
+ setz PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_mask)
/*
* Preempt here doesn't matter because that will deal with any
* pending interrupts. The pending check may end up being run
@@ -99,7 +99,7 @@ ENTRY(xen_restore_fl_direct)
*/
/* check for unmasked and pending */
- cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info) + XEN_vcpu_info_pending
+ cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info + XEN_vcpu_info_pending)
jnz 1f
2: call check_events
1:
diff --git a/init/Kconfig b/init/Kconfig
index 8514b25db21c..4fb5d6fc2c4f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1201,7 +1201,7 @@ config KALLSYMS_ALL
config KALLSYMS_ABSOLUTE_PERCPU
bool
depends on KALLSYMS
- default X86_64 && SMP
+ default X86_64 && SMP && !X86_PIE
config KALLSYMS_BASE_RELATIVE
bool
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
With PIE support and KASLR extended range, the modules may be further
away from the kernel than before breaking mcmodel=kernel expectations.
Add an option to build modules with mcmodel=large. The modules generated
code will make no assumptions on placement in memory.
Despite this option, modules still expect kernel functions to be within
2G and generate relative calls. To solve this issue, the PLT arm64 code
was adapted for x86_64. When a relative relocation go outside its range,
a dynamic PLT entry is used to correctly jump to the destination.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/Kconfig | 10 +++
arch/x86/Makefile | 10 ++-
arch/x86/include/asm/module.h | 16 ++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/module-plts.c | 198 ++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/module.c | 18 ++--
arch/x86/kernel/module.lds | 4 +
7 files changed, 251 insertions(+), 7 deletions(-)
create mode 100644 arch/x86/kernel/module-plts.c
create mode 100644 arch/x86/kernel/module.lds
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b26ee6751021..60d161391d5a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2086,6 +2086,16 @@ config X86_PIE
select DEFAULT_HIDDEN
select MODULE_REL_CRCS if MODVERSIONS
+config X86_MODULE_MODEL_LARGE
+ bool
+ depends on X86_64 && X86_PIE
+
+config X86_MODULE_PLTS
+ bool
+ depends on X86_64
+ select X86_MODULE_MODEL_LARGE
+ select HAVE_MOD_ARCH_SPECIFIC
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SMP
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 452a9621af8f..72a90da0149a 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -138,10 +138,18 @@ else
KBUILD_CFLAGS += -mno-red-zone
ifdef CONFIG_X86_PIE
KBUILD_CFLAGS += -fPIC
- KBUILD_CFLAGS_MODULE += -fno-PIC -mcmodel=kernel
+ KBUILD_CFLAGS_MODULE += -fno-PIC
else
KBUILD_CFLAGS += -mcmodel=kernel
endif
+ifdef CONFIG_X86_MODULE_MODEL_LARGE
+ KBUILD_CFLAGS_MODULE += -mcmodel=large
+else
+ KBUILD_CFLAGS_MODULE += -mcmodel=kernel
+endif
+ifdef CONFIG_X86_MODULE_PLTS
+ KBUILD_LDFLAGS_MODULE += -T $(srctree)/arch/x86/kernel/module.lds
+endif
# -funit-at-a-time shrinks the kernel .text considerably
# unfortunately it makes reading oopses harder.
diff --git a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
index e3b7819caeef..d054c37656ea 100644
--- a/arch/x86/include/asm/module.h
+++ b/arch/x86/include/asm/module.h
@@ -61,4 +61,20 @@
# define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY
#endif
+#ifdef CONFIG_X86_MODULE_PLTS
+struct mod_plt_sec {
+ struct elf64_shdr *plt;
+ int plt_num_entries;
+ int plt_max_entries;
+};
+
+struct mod_arch_specific {
+ struct mod_plt_sec core;
+ struct mod_plt_sec init;
+};
+#endif
+
+u64 module_emit_plt_entry(struct module *mod, void *loc, const Elf64_Rela *rela,
+ Elf64_Sym *sym);
+
#endif /* _ASM_X86_MODULE_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892bdd61a..e294aefb747c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -142,4 +142,6 @@ ifeq ($(CONFIG_X86_64),y)
obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o
obj-y += vsmp_64.o
+
+ obj-$(CONFIG_X86_MODULE_PLTS) += module-plts.o
endif
diff --git a/arch/x86/kernel/module-plts.c b/arch/x86/kernel/module-plts.c
new file mode 100644
index 000000000000..bbf11771f424
--- /dev/null
+++ b/arch/x86/kernel/module-plts.c
@@ -0,0 +1,198 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Generate PLT entries for out-of-bound PC-relative relocations. It is required
+ * when a module can be mapped more than 2G away from the kernel.
+ *
+ * Based on arm64 module-plts implementation.
+ */
+
+#include <linux/elf.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sort.h>
+
+/* jmp QWORD PTR [rip+0xfffffffffffffff2] */
+const u8 jmp_target[] = { 0xFF, 0x25, 0xF2, 0xFF, 0xFF, 0xFF };
+
+struct plt_entry {
+ u64 target; /* Hold the target address */
+ u8 jmp[sizeof(jmp_target)]; /* jmp opcode to target */
+};
+
+static bool in_init(const struct module *mod, void *loc)
+{
+ return (u64)loc - (u64)mod->init_layout.base < mod->init_layout.size;
+}
+
+u64 module_emit_plt_entry(struct module *mod, void *loc, const Elf64_Rela *rela,
+ Elf64_Sym *sym)
+{
+ struct mod_plt_sec *pltsec = !in_init(mod, loc) ? &mod->arch.core :
+ &mod->arch.init;
+ struct plt_entry *plt = (struct plt_entry *)pltsec->plt->sh_addr;
+ int i = pltsec->plt_num_entries;
+ u64 ret;
+
+ /*
+ * <target address>
+ * jmp QWORD PTR [rip+0xfffffffffffffff2] # Target address
+ */
+ plt[i].target = sym->st_value;
+ memcpy(plt[i].jmp, jmp_target, sizeof(jmp_target));
+
+ /*
+ * Check if the entry we just created is a duplicate. Given that the
+ * relocations are sorted, this will be the last entry we allocated.
+ * (if one exists).
+ */
+ if (i > 0 && plt[i].target == plt[i - 2].target) {
+ ret = (u64)&plt[i - 1].jmp;
+ } else {
+ pltsec->plt_num_entries++;
+ BUG_ON(pltsec->plt_num_entries > pltsec->plt_max_entries);
+ ret = (u64)&plt[i].jmp;
+ }
+
+ return ret + rela->r_addend;
+}
+
+#define cmp_3way(a,b) ((a) < (b) ? -1 : (a) > (b))
+
+static int cmp_rela(const void *a, const void *b)
+{
+ const Elf64_Rela *x = a, *y = b;
+ int i;
+
+ /* sort by type, symbol index and addend */
+ i = cmp_3way(ELF64_R_TYPE(x->r_info), ELF64_R_TYPE(y->r_info));
+ if (i == 0)
+ i = cmp_3way(ELF64_R_SYM(x->r_info), ELF64_R_SYM(y->r_info));
+ if (i == 0)
+ i = cmp_3way(x->r_addend, y->r_addend);
+ return i;
+}
+
+static bool duplicate_rel(const Elf64_Rela *rela, int num)
+{
+ /*
+ * Entries are sorted by type, symbol index and addend. That means
+ * that, if a duplicate entry exists, it must be in the preceding
+ * slot.
+ */
+ return num > 0 && cmp_rela(rela + num, rela + num - 1) == 0;
+}
+
+static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
+ Elf64_Word dstidx)
+{
+ unsigned int ret = 0;
+ Elf64_Sym *s;
+ int i;
+
+ for (i = 0; i < num; i++) {
+ switch (ELF64_R_TYPE(rela[i].r_info)) {
+ case R_X86_64_PC32:
+ /*
+ * We only have to consider branch targets that resolve
+ * to symbols that are defined in a different section.
+ * This is not simply a heuristic, it is a fundamental
+ * limitation, since there is no guaranteed way to emit
+ * PLT entries sufficiently close to the branch if the
+ * section size exceeds the range of a branch
+ * instruction. So ignore relocations against defined
+ * symbols if they live in the same section as the
+ * relocation target.
+ */
+ s = syms + ELF64_R_SYM(rela[i].r_info);
+ if (s->st_shndx == dstidx)
+ break;
+
+ /*
+ * Jump relocations with non-zero addends against
+ * undefined symbols are supported by the ELF spec, but
+ * do not occur in practice (e.g., 'jump n bytes past
+ * the entry point of undefined function symbol f').
+ * So we need to support them, but there is no need to
+ * take them into consideration when trying to optimize
+ * this code. So let's only check for duplicates when
+ * the addend is zero: this allows us to record the PLT
+ * entry address in the symbol table itself, rather than
+ * having to search the list for duplicates each time we
+ * emit one.
+ */
+ if (rela[i].r_addend != 0 || !duplicate_rel(rela, i))
+ ret++;
+ break;
+ }
+ }
+ return ret;
+}
+
+int module_frob_arch_sections(Elf_Ehdr *ehdr, Elf_Shdr *sechdrs,
+ char *secstrings, struct module *mod)
+{
+ unsigned long core_plts = 0;
+ unsigned long init_plts = 0;
+ Elf64_Sym *syms = NULL;
+ int i;
+
+ /*
+ * Find the empty .plt section so we can expand it to store the PLT
+ * entries. Record the symtab address as well.
+ */
+ for (i = 0; i < ehdr->e_shnum; i++) {
+ if (!strcmp(secstrings + sechdrs[i].sh_name, ".plt"))
+ mod->arch.core.plt = sechdrs + i;
+ else if (!strcmp(secstrings + sechdrs[i].sh_name, ".init.plt"))
+ mod->arch.init.plt = sechdrs + i;
+ else if (sechdrs[i].sh_type == SHT_SYMTAB)
+ syms = (Elf64_Sym *)sechdrs[i].sh_addr;
+ }
+
+ if (!mod->arch.core.plt || !mod->arch.init.plt) {
+ pr_err("%s: module PLT section(s) missing\n", mod->name);
+ return -ENOEXEC;
+ }
+ if (!syms) {
+ pr_err("%s: module symtab section missing\n", mod->name);
+ return -ENOEXEC;
+ }
+
+ for (i = 0; i < ehdr->e_shnum; i++) {
+ Elf64_Rela *rels = (void *)ehdr + sechdrs[i].sh_offset;
+ int numrels = sechdrs[i].sh_size / sizeof(Elf64_Rela);
+ Elf64_Shdr *dstsec = sechdrs + sechdrs[i].sh_info;
+
+ if (sechdrs[i].sh_type != SHT_RELA)
+ continue;
+
+ /* sort by type, symbol index and addend */
+ sort(rels, numrels, sizeof(Elf64_Rela), cmp_rela, NULL);
+
+ if (strncmp(secstrings + dstsec->sh_name, ".init", 5) != 0)
+ core_plts += count_plts(syms, rels, numrels,
+ sechdrs[i].sh_info);
+ else
+ init_plts += count_plts(syms, rels, numrels,
+ sechdrs[i].sh_info);
+ }
+
+ mod->arch.core.plt->sh_type = SHT_NOBITS;
+ mod->arch.core.plt->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
+ mod->arch.core.plt->sh_addralign = L1_CACHE_BYTES;
+ mod->arch.core.plt->sh_size = (core_plts + 1) * sizeof(struct plt_entry);
+ mod->arch.core.plt_num_entries = 0;
+ mod->arch.core.plt_max_entries = core_plts;
+
+ mod->arch.init.plt->sh_type = SHT_NOBITS;
+ mod->arch.init.plt->sh_flags = SHF_EXECINSTR | SHF_ALLOC;
+ mod->arch.init.plt->sh_addralign = L1_CACHE_BYTES;
+ mod->arch.init.plt->sh_size = (init_plts + 1) * sizeof(struct plt_entry);
+ mod->arch.init.plt_num_entries = 0;
+ mod->arch.init.plt_max_entries = init_plts;
+
+ return 0;
+}
diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index f67bd3205df7..a2b31973572b 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -186,10 +186,15 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
case R_X86_64_PC32:
val -= (u64)loc;
*(u32 *)loc = val;
-#if 0
- if ((s64)val != *(s32 *)loc)
- goto overflow;
-#endif
+ if (IS_ENABLED(CONFIG_X86_MODULE_MODEL_LARGE) &&
+ (s64)val != *(s32 *)loc) {
+ val = module_emit_plt_entry(me, loc, &rel[i],
+ sym);
+ val -= (u64)loc;
+ *(u32 *)loc = val;
+ if ((s64)val != *(s32 *)loc)
+ goto overflow;
+ }
break;
default:
pr_err("%s: Unknown rela relocation: %llu\n",
@@ -202,8 +207,9 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
overflow:
pr_err("overflow in relocation type %d val %Lx\n",
(int)ELF64_R_TYPE(rel[i].r_info), val);
- pr_err("`%s' likely not compiled with -mcmodel=kernel\n",
- me->name);
+ pr_err("`%s' likely not compiled with -mcmodel=%s\n",
+ me->name,
+ IS_ENABLED(CONFIG_X86_MODULE_MODEL_LARGE) ? "large" : "kernel");
return -ENOEXEC;
}
#endif
diff --git a/arch/x86/kernel/module.lds b/arch/x86/kernel/module.lds
new file mode 100644
index 000000000000..f7c9781a9d48
--- /dev/null
+++ b/arch/x86/kernel/module.lds
@@ -0,0 +1,4 @@
+SECTIONS {
+ .plt (NOLOAD) : { BYTE(0) }
+ .init.plt (NOLOAD) : { BYTE(0) }
+}
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
The x86 relocation tool generates a list of 32-bit signed integers. There
was no need to use 64-bit integers because all addresses where above the 2G
top of the memory.
This change add a large-reloc option to generate 64-bit unsigned integers.
It can be used when the kernel plan to go below the top 2G and 32-bit
integers are not enough.
Signed-off-by: Thomas Garnier <[email protected]>
---
arch/x86/tools/relocs.c | 60 +++++++++++++++++++++++++++++++++---------
arch/x86/tools/relocs.h | 4 +--
arch/x86/tools/relocs_common.c | 15 +++++++----
3 files changed, 60 insertions(+), 19 deletions(-)
diff --git a/arch/x86/tools/relocs.c b/arch/x86/tools/relocs.c
index 70f523dd68ff..19b3e6c594b1 100644
--- a/arch/x86/tools/relocs.c
+++ b/arch/x86/tools/relocs.c
@@ -12,8 +12,14 @@
static Elf_Ehdr ehdr;
+#if ELF_BITS == 64
+typedef uint64_t rel_off_t;
+#else
+typedef uint32_t rel_off_t;
+#endif
+
struct relocs {
- uint32_t *offset;
+ rel_off_t *offset;
unsigned long count;
unsigned long size;
};
@@ -627,7 +633,7 @@ static void print_absolute_relocs(void)
printf("\n");
}
-static void add_reloc(struct relocs *r, uint32_t offset)
+static void add_reloc(struct relocs *r, rel_off_t offset)
{
if (r->count == r->size) {
unsigned long newsize = r->size + 50000;
@@ -983,26 +989,48 @@ static void sort_relocs(struct relocs *r)
qsort(r->offset, r->count, sizeof(r->offset[0]), cmp_relocs);
}
-static int write32(uint32_t v, FILE *f)
+static int write32(rel_off_t rel, FILE *f)
{
- unsigned char buf[4];
+ unsigned char buf[sizeof(uint32_t)];
+ uint32_t v = (uint32_t)rel;
put_unaligned_le32(v, buf);
- return fwrite(buf, 1, 4, f) == 4 ? 0 : -1;
+ return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
}
-static int write32_as_text(uint32_t v, FILE *f)
+static int write32_as_text(rel_off_t rel, FILE *f)
{
+ uint32_t v = (uint32_t)rel;
return fprintf(f, "\t.long 0x%08"PRIx32"\n", v) > 0 ? 0 : -1;
}
-static void emit_relocs(int as_text, int use_real_mode)
+static int write64(rel_off_t rel, FILE *f)
+{
+ unsigned char buf[sizeof(uint64_t)];
+ uint64_t v = (uint64_t)rel;
+
+ put_unaligned_le64(v, buf);
+ return fwrite(buf, 1, sizeof(buf), f) == sizeof(buf) ? 0 : -1;
+}
+
+static int write64_as_text(rel_off_t rel, FILE *f)
+{
+ uint64_t v = (uint64_t)rel;
+ return fprintf(f, "\t.quad 0x%016"PRIx64"\n", v) > 0 ? 0 : -1;
+}
+
+static void emit_relocs(int as_text, int use_real_mode, int use_large_reloc)
{
int i;
- int (*write_reloc)(uint32_t, FILE *) = write32;
+ int (*write_reloc)(rel_off_t, FILE *);
int (*do_reloc)(struct section *sec, Elf_Rel *rel, Elf_Sym *sym,
const char *symname);
+ if (use_large_reloc)
+ write_reloc = write64;
+ else
+ write_reloc = write32;
+
#if ELF_BITS == 64
if (!use_real_mode)
do_reloc = do_reloc64;
@@ -1013,6 +1041,9 @@ static void emit_relocs(int as_text, int use_real_mode)
do_reloc = do_reloc32;
else
do_reloc = do_reloc_real;
+
+ /* Large relocations only for 64-bit */
+ use_large_reloc = 0;
#endif
/* Collect up the relocations */
@@ -1036,8 +1067,13 @@ static void emit_relocs(int as_text, int use_real_mode)
* gas will like.
*/
printf(".section \".data.reloc\",\"a\"\n");
- printf(".balign 4\n");
- write_reloc = write32_as_text;
+ if (use_large_reloc) {
+ printf(".balign 8\n");
+ write_reloc = write64_as_text;
+ } else {
+ printf(".balign 4\n");
+ write_reloc = write32_as_text;
+ }
}
if (use_real_mode) {
@@ -1131,7 +1167,7 @@ static void print_reloc_info(void)
void process(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
- int show_reloc_info)
+ int show_reloc_info, int use_large_reloc)
{
regex_init(use_real_mode);
read_ehdr(fp);
@@ -1153,5 +1189,5 @@ void process(FILE *fp, int use_real_mode, int as_text,
print_reloc_info();
return;
}
- emit_relocs(as_text, use_real_mode);
+ emit_relocs(as_text, use_real_mode, use_large_reloc);
}
diff --git a/arch/x86/tools/relocs.h b/arch/x86/tools/relocs.h
index 1d23bf953a4a..cb771cc4412d 100644
--- a/arch/x86/tools/relocs.h
+++ b/arch/x86/tools/relocs.h
@@ -30,8 +30,8 @@ enum symtype {
void process_32(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
- int show_reloc_info);
+ int show_reloc_info, int use_large_reloc);
void process_64(FILE *fp, int use_real_mode, int as_text,
int show_absolute_syms, int show_absolute_relocs,
- int show_reloc_info);
+ int show_reloc_info, int use_large_reloc);
#endif /* RELOCS_H */
diff --git a/arch/x86/tools/relocs_common.c b/arch/x86/tools/relocs_common.c
index acab636bcb34..9cf1391af50a 100644
--- a/arch/x86/tools/relocs_common.c
+++ b/arch/x86/tools/relocs_common.c
@@ -11,14 +11,14 @@ void die(char *fmt, ...)
static void usage(void)
{
- die("relocs [--abs-syms|--abs-relocs|--reloc-info|--text|--realmode]" \
- " vmlinux\n");
+ die("relocs [--abs-syms|--abs-relocs|--reloc-info|--text|--realmode|" \
+ "--large-reloc] vmlinux\n");
}
int main(int argc, char **argv)
{
int show_absolute_syms, show_absolute_relocs, show_reloc_info;
- int as_text, use_real_mode;
+ int as_text, use_real_mode, use_large_reloc;
const char *fname;
FILE *fp;
int i;
@@ -29,6 +29,7 @@ int main(int argc, char **argv)
show_reloc_info = 0;
as_text = 0;
use_real_mode = 0;
+ use_large_reloc = 0;
fname = NULL;
for (i = 1; i < argc; i++) {
char *arg = argv[i];
@@ -53,6 +54,10 @@ int main(int argc, char **argv)
use_real_mode = 1;
continue;
}
+ if (strcmp(arg, "--large-reloc") == 0) {
+ use_large_reloc = 1;
+ continue;
+ }
}
else if (!fname) {
fname = arg;
@@ -74,11 +79,11 @@ int main(int argc, char **argv)
if (e_ident[EI_CLASS] == ELFCLASS64)
process_64(fp, use_real_mode, as_text,
show_absolute_syms, show_absolute_relocs,
- show_reloc_info);
+ show_reloc_info, use_large_reloc);
else
process_32(fp, use_real_mode, as_text,
show_absolute_syms, show_absolute_relocs,
- show_reloc_info);
+ show_reloc_info, use_large_reloc);
fclose(fp);
return 0;
}
--
2.13.2.932.g7449e964c-goog
_______________________________________________
Xen-devel mailing list
[email protected]
https://lists.xen.org/xen-devel
On 07/18/17 15:33, Thomas Garnier wrote:
> With PIE support and KASLR extended range, the modules may be further
> away from the kernel than before breaking mcmodel=kernel expectations.
>
> Add an option to build modules with mcmodel=large. The modules generated
> code will make no assumptions on placement in memory.
>
> Despite this option, modules still expect kernel functions to be within
> 2G and generate relative calls. To solve this issue, the PLT arm64 code
> was adapted for x86_64. When a relative relocation go outside its range,
> a dynamic PLT entry is used to correctly jump to the destination.
Why large as opposed to medium or medium-PIC?
-hpa
On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
> Change the assembly code to use only relative references of symbols for the
> kernel to be PIE compatible. The new __ASM_GET_PTR_PRE macro is used to
> get the address of a symbol on both 32 and 64-bit with PIE support.
>
> Position Independent Executable (PIE) support will allow to extended the
> KASLR randomization range below the -2G memory limit.
>
> Signed-off-by: Thomas Garnier <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 6 ++++--
> arch/x86/kernel/kvm.c | 6 ++++--
> arch/x86/kvm/svm.c | 4 ++--
> 3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 87ac4fba6d8e..3041201a3aeb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1352,9 +1352,11 @@ asmlinkage void kvm_spurious_fault(void);
> ".pushsection .fixup, \"ax\" \n" \
> "667: \n\t" \
> cleanup_insn "\n\t" \
> - "cmpb $0, kvm_rebooting \n\t" \
> + "cmpb $0, kvm_rebooting" __ASM_SEL(,(%%rip)) " \n\t" \
> "jne 668b \n\t" \
> - __ASM_SIZE(push) " $666b \n\t" \
> + __ASM_SIZE(push) "%%" _ASM_AX " \n\t" \
> + __ASM_GET_PTR_PRE(666b) "%%" _ASM_AX "\n\t" \
> + "xchg %%" _ASM_AX ", (%%" _ASM_SP ") \n\t" \
> "call kvm_spurious_fault \n\t" \
> ".popsection \n\t" \
> _ASM_EXTABLE(666b, 667b)
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 71c17a5be983..53b8ad162589 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -618,8 +618,10 @@ asm(
> ".global __raw_callee_save___kvm_vcpu_is_preempted;"
> ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
> "__raw_callee_save___kvm_vcpu_is_preempted:"
> -"movq __per_cpu_offset(,%rdi,8), %rax;"
> -"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
> +"leaq __per_cpu_offset(%rip), %rax;"
> +"movq (%rax,%rdi,8), %rax;"
> +"addq " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rip), %rax;"
This doesn't look right. It's accessing a per-cpu variable. The
per-cpu section is an absolute, zero-based section and not subject to
relocation.
> +"cmpb $0, (%rax);
> "setne %al;"
> "ret;"
> ".popsection");
--
Brian Gerst
On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
> Perpcu uses a clever design where the .percu ELF section has a virtual
> address of zero and the relocation code avoid relocating specific
> symbols. It makes the code simple and easily adaptable with or without
> SMP support.
>
> This design is incompatible with PIE because generated code always try to
> access the zero virtual address relative to the default mapping address.
> It becomes impossible when KASLR is configured to go below -2G. This
> patch solves this problem by removing the zero mapping and adapting the GS
> base to be relative to the expected address. These changes are done only
> when PIE is enabled. The original implementation is kept as-is
> by default.
The reason the per-cpu section is zero-based on x86-64 is to
workaround GCC hardcoding the stack protector canary at %gs:40. So
this patch is incompatible with CONFIG_STACK_PROTECTOR.
--
Brian Gerst
On Tue, Jul 18, 2017 at 9:35 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/18/17 15:33, Thomas Garnier wrote:
>> With PIE support and KASLR extended range, the modules may be further
>> away from the kernel than before breaking mcmodel=kernel expectations.
>>
>> Add an option to build modules with mcmodel=large. The modules generated
>> code will make no assumptions on placement in memory.
>>
>> Despite this option, modules still expect kernel functions to be within
>> 2G and generate relative calls. To solve this issue, the PLT arm64 code
>> was adapted for x86_64. When a relative relocation go outside its range,
>> a dynamic PLT entry is used to correctly jump to the destination.
>
> Why large as opposed to medium or medium-PIC?
Or for that matter, why not small-PIC? We aren't changing the size of
the kernel to be larger than 2G text or data. Small-PIC would still
allow it to be placed anywhere in the address space, and would
generate far better code.
--
Brian Gerst
On 07/18/17 at 03:33pm, Thomas Garnier wrote:
> quiet_cmd_relocs = RELOCS $@
> cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
> $(obj)/vmlinux.relocs: vmlinux FORCE
> diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> index a0838ab929f2..0a0c80ab1842 100644
> --- a/arch/x86/boot/compressed/misc.c
> +++ b/arch/x86/boot/compressed/misc.c
> @@ -170,10 +170,18 @@ void __puthex(unsigned long value)
> }
>
> #if CONFIG_X86_NEED_RELOCS
> +
> +/* Large randomization go lower than -2G and use large relocation table */
> +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> +typedef long rel_t;
> +#else
> +typedef int rel_t;
> +#endif
> +
> static void handle_relocations(void *output, unsigned long output_len,
> unsigned long virt_addr)
> {
> - int *reloc;
> + rel_t *reloc;
> unsigned long delta, map, ptr;
> unsigned long min_addr = (unsigned long)output;
> unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 3f5f08b010d0..6b65f846dd64 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -48,7 +48,11 @@
> #define __PAGE_OFFSET __PAGE_OFFSET_BASE
> #endif /* CONFIG_RANDOMIZE_MEMORY */
>
> +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> +#define __START_KERNEL_map _AC(0xffffffff00000000, UL)
> +#else
> #define __START_KERNEL_map _AC(0xffffffff80000000, UL)
> +#endif /* CONFIG_RANDOMIZE_BASE_LARGE */
>
> /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
> #ifdef CONFIG_X86_5LEVEL
> @@ -65,9 +69,14 @@
> * 512MiB by default, leaving 1.5GiB for modules once the page tables
> * are fully set up. If kernel ASLR is configured, it can extend the
> * kernel page table mapping, reducing the size of the modules area.
> + * On PIE, we relocate the binary 2G lower so add this extra space.
> */
> #if defined(CONFIG_RANDOMIZE_BASE)
> +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> +#define KERNEL_IMAGE_SIZE (_AC(3, UL) * 1024 * 1024 * 1024)
> +#else
> #define KERNEL_IMAGE_SIZE (1024 * 1024 * 1024)
> +#endif
> #else
> #define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)
> #endif
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 4103e90ff128..235c3f7b46c7 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -39,6 +39,7 @@ static unsigned int __initdata next_early_pgt;
> pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
>
> #define __head __section(.head.text)
> +#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
>
> static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
> {
> @@ -54,6 +55,8 @@ unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
> void __head notrace __startup_64(unsigned long physaddr)
> {
> unsigned long load_delta, *p;
> + unsigned long level3_kernel_start, level3_kernel_count;
> + unsigned long level3_fixmap_start;
> pgdval_t *pgd;
> p4dval_t *p4d;
> pudval_t *pud;
> @@ -74,6 +77,11 @@ void __head notrace __startup_64(unsigned long physaddr)
> if (load_delta & ~PMD_PAGE_MASK)
> for (;;);
>
> + /* Look at the randomization spread to adapt page table used */
> + level3_kernel_start = pud_index(__START_KERNEL_map);
> + level3_kernel_count = pud_count(KERNEL_IMAGE_SIZE);
> + level3_fixmap_start = level3_kernel_start + level3_kernel_count;
> +
> /* Fixup the physical addresses in the page table */
>
> pgd = fixup_pointer(&early_top_pgt, physaddr);
> @@ -85,8 +93,9 @@ void __head notrace __startup_64(unsigned long physaddr)
> }
>
> pud = fixup_pointer(&level3_kernel_pgt, physaddr);
> - pud[510] += load_delta;
> - pud[511] += load_delta;
> + for (i = 0; i < level3_kernel_count; i++)
> + pud[level3_kernel_start + i] += load_delta;
> + pud[level3_fixmap_start] += load_delta;
>
> pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
> pmd[506] += load_delta;
> @@ -137,7 +146,7 @@ void __head notrace __startup_64(unsigned long physaddr)
> */
>
> pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> - for (i = 0; i < PTRS_PER_PMD; i++) {
> + for (i = 0; i < PTRS_PER_PMD * level3_kernel_count; i++) {
> if (pmd[i] & _PAGE_PRESENT)
> pmd[i] += load_delta;
Wow, this is dangerous. Three pud entries of level3_kernel_pgt all point
to level2_kernel_pgt, it's out of bound of level2_kernel_pgt and
overwrite the next data.
And if only use one page for level2_kernel_pgt, and kernel is randomized
to cross the pud entry of -4G to -1G, it won't work well.
> }
> @@ -268,7 +277,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
> */
> BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
> BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
> - BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
> + BUILD_BUG_ON(!IS_ENABLED(CONFIG_RANDOMIZE_BASE_LARGE) &&
> + MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
> BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
> BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
> BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 4d0a7e68bfe8..e8b2d6706eca 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -39,11 +39,15 @@
>
> #define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
> #define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
> +#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
>
> PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
> PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
> L3_START_KERNEL = pud_index(__START_KERNEL_map)
>
> +/* Adapt page table L3 space based on range of randomization */
> +L3_KERNEL_ENTRY_COUNT = pud_count(KERNEL_IMAGE_SIZE)
> +
> .text
> __HEAD
> .code64
> @@ -396,7 +400,12 @@ NEXT_PAGE(level4_kernel_pgt)
> NEXT_PAGE(level3_kernel_pgt)
> .fill L3_START_KERNEL,8,0
> /* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
> - .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
> + i = 0
> + .rept L3_KERNEL_ENTRY_COUNT
> + .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE \
> + + PAGE_SIZE*i
> + i = i + 1
> + .endr
> .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
>
> NEXT_PAGE(level2_kernel_pgt)
> --
> 2.13.2.932.g7449e964c-goog
>
On 07/19/17 at 08:10pm, Baoquan He wrote:
> On 07/18/17 at 03:33pm, Thomas Garnier wrote:
>
> > quiet_cmd_relocs = RELOCS $@
> > cmd_relocs = $(CMD_RELOCS) $< > $@;$(CMD_RELOCS) --abs-relocs $<
> > $(obj)/vmlinux.relocs: vmlinux FORCE
> > diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
> > index a0838ab929f2..0a0c80ab1842 100644
> > --- a/arch/x86/boot/compressed/misc.c
> > +++ b/arch/x86/boot/compressed/misc.c
> > @@ -170,10 +170,18 @@ void __puthex(unsigned long value)
> > }
> >
> > #if CONFIG_X86_NEED_RELOCS
> > +
> > +/* Large randomization go lower than -2G and use large relocation table */
> > +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> > +typedef long rel_t;
> > +#else
> > +typedef int rel_t;
> > +#endif
> > +
> > static void handle_relocations(void *output, unsigned long output_len,
> > unsigned long virt_addr)
> > {
> > - int *reloc;
> > + rel_t *reloc;
> > unsigned long delta, map, ptr;
> > unsigned long min_addr = (unsigned long)output;
> > unsigned long max_addr = min_addr + (VO___bss_start - VO__text);
> > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> > index 3f5f08b010d0..6b65f846dd64 100644
> > --- a/arch/x86/include/asm/page_64_types.h
> > +++ b/arch/x86/include/asm/page_64_types.h
> > @@ -48,7 +48,11 @@
> > #define __PAGE_OFFSET __PAGE_OFFSET_BASE
> > #endif /* CONFIG_RANDOMIZE_MEMORY */
> >
> > +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> > +#define __START_KERNEL_map _AC(0xffffffff00000000, UL)
> > +#else
> > #define __START_KERNEL_map _AC(0xffffffff80000000, UL)
> > +#endif /* CONFIG_RANDOMIZE_BASE_LARGE */
> >
> > /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
> > #ifdef CONFIG_X86_5LEVEL
> > @@ -65,9 +69,14 @@
> > * 512MiB by default, leaving 1.5GiB for modules once the page tables
> > * are fully set up. If kernel ASLR is configured, it can extend the
> > * kernel page table mapping, reducing the size of the modules area.
> > + * On PIE, we relocate the binary 2G lower so add this extra space.
> > */
> > #if defined(CONFIG_RANDOMIZE_BASE)
> > +#ifdef CONFIG_RANDOMIZE_BASE_LARGE
> > +#define KERNEL_IMAGE_SIZE (_AC(3, UL) * 1024 * 1024 * 1024)
> > +#else
> > #define KERNEL_IMAGE_SIZE (1024 * 1024 * 1024)
> > +#endif
> > #else
> > #define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)
> > #endif
> > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> > index 4103e90ff128..235c3f7b46c7 100644
> > --- a/arch/x86/kernel/head64.c
> > +++ b/arch/x86/kernel/head64.c
> > @@ -39,6 +39,7 @@ static unsigned int __initdata next_early_pgt;
> > pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
> >
> > #define __head __section(.head.text)
> > +#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
> >
> > static void __head *fixup_pointer(void *ptr, unsigned long physaddr)
> > {
> > @@ -54,6 +55,8 @@ unsigned long _text_offset = (unsigned long)(_text - __START_KERNEL_map);
> > void __head notrace __startup_64(unsigned long physaddr)
> > {
> > unsigned long load_delta, *p;
> > + unsigned long level3_kernel_start, level3_kernel_count;
> > + unsigned long level3_fixmap_start;
> > pgdval_t *pgd;
> > p4dval_t *p4d;
> > pudval_t *pud;
> > @@ -74,6 +77,11 @@ void __head notrace __startup_64(unsigned long physaddr)
> > if (load_delta & ~PMD_PAGE_MASK)
> > for (;;);
> >
> > + /* Look at the randomization spread to adapt page table used */
> > + level3_kernel_start = pud_index(__START_KERNEL_map);
> > + level3_kernel_count = pud_count(KERNEL_IMAGE_SIZE);
> > + level3_fixmap_start = level3_kernel_start + level3_kernel_count;
> > +
> > /* Fixup the physical addresses in the page table */
> >
> > pgd = fixup_pointer(&early_top_pgt, physaddr);
> > @@ -85,8 +93,9 @@ void __head notrace __startup_64(unsigned long physaddr)
> > }
> >
> > pud = fixup_pointer(&level3_kernel_pgt, physaddr);
> > - pud[510] += load_delta;
> > - pud[511] += load_delta;
> > + for (i = 0; i < level3_kernel_count; i++)
> > + pud[level3_kernel_start + i] += load_delta;
> > + pud[level3_fixmap_start] += load_delta;
> >
> > pmd = fixup_pointer(level2_fixmap_pgt, physaddr);
> > pmd[506] += load_delta;
> > @@ -137,7 +146,7 @@ void __head notrace __startup_64(unsigned long physaddr)
> > */
> >
> > pmd = fixup_pointer(level2_kernel_pgt, physaddr);
> > - for (i = 0; i < PTRS_PER_PMD; i++) {
> > + for (i = 0; i < PTRS_PER_PMD * level3_kernel_count; i++) {
> > if (pmd[i] & _PAGE_PRESENT)
> > pmd[i] += load_delta;
>
> Wow, this is dangerous. Three pud entries of level3_kernel_pgt all point
> to level2_kernel_pgt, it's out of bound of level2_kernel_pgt and
> overwrite the next data.
>
> And if only use one page for level2_kernel_pgt, and kernel is randomized
> to cross the pud entry of -4G to -1G, it won't work well.
Sorry, I was wrong, the size of level2_kernel_pgt is decided by
KERNEL_IMAGE_SIZE. So it's not a problem, please ignore this comment.
>
> > }
> > @@ -268,7 +277,8 @@ asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
> > */
> > BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
> > BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
> > - BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
> > + BUILD_BUG_ON(!IS_ENABLED(CONFIG_RANDOMIZE_BASE_LARGE) &&
> > + MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
> > BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
> > BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
> > BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
> > diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> > index 4d0a7e68bfe8..e8b2d6706eca 100644
> > --- a/arch/x86/kernel/head_64.S
> > +++ b/arch/x86/kernel/head_64.S
> > @@ -39,11 +39,15 @@
> >
> > #define p4d_index(x) (((x) >> P4D_SHIFT) & (PTRS_PER_P4D-1))
> > #define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
> > +#define pud_count(x) (((x + (PUD_SIZE - 1)) & ~(PUD_SIZE - 1)) >> PUD_SHIFT)
> >
> > PGD_PAGE_OFFSET = pgd_index(__PAGE_OFFSET_BASE)
> > PGD_START_KERNEL = pgd_index(__START_KERNEL_map)
> > L3_START_KERNEL = pud_index(__START_KERNEL_map)
> >
> > +/* Adapt page table L3 space based on range of randomization */
> > +L3_KERNEL_ENTRY_COUNT = pud_count(KERNEL_IMAGE_SIZE)
> > +
> > .text
> > __HEAD
> > .code64
> > @@ -396,7 +400,12 @@ NEXT_PAGE(level4_kernel_pgt)
> > NEXT_PAGE(level3_kernel_pgt)
> > .fill L3_START_KERNEL,8,0
> > /* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
> > - .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
> > + i = 0
> > + .rept L3_KERNEL_ENTRY_COUNT
> > + .quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE \
> > + + PAGE_SIZE*i
> > + i = i + 1
> > + .endr
> > .quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
> >
> > NEXT_PAGE(level2_kernel_pgt)
> > --
> > 2.13.2.932.g7449e964c-goog
> >
On Tue, 18 Jul 2017, Thomas Garnier wrote:
> Performance/Size impact:
> Hackbench (50% and 1600% loads):
> - PIE enabled: 7% to 8% on half load, 10% on heavy load.
> slab_test (average of 10 runs):
> - PIE enabled: 3% to 4%
> Kernbench (average of 10 Half and Optimal runs):
> - PIE enabled: 5% to 6%
>
> Size of vmlinux (Ubuntu configuration):
> File size:
> - PIE disabled: 472928672 bytes (-0.000169% from baseline)
> - PIE enabled: 216878461 bytes (-54.14% from baseline)
Maybe we need something like CONFIG_PARANOIA so that we can determine at
build time how much performance we want to sacrifice for performance?
Its going to be difficult to understand what all these hardening config
options do.
On Tue, Jul 18, 2017 at 7:49 PM, Brian Gerst <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
>> Change the assembly code to use only relative references of symbols for the
>> kernel to be PIE compatible. The new __ASM_GET_PTR_PRE macro is used to
>> get the address of a symbol on both 32 and 64-bit with PIE support.
>>
>> Position Independent Executable (PIE) support will allow to extended the
>> KASLR randomization range below the -2G memory limit.
>>
>> Signed-off-by: Thomas Garnier <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 6 ++++--
>> arch/x86/kernel/kvm.c | 6 ++++--
>> arch/x86/kvm/svm.c | 4 ++--
>> 3 files changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 87ac4fba6d8e..3041201a3aeb 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1352,9 +1352,11 @@ asmlinkage void kvm_spurious_fault(void);
>> ".pushsection .fixup, \"ax\" \n" \
>> "667: \n\t" \
>> cleanup_insn "\n\t" \
>> - "cmpb $0, kvm_rebooting \n\t" \
>> + "cmpb $0, kvm_rebooting" __ASM_SEL(,(%%rip)) " \n\t" \
>> "jne 668b \n\t" \
>> - __ASM_SIZE(push) " $666b \n\t" \
>> + __ASM_SIZE(push) "%%" _ASM_AX " \n\t" \
>> + __ASM_GET_PTR_PRE(666b) "%%" _ASM_AX "\n\t" \
>> + "xchg %%" _ASM_AX ", (%%" _ASM_SP ") \n\t" \
>> "call kvm_spurious_fault \n\t" \
>> ".popsection \n\t" \
>> _ASM_EXTABLE(666b, 667b)
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 71c17a5be983..53b8ad162589 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -618,8 +618,10 @@ asm(
>> ".global __raw_callee_save___kvm_vcpu_is_preempted;"
>> ".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
>> "__raw_callee_save___kvm_vcpu_is_preempted:"
>> -"movq __per_cpu_offset(,%rdi,8), %rax;"
>> -"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
>> +"leaq __per_cpu_offset(%rip), %rax;"
>> +"movq (%rax,%rdi,8), %rax;"
>> +"addq " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rip), %rax;"
>
> This doesn't look right. It's accessing a per-cpu variable. The
> per-cpu section is an absolute, zero-based section and not subject to
> relocation.
>
PIE does not respect the zero-based section, it tries to have
everything relative. Patch 16/22 also adapt per-cpu to work with PIE
(while keeping the zero absolute design by default).
>> +"cmpb $0, (%rax);
>> "setne %al;"
>> "ret;"
>> ".popsection");
>
> --
> Brian Gerst
--
Thomas
On Tue, Jul 18, 2017 at 8:59 PM, Brian Gerst <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 9:35 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/18/17 15:33, Thomas Garnier wrote:
>>> With PIE support and KASLR extended range, the modules may be further
>>> away from the kernel than before breaking mcmodel=kernel expectations.
>>>
>>> Add an option to build modules with mcmodel=large. The modules generated
>>> code will make no assumptions on placement in memory.
>>>
>>> Despite this option, modules still expect kernel functions to be within
>>> 2G and generate relative calls. To solve this issue, the PLT arm64 code
>>> was adapted for x86_64. When a relative relocation go outside its range,
>>> a dynamic PLT entry is used to correctly jump to the destination.
>>
>> Why large as opposed to medium or medium-PIC?
>
> Or for that matter, why not small-PIC? We aren't changing the size of
> the kernel to be larger than 2G text or data. Small-PIC would still
> allow it to be placed anywhere in the address space, and would
> generate far better code.
My understanding was that small=PIC and medium=PIC assume that the
module code is in the lower 2G of memory. I will do additional testing
on the modules to confirm that.
>
> --
> Brian Gerst
--
Thomas
On Wed, Jul 19, 2017 at 11:58 AM, Thomas Garnier <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 8:59 PM, Brian Gerst <[email protected]> wrote:
>> On Tue, Jul 18, 2017 at 9:35 PM, H. Peter Anvin <[email protected]> wrote:
>>> On 07/18/17 15:33, Thomas Garnier wrote:
>>>> With PIE support and KASLR extended range, the modules may be further
>>>> away from the kernel than before breaking mcmodel=kernel expectations.
>>>>
>>>> Add an option to build modules with mcmodel=large. The modules generated
>>>> code will make no assumptions on placement in memory.
>>>>
>>>> Despite this option, modules still expect kernel functions to be within
>>>> 2G and generate relative calls. To solve this issue, the PLT arm64 code
>>>> was adapted for x86_64. When a relative relocation go outside its range,
>>>> a dynamic PLT entry is used to correctly jump to the destination.
>>>
>>> Why large as opposed to medium or medium-PIC?
>>
>> Or for that matter, why not small-PIC? We aren't changing the size of
>> the kernel to be larger than 2G text or data. Small-PIC would still
>> allow it to be placed anywhere in the address space, and would
>> generate far better code.
>
> My understanding was that small=PIC and medium=PIC assume that the
> module code is in the lower 2G of memory. I will do additional testing
> on the modules to confirm that.
That is only for small/medium absolute (non-PIC) code. Think about
userspace shared libraries. They are not limited to being mapped in
the lower 2G of the address space.
--
Brian Gerst
On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <[email protected]> wrote:
> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
>> Perpcu uses a clever design where the .percu ELF section has a virtual
>> address of zero and the relocation code avoid relocating specific
>> symbols. It makes the code simple and easily adaptable with or without
>> SMP support.
>>
>> This design is incompatible with PIE because generated code always try to
>> access the zero virtual address relative to the default mapping address.
>> It becomes impossible when KASLR is configured to go below -2G. This
>> patch solves this problem by removing the zero mapping and adapting the GS
>> base to be relative to the expected address. These changes are done only
>> when PIE is enabled. The original implementation is kept as-is
>> by default.
>
> The reason the per-cpu section is zero-based on x86-64 is to
> workaround GCC hardcoding the stack protector canary at %gs:40. So
> this patch is incompatible with CONFIG_STACK_PROTECTOR.
Ok, that make sense. I don't want this feature to not work with
CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
entry for gs so gs:40 points to the correct memory address and
gs:[rip+XX] works correctly through the MSR. Given the separate
discussion on mcmodel, I am going first to check if we can move from
PIE to PIC with a mcmodel=small or medium that would remove the percpu
change requirement. I tried before without success but I understand
better percpu and other components so maybe I can make it work.
Thanks a lot for the feedback.
>
> --
> Brian Gerst
--
Thomas
On Tue 2017-07-18 15:33:24, Thomas Garnier wrote:
> Change the assembly code to use only relative references of symbols for the
> kernel to be PIE compatible.
>
> Position Independent Executable (PIE) support will allow to extended the
> KASLR randomization range below the -2G memory limit.
>
> Signed-off-by: Thomas Garnier <[email protected]>
Acked-by: Pavel Machek <[email protected]>
(But not tested; testing it would be nice).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, Jul 19, 2017 at 7:08 AM, Christopher Lameter <[email protected]> wrote:
> On Tue, 18 Jul 2017, Thomas Garnier wrote:
>
>> Performance/Size impact:
>> Hackbench (50% and 1600% loads):
>> - PIE enabled: 7% to 8% on half load, 10% on heavy load.
>> slab_test (average of 10 runs):
>> - PIE enabled: 3% to 4%
>> Kernbench (average of 10 Half and Optimal runs):
>> - PIE enabled: 5% to 6%
>>
>> Size of vmlinux (Ubuntu configuration):
>> File size:
>> - PIE disabled: 472928672 bytes (-0.000169% from baseline)
>> - PIE enabled: 216878461 bytes (-54.14% from baseline)
>
> Maybe we need something like CONFIG_PARANOIA so that we can determine at
> build time how much performance we want to sacrifice for performance?
>
> Its going to be difficult to understand what all these hardening config
> options do.
This kind of thing got discussed recently, and like
CONFIG_EXPERIMENTAL, a global config doesn't really work. The best
thing to do is to document each config as well as possible and system
builders can decide.
-Kees
--
Kees Cook
Pixel Security
On Wed, Jul 19, 2017 at 3:27 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/19/17 08:40, Thomas Garnier wrote:
>>>
>>> This doesn't look right. It's accessing a per-cpu variable. The
>>> per-cpu section is an absolute, zero-based section and not subject to
>>> relocation.
>>
>> PIE does not respect the zero-based section, it tries to have
>> everything relative. Patch 16/22 also adapt per-cpu to work with PIE
>> (while keeping the zero absolute design by default).
>>
>
> This is silly. The right thing is for PIE is to be explicitly absolute,
> without (%rip). The use of (%rip) memory references for percpu is just
> an optimization.
I agree that it is odd but that's how the compiler generates code. I
will re-explore PIC options with mcmodel=small or medium, as mentioned
on other threads.
>
> -hpa
>
--
Thomas
On Wed, Jul 19, 2017 at 3:33 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/18/17 15:33, Thomas Garnier wrote:
>> The x86 relocation tool generates a list of 32-bit signed integers. There
>> was no need to use 64-bit integers because all addresses where above the 2G
>> top of the memory.
>>
>> This change add a large-reloc option to generate 64-bit unsigned integers.
>> It can be used when the kernel plan to go below the top 2G and 32-bit
>> integers are not enough.
>
> Why on Earth? This would only be necessary if the *kernel itself* was
> more than 2G, which isn't going to happen for the forseeable future.
Because the relocation integer is an absolute address, not an offset
in the binary. Next iteration, I can try using a 32-bit offset for
everyone.
>
> -hpa
>
--
Thomas
On 07/19/17 08:40, Thomas Garnier wrote:
>>
>> This doesn't look right. It's accessing a per-cpu variable. The
>> per-cpu section is an absolute, zero-based section and not subject to
>> relocation.
>
> PIE does not respect the zero-based section, it tries to have
> everything relative. Patch 16/22 also adapt per-cpu to work with PIE
> (while keeping the zero absolute design by default).
>
This is silly. The right thing is for PIE is to be explicitly absolute,
without (%rip). The use of (%rip) memory references for percpu is just
an optimization.
-hpa
On 07/18/17 15:33, Thomas Garnier wrote:
> The x86 relocation tool generates a list of 32-bit signed integers. There
> was no need to use 64-bit integers because all addresses where above the 2G
> top of the memory.
>
> This change add a large-reloc option to generate 64-bit unsigned integers.
> It can be used when the kernel plan to go below the top 2G and 32-bit
> integers are not enough.
Why on Earth? This would only be necessary if the *kernel itself* was
more than 2G, which isn't going to happen for the forseeable future.
-hpa
On 19 July 2017 at 23:27, H. Peter Anvin <[email protected]> wrote:
> On 07/19/17 08:40, Thomas Garnier wrote:
>>>
>>> This doesn't look right. It's accessing a per-cpu variable. The
>>> per-cpu section is an absolute, zero-based section and not subject to
>>> relocation.
>>
>> PIE does not respect the zero-based section, it tries to have
>> everything relative. Patch 16/22 also adapt per-cpu to work with PIE
>> (while keeping the zero absolute design by default).
>>
>
> This is silly. The right thing is for PIE is to be explicitly absolute,
> without (%rip). The use of (%rip) memory references for percpu is just
> an optimization.
>
Sadly, there is an issue in binutils that may prevent us from doing
this as cleanly as we would want.
For historical reasons, bfd.ld emits special symbols like
__GLOBAL_OFFSET_TABLE__ as absolute symbols with a section index of
SHN_ABS, even though it is quite obvious that they are relative like
any other symbol that points into the image. Unfortunately, this means
that binutils needs to emit R_X86_64_RELATIVE relocations even for
SHN_ABS symbols, which means we lose the ability to use both absolute
and relocatable symbols in the same PIE image (unless the reloc tool
can filter them out)
More info here:
https://sourceware.org/bugzilla/show_bug.cgi?id=19818
On 07/18/17 15:33, Thomas Garnier wrote:
> Change the assembly code to use only relative references of symbols for the
> kernel to be PIE compatible.
>
> Position Independent Executable (PIE) support will allow to extended the
> KASLR randomization range below the -2G memory limit.
>
> Signed-off-by: Thomas Garnier <[email protected]>
> ---
> arch/x86/kernel/relocate_kernel_64.S | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 98111b38ebfd..da817d1628ac 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -186,7 +186,7 @@ identity_mapped:
> movq %rax, %cr3
> lea PAGE_SIZE(%r8), %rsp
> call swap_pages
> - movq $virtual_mapped, %rax
> + leaq virtual_mapped(%rip), %rax
> pushq %rax
> ret
>
This is completely wrong. The whole point is that %rip here is on an
identity-mapped page, which means that its offset to the actual symbol
is ill-defined.
The use of pushq/ret to do an indirect jump is bizarre, though, instead of:
pushq %r8
ret
one ought to simply do
jmpq *%r8
I think the author of this code was confused by the fact that we have to
use this construct to do a *far* jump.
There are some other very bizarre constructs in this file, that I can
only assume comes from clumsy porting from 32 bits, for example:
call 1f
1:
popq %r8
subq $(1b - relocate_kernel), %r8
... instead of the much simpler ...
leaq relocate_kernel(%rip), %r8
With this value in %r8 anyway, you can simply do:
leaq (virtual_mapped - relocate_kernel)(%r8), %rax
jmpq *%rax
This patchset scares me. There seems to be a lot of places where you
have not been very aware of what is actually happening in the code, nor
have done research about how the ABIs actually work and affect things.
Sorry.
-hpa
On Wed, Jul 19, 2017 at 3:58 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/18/17 15:33, Thomas Garnier wrote:
>> Change the assembly code to use only relative references of symbols for the
>> kernel to be PIE compatible.
>>
>> Position Independent Executable (PIE) support will allow to extended the
>> KASLR randomization range below the -2G memory limit.
>>
>> Signed-off-by: Thomas Garnier <[email protected]>
>> ---
>> arch/x86/kernel/relocate_kernel_64.S | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
>> index 98111b38ebfd..da817d1628ac 100644
>> --- a/arch/x86/kernel/relocate_kernel_64.S
>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>> @@ -186,7 +186,7 @@ identity_mapped:
>> movq %rax, %cr3
>> lea PAGE_SIZE(%r8), %rsp
>> call swap_pages
>> - movq $virtual_mapped, %rax
>> + leaq virtual_mapped(%rip), %rax
>> pushq %rax
>> ret
>>
>
> This is completely wrong. The whole point is that %rip here is on an
> identity-mapped page, which means that its offset to the actual symbol
> is ill-defined.
>
> The use of pushq/ret to do an indirect jump is bizarre, though, instead of:
>
> pushq %r8
> ret
>
> one ought to simply do
>
> jmpq *%r8
>
> I think the author of this code was confused by the fact that we have to
> use this construct to do a *far* jump.
>
> There are some other very bizarre constructs in this file, that I can
> only assume comes from clumsy porting from 32 bits, for example:
>
> call 1f
> 1:
> popq %r8
> subq $(1b - relocate_kernel), %r8
>
> ... instead of the much simpler ...
>
> leaq relocate_kernel(%rip), %r8
>
> With this value in %r8 anyway, you can simply do:
>
> leaq (virtual_mapped - relocate_kernel)(%r8), %rax
> jmpq *%rax
>
Thanks I will look into that.
> This patchset scares me. There seems to be a lot of places where you
> have not been very aware of what is actually happening in the code, nor
> have done research about how the ABIs actually work and affect things.
There is a lot of assembly that needed to be change. It was easier to
understand parts that are directly exercised like boot or percpu.
That's why I value people's feedback and will improve the patchset.
Thanks!
>
> Sorry.
>
> -hpa
--
Thomas
On Wed, Jul 19, 2017 at 4:08 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/19/17 15:47, Thomas Garnier wrote:
>> On Wed, Jul 19, 2017 at 3:33 PM, H. Peter Anvin <[email protected]> wrote:
>>> On 07/18/17 15:33, Thomas Garnier wrote:
>>>> The x86 relocation tool generates a list of 32-bit signed integers. There
>>>> was no need to use 64-bit integers because all addresses where above the 2G
>>>> top of the memory.
>>>>
>>>> This change add a large-reloc option to generate 64-bit unsigned integers.
>>>> It can be used when the kernel plan to go below the top 2G and 32-bit
>>>> integers are not enough.
>>>
>>> Why on Earth? This would only be necessary if the *kernel itself* was
>>> more than 2G, which isn't going to happen for the forseeable future.
>>
>> Because the relocation integer is an absolute address, not an offset
>> in the binary. Next iteration, I can try using a 32-bit offset for
>> everyone.
>
> It is an absolute address *as the kernel was originally linked*, for
> obvious reasons.
Sure when the kernel was just above 0xffffffff80000000, it doesn't
work when it goes down to 0xffffffff00000000. That's why using an
offset might make more sense in general.
>
> -hpa
>
--
Thomas.
On 07/19/17 15:47, Thomas Garnier wrote:
> On Wed, Jul 19, 2017 at 3:33 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/18/17 15:33, Thomas Garnier wrote:
>>> The x86 relocation tool generates a list of 32-bit signed integers. There
>>> was no need to use 64-bit integers because all addresses where above the 2G
>>> top of the memory.
>>>
>>> This change add a large-reloc option to generate 64-bit unsigned integers.
>>> It can be used when the kernel plan to go below the top 2G and 32-bit
>>> integers are not enough.
>>
>> Why on Earth? This would only be necessary if the *kernel itself* was
>> more than 2G, which isn't going to happen for the forseeable future.
>
> Because the relocation integer is an absolute address, not an offset
> in the binary. Next iteration, I can try using a 32-bit offset for
> everyone.
It is an absolute address *as the kernel was originally linked*, for
obvious reasons.
-hpa
On 07/19/17 11:26, Thomas Garnier wrote:
> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <[email protected]> wrote:
>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>> address of zero and the relocation code avoid relocating specific
>>> symbols. It makes the code simple and easily adaptable with or without
>>> SMP support.
>>>
>>> This design is incompatible with PIE because generated code always try to
>>> access the zero virtual address relative to the default mapping address.
>>> It becomes impossible when KASLR is configured to go below -2G. This
>>> patch solves this problem by removing the zero mapping and adapting the GS
>>> base to be relative to the expected address. These changes are done only
>>> when PIE is enabled. The original implementation is kept as-is
>>> by default.
>>
>> The reason the per-cpu section is zero-based on x86-64 is to
>> workaround GCC hardcoding the stack protector canary at %gs:40. So
>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
>
> Ok, that make sense. I don't want this feature to not work with
> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
> entry for gs so gs:40 points to the correct memory address and
> gs:[rip+XX] works correctly through the MSR.
What are you talking about? A GDT entry and the MSR do the same thing,
except that a GDT entry is limited to an offset of 0-0xffffffff (which
doesn't work for us, obviously.)
> Given the separate
> discussion on mcmodel, I am going first to check if we can move from
> PIE to PIC with a mcmodel=small or medium that would remove the percpu
> change requirement. I tried before without success but I understand
> better percpu and other components so maybe I can make it work.
>> This is silly. The right thing is for PIE is to be explicitly absolute,
>> without (%rip). The use of (%rip) memory references for percpu is just
>> an optimization.
>
> I agree that it is odd but that's how the compiler generates code. I
> will re-explore PIC options with mcmodel=small or medium, as mentioned
> on other threads.
Why should the way compiler generates code affect the way we do things
in assembly?
That being said, the compiler now has support for generating this kind
of code explicitly via the __seg_gs pointer modifier. That should let
us drop the __percpu_prefix and just use variables directly. I suspect
we want to declare percpu variables as "volatile __seg_gs" to account
for the possibility of CPU switches.
Older compilers won't be able to work with this, of course, but I think
that it is acceptable for those older compilers to not be able to
support PIE.
-hpa
<[email protected]>,"Paul E . McKenney" <[email protected]>,Andrew Morton <[email protected]>,Christopher Li <[email protected]>,Dou Liyang <[email protected]>,Masahiro Yamada <[email protected]>,Daniel Borkmann <[email protected]>,Markus Trippelsdorf <[email protected]>,Peter Foley <[email protected]>,Steven Rostedt <[email protected]>,Tim Chen <[email protected]>,Ard Biesheuvel <[email protected]>,Catalin Marinas <[email protected]>,Matthew Wilcox <[email protected]>,Michal Hocko <[email protected]>,Rob Landley <[email protected]>,Jiri Kosina <[email protected]>,"H . J . Lu" <[email protected]>,Paul Bolle <[email protected]>,Baoquan He <[email protected]>,Daniel Micay <[email protected]>,the arch/x86 maintai
ners <[email protected]>,[email protected],LKML <[email protected]>,[email protected],kvm list <[email protected]>,Linux PM list
<[email protected]>,linux-arch <[email protected]>,[email protected],Kernel Hardening <[email protected]>
From: [email protected]
Message-ID: <[email protected]>
On July 19, 2017 4:25:56 PM PDT, Thomas Garnier <[email protected]> wrote:
>On Wed, Jul 19, 2017 at 4:08 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/19/17 15:47, Thomas Garnier wrote:
>>> On Wed, Jul 19, 2017 at 3:33 PM, H. Peter Anvin <[email protected]>
>wrote:
>>>> On 07/18/17 15:33, Thomas Garnier wrote:
>>>>> The x86 relocation tool generates a list of 32-bit signed
>integers. There
>>>>> was no need to use 64-bit integers because all addresses where
>above the 2G
>>>>> top of the memory.
>>>>>
>>>>> This change add a large-reloc option to generate 64-bit unsigned
>integers.
>>>>> It can be used when the kernel plan to go below the top 2G and
>32-bit
>>>>> integers are not enough.
>>>>
>>>> Why on Earth? This would only be necessary if the *kernel itself*
>was
>>>> more than 2G, which isn't going to happen for the forseeable
>future.
>>>
>>> Because the relocation integer is an absolute address, not an offset
>>> in the binary. Next iteration, I can try using a 32-bit offset for
>>> everyone.
>>
>> It is an absolute address *as the kernel was originally linked*, for
>> obvious reasons.
>
>Sure when the kernel was just above 0xffffffff80000000, it doesn't
>work when it goes down to 0xffffffff00000000. That's why using an
>offset might make more sense in general.
>
>>
>> -hpa
>>
What is the motivation for changing the pre linked address at all?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
<[email protected]>,Chris Metcalf <[email protected]>,"Paul E . McKenney" <[email protected]>,Andrew Morton <[email protected]>,Christopher Li <[email protected]>,Dou Liyang <[email protected]>,Masahiro Yamada <[email protected]>,Daniel Borkmann <[email protected]>,Markus Trippelsdorf <[email protected]>,Peter Foley <[email protected]>,Steven Rostedt <[email protected]>,Tim Chen <[email protected]>,Catalin Marinas <[email protected]>,Matthew Wilcox <[email protected]>,Michal Hocko <[email protected]>,Rob Landley <[email protected]>,Jiri Kosina <[email protected]>,"H . J . Lu" <[email protected]>,Paul Bolle <[email protected]>,Baoquan He <[email protected]>,Daniel Micay <[email protected]>,the arch/x86 maint
ainers <[email protected]>,"[email protected]" <[email protected]>,Linux Kernel Mailing List <[email protected]>,[email protected],kvm list
<[email protected]>,linux-pm <[email protected]>,linux-arch <[email protected]>,Linux-Sparse <[email protected]>,Kernel Hardening <[email protected]>
From: [email protected]
Message-ID: <[email protected]>
On July 19, 2017 3:58:07 PM PDT, Ard Biesheuvel <[email protected]> wrote:
>On 19 July 2017 at 23:27, H. Peter Anvin <[email protected]> wrote:
>> On 07/19/17 08:40, Thomas Garnier wrote:
>>>>
>>>> This doesn't look right. It's accessing a per-cpu variable. The
>>>> per-cpu section is an absolute, zero-based section and not subject
>to
>>>> relocation.
>>>
>>> PIE does not respect the zero-based section, it tries to have
>>> everything relative. Patch 16/22 also adapt per-cpu to work with PIE
>>> (while keeping the zero absolute design by default).
>>>
>>
>> This is silly. The right thing is for PIE is to be explicitly
>absolute,
>> without (%rip). The use of (%rip) memory references for percpu is
>just
>> an optimization.
>>
>
>Sadly, there is an issue in binutils that may prevent us from doing
>this as cleanly as we would want.
>
>For historical reasons, bfd.ld emits special symbols like
>__GLOBAL_OFFSET_TABLE__ as absolute symbols with a section index of
>SHN_ABS, even though it is quite obvious that they are relative like
>any other symbol that points into the image. Unfortunately, this means
>that binutils needs to emit R_X86_64_RELATIVE relocations even for
>SHN_ABS symbols, which means we lose the ability to use both absolute
>and relocatable symbols in the same PIE image (unless the reloc tool
>can filter them out)
>
>More info here:
>https://sourceware.org/bugzilla/show_bug.cgi?id=19818
The reloc tool already has the ability to filter symbols.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 07/19/17 16:33, H. Peter Anvin wrote:
>>
>> I agree that it is odd but that's how the compiler generates code. I
>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>> on other threads.
>
> Why should the way compiler generates code affect the way we do things
> in assembly?
>
> That being said, the compiler now has support for generating this kind
> of code explicitly via the __seg_gs pointer modifier. That should let
> us drop the __percpu_prefix and just use variables directly. I suspect
> we want to declare percpu variables as "volatile __seg_gs" to account
> for the possibility of CPU switches.
>
> Older compilers won't be able to work with this, of course, but I think
> that it is acceptable for those older compilers to not be able to
> support PIE.
>
Grump. It turns out that the compiler doesn't do the right thing for
symbols marked with the __seg_[fg]s markers. __thread does the right
thing, but __thread a) has %fs: hard-coded, still, and b) I believe can
still cache %seg:0 arbitrarily long.
-hpa
On 07/19/17 19:21, H. Peter Anvin wrote:
> On 07/19/17 16:33, H. Peter Anvin wrote:
>>>
>>> I agree that it is odd but that's how the compiler generates code. I
>>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>>> on other threads.
>>
>> Why should the way compiler generates code affect the way we do things
>> in assembly?
>>
>> That being said, the compiler now has support for generating this kind
>> of code explicitly via the __seg_gs pointer modifier. That should let
>> us drop the __percpu_prefix and just use variables directly. I suspect
>> we want to declare percpu variables as "volatile __seg_gs" to account
>> for the possibility of CPU switches.
>>
>> Older compilers won't be able to work with this, of course, but I think
>> that it is acceptable for those older compilers to not be able to
>> support PIE.
>>
>
> Grump. It turns out that the compiler doesn't do the right thing for
> symbols marked with the __seg_[fg]s markers. __thread does the right
> thing, but __thread a) has %fs: hard-coded, still, and b) I believe can
> still cache %seg:0 arbitrarily long.
I filed this bug report for gcc:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81490
It might still be possible to work around this by playing really ugly
games with __thread, but I haven't yet figured out how best to do that.
-hpa
On Wed, Jul 19, 2017 at 4:33 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/19/17 11:26, Thomas Garnier wrote:
>> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <[email protected]> wrote:
>>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
>>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>>> address of zero and the relocation code avoid relocating specific
>>>> symbols. It makes the code simple and easily adaptable with or without
>>>> SMP support.
>>>>
>>>> This design is incompatible with PIE because generated code always try to
>>>> access the zero virtual address relative to the default mapping address.
>>>> It becomes impossible when KASLR is configured to go below -2G. This
>>>> patch solves this problem by removing the zero mapping and adapting the GS
>>>> base to be relative to the expected address. These changes are done only
>>>> when PIE is enabled. The original implementation is kept as-is
>>>> by default.
>>>
>>> The reason the per-cpu section is zero-based on x86-64 is to
>>> workaround GCC hardcoding the stack protector canary at %gs:40. So
>>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
>>
>> Ok, that make sense. I don't want this feature to not work with
>> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
>> entry for gs so gs:40 points to the correct memory address and
>> gs:[rip+XX] works correctly through the MSR.
>
> What are you talking about? A GDT entry and the MSR do the same thing,
> except that a GDT entry is limited to an offset of 0-0xffffffff (which
> doesn't work for us, obviously.)
>
A GDT entry would allow gs:0x40 to be valid while all gs:[rip+XX]
addresses uses the MSR.
I didn't tested it but that was used on the RFG mitigation [1]. The fs
segment register was used for both thread storage and shadow stack.
[1] http://xlab.tencent.com/en/2016/11/02/return-flow-guard/
>> Given the separate
>> discussion on mcmodel, I am going first to check if we can move from
>> PIE to PIC with a mcmodel=small or medium that would remove the percpu
>> change requirement. I tried before without success but I understand
>> better percpu and other components so maybe I can make it work.
>
>>> This is silly. The right thing is for PIE is to be explicitly absolute,
>>> without (%rip). The use of (%rip) memory references for percpu is just
>>> an optimization.
>>
>> I agree that it is odd but that's how the compiler generates code. I
>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>> on other threads.
>
> Why should the way compiler generates code affect the way we do things
> in assembly?
>
> That being said, the compiler now has support for generating this kind
> of code explicitly via the __seg_gs pointer modifier. That should let
> us drop the __percpu_prefix and just use variables directly. I suspect
> we want to declare percpu variables as "volatile __seg_gs" to account
> for the possibility of CPU switches.
>
> Older compilers won't be able to work with this, of course, but I think
> that it is acceptable for those older compilers to not be able to
> support PIE.
>
> -hpa
>
--
Thomas
On Wed, Jul 19, 2017 at 10:34 AM, Brian Gerst <[email protected]> wrote:
> On Wed, Jul 19, 2017 at 11:58 AM, Thomas Garnier <[email protected]> wrote:
>> On Tue, Jul 18, 2017 at 8:59 PM, Brian Gerst <[email protected]> wrote:
>>> On Tue, Jul 18, 2017 at 9:35 PM, H. Peter Anvin <[email protected]> wrote:
>>>> On 07/18/17 15:33, Thomas Garnier wrote:
>>>>> With PIE support and KASLR extended range, the modules may be further
>>>>> away from the kernel than before breaking mcmodel=kernel expectations.
>>>>>
>>>>> Add an option to build modules with mcmodel=large. The modules generated
>>>>> code will make no assumptions on placement in memory.
>>>>>
>>>>> Despite this option, modules still expect kernel functions to be within
>>>>> 2G and generate relative calls. To solve this issue, the PLT arm64 code
>>>>> was adapted for x86_64. When a relative relocation go outside its range,
>>>>> a dynamic PLT entry is used to correctly jump to the destination.
>>>>
>>>> Why large as opposed to medium or medium-PIC?
>>>
>>> Or for that matter, why not small-PIC? We aren't changing the size of
>>> the kernel to be larger than 2G text or data. Small-PIC would still
>>> allow it to be placed anywhere in the address space, and would
>>> generate far better code.
>>
>> My understanding was that small=PIC and medium=PIC assume that the
>> module code is in the lower 2G of memory. I will do additional testing
>> on the modules to confirm that.
>
> That is only for small/medium absolute (non-PIC) code. Think about
> userspace shared libraries. They are not limited to being mapped in
> the lower 2G of the address space.
I built lkdtm with mcmodel=(kernel, small, medium & large).
Comparing the same instruction and its relocation in lkdtm
(lkdtm_register_cpoint).
On mcmodel=kernel:
1b8: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
1bb: R_X86_64_32S .rodata.str1.8+0x50
On mcmodel=small and mcmodel=medium:
1b8: bf 00 00 00 00 mov $0x0,%edi
1b9: R_X86_64_32 .rodata.str1.8+0x50
On mcmodel=large:
235: 48 bf 00 00 00 00 00 movabs $0x0,%rdi
23c: 00 00 00
237: R_X86_64_64 .rodata.str1.8+0x50
The kernel mcmodel extends the sign of the address. It assumes you are
in the top 2G of the address space. So the relocated pointer
0x8XXXXXXX becomes 0xFFFFFFFF8XXXXXXX.
The small and medium mcmodels assume the pointer is within the lower
part of the address space. The generate pointer has the 32-bit high
part to zero. You can only map the module between 0 and 0xFFFFFFFF.
The large mcmodel can handle a full 64-bit pointer.
That's why I use the large mcmodel on modules. I cannot use PIE due to
how the modules are linked.
>
> --
> Brian Gerst
--
Thomas
On Thu, Jul 20, 2017 at 7:26 AM, Thomas Garnier <[email protected]> wrote:
> On Wed, Jul 19, 2017 at 4:33 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/19/17 11:26, Thomas Garnier wrote:
>>> On Tue, Jul 18, 2017 at 8:08 PM, Brian Gerst <[email protected]> wrote:
>>>> On Tue, Jul 18, 2017 at 6:33 PM, Thomas Garnier <[email protected]> wrote:
>>>>> Perpcu uses a clever design where the .percu ELF section has a virtual
>>>>> address of zero and the relocation code avoid relocating specific
>>>>> symbols. It makes the code simple and easily adaptable with or without
>>>>> SMP support.
>>>>>
>>>>> This design is incompatible with PIE because generated code always try to
>>>>> access the zero virtual address relative to the default mapping address.
>>>>> It becomes impossible when KASLR is configured to go below -2G. This
>>>>> patch solves this problem by removing the zero mapping and adapting the GS
>>>>> base to be relative to the expected address. These changes are done only
>>>>> when PIE is enabled. The original implementation is kept as-is
>>>>> by default.
>>>>
>>>> The reason the per-cpu section is zero-based on x86-64 is to
>>>> workaround GCC hardcoding the stack protector canary at %gs:40. So
>>>> this patch is incompatible with CONFIG_STACK_PROTECTOR.
>>>
>>> Ok, that make sense. I don't want this feature to not work with
>>> CONFIG_CC_STACKPROTECTOR*. One way to fix that would be adding a GDT
>>> entry for gs so gs:40 points to the correct memory address and
>>> gs:[rip+XX] works correctly through the MSR.
>>
>> What are you talking about? A GDT entry and the MSR do the same thing,
>> except that a GDT entry is limited to an offset of 0-0xffffffff (which
>> doesn't work for us, obviously.)
>>
>
> A GDT entry would allow gs:0x40 to be valid while all gs:[rip+XX]
> addresses uses the MSR.
>
> I didn't tested it but that was used on the RFG mitigation [1]. The fs
> segment register was used for both thread storage and shadow stack.
>
> [1] http://xlab.tencent.com/en/2016/11/02/return-flow-guard/
>
Small update on that.
I noticed that not only we have the problem of gs:0x40 not being
accessible. The compiler will default to the fs register if
mcmodel=kernel is not set.
On the next patch set, I am going to add support for
-mstack-protector-guard=global so a global variable can be used
instead of the segment register. Similar approach than ARM/ARM64.
Following this patch, I will work with gcc and llvm to add
-mstack-protector-reg=<segment register> support similar to PowerPC.
This way we can have gs used even without mcmodel=kernel. Once that's
an option, I can setup the GDT as described in the previous email
(similar to RFG).
Let me know what you think about this approach.
>>> Given the separate
>>> discussion on mcmodel, I am going first to check if we can move from
>>> PIE to PIC with a mcmodel=small or medium that would remove the percpu
>>> change requirement. I tried before without success but I understand
>>> better percpu and other components so maybe I can make it work.
>>
>>>> This is silly. The right thing is for PIE is to be explicitly absolute,
>>>> without (%rip). The use of (%rip) memory references for percpu is just
>>>> an optimization.
>>>
>>> I agree that it is odd but that's how the compiler generates code. I
>>> will re-explore PIC options with mcmodel=small or medium, as mentioned
>>> on other threads.
>>
>> Why should the way compiler generates code affect the way we do things
>> in assembly?
>>
>> That being said, the compiler now has support for generating this kind
>> of code explicitly via the __seg_gs pointer modifier. That should let
>> us drop the __percpu_prefix and just use variables directly. I suspect
>> we want to declare percpu variables as "volatile __seg_gs" to account
>> for the possibility of CPU switches.
>>
>> Older compilers won't be able to work with this, of course, but I think
>> that it is acceptable for those older compilers to not be able to
>> support PIE.
>>
>> -hpa
>>
>
>
>
> --
> Thomas
--
Thomas
On Wed, Aug 2, 2017 at 9:42 AM, Thomas Garnier <[email protected]> wrote:
> I noticed that not only we have the problem of gs:0x40 not being
> accessible. The compiler will default to the fs register if
> mcmodel=kernel is not set.
>
> On the next patch set, I am going to add support for
> -mstack-protector-guard=global so a global variable can be used
> instead of the segment register. Similar approach than ARM/ARM64.
While this is probably understood, I have to point out that this would
be a major regression for the stack protection on x86.
> Following this patch, I will work with gcc and llvm to add
> -mstack-protector-reg=<segment register> support similar to PowerPC.
> This way we can have gs used even without mcmodel=kernel. Once that's
> an option, I can setup the GDT as described in the previous email
> (similar to RFG).
It would be much nicer if we could teach gcc about the percpu area
instead. This would let us solve the global stack protector problem on
the other architectures:
http://www.openwall.com/lists/kernel-hardening/2017/06/27/6
-Kees
--
Kees Cook
Pixel Security
On Wed, Aug 2, 2017 at 9:56 AM, Kees Cook <[email protected]> wrote:
> On Wed, Aug 2, 2017 at 9:42 AM, Thomas Garnier <[email protected]> wrote:
>> I noticed that not only we have the problem of gs:0x40 not being
>> accessible. The compiler will default to the fs register if
>> mcmodel=kernel is not set.
>>
>> On the next patch set, I am going to add support for
>> -mstack-protector-guard=global so a global variable can be used
>> instead of the segment register. Similar approach than ARM/ARM64.
>
> While this is probably understood, I have to point out that this would
> be a major regression for the stack protection on x86.
I agree, the optimal solution will be using updated gcc/clang.
>
>> Following this patch, I will work with gcc and llvm to add
>> -mstack-protector-reg=<segment register> support similar to PowerPC.
>> This way we can have gs used even without mcmodel=kernel. Once that's
>> an option, I can setup the GDT as described in the previous email
>> (similar to RFG).
>
> It would be much nicer if we could teach gcc about the percpu area
> instead. This would let us solve the global stack protector problem on
> the other architectures:
> http://www.openwall.com/lists/kernel-hardening/2017/06/27/6
Yes, while I am looking at gcc I will take a look at other
architecture to see if I can help there too.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security
--
Thomas