2022-07-08 17:58:50

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2

This series applies on top of the series v3 "objtool: Enable and
implement --mcount option on powerpc" [1] rebased on powerpc-next branch

A few modifications are done to core parts to enable powerpc
implementation:
- R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
redefined as R_PPC_REL32.
- A call to static_call_init() is added to start_kernel() to avoid
every architecture to have to call it
- Trampoline address is provided to arch_static_call_transform() even
when setting a site to fallback on a call to the trampoline when the
target is too far.

[1] https://lore.kernel.org/lkml/[email protected]/T/#rb3a073c54aba563a135fba891e0c34c46e47beef

Christophe Leroy (7):
powerpc: Add missing asm/asm.h for objtool
objtool/powerpc: Activate objtool on PPC32
objtool: Add architecture specific R_REL32 macro
objtool/powerpc: Add necessary support for inline static calls
init: Call static_call_init() from start_kernel()
static_call_inline: Provide trampoline address when updating sites
powerpc/static_call: Implement inline static calls

arch/powerpc/Kconfig | 3 +-
arch/powerpc/include/asm/asm.h | 7 +++
arch/powerpc/include/asm/static_call.h | 2 +
arch/powerpc/kernel/cpu_setup_6xx.S | 26 ++++++---
arch/powerpc/kernel/cpu_setup_fsl_booke.S | 8 ++-
arch/powerpc/kernel/entry_32.S | 8 ++-
arch/powerpc/kernel/head_40x.S | 5 +-
arch/powerpc/kernel/head_8xx.S | 5 +-
arch/powerpc/kernel/head_book3s_32.S | 29 +++++++---
arch/powerpc/kernel/head_fsl_booke.S | 5 +-
arch/powerpc/kernel/static_call.c | 56 ++++++++++++++-----
arch/powerpc/kernel/swsusp_32.S | 5 +-
arch/powerpc/kvm/fpu.S | 17 ++++--
arch/powerpc/platforms/52xx/lite5200_sleep.S | 15 +++--
arch/x86/kernel/static_call.c | 2 +-
init/main.c | 1 +
kernel/static_call_inline.c | 2 +-
tools/objtool/arch/powerpc/decode.c | 16 ++++--
tools/objtool/arch/powerpc/include/arch/elf.h | 1 +
tools/objtool/arch/x86/include/arch/elf.h | 1 +
tools/objtool/check.c | 10 ++--
tools/objtool/orc_gen.c | 2 +-
22 files changed, 162 insertions(+), 64 deletions(-)
create mode 100644 arch/powerpc/include/asm/asm.h

--
2.36.1


2022-07-08 18:00:11

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 1/7] powerpc: Add missing asm/asm.h for objtool

Since commit e2ef115813c3 ("objtool: Fix STACK_FRAME_NON_STANDARD
reloc type"), powerpc needs asm/asm.h to enable objtool.

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/include/asm/asm.h | 7 +++++++
1 file changed, 7 insertions(+)
create mode 100644 arch/powerpc/include/asm/asm.h

diff --git a/arch/powerpc/include/asm/asm.h b/arch/powerpc/include/asm/asm.h
new file mode 100644
index 000000000000..86f46b604e9a
--- /dev/null
+++ b/arch/powerpc/include/asm/asm.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_ASM_H
+#define _ASM_POWERPC_ASM_H
+
+#define _ASM_PTR " .long "
+
+#endif /* _ASM_POWERPC_ASM_H */
--
2.36.1

2022-07-08 18:03:04

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 2/7] objtool/powerpc: Activate objtool on PPC32

Fix several annotations in assembly files and enable objtool on PPC32.

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/Kconfig | 2 +-
arch/powerpc/kernel/cpu_setup_6xx.S | 26 ++++++++++++------
arch/powerpc/kernel/cpu_setup_fsl_booke.S | 8 ++++--
arch/powerpc/kernel/entry_32.S | 8 ++++--
arch/powerpc/kernel/head_40x.S | 5 +++-
arch/powerpc/kernel/head_8xx.S | 5 +++-
arch/powerpc/kernel/head_book3s_32.S | 29 ++++++++++++++------
arch/powerpc/kernel/head_fsl_booke.S | 5 +++-
arch/powerpc/kernel/swsusp_32.S | 5 +++-
arch/powerpc/kvm/fpu.S | 17 ++++++++----
arch/powerpc/platforms/52xx/lite5200_sleep.S | 15 +++++++---
11 files changed, 90 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 96263d78aec9..00a43eb26418 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -237,7 +237,7 @@ config PPC
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && PPC_BOOK3S)
select HAVE_OPTPROBES
- select HAVE_OBJTOOL if PPC64
+ select HAVE_OBJTOOL
select HAVE_OBJTOOL_MCOUNT if HAVE_OBJTOOL
select HAVE_PERF_EVENTS
select HAVE_PERF_EVENTS_NMI if PPC64
diff --git a/arch/powerpc/kernel/cpu_setup_6xx.S b/arch/powerpc/kernel/cpu_setup_6xx.S
index f8b5ff64b604..f29ce3dd6140 100644
--- a/arch/powerpc/kernel/cpu_setup_6xx.S
+++ b/arch/powerpc/kernel/cpu_setup_6xx.S
@@ -4,6 +4,8 @@
* Copyright (C) 2003 Benjamin Herrenschmidt ([email protected])
*/

+#include <linux/linkage.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/cputable.h>
@@ -81,7 +83,7 @@ _GLOBAL(__setup_cpu_745x)
blr

/* Enable caches for 603's, 604, 750 & 7400 */
-setup_common_caches:
+SYM_FUNC_START_LOCAL(setup_common_caches)
mfspr r11,SPRN_HID0
andi. r0,r11,HID0_DCE
ori r11,r11,HID0_ICE|HID0_DCE
@@ -95,11 +97,12 @@ setup_common_caches:
sync
isync
blr
+SYM_FUNC_END(setup_common_caches)

/* 604, 604e, 604ev, ...
* Enable superscalar execution & branch history table
*/
-setup_604_hid0:
+SYM_FUNC_START_LOCAL(setup_604_hid0)
mfspr r11,SPRN_HID0
ori r11,r11,HID0_SIED|HID0_BHTE
ori r8,r11,HID0_BTCD
@@ -110,6 +113,7 @@ setup_604_hid0:
sync
isync
blr
+SYM_FUNC_END(setup_604_hid0)

/* 7400 <= rev 2.7 and 7410 rev = 1.0 suffer from some
* erratas we work around here.
@@ -125,13 +129,14 @@ setup_604_hid0:
* needed once we have applied workaround #5 (though it's
* not set by Apple's firmware at least).
*/
-setup_7400_workarounds:
+SYM_FUNC_START_LOCAL(setup_7400_workarounds)
mfpvr r3
rlwinm r3,r3,0,20,31
cmpwi 0,r3,0x0207
ble 1f
blr
-setup_7410_workarounds:
+SYM_FUNC_END(setup_7400_workarounds)
+SYM_FUNC_START_LOCAL(setup_7410_workarounds)
mfpvr r3
rlwinm r3,r3,0,20,31
cmpwi 0,r3,0x0100
@@ -151,6 +156,7 @@ setup_7410_workarounds:
sync
isync
blr
+SYM_FUNC_END(setup_7410_workarounds)

/* 740/750/7400/7410
* Enable Store Gathering (SGE), Address Broadcast (ABE),
@@ -158,7 +164,7 @@ setup_7410_workarounds:
* Dynamic Power Management (DPM), Speculative (SPD)
* Clear Instruction cache throttling (ICTC)
*/
-setup_750_7400_hid0:
+SYM_FUNC_START_LOCAL(setup_750_7400_hid0)
mfspr r11,SPRN_HID0
ori r11,r11,HID0_SGE | HID0_ABE | HID0_BHTE | HID0_BTIC
oris r11,r11,HID0_DPM@h
@@ -177,12 +183,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_NO_DPM)
sync
isync
blr
+SYM_FUNC_END(setup_750_7400_hid0)

/* 750cx specific
* Looks like we have to disable NAP feature for some PLL settings...
* (waiting for confirmation)
*/
-setup_750cx:
+SYM_FUNC_START_LOCAL(setup_750cx)
mfspr r10, SPRN_HID1
rlwinm r10,r10,4,28,31
cmpwi cr0,r10,7
@@ -196,11 +203,13 @@ setup_750cx:
andc r6,r6,r7
stw r6,CPU_SPEC_FEATURES(r4)
blr
+SYM_FUNC_END(setup_750cx)

/* 750fx specific
*/
-setup_750fx:
+SYM_FUNC_START_LOCAL(setup_750fx)
blr
+SYM_FUNC_END(setup_750fx)

/* MPC 745x
* Enable Store Gathering (SGE), Branch Folding (FOLD)
@@ -212,7 +221,7 @@ setup_750fx:
* Clear Instruction cache throttling (ICTC)
* Enable L2 HW prefetch
*/
-setup_745x_specifics:
+SYM_FUNC_START_LOCAL(setup_745x_specifics)
/* We check for the presence of an L3 cache setup by
* the firmware. If any, we disable NAP capability as
* it's known to be bogus on rev 2.1 and earlier
@@ -270,6 +279,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NO_DPM)
sync
isync
blr
+SYM_FUNC_END(setup_745x_specifics)

/*
* Initialize the FPU registers. This is needed to work around an errata
diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 4bf33f1b4193..f573a4f3bbe6 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -8,6 +8,8 @@
* Benjamin Herrenschmidt <[email protected]>
*/

+#include <linux/linkage.h>
+
#include <asm/page.h>
#include <asm/processor.h>
#include <asm/cputable.h>
@@ -274,7 +276,7 @@ _GLOBAL(flush_dcache_L1)

blr

-has_L2_cache:
+SYM_FUNC_START_LOCAL(has_L2_cache)
/* skip L2 cache on P2040/P2040E as they have no L2 cache */
mfspr r3, SPRN_SVR
/* shift right by 8 bits and clear E bit of SVR */
@@ -290,9 +292,10 @@ has_L2_cache:
1:
li r3, 0
blr
+SYM_FUNC_END(has_L2_cache)

/* flush backside L2 cache */
-flush_backside_L2_cache:
+SYM_FUNC_START_LOCAL(flush_backside_L2_cache)
mflr r10
bl has_L2_cache
mtlr r10
@@ -313,6 +316,7 @@ flush_backside_L2_cache:
bne 1b
2:
blr
+SYM_FUNC_END(flush_backside_L2_cache)

_GLOBAL(cpu_down_flush_e500v2)
mflr r0
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 1d599df6f169..f47b682d4667 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -18,6 +18,8 @@
#include <linux/err.h>
#include <linux/sys.h>
#include <linux/threads.h>
+#include <linux/linkage.h>
+
#include <asm/reg.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -74,17 +76,19 @@ _ASM_NOKPROBE_SYMBOL(prepare_transfer_to_handler)
#endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */

#if defined(CONFIG_PPC_KUEP) && defined(CONFIG_PPC_BOOK3S_32)
- .globl __kuep_lock
+SYM_FUNC_START(__kuep_lock)
__kuep_lock:
lwz r9, THREAD+THSR0(r2)
update_user_segments_by_4 r9, r10, r11, r12
blr
+SYM_FUNC_END(__kuep_lock)

-__kuep_unlock:
+SYM_FUNC_START_LOCAL(__kuep_unlock)
lwz r9, THREAD+THSR0(r2)
rlwinm r9,r9,0,~SR_NX
update_user_segments_by_4 r9, r10, r11, r12
blr
+SYM_FUNC_END(__kuep_unlock)

.macro kuep_lock
bl __kuep_lock
diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
index 088f500896c7..9110fe9d6747 100644
--- a/arch/powerpc/kernel/head_40x.S
+++ b/arch/powerpc/kernel/head_40x.S
@@ -28,6 +28,8 @@
#include <linux/init.h>
#include <linux/pgtable.h>
#include <linux/sizes.h>
+#include <linux/linkage.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -662,7 +664,7 @@ start_here:
* kernel initialization. This maps the first 32 MBytes of memory 1:1
* virtual to physical and more importantly sets the cache mode.
*/
-initial_mmu:
+SYM_FUNC_START_LOCAL(initial_mmu)
tlbia /* Invalidate all TLB entries */
isync

@@ -711,6 +713,7 @@ initial_mmu:
mtspr SPRN_EVPR,r0

blr
+SYM_FUNC_END(initial_mmu)

_GLOBAL(abort)
mfspr r13,SPRN_DBCR0
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 0b05f2be66b9..c94ed5a08c93 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -18,6 +18,8 @@
#include <linux/magic.h>
#include <linux/pgtable.h>
#include <linux/sizes.h>
+#include <linux/linkage.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -625,7 +627,7 @@ start_here:
* 24 Mbytes of data, and the 512k IMMR space. Anything not covered by
* these mappings is mapped by page tables.
*/
-initial_mmu:
+SYM_FUNC_START_LOCAL(initial_mmu)
li r8, 0
mtspr SPRN_MI_CTR, r8 /* remove PINNED ITLB entries */
lis r10, MD_TWAM@h
@@ -686,6 +688,7 @@ initial_mmu:
#endif
mtspr SPRN_DER, r8
blr
+SYM_FUNC_END(initial_mmu)

_GLOBAL(mmu_pin_tlb)
lis r9, (1f - PAGE_OFFSET)@h
diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S
index 6c739beb938c..c0e0868ba01a 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -18,6 +18,8 @@

#include <linux/init.h>
#include <linux/pgtable.h>
+#include <linux/linkage.h>
+
#include <asm/reg.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -877,7 +879,7 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_HPTE_TABLE)
* Load stuff into the MMU. Intended to be called with
* IR=0 and DR=0.
*/
-early_hash_table:
+SYM_FUNC_START_LOCAL(early_hash_table)
sync /* Force all PTE updates to finish */
isync
tlbia /* Clear all TLB entries */
@@ -888,8 +890,9 @@ early_hash_table:
ori r6, r6, 3 /* 256kB table */
mtspr SPRN_SDR1, r6
blr
+SYM_FUNC_END(early_hash_table)

-load_up_mmu:
+SYM_FUNC_START_LOCAL(load_up_mmu)
sync /* Force all PTE updates to finish */
isync
tlbia /* Clear all TLB entries */
@@ -918,6 +921,7 @@ BEGIN_MMU_FTR_SECTION
LOAD_BAT(7,r3,r4,r5)
END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
blr
+SYM_FUNC_END(load_up_mmu)

_GLOBAL(load_segment_registers)
li r0, NUM_USER_SEGMENTS /* load up user segment register values */
@@ -1028,7 +1032,7 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_HPTE_TABLE)
* this makes sure it's done.
* -- Cort
*/
-clear_bats:
+SYM_FUNC_START_LOCAL(clear_bats)
li r10,0

mtspr SPRN_DBAT0U,r10
@@ -1072,6 +1076,7 @@ BEGIN_MMU_FTR_SECTION
mtspr SPRN_IBAT7L,r10
END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
blr
+SYM_FUNC_END(clear_bats)

_GLOBAL(update_bats)
lis r4, 1f@h
@@ -1108,15 +1113,16 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_USE_HIGH_BATS)
mtspr SPRN_SRR1, r6
rfi

-flush_tlbs:
+SYM_FUNC_START_LOCAL(flush_tlbs)
lis r10, 0x40
1: addic. r10, r10, -0x1000
tlbie r10
bgt 1b
sync
blr
+SYM_FUNC_END(flush_tlbs)

-mmu_off:
+SYM_FUNC_START_LOCAL(mmu_off)
addi r4, r3, __after_mmu_off - _start
mfmsr r3
andi. r0,r3,MSR_DR|MSR_IR /* MMU enabled? */
@@ -1128,9 +1134,10 @@ mmu_off:
mtspr SPRN_SRR1,r3
sync
rfi
+SYM_FUNC_END(mmu_off)

/* We use one BAT to map up to 256M of RAM at _PAGE_OFFSET */
-initial_bats:
+SYM_FUNC_START_LOCAL(initial_bats)
lis r11,PAGE_OFFSET@h
tophys(r8,r11)
#ifdef CONFIG_SMP
@@ -1146,9 +1153,10 @@ initial_bats:
mtspr SPRN_IBAT0U,r11
isync
blr
+SYM_FUNC_END(initial_bats)

#ifdef CONFIG_BOOTX_TEXT
-setup_disp_bat:
+SYM_FUNC_START_LOCAL(setup_disp_bat)
/*
* setup the display bat prepared for us in prom.c
*/
@@ -1164,10 +1172,11 @@ setup_disp_bat:
mtspr SPRN_DBAT3L,r8
mtspr SPRN_DBAT3U,r11
blr
+SYM_FUNC_END(setup_disp_bat)
#endif /* CONFIG_BOOTX_TEXT */

#ifdef CONFIG_PPC_EARLY_DEBUG_CPM
-setup_cpm_bat:
+SYM_FUNC_START_LOCAL(setup_cpm_bat)
lis r8, 0xf000
ori r8, r8, 0x002a
mtspr SPRN_DBAT1L, r8
@@ -1177,10 +1186,11 @@ setup_cpm_bat:
mtspr SPRN_DBAT1U, r11

blr
+SYM_FUNC_END(setup_cpm_bat)
#endif

#ifdef CONFIG_PPC_EARLY_DEBUG_USBGECKO
-setup_usbgecko_bat:
+SYM_FUNC_START_LOCAL(setup_usbgecko_bat)
/* prepare a BAT for early io */
#if defined(CONFIG_GAMECUBE)
lis r8, 0x0c00
@@ -1199,6 +1209,7 @@ setup_usbgecko_bat:
mtspr SPRN_DBAT1L, r8
mtspr SPRN_DBAT1U, r11
blr
+SYM_FUNC_END(setup_usbgecko_bat)
#endif

.data
diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index f0db4f52bc00..744b096857a1 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -29,6 +29,8 @@
#include <linux/init.h>
#include <linux/threads.h>
#include <linux/pgtable.h>
+#include <linux/linkage.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -885,7 +887,7 @@ KernelSPE:
* Translate the effec addr in r3 to phys addr. The phys addr will be put
* into r3(higher 32bit) and r4(lower 32bit)
*/
-get_phys_addr:
+SYM_FUNC_START_LOCAL(get_phys_addr)
mfmsr r8
mfspr r9,SPRN_PID
rlwinm r9,r9,16,0x3fff0000 /* turn PID into MAS6[SPID] */
@@ -907,6 +909,7 @@ get_phys_addr:
mfspr r3,SPRN_MAS7
#endif
blr
+SYM_FUNC_END(get_phys_addr)

/*
* Global functions
diff --git a/arch/powerpc/kernel/swsusp_32.S b/arch/powerpc/kernel/swsusp_32.S
index e0cbd63007f2..ffb79326483c 100644
--- a/arch/powerpc/kernel/swsusp_32.S
+++ b/arch/powerpc/kernel/swsusp_32.S
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <linux/threads.h>
+#include <linux/linkage.h>
+
#include <asm/processor.h>
#include <asm/page.h>
#include <asm/cputable.h>
@@ -400,7 +402,7 @@ _ASM_NOKPROBE_SYMBOL(swsusp_arch_resume)
/* FIXME:This construct is actually not useful since we don't shut
* down the instruction MMU, we could just flip back MSR-DR on.
*/
-turn_on_mmu:
+SYM_FUNC_START_LOCAL(turn_on_mmu)
mflr r4
mtsrr0 r4
mtsrr1 r3
@@ -408,4 +410,5 @@ turn_on_mmu:
isync
rfi
_ASM_NOKPROBE_SYMBOL(turn_on_mmu)
+SYM_FUNC_END(turn_on_mmu)

diff --git a/arch/powerpc/kvm/fpu.S b/arch/powerpc/kvm/fpu.S
index 315c94946bad..b68e7f26a81f 100644
--- a/arch/powerpc/kvm/fpu.S
+++ b/arch/powerpc/kvm/fpu.S
@@ -6,6 +6,8 @@
*/

#include <linux/pgtable.h>
+#include <linux/linkage.h>
+
#include <asm/reg.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -110,18 +112,22 @@ FPS_THREE_IN(fsel)
* R8 = (double*)&param3 [load_three]
* LR = instruction call function
*/
-fpd_load_three:
+SYM_FUNC_START_LOCAL(fpd_load_three)
lfd 2,0(r8) /* load param3 */
-fpd_load_two:
+SYM_FUNC_START_LOCAL(fpd_load_two)
lfd 1,0(r7) /* load param2 */
-fpd_load_one:
+SYM_FUNC_START_LOCAL(fpd_load_one)
lfd 0,0(r6) /* load param1 */
-fpd_load_none:
+SYM_FUNC_START_LOCAL(fpd_load_none)
lfd 3,0(r3) /* load up fpscr value */
MTFSF_L(3)
lwz r6, 0(r4) /* load cr */
mtcr r6
blr
+SYM_FUNC_END(fpd_load_none)
+SYM_FUNC_END(fpd_load_one)
+SYM_FUNC_END(fpd_load_two)
+SYM_FUNC_END(fpd_load_three)

/*
* End of double instruction processing
@@ -131,13 +137,14 @@ fpd_load_none:
* R5 = (double*)&result
* LR = caller of instruction call function
*/
-fpd_return:
+SYM_FUNC_START_LOCAL(fpd_return)
mfcr r6
stfd 0,0(r5) /* save result */
mffs 0
stfd 0,0(r3) /* save new fpscr value */
stw r6,0(r4) /* save new cr value */
blr
+SYM_FUNC_END(fpd_return)

/*
* Double operation with no input operand
diff --git a/arch/powerpc/platforms/52xx/lite5200_sleep.S b/arch/powerpc/platforms/52xx/lite5200_sleep.S
index afee8b1515a8..0b12647e7b42 100644
--- a/arch/powerpc/platforms/52xx/lite5200_sleep.S
+++ b/arch/powerpc/platforms/52xx/lite5200_sleep.S
@@ -1,4 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+
#include <asm/reg.h>
#include <asm/ppc_asm.h>
#include <asm/processor.h>
@@ -178,7 +180,8 @@ sram_code:


/* local udelay in sram is needed */
- udelay: /* r11 - tb_ticks_per_usec, r12 - usecs, overwrites r13 */
+SYM_FUNC_START_LOCAL(udelay)
+ /* r11 - tb_ticks_per_usec, r12 - usecs, overwrites r13 */
mullw r12, r12, r11
mftb r13 /* start */
add r12, r13, r12 /* end */
@@ -187,6 +190,7 @@ sram_code:
cmp cr0, r13, r12
blt 1b
blr
+SYM_FUNC_END(udelay)

sram_code_end:

@@ -271,7 +275,7 @@ _ASM_NOKPROBE_SYMBOL(lite5200_wakeup)
SAVE_SR(n+2, addr+2); \
SAVE_SR(n+3, addr+3);

-save_regs:
+SYM_FUNC_START_LOCAL(save_regs)
stw r0, 0(r4)
stw r1, 0x4(r4)
stw r2, 0x8(r4)
@@ -317,6 +321,7 @@ save_regs:
SAVE_SPRN(TBRU, 0x5b)

blr
+SYM_FUNC_END(save_regs)


/* restore registers */
@@ -336,7 +341,7 @@ save_regs:
LOAD_SR(n+2, addr+2); \
LOAD_SR(n+3, addr+3);

-restore_regs:
+SYM_FUNC_START_LOCAL(restore_regs)
lis r4, registers@h
ori r4, r4, registers@l

@@ -393,6 +398,7 @@ restore_regs:

blr
_ASM_NOKPROBE_SYMBOL(restore_regs)
+SYM_FUNC_END(restore_regs)



@@ -403,7 +409,7 @@ _ASM_NOKPROBE_SYMBOL(restore_regs)
* Flush data cache
* Do this by just reading lots of stuff into the cache.
*/
-flush_data_cache:
+SYM_FUNC_START_LOCAL(flush_data_cache)
lis r3,CONFIG_KERNEL_START@h
ori r3,r3,CONFIG_KERNEL_START@l
li r4,NUM_CACHE_LINES
@@ -413,3 +419,4 @@ flush_data_cache:
addi r3,r3,L1_CACHE_BYTES /* Next line, please */
bdnz 1b
blr
+SYM_FUNC_END(flush_data_cache)
--
2.36.1

2022-07-08 18:03:15

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 7/7] powerpc/static_call: Implement inline static calls

Implement inline static calls:
- Put a 'bl' to the destination function ('b' if tail call)
- Put a 'nop' when the destination function is NULL ('blr' if tail call)
- Put a 'li r3,0' when the destination is the RET0 function and not
a tail call.

If the destination is too far (over the 32Mb limit), go via the
trampoline.

Signed-off-by: Christophe Leroy <[email protected]>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/static_call.h | 2 +
arch/powerpc/kernel/static_call.c | 56 +++++++++++++++++++-------
3 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 00a43eb26418..cb92887acc3f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -251,6 +251,7 @@ config PPC
select HAVE_STACKPROTECTOR if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
select HAVE_STACKPROTECTOR if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
select HAVE_STATIC_CALL if PPC32
+ select HAVE_STATIC_CALL_INLINE if PPC32
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE
diff --git a/arch/powerpc/include/asm/static_call.h b/arch/powerpc/include/asm/static_call.h
index de1018cc522b..e3d5d3823dac 100644
--- a/arch/powerpc/include/asm/static_call.h
+++ b/arch/powerpc/include/asm/static_call.h
@@ -26,4 +26,6 @@
#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) __PPC_SCT(name, "blr")
#define ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) __PPC_SCT(name, "b .+20")

+#define CALL_INSN_SIZE 4
+
#endif /* _ASM_POWERPC_STATIC_CALL_H */
diff --git a/arch/powerpc/kernel/static_call.c b/arch/powerpc/kernel/static_call.c
index 863a7aa24650..0093b471186d 100644
--- a/arch/powerpc/kernel/static_call.c
+++ b/arch/powerpc/kernel/static_call.c
@@ -8,26 +8,52 @@ void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
{
int err;
bool is_ret0 = (func == __static_call_return0);
- unsigned long target = (unsigned long)(is_ret0 ? tramp + PPC_SCT_RET0 : func);
- bool is_short = is_offset_in_branch_range((long)target - (long)tramp);
-
- if (!tramp)
- return;
+ unsigned long _tramp = (unsigned long)tramp;
+ unsigned long _func = (unsigned long)func;
+ unsigned long _ret0 = _tramp + PPC_SCT_RET0;
+ bool is_short = is_offset_in_branch_range((long)func - (long)(site ? : tramp));

mutex_lock(&text_mutex);

- if (func && !is_short) {
- err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(target));
- if (err)
- goto out;
+ if (site && !tail) {
+ if (!func)
+ err = patch_instruction(site, ppc_inst(PPC_RAW_NOP()));
+ else if (is_ret0)
+ err = patch_instruction(site, ppc_inst(PPC_RAW_LI(_R3, 0)));
+ else if (is_short)
+ err = patch_branch(site, _func, BRANCH_SET_LINK);
+ else if (tramp)
+ err = patch_branch(site, _tramp, BRANCH_SET_LINK);
+ else
+ err = 0;
+ } else if (site) {
+ if (!func)
+ err = patch_instruction(site, ppc_inst(PPC_RAW_BLR()));
+ else if (is_ret0)
+ err = patch_branch(site, _ret0, 0);
+ else if (is_short)
+ err = patch_branch(site, _func, 0);
+ else if (tramp)
+ err = patch_branch(site, _tramp, 0);
+ else
+ err = 0;
+ } else if (tramp) {
+ if (func && !is_short) {
+ err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(_func));
+ if (err)
+ goto out;
+ }
+
+ if (!func)
+ err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
+ else if (is_ret0)
+ err = patch_branch(tramp, _ret0, 0);
+ else if (is_short)
+ err = patch_branch(tramp, _func, 0);
+ else
+ err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
}

- if (!func)
- err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR()));
- else if (is_short)
- err = patch_branch(tramp, target, 0);
- else
- err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP()));
out:
mutex_unlock(&text_mutex);

--
2.36.1

2022-07-08 18:04:54

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 3/7] objtool: Add architecture specific R_REL32 macro

In order to allow other architectures than x86 to use 32 bits
PC relative relocations (S+A-P), define a R_REL32 macro that each
architecture will define, in the same way as already done for
R_NONE, R_ABS32 and R_ABS64.

For x86 that corresponds to R_X86_64_PC32.
For powerpc it will be R_PPC_REL32/R_PPC64_REL32.

Signed-off-by: Christophe Leroy <[email protected]>
---
v2: Improved commit message based on feedback from Segher
---
tools/objtool/arch/x86/include/arch/elf.h | 1 +
tools/objtool/check.c | 10 +++++-----
tools/objtool/orc_gen.c | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/arch/x86/include/arch/elf.h b/tools/objtool/arch/x86/include/arch/elf.h
index ac14987cf687..e7d228c686db 100644
--- a/tools/objtool/arch/x86/include/arch/elf.h
+++ b/tools/objtool/arch/x86/include/arch/elf.h
@@ -4,5 +4,6 @@
#define R_NONE R_X86_64_NONE
#define R_ABS64 R_X86_64_64
#define R_ABS32 R_X86_64_32
+#define R_REL32 R_X86_64_PC32

#endif /* _OBJTOOL_ARCH_ELF */
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index dec42a226048..ba8fd313372c 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -652,7 +652,7 @@ static int create_static_call_sections(struct objtool_file *file)
/* populate reloc for 'addr' */
if (elf_add_reloc_to_insn(file->elf, sec,
idx * sizeof(struct static_call_site),
- R_X86_64_PC32,
+ R_REL32,
insn->sec, insn->offset))
return -1;

@@ -693,7 +693,7 @@ static int create_static_call_sections(struct objtool_file *file)
/* populate reloc for 'key' */
if (elf_add_reloc(file->elf, sec,
idx * sizeof(struct static_call_site) + 4,
- R_X86_64_PC32, key_sym,
+ R_REL32, key_sym,
is_sibling_call(insn) * STATIC_CALL_SITE_TAIL))
return -1;

@@ -737,7 +737,7 @@ static int create_retpoline_sites_sections(struct objtool_file *file)

if (elf_add_reloc_to_insn(file->elf, sec,
idx * sizeof(int),
- R_X86_64_PC32,
+ R_REL32,
insn->sec, insn->offset)) {
WARN("elf_add_reloc_to_insn: .retpoline_sites");
return -1;
@@ -789,7 +789,7 @@ static int create_ibt_endbr_seal_sections(struct objtool_file *file)

if (elf_add_reloc_to_insn(file->elf, sec,
idx * sizeof(int),
- R_X86_64_PC32,
+ R_REL32,
insn->sec, insn->offset)) {
WARN("elf_add_reloc_to_insn: .ibt_endbr_seal");
return -1;
@@ -3718,7 +3718,7 @@ static int validate_ibt_insn(struct objtool_file *file, struct instruction *insn
continue;

off = reloc->sym->offset;
- if (reloc->type == R_X86_64_PC32 || reloc->type == R_X86_64_PLT32)
+ if (reloc->type == R_REL32 || reloc->type == R_X86_64_PLT32)
off += arch_dest_reloc_offset(reloc->addend);
else
off += reloc->addend;
diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c
index 1f22b7ebae58..49a877b9c879 100644
--- a/tools/objtool/orc_gen.c
+++ b/tools/objtool/orc_gen.c
@@ -101,7 +101,7 @@ static int write_orc_entry(struct elf *elf, struct section *orc_sec,
orc->bp_offset = bswap_if_needed(elf, orc->bp_offset);

/* populate reloc for ip */
- if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_X86_64_PC32,
+ if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_REL32,
insn_sec, insn_off))
return -1;

--
2.36.1

2022-07-08 18:05:42

by Christophe Leroy

[permalink] [raw]
Subject: [PATCH v2 4/7] objtool/powerpc: Add necessary support for inline static calls

In order to support inline static calls for powerpc, objtool needs
the following additions:
- R_REL32 macro
- Support for JUMP instruction used for tail calls

Add the support of decoding branch instruction 'b' which is the jump
instruction used for tail calls. This is because a static call can be
a tail call.

Signed-off-by: Christophe Leroy <[email protected]>
---
tools/objtool/arch/powerpc/decode.c | 16 ++++++++++------
tools/objtool/arch/powerpc/include/arch/elf.h | 1 +
2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/tools/objtool/arch/powerpc/decode.c b/tools/objtool/arch/powerpc/decode.c
index 06fc0206bf8e..ba84869cd134 100644
--- a/tools/objtool/arch/powerpc/decode.c
+++ b/tools/objtool/arch/powerpc/decode.c
@@ -59,13 +59,17 @@ int arch_decode_instruction(struct objtool_file *file, const struct section *sec
opcode = insn >> 26;

switch (opcode) {
- case 18: /* bl */
- if ((insn & 3) == 1) {
+ case 18: /* bl/b */
+ if ((insn & 3) == 1)
*type = INSN_CALL;
- *immediate = insn & 0x3fffffc;
- if (*immediate & 0x2000000)
- *immediate -= 0x4000000;
- }
+ else if ((insn & 3) == 0)
+ *type = INSN_JUMP_UNCONDITIONAL;
+ else
+ break;
+
+ *immediate = insn & 0x3fffffc;
+ if (*immediate & 0x2000000)
+ *immediate -= 0x4000000;
break;
}

diff --git a/tools/objtool/arch/powerpc/include/arch/elf.h b/tools/objtool/arch/powerpc/include/arch/elf.h
index 73f9ae172fe5..befc2e30d38b 100644
--- a/tools/objtool/arch/powerpc/include/arch/elf.h
+++ b/tools/objtool/arch/powerpc/include/arch/elf.h
@@ -6,5 +6,6 @@
#define R_NONE R_PPC_NONE
#define R_ABS64 R_PPC64_ADDR64
#define R_ABS32 R_PPC_ADDR32
+#define R_REL32 R_PPC_REL32 /* R_PPC64_REL32 is identical */

#endif /* _OBJTOOL_ARCH_ELF */
--
2.36.1

2022-07-09 07:02:37

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2

Hello Christophe,

On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
<[email protected]> wrote:
>
> This series applies on top of the series v3 "objtool: Enable and
> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>
> A few modifications are done to core parts to enable powerpc
> implementation:
> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
> redefined as R_PPC_REL32.
> - A call to static_call_init() is added to start_kernel() to avoid
> every architecture to have to call it
> - Trampoline address is provided to arch_static_call_transform() even
> when setting a site to fallback on a call to the trampoline when the
> target is too far.
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>
> Christophe Leroy (7):
> powerpc: Add missing asm/asm.h for objtool
> objtool/powerpc: Activate objtool on PPC32
> objtool: Add architecture specific R_REL32 macro
> objtool/powerpc: Add necessary support for inline static calls
> init: Call static_call_init() from start_kernel()
> static_call_inline: Provide trampoline address when updating sites
> powerpc/static_call: Implement inline static calls
>

Could you quantify the performance gains of moving from out-of-line,
patched tail-call branch instructions to full-fledged inline static
calls? On x86, the retpoline problem makes this glaringly obvious, but
on other architectures, the complexity of supporting this model may
outweigh the performance advantages.

2022-09-01 17:31:18

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2



Le 09/07/2022 à 08:52, Ard Biesheuvel a écrit :
> Hello Christophe,
>
> On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
> <[email protected]> wrote:
>>
>> This series applies on top of the series v3 "objtool: Enable and
>> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>>
>> A few modifications are done to core parts to enable powerpc
>> implementation:
>> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
>> redefined as R_PPC_REL32.
>> - A call to static_call_init() is added to start_kernel() to avoid
>> every architecture to have to call it
>> - Trampoline address is provided to arch_static_call_transform() even
>> when setting a site to fallback on a call to the trampoline when the
>> target is too far.
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>>
>> Christophe Leroy (7):
>> powerpc: Add missing asm/asm.h for objtool
>> objtool/powerpc: Activate objtool on PPC32
>> objtool: Add architecture specific R_REL32 macro
>> objtool/powerpc: Add necessary support for inline static calls
>> init: Call static_call_init() from start_kernel()
>> static_call_inline: Provide trampoline address when updating sites
>> powerpc/static_call: Implement inline static calls
>>
>
> Could you quantify the performance gains of moving from out-of-line,
> patched tail-call branch instructions to full-fledged inline static
> calls? On x86, the retpoline problem makes this glaringly obvious, but
> on other architectures, the complexity of supporting this model may
> outweigh the performance advantages.

Surprisingly, I get worst performance with inline static call than with
out of line static call:

No static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.186
Running with 1*40 (== 40) tasks.
Time: 16.738
Running with 1*40 (== 40) tasks.
Time: 16.579
Running with 1*40 (== 40) tasks.
Time: 16.838
Running with 1*40 (== 40) tasks.
Time: 16.652
Running with 1*40 (== 40) tasks.
Time: 17.380
Running with 1*40 (== 40) tasks.
Time: 16.630
Running with 1*40 (== 40) tasks.
Time: 16.850
Running with 1*40 (== 40) tasks.
Time: 17.161
Running with 1*40 (== 40) tasks.
Time: 16.722

Performance counter stats for './hackbench 1' (10 runs):

17019.55 msec task-clock # 0.980 CPUs
utilized ( +- 0.51% )
4847 context-switches # 282.280 /sec
( +- 6.32% )
0 cpu-migrations # 0.000 /sec
1249 page-faults # 72.739 /sec
( +- 0.49% )
2245344976 cycles # 0.131 GHz
( +- 0.51% )
727437072 instructions # 0.32 insn per
cycle ( +- 0.40% )
<not supported> branches
<not supported> branch-misses

17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% )


Outline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 15.892
Running with 1*40 (== 40) tasks.
Time: 15.731
Running with 1*40 (== 40) tasks.
Time: 15.507
Running with 1*40 (== 40) tasks.
Time: 16.269
Running with 1*40 (== 40) tasks.
Time: 15.934
Running with 1*40 (== 40) tasks.
Time: 16.048
Running with 1*40 (== 40) tasks.
Time: 15.700
Running with 1*40 (== 40) tasks.
Time: 16.063
Running with 1*40 (== 40) tasks.
Time: 15.852
Running with 1*40 (== 40) tasks.
Time: 15.941

Performance counter stats for './hackbench 1' (10 runs):

16227.32 msec task-clock # 0.992 CPUs
utilized ( +- 0.42% )
3732 context-switches # 230.525 /sec
( +- 6.42% )
0 cpu-migrations # 0.000 /sec
1244 page-faults # 76.842 /sec
( +- 0.11% )
2141094288 cycles # 0.132 GHz
( +- 0.42% )
712598441 instructions # 0.33 insn per
cycle ( +- 0.29% )
<not supported> branches
<not supported> branch-misses

16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% )


Inline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.512
Running with 1*40 (== 40) tasks.
Time: 17.240
Running with 1*40 (== 40) tasks.
Time: 16.901
Running with 1*40 (== 40) tasks.
Time: 17.125
Running with 1*40 (== 40) tasks.
Time: 17.262
Running with 1*40 (== 40) tasks.
Time: 17.298
Running with 1*40 (== 40) tasks.
Time: 17.182
Running with 1*40 (== 40) tasks.
Time: 16.988
Running with 1*40 (== 40) tasks.
Time: 17.102
Running with 1*40 (== 40) tasks.
Time: 16.669

Performance counter stats for './hackbench 1' (10 runs):

16976.76 msec task-clock # 0.964 CPUs
utilized ( +- 0.44% )
4760 context-switches # 273.007 /sec
( +- 4.93% )
0 cpu-migrations # 0.000 /sec
1252 page-faults # 71.808 /sec
( +- 0.35% )
2239986112 cycles # 0.128 GHz
( +- 0.44% )
721540184 instructions # 0.31 insn per
cycle ( +- 0.31% )
<not supported> branches
<not supported> branch-misses

17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% )


Summary:

No static calls:
17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% )
Out-of-line static calls:
16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% )
Inline static calls:
17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% )

Is there anything wrong with inline statica calls ?

Christophe

2022-09-08 00:42:56

by Benjamin Gray

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2

On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
> Surprisingly, I get worst performance with inline static call than
> with
> out of line static call:

I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
out-of-line calls in a loop I saw a similar thing where adding more
indirection improved the performance despite doing more work. The cause
seemed to be a combination of using older hardware and the target being
too short (just an integer increment). Moving to a newer machine and
adding a lot of NOPs to the target made the performance make sense.


Attachments:
signature.asc (235.00 B)
This is a digitally signed message part

2022-09-08 07:16:40

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2



Le 08/09/2022 à 02:13, Benjamin Gray a écrit :
> On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
>> Surprisingly, I get worst performance with inline static call than
>> with
>> out of line static call:
>
> I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
> out-of-line calls in a loop I saw a similar thing where adding more
> indirection improved the performance despite doing more work. The cause
> seemed to be a combination of using older hardware and the target being
> too short (just an integer increment). Moving to a newer machine and
> adding a lot of NOPs to the target made the performance make sense.

Yes might be.

I think I'll first do new tests with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Christophe