LinuxLists.cc - [PATCH v2 0/2] LoongArch: Modify handle

2022-06-21 10:11:18

Subject: [PATCH v2 0/2] LoongArch: Modify handle_syscall

v2: update the commit message of patch #2 to fix a typo,
sorry for that.

Tiezhu Yang (2):
LoongArch: Add TI_SYSCALL in output_thread_info_defines()
LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

arch/loongarch/include/asm/stackframe.h | 5 +++++
arch/loongarch/kernel/asm-offsets.c | 1 +
arch/loongarch/kernel/entry.S | 15 +++++++++++++++
3 files changed, 21 insertions(+)

--
2.1.0

2022-06-21 10:11:26

by Tiezhu Yang

[permalink] [raw]

Subject: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
for all syscalls.

(1) rt_sigreturn call RESTORE_ALL_AND_RET.
(2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.

This patch only adds the minimal changes as simple as possible
to reduce the code complexity, at the same time, it can reduce
many load instructions.

Here are the test environments:

Hardware: Loongson-LS3A5000-7A1000-1w-A2101
Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
System: loongarch64-clfs-system-5.0 [2]

The system passed functional testing used with the following
test case without and with this patch:

git clone https://github.com/hevz/sigaction-test.git
cd sigaction-test
make check

Additionally, use UnixBench syscall to test the performance:

git clone https://github.com/kdlucas/byte-unixbench.git
cd byte-unixbench/UnixBench/
make
pgms/syscall 600

In order to avoid the performance impact, add init=/bin/bash
to the boot cmdline.

Here is the test result, the bigger the better, it shows about
1.2% gain tested with close, getpid and exec [3]:

duration without_this_patch with_this_patch
600 s 626558267 lps 634244079 lps

[1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
[2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
[3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c

Signed-off-by: Tiezhu Yang <[email protected]>
---
arch/loongarch/include/asm/stackframe.h | 5 +++++
arch/loongarch/kernel/entry.S | 15 +++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
index 4ca9530..551ab8f 100644
--- a/arch/loongarch/include/asm/stackframe.h
+++ b/arch/loongarch/include/asm/stackframe.h
@@ -216,4 +216,9 @@
RESTORE_SP_AND_RET \docfi
.endm

+ .macro RESTORE_STATIC_SOME_SP_AND_RET docfi=0
+ RESTORE_STATIC \docfi
+ RESTORE_SOME \docfi
+ RESTORE_SP_AND_RET \docfi
+ .endm
#endif /* _ASM_STACKFRAME_H */
diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
index d5b3dbc..c764c99 100644
--- a/arch/loongarch/kernel/entry.S
+++ b/arch/loongarch/kernel/entry.S
@@ -14,6 +14,7 @@
#include <asm/regdef.h>
#include <asm/stackframe.h>
#include <asm/thread_info.h>
+#include <asm/unistd.h>

.text
.cfi_sections .debug_frame
@@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
li.d tp, ~_THREAD_MASK
and tp, tp, sp

+ /* Syscall number held in a7, we can store it in TI_SYSCALL. */
+ LONG_S a7, tp, TI_SYSCALL
+
move a0, sp
bl do_syscall

+ /*
+ * Syscall number held in a7 which is stored in TI_SYSCALL.
+ * rt_sigreturn call RESTORE_ALL_AND_RET.
+ * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
+ */
+ LONG_L t3, tp, TI_SYSCALL
+ li.w t4, __NR_rt_sigreturn
+ beq t3, t4, 1f
+
+ RESTORE_STATIC_SOME_SP_AND_RET
+1:
RESTORE_ALL_AND_RET
SYM_FUNC_END(handle_syscall)

--
2.1.0

2022-06-22 10:15:22

by Huacai Chen

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

Hi, Tiezhu,

On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <[email protected]> wrote:
>
> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
> for all syscalls.
>
> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>
> This patch only adds the minimal changes as simple as possible
> to reduce the code complexity, at the same time, it can reduce
> many load instructions.
>
> Here are the test environments:
>
> Hardware: Loongson-LS3A5000-7A1000-1w-A2101
> Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
> System: loongarch64-clfs-system-5.0 [2]
>
> The system passed functional testing used with the following
> test case without and with this patch:
>
> git clone https://github.com/hevz/sigaction-test.git
> cd sigaction-test
> make check
>
> Additionally, use UnixBench syscall to test the performance:
>
> git clone https://github.com/kdlucas/byte-unixbench.git
> cd byte-unixbench/UnixBench/
> make
> pgms/syscall 600
>
> In order to avoid the performance impact, add init=/bin/bash
> to the boot cmdline.
>
> Here is the test result, the bigger the better, it shows about
> 1.2% gain tested with close, getpid and exec [3]:
>
> duration without_this_patch with_this_patch
> 600 s 626558267 lps 634244079 lps
>
> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
> [2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
> [3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
I test your patch and the whole UnixBench result is like this:

Before patch, single thread:

System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 9235787.7 791.4
Double-Precision Whetstone 55.0 2758.7 501.6
Execl Throughput 43.0 2386.8 555.1
File Copy 1024 bufsize 2000 maxblocks 3960.0 191752.0 484.2
File Copy 256 bufsize 500 maxblocks 1655.0 78737.9 475.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 297402.5 512.8
Pipe Throughput 12440.0 353658.1 284.3
Pipe-based Context Switching 4000.0 120140.8 300.4
Process Creation 126.0 5735.0 455.2
Shell Scripts (1 concurrent) 42.4 2701.5 637.1
Shell Scripts (8 concurrent) 6.0 894.9 1491.5
System Call Overhead 15000.0 557467.4 371.6
========
System Benchmarks Index Score 516.1

After patch, single thread:

System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 9235688.9 791.4
Double-Precision Whetstone 55.0 2758.7 501.6
Execl Throughput 43.0 2377.8 553.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 192545.5 486.2
File Copy 256 bufsize 500 maxblocks 1655.0 79735.0 481.8
File Copy 4096 bufsize 8000 maxblocks 5800.0 299621.9 516.6
Pipe Throughput 12440.0 354969.1 285.3
Pipe-based Context Switching 4000.0 118307.5 295.8
Process Creation 126.0 5757.0 456.9
Shell Scripts (1 concurrent) 42.4 2695.4 635.7
Shell Scripts (8 concurrent) 6.0 894.4 1490.6
System Call Overhead 15000.0 563582.7 375.7
========
System Benchmarks Index Score 517.0

Before patch, multi-threads:

System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 36943633.4 3165.7
Double-Precision Whetstone 55.0 11035.8 2006.5
Execl Throughput 43.0 8800.1 2046.5
File Copy 1024 bufsize 2000 maxblocks 3960.0 277638.3 701.1
File Copy 256 bufsize 500 maxblocks 1655.0 92530.5 559.1
File Copy 4096 bufsize 8000 maxblocks 5800.0 524344.3 904.0
Pipe Throughput 12440.0 1359237.2 1092.6
Pipe-based Context Switching 4000.0 571511.4 1428.8
Process Creation 126.0 20823.3 1652.6
Shell Scripts (1 concurrent) 42.4 6883.9 1623.6
Shell Scripts (8 concurrent) 6.0 981.7 1636.1
System Call Overhead 15000.0 2029539.8 1353.0
========
System Benchmarks Index Score 1367.4

After patch, multi-threads:

System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 36943793.6 3165.7
Double-Precision Whetstone 55.0 11035.5 2006.4
Execl Throughput 43.0 8768.3 2039.1
File Copy 1024 bufsize 2000 maxblocks 3960.0 277962.9 701.9
File Copy 256 bufsize 500 maxblocks 1655.0 92059.7 556.3
File Copy 4096 bufsize 8000 maxblocks 5800.0 525937.5 906.8
Pipe Throughput 12440.0 1361566.6 1094.5
Pipe-based Context Switching 4000.0 575835.4 1439.6
Process Creation 126.0 20426.4 1621.1
Shell Scripts (1 concurrent) 42.4 6877.5 1622.0
Shell Scripts (8 concurrent) 6.0 980.3 1633.8
System Call Overhead 15000.0 2049771.6 1366.5
========
System Benchmarks Index Score 1366.6

From my point of view, the benefit is negligible.

Huacai

>
> Signed-off-by: Tiezhu Yang <[email protected]>
> ---
> arch/loongarch/include/asm/stackframe.h | 5 +++++
> arch/loongarch/kernel/entry.S | 15 +++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
> index 4ca9530..551ab8f 100644
> --- a/arch/loongarch/include/asm/stackframe.h
> +++ b/arch/loongarch/include/asm/stackframe.h
> @@ -216,4 +216,9 @@
> RESTORE_SP_AND_RET \docfi
> .endm
>
> + .macro RESTORE_STATIC_SOME_SP_AND_RET docfi=0
> + RESTORE_STATIC \docfi
> + RESTORE_SOME \docfi
> + RESTORE_SP_AND_RET \docfi
> + .endm
> #endif /* _ASM_STACKFRAME_H */
> diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
> index d5b3dbc..c764c99 100644
> --- a/arch/loongarch/kernel/entry.S
> +++ b/arch/loongarch/kernel/entry.S
> @@ -14,6 +14,7 @@
> #include <asm/regdef.h>
> #include <asm/stackframe.h>
> #include <asm/thread_info.h>
> +#include <asm/unistd.h>
>
> .text
> .cfi_sections .debug_frame
> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
> li.d tp, ~_THREAD_MASK
> and tp, tp, sp
>
> + /* Syscall number held in a7, we can store it in TI_SYSCALL. */
> + LONG_S a7, tp, TI_SYSCALL
> +
> move a0, sp
> bl do_syscall
>
> + /*
> + * Syscall number held in a7 which is stored in TI_SYSCALL.
> + * rt_sigreturn call RESTORE_ALL_AND_RET.
> + * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
> + */
> + LONG_L t3, tp, TI_SYSCALL
> + li.w t4, __NR_rt_sigreturn
> + beq t3, t4, 1f
> +
> + RESTORE_STATIC_SOME_SP_AND_RET
> +1:
> RESTORE_ALL_AND_RET
> SYM_FUNC_END(handle_syscall)
>
> --
> 2.1.0
>

2022-06-23 01:16:10

by Tiezhu Yang

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

Cc Jiaxun Yang <[email protected]>

On 06/22/2022 06:01 PM, Huacai Chen wrote:
> Hi, Tiezhu,
>
> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <[email protected]> wrote:
>>
>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>> for all syscalls.
>>
>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>
>> This patch only adds the minimal changes as simple as possible
>> to reduce the code complexity, at the same time, it can reduce
>> many load instructions.
>>
>> Here are the test environments:
>>
>> Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>> Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>> System: loongarch64-clfs-system-5.0 [2]
>>
>> The system passed functional testing used with the following
>> test case without and with this patch:
>>
>> git clone https://github.com/hevz/sigaction-test.git
>> cd sigaction-test
>> make check
>>
>> Additionally, use UnixBench syscall to test the performance:
>>
>> git clone https://github.com/kdlucas/byte-unixbench.git
>> cd byte-unixbench/UnixBench/
>> make
>> pgms/syscall 600
>>
>> In order to avoid the performance impact, add init=/bin/bash
>> to the boot cmdline.
>>
>> Here is the test result, the bigger the better, it shows about
>> 1.2% gain tested with close, getpid and exec [3]:
>>
>> duration without_this_patch with_this_patch
>> 600 s 626558267 lps 634244079 lps
>>
>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>> [2] https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>> [3] https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
> I test your patch and the whole UnixBench result is like this:
>
> Before patch, single thread:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 9235787.7 791.4
> Double-Precision Whetstone 55.0 2758.7 501.6
> Execl Throughput 43.0 2386.8 555.1
> File Copy 1024 bufsize 2000 maxblocks 3960.0 191752.0 484.2
> File Copy 256 bufsize 500 maxblocks 1655.0 78737.9 475.8
> File Copy 4096 bufsize 8000 maxblocks 5800.0 297402.5 512.8
> Pipe Throughput 12440.0 353658.1 284.3
> Pipe-based Context Switching 4000.0 120140.8 300.4
> Process Creation 126.0 5735.0 455.2
> Shell Scripts (1 concurrent) 42.4 2701.5 637.1
> Shell Scripts (8 concurrent) 6.0 894.9 1491.5
> System Call Overhead 15000.0 557467.4 371.6
> ========
> System Benchmarks Index Score 516.1
>
> After patch, single thread:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 9235688.9 791.4
> Double-Precision Whetstone 55.0 2758.7 501.6
> Execl Throughput 43.0 2377.8 553.0
> File Copy 1024 bufsize 2000 maxblocks 3960.0 192545.5 486.2
> File Copy 256 bufsize 500 maxblocks 1655.0 79735.0 481.8
> File Copy 4096 bufsize 8000 maxblocks 5800.0 299621.9 516.6
> Pipe Throughput 12440.0 354969.1 285.3
> Pipe-based Context Switching 4000.0 118307.5 295.8
> Process Creation 126.0 5757.0 456.9
> Shell Scripts (1 concurrent) 42.4 2695.4 635.7
> Shell Scripts (8 concurrent) 6.0 894.4 1490.6
> System Call Overhead 15000.0 563582.7 375.7
> ========
> System Benchmarks Index Score 517.0
>
> Before patch, multi-threads:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 36943633.4 3165.7
> Double-Precision Whetstone 55.0 11035.8 2006.5
> Execl Throughput 43.0 8800.1 2046.5
> File Copy 1024 bufsize 2000 maxblocks 3960.0 277638.3 701.1
> File Copy 256 bufsize 500 maxblocks 1655.0 92530.5 559.1
> File Copy 4096 bufsize 8000 maxblocks 5800.0 524344.3 904.0
> Pipe Throughput 12440.0 1359237.2 1092.6
> Pipe-based Context Switching 4000.0 571511.4 1428.8
> Process Creation 126.0 20823.3 1652.6
> Shell Scripts (1 concurrent) 42.4 6883.9 1623.6
> Shell Scripts (8 concurrent) 6.0 981.7 1636.1
> System Call Overhead 15000.0 2029539.8 1353.0
> ========
> System Benchmarks Index Score 1367.4
>
> After patch, multi-threads:
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 36943793.6 3165.7
> Double-Precision Whetstone 55.0 11035.5 2006.4
> Execl Throughput 43.0 8768.3 2039.1
> File Copy 1024 bufsize 2000 maxblocks 3960.0 277962.9 701.9
> File Copy 256 bufsize 500 maxblocks 1655.0 92059.7 556.3
> File Copy 4096 bufsize 8000 maxblocks 5800.0 525937.5 906.8
> Pipe Throughput 12440.0 1361566.6 1094.5
> Pipe-based Context Switching 4000.0 575835.4 1439.6
> Process Creation 126.0 20426.4 1621.1
> Shell Scripts (1 concurrent) 42.4 6877.5 1622.0
> Shell Scripts (8 concurrent) 6.0 980.3 1633.8
> System Call Overhead 15000.0 2049771.6 1366.5
> ========
> System Benchmarks Index Score 1366.6
>
> From my point of view, the benefit is negligible.

There is another way to look at what is going on.
This patch is related with syscall, I prefer to
observe "System Call Overhead" in the test results.

Here are the INDEX of "System Call Overhead" in your test results:

thread before_patch after_patch gain
single 371.6 375.7 1.103%
multi 1353.0 1366.5 0.998%

For now, I would like to wait for other people's review.
If the conclusion is the optimization is meaningless,
I am fine with ignoring this patch.

Thanks,
Tiezhu

>
>
> Huacai
>
>>
>> Signed-off-by: Tiezhu Yang <[email protected]>
>> ---
>> arch/loongarch/include/asm/stackframe.h | 5 +++++
>> arch/loongarch/kernel/entry.S | 15 +++++++++++++++
>> 2 files changed, 20 insertions(+)
>>
>> diff --git a/arch/loongarch/include/asm/stackframe.h b/arch/loongarch/include/asm/stackframe.h
>> index 4ca9530..551ab8f 100644
>> --- a/arch/loongarch/include/asm/stackframe.h
>> +++ b/arch/loongarch/include/asm/stackframe.h
>> @@ -216,4 +216,9 @@
>> RESTORE_SP_AND_RET \docfi
>> .endm
>>
>> + .macro RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>> + RESTORE_STATIC \docfi
>> + RESTORE_SOME \docfi
>> + RESTORE_SP_AND_RET \docfi
>> + .endm
>> #endif /* _ASM_STACKFRAME_H */
>> diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S
>> index d5b3dbc..c764c99 100644
>> --- a/arch/loongarch/kernel/entry.S
>> +++ b/arch/loongarch/kernel/entry.S
>> @@ -14,6 +14,7 @@
>> #include <asm/regdef.h>
>> #include <asm/stackframe.h>
>> #include <asm/thread_info.h>
>> +#include <asm/unistd.h>
>>
>> .text
>> .cfi_sections .debug_frame
>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>> li.d tp, ~_THREAD_MASK
>> and tp, tp, sp
>>
>> + /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>> + LONG_S a7, tp, TI_SYSCALL
>> +
>> move a0, sp
>> bl do_syscall
>>
>> + /*
>> + * Syscall number held in a7 which is stored in TI_SYSCALL.
>> + * rt_sigreturn call RESTORE_ALL_AND_RET.
>> + * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>> + */
>> + LONG_L t3, tp, TI_SYSCALL
>> + li.w t4, __NR_rt_sigreturn
>> + beq t3, t4, 1f
>> +
>> + RESTORE_STATIC_SOME_SP_AND_RET
>> +1:
>> RESTORE_ALL_AND_RET
>> SYM_FUNC_END(handle_syscall)
>>
>> --
>> 2.1.0
>>

2022-06-25 02:12:52

by Tiezhu Yang

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] LoongArch: No need to call RESTORE_ALL_AND_RET for all syscalls

Cc [email protected]
Arnd Bergmann <[email protected]>
Guo Ren <[email protected]>

On 06/23/2022 08:43 AM, Tiezhu Yang wrote:
> Cc Jiaxun Yang <[email protected]>
>
> On 06/22/2022 06:01 PM, Huacai Chen wrote:
>> Hi, Tiezhu,
>>
>> On Tue, Jun 21, 2022 at 6:08 PM Tiezhu Yang <[email protected]>
>> wrote:
>>>
>>> In handle_syscall, it is unnecessary to call RESTORE_ALL_AND_RET
>>> for all syscalls.
>>>
>>> (1) rt_sigreturn call RESTORE_ALL_AND_RET.
>>> (2) The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>>
>>> This patch only adds the minimal changes as simple as possible
>>> to reduce the code complexity, at the same time, it can reduce
>>> many load instructions.
>>>
>>> Here are the test environments:
>>>
>>> Hardware: Loongson-LS3A5000-7A1000-1w-A2101
>>> Firmware: UDK2018-LoongArch-A2101-pre-beta8 [1]
>>> System: loongarch64-clfs-system-5.0 [2]
>>>
>>> The system passed functional testing used with the following
>>> test case without and with this patch:
>>>
>>> git clone https://github.com/hevz/sigaction-test.git
>>> cd sigaction-test
>>> make check
>>>
>>> Additionally, use UnixBench syscall to test the performance:
>>>
>>> git clone https://github.com/kdlucas/byte-unixbench.git
>>> cd byte-unixbench/UnixBench/
>>> make
>>> pgms/syscall 600
>>>
>>> In order to avoid the performance impact, add init=/bin/bash
>>> to the boot cmdline.
>>>
>>> Here is the test result, the bigger the better, it shows about
>>> 1.2% gain tested with close, getpid and exec [3]:
>>>
>>> duration without_this_patch with_this_patch
>>> 600 s 626558267 lps 634244079 lps
>>>
>>> [1] https://github.com/loongson/Firmware/tree/main/5000Series/PC/A2101
>>> [2]
>>> https://github.com/sunhaiyong1978/CLFS-for-LoongArch/releases/tag/5.0
>>> [3]
>>> https://github.com/kdlucas/byte-unixbench/blob/master/UnixBench/src/syscall.c
>>>
>> I test your patch and the whole UnixBench result is like this:
>>
>> Before patch, single thread:
>>
>> System Benchmarks Index Values BASELINE RESULT
>> INDEX
>> Dhrystone 2 using register variables 116700.0 9235787.7
>> 791.4
>> Double-Precision Whetstone 55.0 2758.7
>> 501.6
>> Execl Throughput 43.0 2386.8
>> 555.1
>> File Copy 1024 bufsize 2000 maxblocks 3960.0 191752.0
>> 484.2
>> File Copy 256 bufsize 500 maxblocks 1655.0 78737.9
>> 475.8
>> File Copy 4096 bufsize 8000 maxblocks 5800.0 297402.5
>> 512.8
>> Pipe Throughput 12440.0 353658.1
>> 284.3
>> Pipe-based Context Switching 4000.0 120140.8
>> 300.4
>> Process Creation 126.0 5735.0
>> 455.2
>> Shell Scripts (1 concurrent) 42.4 2701.5
>> 637.1
>> Shell Scripts (8 concurrent) 6.0 894.9
>> 1491.5
>> System Call Overhead 15000.0 557467.4
>> 371.6
>>
>> ========
>> System Benchmarks Index Score
>> 516.1
>>
>> After patch, single thread:
>>
>> System Benchmarks Index Values BASELINE RESULT
>> INDEX
>> Dhrystone 2 using register variables 116700.0 9235688.9
>> 791.4
>> Double-Precision Whetstone 55.0 2758.7
>> 501.6
>> Execl Throughput 43.0 2377.8
>> 553.0
>> File Copy 1024 bufsize 2000 maxblocks 3960.0 192545.5
>> 486.2
>> File Copy 256 bufsize 500 maxblocks 1655.0 79735.0
>> 481.8
>> File Copy 4096 bufsize 8000 maxblocks 5800.0 299621.9
>> 516.6
>> Pipe Throughput 12440.0 354969.1
>> 285.3
>> Pipe-based Context Switching 4000.0 118307.5
>> 295.8
>> Process Creation 126.0 5757.0
>> 456.9
>> Shell Scripts (1 concurrent) 42.4 2695.4
>> 635.7
>> Shell Scripts (8 concurrent) 6.0 894.4
>> 1490.6
>> System Call Overhead 15000.0 563582.7
>> 375.7
>>
>> ========
>> System Benchmarks Index Score
>> 517.0
>>
>> Before patch, multi-threads:
>>
>> System Benchmarks Index Values BASELINE RESULT
>> INDEX
>> Dhrystone 2 using register variables 116700.0 36943633.4
>> 3165.7
>> Double-Precision Whetstone 55.0 11035.8
>> 2006.5
>> Execl Throughput 43.0 8800.1
>> 2046.5
>> File Copy 1024 bufsize 2000 maxblocks 3960.0 277638.3
>> 701.1
>> File Copy 256 bufsize 500 maxblocks 1655.0 92530.5
>> 559.1
>> File Copy 4096 bufsize 8000 maxblocks 5800.0 524344.3
>> 904.0
>> Pipe Throughput 12440.0 1359237.2
>> 1092.6
>> Pipe-based Context Switching 4000.0 571511.4
>> 1428.8
>> Process Creation 126.0 20823.3
>> 1652.6
>> Shell Scripts (1 concurrent) 42.4 6883.9
>> 1623.6
>> Shell Scripts (8 concurrent) 6.0 981.7
>> 1636.1
>> System Call Overhead 15000.0 2029539.8
>> 1353.0
>>
>> ========
>> System Benchmarks Index Score
>> 1367.4
>>
>> After patch, multi-threads:
>>
>> System Benchmarks Index Values BASELINE RESULT
>> INDEX
>> Dhrystone 2 using register variables 116700.0 36943793.6
>> 3165.7
>> Double-Precision Whetstone 55.0 11035.5
>> 2006.4
>> Execl Throughput 43.0 8768.3
>> 2039.1
>> File Copy 1024 bufsize 2000 maxblocks 3960.0 277962.9
>> 701.9
>> File Copy 256 bufsize 500 maxblocks 1655.0 92059.7
>> 556.3
>> File Copy 4096 bufsize 8000 maxblocks 5800.0 525937.5
>> 906.8
>> Pipe Throughput 12440.0 1361566.6
>> 1094.5
>> Pipe-based Context Switching 4000.0 575835.4
>> 1439.6
>> Process Creation 126.0 20426.4
>> 1621.1
>> Shell Scripts (1 concurrent) 42.4 6877.5
>> 1622.0
>> Shell Scripts (8 concurrent) 6.0 980.3
>> 1633.8
>> System Call Overhead 15000.0 2049771.6
>> 1366.5
>>
>> ========
>> System Benchmarks Index Score
>> 1366.6
>>
>> From my point of view, the benefit is negligible.
>
> There is another way to look at what is going on.
> This patch is related with syscall, I prefer to
> observe "System Call Overhead" in the test results.
>
> Here are the INDEX of "System Call Overhead" in your test results:
>
> thread before_patch after_patch gain
> single 371.6 375.7 1.103%
> multi 1353.0 1366.5 0.998%
>
> For now, I would like to wait for other people's review.
> If the conclusion is the optimization is meaningless,
> I am fine with ignoring this patch.

Any comments will be much appreciated.

Here is the link:

https://lore.kernel.org/lkml/[email protected]/

Thanks,
Tiezhu

>
> Thanks,
> Tiezhu
>
>>
>>
>> Huacai
>>
>>>
>>> Signed-off-by: Tiezhu Yang <[email protected]>
>>> ---
>>> arch/loongarch/include/asm/stackframe.h | 5 +++++
>>> arch/loongarch/kernel/entry.S | 15 +++++++++++++++
>>> 2 files changed, 20 insertions(+)
>>>
>>> diff --git a/arch/loongarch/include/asm/stackframe.h
>>> b/arch/loongarch/include/asm/stackframe.h
>>> index 4ca9530..551ab8f 100644
>>> --- a/arch/loongarch/include/asm/stackframe.h
>>> +++ b/arch/loongarch/include/asm/stackframe.h
>>> @@ -216,4 +216,9 @@
>>> RESTORE_SP_AND_RET \docfi
>>> .endm
>>>
>>> + .macro RESTORE_STATIC_SOME_SP_AND_RET docfi=0
>>> + RESTORE_STATIC \docfi
>>> + RESTORE_SOME \docfi
>>> + RESTORE_SP_AND_RET \docfi
>>> + .endm
>>> #endif /* _ASM_STACKFRAME_H */
>>> diff --git a/arch/loongarch/kernel/entry.S
>>> b/arch/loongarch/kernel/entry.S
>>> index d5b3dbc..c764c99 100644
>>> --- a/arch/loongarch/kernel/entry.S
>>> +++ b/arch/loongarch/kernel/entry.S
>>> @@ -14,6 +14,7 @@
>>> #include <asm/regdef.h>
>>> #include <asm/stackframe.h>
>>> #include <asm/thread_info.h>
>>> +#include <asm/unistd.h>
>>>
>>> .text
>>> .cfi_sections .debug_frame
>>> @@ -62,9 +63,23 @@ SYM_FUNC_START(handle_syscall)
>>> li.d tp, ~_THREAD_MASK
>>> and tp, tp, sp
>>>
>>> + /* Syscall number held in a7, we can store it in TI_SYSCALL. */
>>> + LONG_S a7, tp, TI_SYSCALL
>>> +
>>> move a0, sp
>>> bl do_syscall
>>>
>>> + /*
>>> + * Syscall number held in a7 which is stored in TI_SYSCALL.
>>> + * rt_sigreturn call RESTORE_ALL_AND_RET.
>>> + * The other syscalls call RESTORE_STATIC_SOME_SP_AND_RET.
>>> + */
>>> + LONG_L t3, tp, TI_SYSCALL
>>> + li.w t4, __NR_rt_sigreturn
>>> + beq t3, t4, 1f
>>> +
>>> + RESTORE_STATIC_SOME_SP_AND_RET
>>> +1:
>>> RESTORE_ALL_AND_RET
>>> SYM_FUNC_END(handle_syscall)
>>>
>>> --
>>> 2.1.0
>>>