2021-02-26 19:58:51

by Josh Don

[permalink] [raw]
Subject: [PATCH] sched: Optimize __calc_delta.

From: Clement Courbet <[email protected]>

A significant portion of __calc_delta time is spent in the loop
shifting a u64 by 32 bits. Use a __builtin_clz instead of iterating.

This is ~7x faster on benchmarks.

Signed-off-by: Clement Courbet <[email protected]>
Signed-off-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..dbd1ae203f7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -214,6 +214,16 @@ static void __update_inv_weight(struct load_weight *lw)
lw->inv_weight = WMULT_CONST / w;
}

+/*
+ * A __builtin_clz that handles a u32 value on architectures
+ * where `sizeof(unsigned int) < 32`.
+ */
+#if (__SIZEOF_INT__ < 32)
+# define BUILTIN_CLZ32(v) __builtin_clzl(v)
+#else
+# define BUILTIN_CLZ32(v) __builtin_clz(v)
+#endif
+
/*
* delta_exec * weight / lw.weight
* OR
@@ -229,27 +239,31 @@ static void __update_inv_weight(struct load_weight *lw)
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);
+ u32 fact_hi = (u32)(fact >> 32);
int shift = WMULT_SHIFT;
+ int fs;

__update_inv_weight(lw);

- if (unlikely(fact >> 32)) {
- while (fact >> 32) {
- fact >>= 1;
- shift--;
- }
+ if (unlikely(fact_hi)) {
+ fs = 32 - BUILTIN_CLZ32(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

fact = mul_u32_u32(fact, lw->inv_weight);

- while (fact >> 32) {
- fact >>= 1;
- shift--;
+ fact_hi = (u32)(fact >> 32);
+ if (fact_hi) {
+ fs = 32 - BUILTIN_CLZ32(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

return mul_u64_u32_shr(delta_exec, fact, shift);
}

+#undef BUILTIN_CLZ32

const struct sched_class fair_sched_class;

--
2.30.1.766.gb4fecdf3b7-goog


2021-02-26 21:06:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Fri, Feb 26, 2021 at 11:52:39AM -0800, Josh Don wrote:
> From: Clement Courbet <[email protected]>
>
> A significant portion of __calc_delta time is spent in the loop
> shifting a u64 by 32 bits. Use a __builtin_clz instead of iterating.
>
> This is ~7x faster on benchmarks.

Have you tried on hardware without such fancy instructions?

2021-02-28 15:07:08

by kernel test robot

[permalink] [raw]
Subject: [sched] 4112549ee5: WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_fwd_prog_nr[rcutorture]


Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 4112549ee56c230415ebc5bbfa15533185ceb2e6 ("[PATCH] sched: Optimize __calc_delta.")
url: https://github.com/0day-ci/linux/commits/Josh-Don/sched-Optimize-__calc_delta/20210227-035552
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git c5e6fc08feb2b88dc5dac2f3c817e1c2a4cafda4

in testcase: rcutorture
version:
with following parameters:

runtime: 300s
test: cpuhotplug
torture_type: rcu

test-description: rcutorture is rcutorture kernel module load/unload test.
test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-------------------------------------------------------------------------+------------+------------+
| | c5e6fc08fe | 4112549ee5 |
+-------------------------------------------------------------------------+------------+------------+
| INFO:rcu_sched_self-detected_stall_on_CPU | 0 | 18 |
| RIP:write_comp_data | 0 | 4 |
| RIP:rcu_scale_wait_shutdown[rcuscale] | 0 | 6 |
| RIP:___might_sleep | 0 | 8 |
| RIP:rcu_scale_reader[rcuscale] | 0 | 18 |
| BUG:kernel_hang_in_test_stage | 0 | 20 |
| RIP:to_kthread | 0 | 7 |
| RIP:__sanitizer_cov_trace_pc | 0 | 8 |
| RIP:rcu_all_qs | 0 | 3 |
| RIP:check_kcov_mode | 0 | 11 |
| RIP:test_bit | 0 | 4 |
| WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_fwd_prog_nr[rcutorture] | 0 | 8 |
| RIP:rcu_torture_fwd_prog_nr[rcutorture] | 0 | 8 |
| WARNING:at_kernel/rcu/rcutorture.c:#rcu_torture_fwd_prog_cr[rcutorture] | 0 | 9 |
| RIP:rcu_torture_fwd_prog_cr[rcutorture] | 0 | 9 |
| RIP:__kasan_check_read | 0 | 6 |
| INFO:rcu_sched_detected_stalls_on_CPUs/tasks | 0 | 5 |
| RIP:torture_must_stop[torture] | 0 | 2 |
| RIP:check_memory_region | 0 | 9 |
| RIP:__sanitizer_cov_trace_const_cmp4 | 0 | 1 |
| RIP:kthread_should_stop | 0 | 1 |
+-------------------------------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 487.587615] ------------[ cut here ]------------
[ 487.588208] WARNING: CPU: 0 PID: 3527 at kernel/rcu/rcutorture.c:1969 rcu_torture_fwd_prog_nr+0x467/0x533 [rcutorture]
[ 487.589407] Modules linked in: rcutorture torture
[ 487.589953] CPU: 0 PID: 3527 Comm: rcu_torture_fwd Tainted: G W 5.11.0-00053-g4112549ee56c #6
[ 487.591092] RIP: 0010:rcu_torture_fwd_prog_nr+0x467/0x533 [rcutorture]
[ 487.591823] Code: 80 3c 02 00 74 05 e8 13 6d 4c e1 ff 53 30 48 8b 74 24 20 48 89 c7 e8 97 d7 ff ff 4d 85 e4 49 89 c0 75 0a 48 83 f8 01 77 04 90 <0f> 0b 90 48 8b 54 24 40 4c 89 e1 48 c7 c6 00 20 02 a0 48 c7 c7 a0
[ 487.593721] RSP: 0018:ffff88812db27d80 EFLAGS: 00010297
[ 487.594328] RAX: 0000000000000000 RBX: ffffffffa0025820 RCX: ffff88814393a8c0
[ 487.595140] RDX: ffff88814393a8c0 RSI: 000000000001480d RDI: 000000000001480d
[ 487.595942] RBP: ffff88812db27e60 R08: 0000000000000000 R09: 0000000000000000
[ 487.596741] R10: ffffed10297198b1 R11: ffff88814b8cc587 R12: 0000000000000000
[ 487.597488] R13: dffffc0000000000 R14: ffff88814393a8c0 R15: 0000000000000000
[ 487.598286] FS: 0000000000000000(0000) GS:ffff8881e8800000(0000) knlGS:0000000000000000
[ 487.599203] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 487.599860] CR2: 000055b0091b8f68 CR3: 000000014ce44000 CR4: 00000000000406f0
[ 487.600665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 487.601458] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 487.602258] Call Trace:
[ 487.602560] ? __sanitizer_cov_trace_pc+0x1e/0x43
[ 487.603118] ? tick_nohz_full_enabled+0x8e/0x8e [rcutorture]
[ 487.603777] ? __next_timer_interrupt+0x1e2/0x1e2
[ 487.604314] ? rcu_torture_boost_cb+0x3f/0x3f [rcutorture]
[ 487.604941] ? __do_set_cpus_allowed+0x1e0/0x1e0
[ 487.605476] ? __kasan_check_write+0x14/0x16
[ 487.605965] rcu_torture_fwd_prog+0x106/0x198 [rcutorture]
[ 487.606619] ? rcu_torture_fwd_prog_cr+0x5c4/0x5c4 [rcutorture]
[ 487.607325] ? __kasan_check_read+0x11/0x13
[ 487.607765] ? write_comp_data+0x24/0x6f
[ 487.608204] ? __sanitizer_cov_trace_pc+0x1e/0x43
[ 487.608738] ? __kthread_parkme+0x101/0x16d
[ 487.609226] kthread+0x327/0x33b
[ 487.609576] ? _raw_spin_unlock_irq+0x9/0x13
[ 487.610021] ? rcu_torture_fwd_prog_cr+0x5c4/0x5c4 [rcutorture]
[ 487.610708] ? kthread_queue_delayed_work+0xcd/0xcd
[ 487.611282] ret_from_fork+0x1f/0x30
[ 487.611713] ---[ end trace 2946c7758b8349f5 ]---



To reproduce:

# build kernel
cd linux
cp config-5.11.0-00053-g4112549ee56c .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Oliver Sang


Attachments:
(No filename) (6.25 kB)
config-5.11.0-00053-g4112549ee56c (118.97 kB)
job-script (4.47 kB)
dmesg.xz (26.92 kB)
rcutorture (5.49 kB)
Download all attachments

2021-03-04 07:18:08

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Fri, Feb 26, 2021 at 1:03 PM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 11:52:39AM -0800, Josh Don wrote:
> > From: Clement Courbet <[email protected]>
> >
> > A significant portion of __calc_delta time is spent in the loop
> > shifting a u64 by 32 bits. Use a __builtin_clz instead of iterating.
> >
> > This is ~7x faster on benchmarks.
>
> Have you tried on hardware without such fancy instructions?

Was not able to find any on hand unfortunately. Clement did rework the
patch to use fls() instead, and has benchmarks for the generic and asm
variations. All of which are faster than the loop. In my next reply,
I'll include the updated patch inline.

2021-03-04 07:19:27

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

From: Clement Courbet <[email protected]>

A significant portion of __calc_delta time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on X86 we get an additional factor 2 in speed without dedicated
implementation.

On gcc, the asm versions of `fls` are about the same speed as the
builtin. On clang, the versions that use fls (fls,fls64) are more than
twice as slow as the builtin. This is because the way the `fls` function
is written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This can be fixed in a separate patch.

```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms ±12%
BM_Calc<__calc_delta_generic_fls> 2.36ms ±13%
BM_Calc<__calc_delta_asm_fls> 2.45ms ±13%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ±12%
BM_Calc<__calc_delta_asm_fls64> 2.46ms ±13%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ±15%
BM_Calc<__calc_delta_builtin> 1.32ms ±11%
```

Signed-off-by: Clement Courbet <[email protected]>
Signed-off-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
kernel/sched/sched.h | 1 +
2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..67e5a1d536ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -214,6 +214,16 @@ static void __update_inv_weight(struct load_weight *lw)
lw->inv_weight = WMULT_CONST / w;
}

+/*
+ * An fls that handles an u32 value on architectures
+ * where `sizeof(unsigned int) < 32`.
+ */
+#if (__SIZEOF_INT__ >= 32)
+# define FLS_AT_LEAST_32(v) fls(v)
+#else
+# define FLS_AT_LEAST_32(v) fls64(v)
+#endif
+
/*
* delta_exec * weight / lw.weight
* OR
@@ -229,27 +239,31 @@ static void __update_inv_weight(struct load_weight *lw)
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct
load_weight *lw)
{
u64 fact = scale_load_down(weight);
+ u32 fact_hi = (u32)(fact >> 32);
int shift = WMULT_SHIFT;
+ int fs;

__update_inv_weight(lw);

- if (unlikely(fact >> 32)) {
- while (fact >> 32) {
- fact >>= 1;
- shift--;
- }
+ if (unlikely(fact_hi)) {
+ fs = FLS_AT_LEAST_32(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

fact = mul_u32_u32(fact, lw->inv_weight);

- while (fact >> 32) {
- fact >>= 1;
- shift--;
+ fact_hi = (u32)(fact >> 32);
+ if (fact_hi) {
+ fs = FLS_AT_LEAST_32(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

return mul_u64_u32_shr(delta_exec, fact, shift);
}

+#undef FLS_AT_LEAST_32

const struct sched_class fair_sched_class;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..714af71cf983 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,6 +36,7 @@
#include <uapi/linux/sched/types.h>

#include <linux/binfmts.h>
+#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/context_tracking.h>
--
2.30.1.766.gb4fecdf3b7-goog

2021-03-04 09:16:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Tue, Mar 02, 2021 at 12:57:37PM -0800, Josh Don wrote:
> On gcc, the asm versions of `fls` are about the same speed as the
> builtin. On clang, the versions that use fls (fls,fls64) are more than
> twice as slow as the builtin. This is because the way the `fls` function
> is written, clang puts the value in memory:
> https://godbolt.org/z/EfMbYe. This can be fixed in a separate patch.

Is this because clang gets the asm constraints wrong? ISTR that
happening before, surely the right thing is to fix clang?

2021-03-04 09:16:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Tue, Mar 02, 2021 at 12:55:07PM -0800, Josh Don wrote:
> On Fri, Feb 26, 2021 at 1:03 PM Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Feb 26, 2021 at 11:52:39AM -0800, Josh Don wrote:
> > > From: Clement Courbet <[email protected]>
> > >
> > > A significant portion of __calc_delta time is spent in the loop
> > > shifting a u64 by 32 bits. Use a __builtin_clz instead of iterating.
> > >
> > > This is ~7x faster on benchmarks.
> >
> > Have you tried on hardware without such fancy instructions?
>
> Was not able to find any on hand unfortunately. Clement did rework the
> patch to use fls() instead, and has benchmarks for the generic and asm
> variations. All of which are faster than the loop. In my next reply,
> I'll include the updated patch inline.

Excellent; I have some vague memories where using fls ended up slower
for some ARMs, but I can't seem to remember enough to even Google it :/

2021-03-04 13:17:50

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Wed, Mar 3, 2021 at 2:02 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Mar 02, 2021 at 12:57:37PM -0800, Josh Don wrote:
> > On gcc, the asm versions of `fls` are about the same speed as the
> > builtin. On clang, the versions that use fls (fls,fls64) are more than
> > twice as slow as the builtin. This is because the way the `fls` function
> > is written, clang puts the value in memory:
> > https://godbolt.org/z/EfMbYe. This can be fixed in a separate patch.
>
> Is this because clang gets the asm constraints wrong? ISTR that
> happening before, surely the right thing is to fix clang?

https://bugs.llvm.org/show_bug.cgi?id=49406 filed

2021-03-04 13:21:54

by Josh Don

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

> you made fact_hi u32, why can't we unconditionally use fls() ?

Thanks for clarifying with ILP32; will remove this macro and simplify
to just fls().

2021-03-04 21:48:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Optimize __calc_delta.

On Tue, Mar 02, 2021 at 12:57:37PM -0800, Josh Don wrote:
> From: Clement Courbet <[email protected]>
>
> A significant portion of __calc_delta time is spent in the loop
> shifting a u64 by 32 bits. Use `fls` instead of iterating.
>
> This is ~7x faster on benchmarks.
>
> The generic `fls` implementation (`generic_fls`) is still ~4x faster
> than the loop.
> Architectures that have a better implementation will make use of it. For
> example, on X86 we get an additional factor 2 in speed without dedicated
> implementation.
>
> On gcc, the asm versions of `fls` are about the same speed as the
> builtin. On clang, the versions that use fls (fls,fls64) are more than
> twice as slow as the builtin. This is because the way the `fls` function
> is written, clang puts the value in memory:
> https://godbolt.org/z/EfMbYe. This can be fixed in a separate patch.
>
> ```
> name cpu/op
> BM_Calc<__calc_delta_loop> 9.57ms ?12%
> BM_Calc<__calc_delta_generic_fls> 2.36ms ?13%
> BM_Calc<__calc_delta_asm_fls> 2.45ms ?13%
> BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ?12%
> BM_Calc<__calc_delta_asm_fls64> 2.46ms ?13%
> BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ?15%
> BM_Calc<__calc_delta_builtin> 1.32ms ?11%
> ```
>
> Signed-off-by: Clement Courbet <[email protected]>
> Signed-off-by: Josh Don <[email protected]>
> ---
> kernel/sched/fair.c | 30 ++++++++++++++++++++++--------
> kernel/sched/sched.h | 1 +
> 2 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..67e5a1d536ad 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -214,6 +214,16 @@ static void __update_inv_weight(struct load_weight *lw)
> lw->inv_weight = WMULT_CONST / w;
> }
>
> +/*
> + * An fls that handles an u32 value on architectures
> + * where `sizeof(unsigned int) < 32`.
> + */
> +#if (__SIZEOF_INT__ >= 32)

This should never happen, we use ILP32 or LP64 for Linux.

> +# define FLS_AT_LEAST_32(v) fls(v)
> +#else
> +# define FLS_AT_LEAST_32(v) fls64(v)
> +#endif
> +
> /*
> * delta_exec * weight / lw.weight
> * OR
> @@ -229,27 +239,31 @@ static void __update_inv_weight(struct load_weight *lw)
> static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct
> load_weight *lw)
> {
> u64 fact = scale_load_down(weight);
> + u32 fact_hi = (u32)(fact >> 32);
> int shift = WMULT_SHIFT;
> + int fs;
>
> __update_inv_weight(lw);
>
> - if (unlikely(fact >> 32)) {
> - while (fact >> 32) {
> - fact >>= 1;
> - shift--;
> - }
> + if (unlikely(fact_hi)) {
> + fs = FLS_AT_LEAST_32(fact_hi);

you made fact_hi u32, why can't we unconditionally use fls() ?

> + shift -= fs;
> + fact >>= fs;
> }
>
> fact = mul_u32_u32(fact, lw->inv_weight);
>
> - while (fact >> 32) {
> - fact >>= 1;
> - shift--;
> + fact_hi = (u32)(fact >> 32);
> + if (fact_hi) {
> + fs = FLS_AT_LEAST_32(fact_hi);
> + shift -= fs;
> + fact >>= fs;
> }
>
> return mul_u64_u32_shr(delta_exec, fact, shift);
> }

Horrific whitespace damage..

2021-03-04 23:31:03

by Josh Don

[permalink] [raw]
Subject: [PATCH v2] sched: Optimize __calc_delta.

From: Clement Courbet <[email protected]>

A significant portion of __calc_delta time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on X86 we get an additional factor 2 in speed without dedicated
implementation.

On gcc, the asm versions of `fls` are about the same speed as the
builtin. On clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?id=49406.

```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms ±12%
BM_Calc<__calc_delta_generic_fls> 2.36ms ±13%
BM_Calc<__calc_delta_asm_fls> 2.45ms ±13%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ±12%
BM_Calc<__calc_delta_asm_fls64> 2.46ms ±13%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ±15%
BM_Calc<__calc_delta_builtin> 1.32ms ±11%
```

Signed-off-by: Clement Courbet <[email protected]>
Signed-off-by: Josh Don <[email protected]>
---
kernel/sched/fair.c | 19 +++++++++++--------
kernel/sched/sched.h | 1 +
2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..a691371960ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -229,22 +229,25 @@ static void __update_inv_weight(struct load_weight *lw)
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);
+ u32 fact_hi = (u32)(fact >> 32);
int shift = WMULT_SHIFT;
+ int fs;

__update_inv_weight(lw);

- if (unlikely(fact >> 32)) {
- while (fact >> 32) {
- fact >>= 1;
- shift--;
- }
+ if (unlikely(fact_hi)) {
+ fs = fls(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

fact = mul_u32_u32(fact, lw->inv_weight);

- while (fact >> 32) {
- fact >>= 1;
- shift--;
+ fact_hi = (u32)(fact >> 32);
+ if (fact_hi) {
+ fs = fls(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

return mul_u64_u32_shr(delta_exec, fact, shift);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10a1522b1e30..714af71cf983 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,6 +36,7 @@
#include <uapi/linux/sched/types.h>

#include <linux/binfmts.h>
+#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/context_tracking.h>
--
2.30.1.766.gb4fecdf3b7-goog

Subject: [tip: sched/core] sched: Optimize __calc_delta()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 1e17fb8edc5ad6587e9303ccdebce853bc8cf30c
Gitweb: https://git.kernel.org/tip/1e17fb8edc5ad6587e9303ccdebce853bc8cf30c
Author: Clement Courbet <[email protected]>
AuthorDate: Wed, 03 Mar 2021 14:46:53 -08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 10 Mar 2021 09:51:49 +01:00

sched: Optimize __calc_delta()

A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.

On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.

```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115%
BM_Calc<__calc_delta_builtin> 1.32ms Â=B111%
```

Signed-off-by: Clement Courbet <[email protected]>
Signed-off-by: Josh Don <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/fair.c | 19 +++++++++++--------
kernel/sched/sched.h | 1 +
2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f5d6541..2e2ab1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -229,22 +229,25 @@ static void __update_inv_weight(struct load_weight *lw)
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);
+ u32 fact_hi = (u32)(fact >> 32);
int shift = WMULT_SHIFT;
+ int fs;

__update_inv_weight(lw);

- if (unlikely(fact >> 32)) {
- while (fact >> 32) {
- fact >>= 1;
- shift--;
- }
+ if (unlikely(fact_hi)) {
+ fs = fls(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

fact = mul_u32_u32(fact, lw->inv_weight);

- while (fact >> 32) {
- fact >>= 1;
- shift--;
+ fact_hi = (u32)(fact >> 32);
+ if (fact_hi) {
+ fs = fls(fact_hi);
+ shift -= fs;
+ fact >>= fs;
}

return mul_u64_u32_shr(delta_exec, fact, shift);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb8bb06..d2e09a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -36,6 +36,7 @@
#include <uapi/linux/sched/types.h>

#include <linux/binfmts.h>
+#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/compat.h>
#include <linux/context_tracking.h>