LinuxLists.cc - [PATCH] lib/int_sqrt.c: Optimize square root function

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

Hmm. I don't disagree, but would like some more feedback.

Davidlohr - you were the person to touch this function last (commit
30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
you did so for performance reasons. And in fact, when you did that,
you removed that initial loop:

- one = 1UL << (BITS_PER_LONG - 2);
- while (one > op)
- one >>= 2;

but I'm not sure that was actually all that conscious, I think the
real optimization was the changes inside the loop to make the final
real loop faster and simpler.

Also, you had performance numbers, so presumably a test harness for it
all. It probably depends a lot on the actual distribution of argument
values, of course, but it would be good to accompany the patch with
actual real numbers like lasty time.

(I'm also not entirely sure what uses int_sqrt() that ends up being so
performance-critical, so it would be good to document that too, since
that probably also matters for the "what's the normal argument range"
question..)

Linus

On Mon, Feb 2, 2015 at 9:12 AM, Anshul Garg <[email protected]> wrote:
> From: Anshul Garg <[email protected]>
>
> Unnecessary instructions are executing even though m is
> greater than x so added logic to make m less than equal to
> x before performing these operations.
>
> Signed-off-by: Anshul Garg <[email protected]>
> ---
> lib/int_sqrt.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
> index 1ef4cc3..64ae722 100644
> --- a/lib/int_sqrt.c
> +++ b/lib/int_sqrt.c
> @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
> return x;
>
> m = 1UL << (BITS_PER_LONG - 2);
> +
> + while (m > x)
> + m >>= 2;
> while (m != 0) {
> b = y + m;
> y >>= 1;
> --
> 1.7.9.5
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> http://www.avast.com
>

2015-02-02 19:11:01

by Joe Perches

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, 2015-02-02 at 09:12 -0800, Anshul Garg wrote:
> From: Anshul Garg <[email protected]>
>
> Unnecessary instructions are executing even though m is
> greater than x so added logic to make m less than equal to
> x before performing these operations.
>
> Signed-off-by: Anshul Garg <[email protected]>
> ---
> lib/int_sqrt.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
[]
> @@ -22,6 +22,9 @@ unsigned long int_sqrt(unsigned long x)
> return x;
>
> m = 1UL << (BITS_PER_LONG - 2);
> +
> + while (m > x)
> + m >>= 2;

Perhaps removing the while and using fls(x) would be better.

Perhaps something like:

m = 1UL << min(sizeof(unsigned long) == sizeof(u64) ?
fls64(x) : fls(x),
BITS_PER_LONG - 2);

> while (m != 0) {
> b = y + m;
> y >>= 1;

2015-02-02 19:13:48

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
<[email protected]> wrote:
>
> (I'm also not entirely sure what uses int_sqrt() that ends up being so
> performance-critical, so it would be good to document that too, since
> that probably also matters for the "what's the normal argument range"
> question..)

... it's also not entirely clear that we need a whole new loop. We
might just instead start off with a better guess for 'm' using some
calculation that might be doable with a single conditional move
instruction instead of a loop. Because I suspect that the inevitable
branch misprediction of a new loop is likely as expensive as a few
iterations through the core one.

IOW, instead of

m = 1UL << (BITS_PER_LONG - 2);

perhaps something like

m = 1UL << (BITS_PER_LONG/2- 2);
if (m < x)
m <<= BITS_PER_LONG/2;

(assuming gcc can change that code into a "cmov") might cut down the
"lots of empty loops" case in half for small values of 'x'.

There's probably some other better cheap initial guess value estimator.

Linus

2015-02-02 19:39:21

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, Feb 2, 2015 at 11:10 AM, Joe Perches <[email protected]> wrote:
>
> Perhaps removing the while and using fls(x) would be better.

No. fls is often slow, and you then do have to also round it down to
an even bit number etc, so it gets somewhat complex too.

We *probably* have some argument range that we care more about, which
is why I'd like to know what the profile is that triggered this
optimization, and what the common argument range is.

For example, one user is the cpu frequency governor, which seems to
try to predict the length of the next sleep. The values can probably
be anything (it has protections for overflowing into "long long"), but
presumably the onyl time we care about performance is when it's called
very very often, in which case I'm going out on a limb and saying that
the intervals will be short, and the standard deviation of those
intervals will also be small. So *maybe* that function really only
cares about performance when we have small arguments. I dunno.

Linus

2015-02-03 04:41:54

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> IOW, instead of
>
> m = 1UL << (BITS_PER_LONG - 2);
>
> perhaps something like
>
> m = 1UL << (BITS_PER_LONG/2- 2);
> if (m < x)
> m <<= BITS_PER_LONG/2;
>
> (assuming gcc can change that code into a "cmov") might cut down the
> "lots of empty loops" case in half for small values of 'x'.

Makes a lot of sense.

> There's probably some other better cheap initial guess value estimator.

Just to get a feeling for the normal arg range in the non-driver parts
that use this thing:

fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
fs/nfsd/nfscache.c: limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10);
kernel/rcu/tree_plugin.h: ls = int_sqrt(nr_cpu_ids);
mm/memcontrol.c: inactive_ratio = int_sqrt(10 * gb);
mm/page_alloc.c: ratio = int_sqrt(10 * gb);
mm/page_alloc.c: new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);

So mostly values that scale according to mem or cpu.

2015-02-03 04:47:33

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
> Hmm. I don't disagree, but would like some more feedback.
>
> Davidlohr - you were the person to touch this function last (commit
> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
> you did so for performance reasons. And in fact, when you did that,
> you removed that initial loop:
>
> - one = 1UL << (BITS_PER_LONG - 2);
> - while (one > op)
> - one >>= 2;
>
> but I'm not sure that was actually all that conscious, I think the
> real optimization was the changes inside the loop to make the final
> real loop faster and simpler.

I missed that. And yes, the real optimization should be in the loop.

>
> Also, you had performance numbers, so presumably a test harness for it
> all. It probably depends a lot on the actual distribution of argument
> values, of course, but it would be good to accompany the patch with
> actual real numbers like lasty time.

Aha. In my case I recall I ran a usersapce program using each function
from 1 to a million, and throwing perf at it for 10 times.

> (I'm also not entirely sure what uses int_sqrt() that ends up being so
> performance-critical, so it would be good to document that too, since
> that probably also matters for the "what's the normal argument range"
> question..)

It's not a big deal afaik.

Thanks,
Davidlohr

2015-02-03 04:57:22

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, 2015-02-02 at 20:41 -0800, Davidlohr Bueso wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
> > IOW, instead of
> >
> > m = 1UL << (BITS_PER_LONG - 2);
> >
> > perhaps something like
> >
> > m = 1UL << (BITS_PER_LONG/2- 2);
> > if (m < x)
> > m <<= BITS_PER_LONG/2;

Assuming small values mostly, we could also try more aggressive
estimators for BITS_PER_LONG == 64. But then again, it probably doesn't
matter.

2015-02-03 15:42:10

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Tue, Feb 3, 2015 at 10:17 AM, Davidlohr Bueso <[email protected]> wrote:
> On Mon, 2015-02-02 at 11:00 -0800, Linus Torvalds wrote:
>> Hmm. I don't disagree, but would like some more feedback.
>>
>> Davidlohr - you were the person to touch this function last (commit
>> 30493cc9dddb: "lib/int_sqrt.c: optimize square root algorithm"), and
>> you did so for performance reasons. And in fact, when you did that,
>> you removed that initial loop:
>>
>> - one = 1UL << (BITS_PER_LONG - 2);
>> - while (one > op)
>> - one >>= 2;
>>
>> but I'm not sure that was actually all that conscious, I think the
>> real optimization was the changes inside the loop to make the final
>> real loop faster and simpler.
>
> I missed that. And yes, the real optimization should be in the loop.
>
>>
>> Also, you had performance numbers, so presumably a test harness for it
>> all. It probably depends a lot on the actual distribution of argument
>> values, of course, but it would be good to accompany the patch with
>> actual real numbers like lasty time.
>
> Aha. In my case I recall I ran a usersapce program using each function
> from 1 to a million, and throwing perf at it for 10 times.
>
I have done profiling of int_sqrt function using perf tool for 10 times.
For this purpose i have created a userspace program which uses sqrt function
from 1 to a million.

int_sqrt_old -> current algorithm version
int_sqrt_new -> with proposed change

these results are for BITS_PER_LONG=64.

Performance counter stats for './int_sqrt_old' (10 runs):

460.944061 task-clock (msec) # 0.969 CPUs utilized
( +- 1.72% )
64 context-switches # 0.139 K/sec ( +- 2.27% )
0 cpu-migrations # 0.000 K/sec
132 page-faults # 0.286 K/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses

0.475795341 seconds time elapsed( +- 3.20% )

Performance counter stats for './int_sqrt_new' (10 runs):

401.782119 task-clock (msec) # 0.974 CPUs utilized(
+- 1.55% )
57 context-switches # 0.141 K/sec( +- 1.92% )
0 cpu-migrations # 0.000 K/sec
132 page-faults # 0.329 K/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses

0.412593296 seconds time elapsed( +- 2.03% )

As per profiling definitely there is improvement in algorithm timing.

>> (I'm also not entirely sure what uses int_sqrt() that ends up being so
>> performance-critical, so it would be good to document that too, since
>> that probably also matters for the "what's the normal argument range"
>> question..)
>
> It's not a big deal afaik.
>
> Thanks,
> Davidlohr
>

2015-02-03 15:54:58

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Tue, Feb 3, 2015 at 10:11 AM, Davidlohr Bueso <[email protected]> wrote:
> On Mon, 2015-02-02 at 11:13 -0800, Linus Torvalds wrote:
>> IOW, instead of
>>
>> m = 1UL << (BITS_PER_LONG - 2);
>>
>> perhaps something like
>>
>> m = 1UL << (BITS_PER_LONG/2- 2);
>> if (m < x)
>> m <<= BITS_PER_LONG/2;

On doing this change also extra instruction(shown below) will execute
if x is small(say 10000).

b = y + m;
y >>= 1;
if (x >= b) {

Although i understand your point that we can set value of m according
to range of x.
But in future also range of x can change if apart from memory some
other module starts
using this function more extensively.
So in worst case it will be almost same as current implementation.

So i think its better to scale m according to x by doing
while(m > x)
m>>=2;
rather than according to range of x.

Thanks
Anshul Garg

>>
>> (assuming gcc can change that code into a "cmov") might cut down the
>> "lots of empty loops" case in half for small values of 'x'.
>
> Makes a lot of sense.
>

>> There's probably some other better cheap initial guess value estimator.
>
> Just to get a feeling for the normal arg range in the non-driver parts
> that use this thing:
>
> fs/ceph/super.h: congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
> fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages)) << (PAGE_SHIFT-10);
> fs/nfsd/nfscache.c: limit = (16 * int_sqrt(low_pages)) << (PAGE_SHIFT-10);
> kernel/rcu/tree_plugin.h: ls = int_sqrt(nr_cpu_ids);
> mm/memcontrol.c: inactive_ratio = int_sqrt(10 * gb);
> mm/page_alloc.c: ratio = int_sqrt(10 * gb);
> mm/page_alloc.c: new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
>
> So mostly values that scale according to mem or cpu.
>
>
>

2015-02-05 18:20:32

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Tue, Feb 3, 2015 at 7:42 AM, Anshul Garg <[email protected]> wrote:
>
> I have done profiling of int_sqrt function using perf tool for 10 times.
> For this purpose i have created a userspace program which uses sqrt function
> from 1 to a million.

Hmm. I did that too, and it doesn't improve things for me. In fact, it
makes it slower.

[torvalds@i7 ~]$ gcc -Wall -O2 -DREDUCE int_sqrt.c ; time ./a.out

real 0m2.098s
user 0m2.095s
sys 0m0.000s

[torvalds@i7 ~]$ gcc -Wall -O2 int_sqrt.c ; time ./a.out

real 0m1.886s
user 0m1.883s
sys 0m0.000s

and the profile shows that 35% of the time is spent on that branch
back of the initial reduction loop.

In contrast, my suggested "reduce just once" does seem to improve things:

[torvalds@i7 ~]$ gcc -Wall -O2 -DONCE int_sqrt.c ; time ./a.out

real 0m1.436s
user 0m1.434s
sys 0m0.000s

but it's kind of hacky.

NOTE! This probably depends a lot on microarchitecture details,
including very much branch predictor etc. And I didn't actually check
that it gives the right result, but I do think that this optimization
needs to be looked at more if we want to do it.

I was running this on an i7-4770S, fwiw.

Attached is the stupid test-program I used to do the above. Maybe I
did something wrong.

Linus

Attachments:

int_sqrt.c (951.00 B)

2015-02-05 18:33:31

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
<[email protected]> wrote:
>
> Hmm. I did that too [..]

Side note: one difference in our results (apart from possibly just CPU
uarch details) is that my loop goes to 100M to make it easier to just
time it. Which means that my load essentially had about three more
iterations over nonzero data.

Linus

2015-02-05 18:43:38

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Feb 5, 2015 at 10:33 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 5, 2015 at 10:20 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Hmm. I did that too [..]
>
> Side note: one difference in our results (apart from possibly just CPU
> uarch details) is that my loop goes to 100M to make it easier to just
> time it. Which means that my load essentially had about three more
> iterations over nonzero data.
>
> Linus

I have also done the same testing on 100 million numbers.
Attaching source codes.

Below is the result ::

int_sqrt_old - current
int_sqrt_new - With proposed change

anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_old

real 0m41.895s
user 0m36.490s
sys 0m0.365s

anshul@ubuntu:~/kernel_latest/sqrt$ time ./int_sqrt_new

real 0m39.491s
user 0m36.495s
sys 0m0.338s

I have run this test on Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz VMWare Machine.
Please check if i am doing anything wrong.

NOTE ::
I have not used gcc optimizations while compilation.
With O2 level optimization proposed solution is taking more time.

Attachments:

int_sqrt_old.c (721.00 B)
int_sqrt_new.c (750.00 B)
Download all attachments

2015-02-05 19:37:04

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <[email protected]> wrote:
>
> NOTE ::
> I have not used gcc optimizations while compilation.
> With O2 level optimization proposed solution is taking more time.

The thing is, the kernel is compiled with -O2, so that's what matters.

Also, for very tight loops like this, the major costs tend to be very
subtle microarchitectural details, particularly branch prediction.
Which in turn end up sometimes depending on just exactly where the
branches were placed, and even whether two conditional branches were
in the same 8-byte aligned region etc things (because the branch
prediction might be done ignoring the low bits of the EIP etc). So not
only does the exact microarchitecture matter, things that don't *seem*
like they should matter can change behavior a lot.

My point is really that the performance numbers are very ambiguous.
The patch may well help in some situations, but hurt in others.

Linus

2015-02-08 15:39:30

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

Dear Mr. linus,

Thanks for quick replies.

Yes performance numbers are not conclusive enough.
So its better to discard this patch as of now.

I will try to explore more in this area.

Thanks & regards
Anshul Garg

On Fri, Feb 6, 2015 at 1:07 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 5, 2015 at 10:43 AM, Anshul Garg <[email protected]> wrote:
>>
>> NOTE ::
>> I have not used gcc optimizations while compilation.
>> With O2 level optimization proposed solution is taking more time.
>
> The thing is, the kernel is compiled with -O2, so that's what matters.
>
> Also, for very tight loops like this, the major costs tend to be very
> subtle microarchitectural details, particularly branch prediction.
> Which in turn end up sometimes depending on just exactly where the
> branches were placed, and even whether two conditional branches were
> in the same 8-byte aligned region etc things (because the branch
> prediction might be done ignoring the low bits of the EIP etc). So not
> only does the exact microarchitecture matter, things that don't *seem*
> like they should matter can change behavior a lot.
>
> My point is really that the performance numbers are very ambiguous.
> The patch may well help in some situations, but hurt in others.
>
> Linus

2017-07-20 11:24:59

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
> > performance-critical, so it would be good to document that too, since
> > that probably also matters for the "what's the normal argument range"
> > question..)
>
> ... it's also not entirely clear that we need a whole new loop. We
> might just instead start off with a better guess for 'm' using some
> calculation that might be doable with a single conditional move
> instruction instead of a loop. Because I suspect that the inevitable
> branch misprediction of a new loop is likely as expensive as a few
> iterations through the core one.
>
> IOW, instead of
>
> m = 1UL << (BITS_PER_LONG - 2);
>
> perhaps something like
>
> m = 1UL << (BITS_PER_LONG/2- 2);
> if (m < x)
> m <<= BITS_PER_LONG/2;
>
> (assuming gcc can change that code into a "cmov") might cut down the
> "lots of empty loops" case in half for small values of 'x'.
>
> There's probably some other better cheap initial guess value estimator.

So I had a wee look at int_sqrt(), and yes Davidlohr's patch makes it
worse for my machine (Xeon E3v5 aka. fancy Skylake).

As in _MUCH_ worse..

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

828,233,295 cycles:u ( +- 0.04% )
2,480,095,771 instructions:u # 2.99 insn per cycle ( +- 0.00% )

0.220202758 seconds time elapsed ( +- 0.18% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

603,809,551 cycles:u ( +- 1.30% )
1,677,726,591 instructions:u # 2.78 insn per cycle ( +- 0.00% )

0.160801044 seconds time elapsed

That is, we went from ~60 cycles to ~82 cycles per invocation.

Now, the patches proposed in this thread do indeed improve matters but
seem to not have merged up until this day:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

598,827,196 cycles:u ( +- 0.18% )
1,697,726,381 instructions:u # 2.84 insn per cycle ( +- 0.00% )

0.159880738 seconds time elapsed ( +- 0.36% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DLINUS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

601,463,184 cycles:u ( +- 0.07% )
1,380,095,976 instructions:u # 2.29 insn per cycle ( +- 0.00% )

0.160788095 seconds time elapsed ( +- 0.55% )

Which basically bring us back to the old performance. So I would
strongly suggest we merge one.

Now, the thing is, Intel CPUs are actually fairly good at fls(), so I
gave Joe's suggestions a spin too..

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

390,420,004 cycles:u ( +- 0.07% )
1,141,802,493 instructions:u # 2.92 insn per cycle ( +- 0.00% )

0.105480000 seconds time elapsed ( +- 0.64% )

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

328,415,775 cycles:u ( +- 0.15% )
1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% )

0.088703205 seconds time elapsed

Which gets us a real improvement...

Now even with the software fls() fallback from
include/asm-generic/bitops/fls.h there is a significant improvement:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 -DSOFTFLS=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

375,633,931 cycles:u ( +- 0.13% )
1,289,700,013 instructions:u # 3.43 insn per cycle ( +- 0.00% )

0.100596211 seconds time elapsed

Also, I ran with -DVALIDATE=1 and while the comment says its a 'rough'
estimate of the sqrt() I did not in fact find any value for which we
deviate < INT_MAX.

----------------------[ sqrt.c ]---------------------------

#include <math.h>
#include <stdio.h>

#define BITS_PER_LONG (sizeof(long) * 8)

#ifndef LOOPS
#define LOOPS 1000000
#endif

#ifdef SOFTFLS

static __always_inline unsigned long fls(unsigned long x)
{
int r = 32;

if (!x)
return 0;

if (!(x & 0xffff0000u)) {
x <<= 16;
r -= 16;
}
if (!(x & 0xff000000u)) {
x <<= 8;
r -= 8;
}
if (!(x & 0xf0000000u)) {
x <<= 4;
r -= 4;
}
if (!(x & 0xc0000000u)) {
x <<= 2;
r -= 2;
}
if (!(x & 0x80000000u)) {
x <<= 1;
r -= 1;
}
return r;
}

#else

static __always_inline unsigned long fls(unsigned long word)
{
asm("rep; bsr %1,%0"
: "=r" (word)
: "rm" (word));
return BITS_PER_LONG - 1 - word;
}

#endif

#ifdef NEW

unsigned long __attribute__((noinline)) int_sqrt(unsigned long x)
{
unsigned long b, m, y = 0;

if (x <= 1)
return x;

#ifdef LINUS
m = 1UL << (BITS_PER_LONG/2 - 2);
if (m < x)
m <<= BITS_PER_LONG/2;

#else

#ifdef FLS
m = 1UL << ((fls(x) + 1) & ~1UL);
#else
m = 1UL << (BITS_PER_LONG - 2);
#endif

#ifdef ANSHUL
while (m > x)
m >>= 2;
#endif
#endif

while (m != 0) {
b = y + m;
y >>= 1;

if (x >= b) {
x -= b;
y += m;
}
m >>= 2;
}

return y;
}

#else

unsigned long __attribute__((noinline)) int_sqrt(unsigned long x)
{
unsigned long op, res, one;

op = x;
res = 0;

one = 1UL << (BITS_PER_LONG - 2);
while (one > op)
one >>= 2;

while (one != 0) {
if (op >= res + one) {
op = op - (res + one);
res = res + 2 * one;
}
res /= 2;
one /= 4;
}
return res;
}

#endif

void main(void)
{
unsigned long i;

for (i=0; i<LOOPS; i++) {
unsigned long a = int_sqrt(i);
asm volatile("" : : "r" (a));

#ifdef VALIDATE
{
unsigned long b = floor(sqrt(i));
if (a != b)
printf("%ld %ld %ld\n", i, a, b);
}
#endif

}
}

--------------

The only caveat seems to be that our fls() is only defined for 'int' not
long :/

---
lib/int_sqrt.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..87c3aa360441 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,11 @@

#include <linux/kernel.h>
#include <linux/export.h>
+#include <linux/bitops.h>

/**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - approximation to sqrt
* @x: integer of which to calculate the sqrt
- *
- * A very rough approximation to the sqrt() function.
*/
unsigned long int_sqrt(unsigned long x)
{
@@ -21,7 +20,11 @@ unsigned long int_sqrt(unsigned long x)
if (x <= 1)
return x;

- m = 1UL << (BITS_PER_LONG - 2);
+ m = 1UL << ((fls(x) + 1) & ~1UL);
+
+ while (m > x)
+ m >>= 2;
+
while (m != 0) {
b = y + m;
y >>= 1;

2017-07-20 11:52:56

by Joe Perches

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
[]
> + m = 1UL << ((fls(x) + 1) & ~1UL);

maybe

#if BITS_PER_LONG == 64
m = 1UL << ((fls64(x) + 1) & ~1UL);
#else
m = 1UL << ((fls(x) + 1) & ~1UL);
#endif

2017-07-20 14:13:45

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 04:52:49AM -0700, Joe Perches wrote:
> On Thu, 2017-07-20 at 13:24 +0200, Peter Zijlstra wrote:
> []
> > + m = 1UL << ((fls(x) + 1) & ~1UL);
>
> maybe
>
> #if BITS_PER_LONG == 64
> m = 1UL << ((fls64(x) + 1) & ~1UL);
> #else
> m = 1UL << ((fls(x) + 1) & ~1UL);
> #endif

We do seem to have __fls() which is defined on long.

2017-07-20 15:27:58

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 01:24:49PM +0200, Peter Zijlstra wrote:
> ~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt
>
> Performance counter stats for './sqrt' (10 runs):
>
> 328,415,775 cycles:u ( +- 0.15% )
> 1,138,579,704 instructions:u # 3.47 insn per cycle ( +- 0.00% )
>
> 0.088703205 seconds time elapsed

> static __always_inline unsigned long fls(unsigned long word)
> {
> asm("rep; bsr %1,%0"
> : "=r" (word)
> : "rm" (word));
> return BITS_PER_LONG - 1 - word;
> }

That is actually "lzcnt", if I used the regular fls implementation:

static __always_inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}

It ends up slightly more expensive:

~/tmp$ gcc -o sqrt sqrt.c -lm -O2 -DLOOPS=10000000 -DNEW=1 -DFLS=1 -DANSHUL=1 ; perf stat --repeat 10 -e cycles:u -e instructions:u ./sqrt

Performance counter stats for './sqrt' (10 runs):

384,842,215 cycles:u ( +- 0.08% )
1,118,579,712 instructions:u # 2.91 insn per cycle ( +- 0.00% )

0.103018001 seconds time elapsed

Still loads cheaper than pretty much any other combination.

2017-07-20 18:31:39

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

How did this two-year old thread get resurrected?

Anyway, it got resurrected without even answering one core question:

On Thu, Jul 20, 2017 at 4:24 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Feb 02, 2015 at 11:13:44AM -0800, Linus Torvalds wrote:
>>>> On Mon, Feb 2, 2015 at 11:00 AM, Linus Torvalds <[email protected]> wrote:
>> >
>> > (I'm also not entirely sure what uses int_sqrt() that ends up being so
>> > performance-critical, so it would be good to document that too, since
>> > that probably also matters for the "what's the normal argument range"
>> > question..)

This is still the case. Which of the (very few) users really _care_?
And what are the normal values for that?

For example, the 802.11 minstrel code does a "MINSTREL_TRUNC()" on a
"unsigned int" value that is in millions. It's not even "unsigned
long", so we know it's not many thousands of millions, and
MINSTREL_TRUNC shifts it down by 12 bits.

So we know we have at most a 20-bit argument.

The one case that uses actual unsigned long seems to be
"slow_is_prime_number()", and honestly, the sqrt() is the *least* of
our problems there.

There's a few drivers and filesystems that use it. I do not believe
performance matters in those cases. Even if you do a "int_sqrt()" per
nertwork packet on some high-performance network (and none of them
look anything like that).

And there's a couple of VM users. They don't look particularly critical either.

So why do you care? Because honestly, calling int_sqrt() once in a
blue moon with caches cold and no branch prediction information tends
to have very different performance characteristics from calling it in
a loop with very predictable input.

So I think your "benchmark" is just garbage, in that it's testing
something entirely different than the actual load.

Also, since this is a generic library routine, no way can we depend on
fls being fast.

But we could certainly improve on the initial value a lot. It's just
that we should probably strive to improve on it without adding extra
branch misprediction or I$ misses - both things that your benchmark
isn't actually testing at all, since it does the exact opposite of
that by basically preloading both.

And the *most* important question is that first one:

"Why does this matter, and what is the range it matters for?"

Linus

2017-07-20 22:34:32

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 11:31:36AM -0700, Linus Torvalds wrote:
> How did this two-year old thread get resurrected?

I was looking for the original thread doing that 'optimization'
Davidlohr did but found this first.

> And the *most* important question is that first one:
>
> "Why does this matter, and what is the range it matters for?"

I was looking to do some work on the idle estimator. Parts of that keep
online avg and variance for normal distributions. I wanted to bias the
avg downwards, the way to do that is to subtract a scaled stdev from it.
Computing the stdev from a variance requires the sqrt.

Thomas rightly asked how expensive our sqrt is, I found Davidlohr's
commit message and got confused by the numbers, so I reran them and
found the optimization did the reverse, it made things worse.

By then I was prodding at it... 'fun' problem :-)

In any case, I suppose the range of values would be in the order of
TICK_NSEC, so the variance would be a number of orders below that. So
we're looking at fairly small numbers <1e5.

> Also, since this is a generic library routine, no way can we depend on
> fls being fast.

Which is why I also tested the software fls, but you're right in that
the microbench primes the branch predictor. Still, the software fls is 6
branches, whereas the 'missing' loop:

while (m > x)
m >>= 2;

would need up to 30 or so cycles worst case. So even in that respect it
makes sense its a 'win', esp. so for small numbers.

2017-07-20 23:24:35

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 3:34 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> "Why does this matter, and what is the range it matters for?"
>
> I was looking to do some work on the idle estimator. Parts of that keep
> online avg and variance for normal distributions. I wanted to bias the
> avg downwards, the way to do that is to subtract a scaled stdev from it.
> Computing the stdev from a variance requires the sqrt.
>
> In any case, I suppose the range of values would be in the order of
> TICK_NSEC, so the variance would be a number of orders below that. So
> we're looking at fairly small numbers <1e5.

Ok. So that already cuts down the problem space a lot.

And yes, the current code is very suboptimal for exactly the small number case.

It's also typiocally the case that right-shifting is more expensive
than left-shifting, so the fact that we left-shift twice in that loop
for all the big values is extra expensive.

So I would actually suggest a different approach:

- start with a small 'm'

- and grow it up.

The "small m" means that there is a small constant, which is good.

And growing it up is a single trivial left shift by two, which can
also be done just two adds or as a single lea, so it works fine even
on architectures with slow shifters etc.

So mayube something like the attached?

NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
generation looks ok even if gcc is being way too clever and turns that
first loop into a counted loop instead. Whatever, maybe it's the right
thing to do.

But note that I might have broken the sqrt for some case, so you need
to double-check my logic there. The *intention* is that it's the exact
same thing as it used to do, just initializing 'm' differently.

Linus

Attachments:

patch.diff (745.00 B)

2017-07-21 11:40:55

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Thu, Jul 20, 2017 at 04:24:32PM -0700, Linus Torvalds wrote:

> So mayube something like the attached?
>
> NOTE NOTE NOTE! This may be pure garbage. It's untested. But the code
> generation looks ok even if gcc is being way too clever and turns that
> first loop into a counted loop instead. Whatever, maybe it's the right
> thing to do.

It passes a -DVALIDATE=1 run (up to a billion or so), so it seems to work.

I've made the test thing a bit more complicated in order to address your
concerns for primed branch predictors (and once the branch predictors
get wiped, the predictable input values don't matter anymore either I
think, although I could easily LFSR generate them as well of course).

Find it attached in a tarball.

The results, from running ./test.sh (left SKL, right SNB):

(values <100k)

Cycles:

EVENT=0 -DLINUS=1
event: 29.533080 +- 0.046261 36.800100 +- 0.046067
EVENT=0 -DLINUS=1 -DWIPE_BTB=1
event: 143.513440 +- 0.152651 153.054640 +- 0.099403
EVENT=0 -DNEW=1 -DANSHUL=1
event: 41.457300 +- 0.048409 55.046100 +- 0.044968
EVENT=0 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 128.351400 +- 0.366873 161.183690 +- 0.132301
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.433170 +- 0.043859 27.672700 +- 0.063982
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 79.850210 +- 0.133646 112.422470 +- 0.101465
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 31.771680 +- 0.037655 37.515160 +- 0.080305
EVENT=0 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 129.673440 +- 0.430308 125.514390 +- 0.117590

Branches:

EVENT=4 -DLINUS=1
event: 31.507740 +- 0.010470 31.507710 +- 0.010470
EVENT=4 -DLINUS=1 -DWIPE_BTB=1
event: 31.507720 +- 0.010470 31.507760 +- 0.010470
EVENT=4 -DNEW=1 -DANSHUL=1
event: 45.125510 +- 0.002696 45.125510 +- 0.002696
EVENT=4 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 45.125490 +- 0.002696 45.125540 +- 0.002697
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 21.252320 +- 0.005266 21.252330 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 21.252320 +- 0.005266 21.252340 +- 0.005266
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 25.727940 +- 0.005975 25.727930 +- 0.005975
EVENT=4 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 25.727940 +- 0.005975 25.727930 +- 0.005975

Branch-Misses:

EVENT=5 -DLINUS=1
event: 0.022670 +- 0.000789 0.020920 +- 0.000812
EVENT=5 -DLINUS=1 -DWIPE_BTB=1
event: 5.481750 +- 0.006930 5.955130 +- 0.005228
EVENT=5 -DNEW=1 -DANSHUL=1
event: 0.025990 +- 0.000798 0.017170 +- 0.000690
EVENT=5 -DNEW=1 -DANSHUL=1 -DWIPE_BTB=1
event: 3.343600 +- 0.004249 5.723570 +- 0.004876
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1
event: 0.019040 +- 0.000731 0.022570 +- 0.000829
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DWIPE_BTB=1
event: 2.156350 +- 0.004643 4.297650 +- 0.004910
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1
event: 0.019800 +- 0.000747 0.016780 +- 0.000688
EVENT=5 -DNEW=1 -DANSHUL=1 -DFLS=1 -DSOFTFLS=1 -DWIPE_BTB=1
event: 4.481360 +- 0.005505 4.518300 +- 0.004877

Which still doesn't explain your hatred of fls(), as even with cold
branches and a software fls, its faster than your latest proposal (as
can be explained by the lower total number of branches for the software
fls variant compared to your proposal).

And when we start to look at larger values, the fls() one pulls even
further ahead.

So I'll stick with my proposal... I can write up a proper Changelog and
maybe add my test to tools/testing/sqrt/ if you think its worth it.

---
lib/int_sqrt.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/int_sqrt.c b/lib/int_sqrt.c
index 1ef4cc344977..611933760225 100644
--- a/lib/int_sqrt.c
+++ b/lib/int_sqrt.c
@@ -7,12 +7,13 @@

#include <linux/kernel.h>
#include <linux/export.h>
+#include <linux/bitops.h>

/**
- * int_sqrt - rough approximation to sqrt
+ * int_sqrt - Computes the integer square root
* @x: integer of which to calculate the sqrt
*
- * A very rough approximation to the sqrt() function.
+ * Computes: floor(sqrt(@x))
*/
unsigned long int_sqrt(unsigned long x)
{
@@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
if (x <= 1)
return x;

- m = 1UL << (BITS_PER_LONG - 2);
+ m = 1UL << (__fls(x) & ~1U);
+
+ while (m > x)
+ m >>= 2;
+
while (m != 0) {
b = y + m;
y >>= 1;

Attachments:

(No filename) (4.63 kB)
sqrt.tar (20.00 kB)
Download all attachments

2017-07-21 12:15:15

by Joe Perches

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> if (x <= 1)
> return x;
>
> - m = 1UL << (BITS_PER_LONG - 2);
> + m = 1UL << (__fls(x) & ~1U);
> +
> + while (m > x)
> + m >>= 2;

while (m > x) ?

Belt and suspenders if __fls is broken?

2017-07-21 13:26:31

[permalink] [raw]

Subject: Re: [PATCH] lib/int_sqrt.c: Optimize square root function

On Fri, Jul 21, 2017 at 05:15:10AM -0700, Joe Perches wrote:
> On Fri, 2017-07-21 at 13:40 +0200, Peter Zijlstra wrote:
> > @@ -21,7 +22,11 @@ unsigned long int_sqrt(unsigned long x)
> > if (x <= 1)
> > return x;
> >
> > - m = 1UL << (BITS_PER_LONG - 2);
> > + m = 1UL << (__fls(x) & ~1U);
> > +
> > + while (m > x)
> > + m >>= 2;
>
> while (m > x) ?
>
> Belt and suspenders if __fls is broken?

Hmm... you're right, that should not happen. It is a remnant from when I
rounded up, like:

m = 1UL << ((__fls(x) + 1) & ~1UL);

Because I worried about the case where m == x, which is not included in
the loop above (but works when you look at the actual computation loop
and passes VALIDATE=1).

But check this... I cannot explain :/

When I remove that loop, we, as fully expected, loose 1 branch, but the
cycle count for the branch-cold case shoots up. Must be something GCC
does.

EVENT=0 -DNEW=1 -DFLS=1
event: 19.626050 +- 0.038995
EVENT=0 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 109.610670 +- 0.425667

EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.445680 +- 0.043782
EVENT=0 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 83.590420 +- 0.142126

EVENT=4 -DNEW=1 -DFLS=1
event: 20.252330 +- 0.005265
EVENT=4 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 20.252340 +- 0.005265

EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 21.252300 +- 0.005266
EVENT=4 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 21.252300 +- 0.005266

EVENT=5 -DNEW=1 -DFLS=1
event: 0.019370 +- 0.000732
EVENT=5 -DNEW=1 -DFLS=1 -DWIPE_BTB=1
event: 3.665240 +- 0.005309

EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1
event: 0.020150 +- 0.000755
EVENT=5 -DNEW=1 -DFLS=1 -DANSHUL=1 -DWIPE_BTB=1
event: 2.225330 +- 0.004875

Let me dig out another GCC version current:

gcc (Debian 6.3.0-18) 6.3.0 20170516

2017-07-21 13:33:09