2014-01-13 21:42:54

by Hannes Frederic Sowa

[permalink] [raw]
Subject: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

This patch is a RFC and part of a series Daniel Borkmann and me want to
do when introducing prandom_u32_range{,_ro} and prandom_u32_max{,_ro}
helpers later this week.

At first Jakub Zawadzki noticed that some divisions by reciprocal_divide
were not correct:
http://www.wireshark.org/~darkjames/reciprocal-buggy.c

He could also show this with BPF:
http://www.wireshark.org/~darkjames/set-and-dump-filter-k-bug.c

Also: reciprocal_value and reciprocal_divide always return 0 for
divisions by 1. This is a bit worrisome as those functions also get
used in mm/slab.c and lib/flex_array.c. Bonding already seems to check
for the 1-divisor case and handles that correctly. We don't know about
other problems, yet.

I propose an correction/update of the algorithm
based on the paper "T. Granlund and P. L. Montgomery:
Division by Invariant Integers Using Multiplication"
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556>

The assembler implementation from Agner Fog, found here
<http://www.agner.org/optimize/asmlib.zip>, helped a lot while
implementing.

I would like to have feedback if people see problems with this patch or
have concerns about performance. I did some testing on x86-64 and found
no problems so far but did no performance evaluation, yet.

The current code does break the call-sides of reciprocal_divide. The necessary
changes will be part of the full series, then.

Thanks!
---
include/linux/reciprocal_div.h | 12 +++++++++---
lib/reciprocal_div.c | 22 ++++++++++++++++++----
2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
index f9c90b3..6f17a87 100644
--- a/include/linux/reciprocal_div.h
+++ b/include/linux/reciprocal_div.h
@@ -22,11 +22,17 @@
* Should not be called before each reciprocal_divide(),
* or else the performance is slower than a normal divide.
*/
-extern u32 reciprocal_value(u32 B);

+struct reciprocal_value {
+ u32 m;
+ u8 sh1, sh2;
+};

-static inline u32 reciprocal_divide(u32 A, u32 R)
+struct reciprocal_value reciprocal_value(u32 d);
+
+static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
{
- return (u32)(((u64)A * R) >> 32);
+ u32 t = (u32)(((u64)a * R.m) >> 32);
+ return (t + ((a - t) >> R.sh1)) >> R.sh2;
}
#endif
diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
index 75510e9..b741b30 100644
--- a/lib/reciprocal_div.c
+++ b/lib/reciprocal_div.c
@@ -1,11 +1,25 @@
+#include <linux/kernel.h>
#include <asm/div64.h>
#include <linux/reciprocal_div.h>
#include <linux/export.h>

-u32 reciprocal_value(u32 k)
+/* For a description of the algorithmus please look at
+ * linux/reciprocal_div.h
+ */
+
+struct reciprocal_value reciprocal_value(u32 d)
{
- u64 val = (1LL << 32) + (k - 1);
- do_div(val, k);
- return (u32)val;
+ struct reciprocal_value R;
+ u64 m;
+ int l;
+
+ l = fls(d - 1);
+ m = ((1ULL << 32) * ((1ULL << l) - d));
+ do_div(m, d);
+ ++m;
+ R.m = (u32)m;
+ R.sh1 = min(l, 1);
+ R.sh2 = max(l-1, 0);
+ return R;
}
EXPORT_SYMBOL(reciprocal_value);
--
1.8.4.2


2014-01-14 18:02:27

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On 01/13/2014 01:42 PM, Hannes Frederic Sowa wrote:
>
> I propose an correction/update of the algorithm
> based on the paper "T. Granlund and P. L. Montgomery:
> Division by Invariant Integers Using Multiplication"
> <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.2556>
>
> The assembler implementation from Agner Fog, found here
> <http://www.agner.org/optimize/asmlib.zip>, helped a lot while
> implementing.
>
> I would like to have feedback if people see problems with this patch or
> have concerns about performance. I did some testing on x86-64 and found
> no problems so far but did no performance evaluation, yet.
>
> The current code does break the call-sides of reciprocal_divide. The necessary
> changes will be part of the full series, then.
>
> Thanks!
> ---
> include/linux/reciprocal_div.h | 12 +++++++++---
> lib/reciprocal_div.c | 22 ++++++++++++++++++----
> 2 files changed, 27 insertions(+), 7 deletions(-)

Just trivia (coding style and spelling):

> diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
> index 75510e9..b741b30 100644
> --- a/lib/reciprocal_div.c
> +++ b/lib/reciprocal_div.c
> @@ -1,11 +1,25 @@
> +#include <linux/kernel.h>
> #include <asm/div64.h>
> #include <linux/reciprocal_div.h>
> #include <linux/export.h>
>
> -u32 reciprocal_value(u32 k)
> +/* For a description of the algorithmus please look at

algorithms

> + * linux/reciprocal_div.h
> + */

and kernel coding style for multi-line comments is like so:

/*
* For a description of the algorithms, please look at
* linux/reciprocal_div.h
*/

> +
> +struct reciprocal_value reciprocal_value(u32 d)
> {



--
~Randy

2014-01-14 18:14:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Mon, 2014-01-13 at 22:42 +0100, Hannes Frederic Sowa wrote:
> This patch is a RFC and part of a series Daniel Borkmann and me want to
> do when introducing prandom_u32_range{,_ro} and prandom_u32_max{,_ro}
> helpers later this week.

> -static inline u32 reciprocal_divide(u32 A, u32 R)
> +struct reciprocal_value reciprocal_value(u32 d);
> +
> +static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
> {
> - return (u32)(((u64)A * R) >> 32);
> + u32 t = (u32)(((u64)a * R.m) >> 32);
> + return (t + ((a - t) >> R.sh1)) >> R.sh2;
> }

I would rather introduce new helpers and convert users that really need
them.

For instance, just use a divide in BPF, because doing this on JIT might
be too complex for the gains. Strangely, libpcap doesn't seem to
optimize any divide, like divides by a power of two...

Reciprocal were added 7 years ago, for very specific uses, but current
cpus have reasonably fast dividers.


2014-01-14 19:22:27

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On 2014-01-14 13:07, Eric Dumazet wrote:
> On Mon, 2014-01-13 at 22:42 +0100, Hannes Frederic Sowa wrote:
>> This patch is a RFC and part of a series Daniel Borkmann and me want to
>> do when introducing prandom_u32_range{,_ro} and prandom_u32_max{,_ro}
>> helpers later this week.
>
>> -static inline u32 reciprocal_divide(u32 A, u32 R)
>> +struct reciprocal_value reciprocal_value(u32 d);
>> +
>> +static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
>> {
>> - return (u32)(((u64)A * R) >> 32);
>> + u32 t = (u32)(((u64)a * R.m) >> 32);
>> + return (t + ((a - t) >> R.sh1)) >> R.sh2;
>> }
>
> I would rather introduce new helpers and convert users that really need
> them.
>
> For instance, just use a divide in BPF, because doing this on JIT might
> be too complex for the gains. Strangely, libpcap doesn't seem to
> optimize any divide, like divides by a power of two...
>
> Reciprocal were added 7 years ago, for very specific uses, but current
> cpus have reasonably fast dividers.

I disagree with the statement that current CPU's have reasonably fast
dividers. A lot of embedded processors and many low-end x86 CPU's do
not in-fact have any hardware divider, and usually provide it using
microcode based emulation if they provide it at all. The AMD Jaguar
micro-architecture in particular comes to mind, it uses an iterative
division algorithm provided by the microcode that only produces 2 bits
of quotient per cycle, even in the best case (2 8-bit integers and an
integral 8-bit quotient) this still takes 4 cycles, which is twice as
slow as any other math operation on the same processor.

2014-01-14 19:50:36

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, 2014-01-14 at 14:22 -0500, Austin S Hemmelgarn wrote:

> I disagree with the statement that current CPU's have reasonably fast
> dividers. A lot of embedded processors and many low-end x86 CPU's do
> not in-fact have any hardware divider, and usually provide it using
> microcode based emulation if they provide it at all. The AMD Jaguar
> micro-architecture in particular comes to mind, it uses an iterative
> division algorithm provided by the microcode that only produces 2 bits
> of quotient per cycle, even in the best case (2 8-bit integers and an
> integral 8-bit quotient) this still takes 4 cycles, which is twice as
> slow as any other math operation on the same processor.

I doubt you run any BPF filter with a divide instruction in it on these
platform.

Get real, do not over optimize things where it does not matter.


2014-01-14 20:10:46

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, Jan 14, 2014 at 11:50:32AM -0800, Eric Dumazet wrote:
> On Tue, 2014-01-14 at 14:22 -0500, Austin S Hemmelgarn wrote:
>
> > I disagree with the statement that current CPU's have reasonably fast
> > dividers. A lot of embedded processors and many low-end x86 CPU's do
> > not in-fact have any hardware divider, and usually provide it using
> > microcode based emulation if they provide it at all. The AMD Jaguar
> > micro-architecture in particular comes to mind, it uses an iterative
> > division algorithm provided by the microcode that only produces 2 bits
> > of quotient per cycle, even in the best case (2 8-bit integers and an
> > integral 8-bit quotient) this still takes 4 cycles, which is twice as
> > slow as any other math operation on the same processor.
>
> I doubt you run any BPF filter with a divide instruction in it on these
> platform.
>
> Get real, do not over optimize things where it does not matter.

If I read the instruction tables correctly, we could half the latency with
reciprocal divide even on haswell.

What a pitty that the Intel Architecture Code Analyzer does not support imul
nor div instruction. :(

2014-01-14 20:53:53

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On 2014-01-14 14:50, Eric Dumazet wrote:
> On Tue, 2014-01-14 at 14:22 -0500, Austin S Hemmelgarn wrote:
>
>> I disagree with the statement that current CPU's have reasonably fast
>> dividers. A lot of embedded processors and many low-end x86 CPU's do
>> not in-fact have any hardware divider, and usually provide it using
>> microcode based emulation if they provide it at all. The AMD Jaguar
>> micro-architecture in particular comes to mind, it uses an iterative
>> division algorithm provided by the microcode that only produces 2 bits
>> of quotient per cycle, even in the best case (2 8-bit integers and an
>> integral 8-bit quotient) this still takes 4 cycles, which is twice as
>> slow as any other math operation on the same processor.
>
> I doubt you run any BPF filter with a divide instruction in it on these
> platform.
>
> Get real, do not over optimize things where it does not matter.
>
Actually, I have three Jaguar based routers, and use BPF regularly as
part of their iptables rules to log certain packet types.

2014-01-14 22:39:37

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, Jan 14, 2014 at 10:07:05AM -0800, Eric Dumazet wrote:
> On Mon, 2014-01-13 at 22:42 +0100, Hannes Frederic Sowa wrote:
> > This patch is a RFC and part of a series Daniel Borkmann and me want to
> > do when introducing prandom_u32_range{,_ro} and prandom_u32_max{,_ro}
> > helpers later this week.
>
> > -static inline u32 reciprocal_divide(u32 A, u32 R)
> > +struct reciprocal_value reciprocal_value(u32 d);
> > +
> > +static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
> > {
> > - return (u32)(((u64)A * R) >> 32);
> > + u32 t = (u32)(((u64)a * R.m) >> 32);
> > + return (t + ((a - t) >> R.sh1)) >> R.sh2;
> > }
>
> I would rather introduce new helpers and convert users that really need
> them.
>
> For instance, just use a divide in BPF, because doing this on JIT might
> be too complex for the gains. Strangely, libpcap doesn't seem to
> optimize any divide, like divides by a power of two...
>
> Reciprocal were added 7 years ago, for very specific uses, but current
> cpus have reasonably fast dividers.

Agreed. The new algorithm would need to change the size of struct
sock_filter, which is exported to user space. We will leave BPF as-is
for the time being and check that later.

Greetings,

Hannes

2014-01-14 22:50:46

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, 2014-01-14 at 15:53 -0500, Austin S Hemmelgarn wrote:
> On 2014-01-14 14:50, Eric Dumazet wrote:
> > On Tue, 2014-01-14 at 14:22 -0500, Austin S Hemmelgarn wrote:
> >
> >> I disagree with the statement that current CPU's have reasonably fast
> >> dividers. A lot of embedded processors and many low-end x86 CPU's do
> >> not in-fact have any hardware divider, and usually provide it using
> >> microcode based emulation if they provide it at all. The AMD Jaguar
> >> micro-architecture in particular comes to mind, it uses an iterative
> >> division algorithm provided by the microcode that only produces 2 bits
> >> of quotient per cycle, even in the best case (2 8-bit integers and an
> >> integral 8-bit quotient) this still takes 4 cycles, which is twice as
> >> slow as any other math operation on the same processor.
> >
> > I doubt you run any BPF filter with a divide instruction in it on these
> > platform.
> >
> > Get real, do not over optimize things where it does not matter.
> >
> Actually, I have three Jaguar based routers, and use BPF regularly as
> part of their iptables rules to log certain packet types.



1) Have you enabled BPF JIT

2) Do you have divide instructions in a BPF filter,
if yes, I would like to have an example.
(current code works well for small divisors anyway)

3) How much time is spent in BPF compared to the rest of the stack,
especially if you run iptables...

Spending 2 or 3 days of work to save ~7 cycles for a divide that
probably can be replaced by a shift anyway, while spending 5000 cycles
per packet is what I call not a wise optimization, especially
if dealing with old hardware.

Even on a Jaguar, the proposed alternative

+ u32 t = (u32)(((u64)a * R.m) >> 32);
+ return (t + ((a - t) >> R.sh1)) >> R.sh2;

will have a similar cost.

2014-01-14 23:25:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, Jan 14, 2014 at 02:45:37PM -0800, Eric Dumazet wrote:
> Even on a Jaguar, the proposed alternative

I don't know what Jaguar you guys are talking about but the Jaguar
I know - Fam16h - has an int hardware divider:

http://developer.amd.com/wordpress/media/2012/10/SOG_16h_52128_PUB_Rev1_1.pdf

So all that talk about microcode is plain wrong. The hardware divider
comes from Llano (F12h) so it must be some other Jaguar, maybe Bobcat.

:-)

If it is Bobcat, then it has a 1-bit per cycle ucode int divider.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2014-01-15 02:51:49

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm


On 01/14/2014 06:25 PM, Borislav Petkov wrote:
> On Tue, Jan 14, 2014 at 02:45:37PM -0800, Eric Dumazet wrote:
>> Even on a Jaguar, the proposed alternative
> I don't know what Jaguar you guys are talking about but the Jaguar
> I know - Fam16h - has an int hardware divider:
>
> http://developer.amd.com/wordpress/media/2012/10/SOG_16h_52128_PUB_Rev1_1.pdf
>
> So all that talk about microcode is plain wrong. The hardware divider
> comes from Llano (F12h) so it must be some other Jaguar, maybe Bobcat.
>
> :-)
>
> If it is Bobcat, then it has a 1-bit per cycle ucode int divider.
>
Apologies, it does have a hardware divider, however it still only gets 2
bits per cycle.

2014-01-15 15:02:55

by Hannes Frederic Sowa

[permalink] [raw]
Subject: Re: [PATCH RFC] reciprocal_divide: correction/update of the algorithm

On Tue, Jan 14, 2014 at 10:02:19AM -0800, Randy Dunlap wrote:
> Just trivia (coding style and spelling):
>
> > diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c
> > index 75510e9..b741b30 100644
> > --- a/lib/reciprocal_div.c
> > +++ b/lib/reciprocal_div.c
> > @@ -1,11 +1,25 @@
> > +#include <linux/kernel.h>
> > #include <asm/div64.h>
> > #include <linux/reciprocal_div.h>
> > #include <linux/export.h>
> >
> > -u32 reciprocal_value(u32 k)
> > +/* For a description of the algorithmus please look at
>
> algorithms
>
> > + * linux/reciprocal_div.h
> > + */
>
> and kernel coding style for multi-line comments is like so:
>
> /*
> * For a description of the algorithms, please look at
> * linux/reciprocal_div.h
> */
>
> > +
> > +struct reciprocal_value reciprocal_value(u32 d)
> > {

Thanks, fixed those stuff locally along with a better description. I am
used to netdev comments. ;)