2003-09-29 09:07:25

by Muli Ben-Yehuda

[permalink] [raw]
Subject: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

In continuation to the thread at
http://marc.theaimsgroup.com/?l=linux-kernel&m=3D106270456516176&w=2 wrt
translating PROT_(censored) bits to VM_(censored) bits, here's a small
comment only patch to document the optimizing macro Jamie's patch
used. Against 2.6.0-t6, but should apply to anything
recent. Compiles.

Cheers,
Muli

--- linux-2.5/include/linux/mman.h Sun Sep 7 10:05:18 2003
+++ optimizing-macro-2.6.0-t6/include/linux/mman.h Mon Sep 29 11:53:12 2003
@@ -28,7 +28,10 @@
vm_acct_memory(-pages);
}

-/* Optimisation macro. */
+/* Optimisation macro, used to be defined as: */
+/* ((bit1 == bit2) ? (x & bit1) : (x & bit1) ? bit2 : 0) */
+/* but this version is faster */
+/* "check if bit1 is on in 'x'. If it is, return bit2" */
#define _calc_vm_trans(x,bit1,bit2) \
((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
: ((x) & (bit1)) / ((bit1) / (bit2)))

--
Muli Ben-Yehuda
http://www.mulix.org


Attachments:
(No filename) (933.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2003-09-29 15:34:51

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Muli Ben-Yehuda wrote:
> -/* Optimisation macro. */
> +/* Optimisation macro, used to be defined as: */
> +/* ((bit1 == bit2) ? (x & bit1) : (x & bit1) ? bit2 : 0) */
> +/* but this version is faster */
> +/* "check if bit1 is on in 'x'. If it is, return bit2" */
> #define _calc_vm_trans(x,bit1,bit2) \
> ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
> : ((x) & (bit1)) / ((bit1) / (bit2)))

I agree with the intent of that comment, but the code in it is
unnecessarily complex. See if you like this, and if you do feel free
to submit it as a patch:

/* Optimisation macro. It is equivalent to:
(x & bit1) ? bit2 : 0
but this version is faster. ("bit1" and "bit2" must be single bits). */

cheers,
-- Jamie

2003-09-29 15:52:54

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

On Mon, 29 Sep 2003 16:34:37 BST, Jamie Lokier said:
> Muli Ben-Yehuda wrote:
> > -/* Optimisation macro. */
> > +/* Optimisation macro, used to be defined as: */
> > +/* ((bit1 == bit2) ? (x & bit1) : (x & bit1) ? bit2 : 0) */
> > +/* but this version is faster */
> > +/* "check if bit1 is on in 'x'. If it is, return bit2" */
> > #define _calc_vm_trans(x,bit1,bit2) \
> > ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
> > : ((x) & (bit1)) / ((bit1) / (bit2)))
>
> I agree with the intent of that comment, but the code in it is
> unnecessarily complex. See if you like this, and if you do feel free
> to submit it as a patch:
>
> /* Optimisation macro. It is equivalent to:
> (x & bit1) ? bit2 : 0
> but this version is faster. ("bit1" and "bit2" must be single bits). */

Is this supposed to return the bitmask bit2, or (x & bit2)? If the former,
then your code is right. If the latter, (x & bit1) ? (x & bit2) : 0

I'm totally failing to see why the original did the bit1 == bit2 compare,
so maybe mhyself and Jamie are both missing some subtlety?


Attachments:
(No filename) (226.00 B)

2003-09-29 17:14:27

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

[email protected] wrote:
> Is this supposed to return the bitmask bit2, or (x & bit2)? If the former,
> then your code is right. If the latter, (x & bit1) ? (x & bit2) : 0

The former.

> I'm totally failing to see why the original did the bit1 == bit2 compare,
> so maybe mhyself and Jamie are both missing some subtlety?

"bit1 == bit2" was an optimisation. It made the machine code smaller,
without changing the result.

-- Jamie

2003-09-29 17:29:16

by Maciej Żenczykowski

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

> > > +/* Optimisation macro, used to be defined as: */
> > > +/* ((bit1 == bit2) ? (x & bit1) : (x & bit1) ? bit2 : 0) */
> > > +/* but this version is faster */
> > > +/* "check if bit1 is on in 'x'. If it is, return bit2" */
> > > #define _calc_vm_trans(x,bit1,bit2) \
> > > ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
> > > : ((x) & (bit1)) / ((bit1) / (bit2)))
> >
> > I agree with the intent of that comment, but the code in it is
> > unnecessarily complex. See if you like this, and if you do feel free
> > to submit it as a patch:
> >
> > /* Optimisation macro. It is equivalent to:
> > (x & bit1) ? bit2 : 0
> > but this version is faster. ("bit1" and "bit2" must be single bits). */
>
> Is this supposed to return the bitmask bit2, or (x & bit2)? If the former,
> then your code is right. If the latter, (x & bit1) ? (x & bit2) : 0
>
> I'm totally failing to see why the original did the bit1 == bit2 compare,
> so maybe mhyself and Jamie are both missing some subtlety?

I'd guess since the majority of the time bit1 and bit2 are compile time
constants and the comparison can be resolved during compilation into more
effective code in the bit1==bit2 case, thus effectively decreasing not
increasing the amount of generated object code.

That probably also explains why the convoluted long code with division
above is faster - the division and multiplication most likely gets
resolved during compile (since we're dealing with compile time constants)
into a single shift and results in an execution of bitop 'and' followed by
'shift left/right', which due to lacking and conditional branches
(necessary for ?:) is usually significantly faster due to a lack of
branch mispredictions.

MaZe.


2003-09-30 02:24:06

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Maciej Zenczykowski <[email protected]> writes:
> That probably also explains why the convoluted long code with division
> above is faster - the division and multiplication most likely gets
> resolved during compile (since we're dealing with compile time constants)
> into a single shift and results in an execution of bitop 'and' followed by
> 'shift left/right', which due to lacking and conditional branches
> (necessary for ?:) is usually significantly faster due to a lack of
> branch mispredictions.

Hmmm, on my arch (v850) gcc-2.9x produce different, but equally
efficient (no branches) code for both the old `obvious' expression and
the new `convoluted' expression. gcc-3.3.x produces the _same_ two
instructions for both expressions, except that the two instructions are
in different orders. :-)

-Miles
--
Freedom's just another word, for nothing left to lose --Janis Joplin

2003-09-30 07:10:24

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

On Mon, Sep 29, 2003 at 04:34:37PM +0100, Jamie Lokier wrote:

> I agree with the intent of that comment, but the code in it is
> unnecessarily complex. See if you like this, and if you do feel free
> to submit it as a patch:

Ah, much nicer, thank you. I'll submit it, but first, what do you
think about this version? it keeps the optimization and enforces the
"bit1 and bit2 must be single bits only" rule. It could probably be
improved to be done at build time, if bit1 and bit2 and compile time
constants. Against 2.6.0-t6-cvs, compiles and boots for me.

--- linux-2.5/include/linux/mman.h Sun Sep 7 10:05:18 2003
+++ calc-vm-bit-optimizing-macro-2.6.0-t6//include/linux/mman.h Tue Sep 30 10:02:01 2003
@@ -28,10 +28,28 @@
vm_acct_memory(-pages);
}

-/* Optimisation macro. */
-#define _calc_vm_trans(x,bit1,bit2) \
- ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
- : ((x) & (bit1)) / ((bit1) / (bit2)))
+/*
+ * assert that only a single bit is on in 'bit'
+ */
+static inline void assert_single_bit(unsigned long bit)
+{
+ BUG_ON(bit & (bit - 1));
+}
+/*
+ * Optimisation function. It is equivalent to:
+ * (x & bit1) ? bit2 : 0
+ * but this version is faster.
+ * ("bit1" and "bit2" must be single bits).
+ */
+static inline unsigned long
+_calc_vm_trans(unsigned long x, unsigned long bit1, unsigned long bit2)
+{
+ assert_single_bit(bit1);
+ assert_single_bit(bit2);
+
+ return ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1))
+ : ((x) & (bit1)) / ((bit1) / (bit2)));
+}

/*
* Combine the mmap "prot" argument into "vm_flags" used internally.

--
Muli Ben-Yehuda
http://www.mulix.org


Attachments:
(No filename) (1.61 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2003-09-30 07:35:27

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Muli Ben-Yehuda <[email protected]> writes:
> Ah, much nicer, thank you. I'll submit it, but first, what do you
> think about this version? it keeps the optimization and enforces the
> "bit1 and bit2 must be single bits only" rule.

On my arch it results in huge bloated code, including mul and div insns!

Here's what gcc 3.3.x does for both the prefix macro versions:

#define _calc_vm_trans_2(x,bit1,bit2) \
((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
: ((x) & (bit1)) / ((bit1) / (bit2)))
int test2_0 (unsigned val)
{
return _calc_vm_trans_2 (val,0x20,0x80);
}

==>

_test2_0:
andi 32,r6,r10 #, val, tmp43
shl 2,r10 #, <result>
jmp [r31]


Here's what gcc 3.3.x does for your inlined function version:

static inline void assert_single_bit(unsigned long bit)
{
extern __bug() __attribute__ ((noreturn));
if (bit & (bit - 1))
__bug ();
}
static inline unsigned long
_calc_vm_trans_3(unsigned long x, unsigned long bit1, unsigned long bit2)
{
assert_single_bit(bit1);
assert_single_bit(bit2);

return ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1))
: ((x) & (bit1)) / ((bit1) / (bit2)));
}
int test3_0 (unsigned val)
{
return _calc_vm_trans_3 (val,0x20,0x80);
}

==>

_test3_0:
movea lo(32),r0,r7 #, bit1
movea lo(128),r0,r5 #, bit2
divu r7,r5,r10 # bit1, bit2, tmp55
mov r6,r10 # val, tmp52
and r7,r10 # bit1, tmp52
mul r5,r10,r0 # bit2, tmp52
jmp [r31]

[I've no idea what's up with that, I don't even think the generated code
makes sense, but it seems like the presence of inline function is
confusing something.]

-Miles
--
In New York, most people don't have cars, so if you want to kill a person, you
have to take the subway to their house. And sometimes on the way, the train
is delayed and you get impatient, so you have to kill someone on the subway.
[George Carlin]

2003-09-30 07:41:54

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

On Tue, Sep 30, 2003 at 04:34:31PM +0900, Miles Bader wrote:
> Muli Ben-Yehuda <[email protected]> writes:
> > Ah, much nicer, thank you. I'll submit it, but first, what do you
> > think about this version? it keeps the optimization and enforces the
> > "bit1 and bit2 must be single bits only" rule.
>
> On my arch it results in huge bloated code, including mul and div
> insns!

Eeek!

> [I've no idea what's up with that, I don't even think the generated code
> makes sense, but it seems like the presence of inline function is
> confusing something.]

Ok, that's a pretty convincing argument for scraping that
version. I'll rewrite it to evaluate the arguments at compile time if
they're constants, which they are, in our case. Unless someone else
beats me to it, of course ;-)
--
Muli Ben-Yehuda
http://www.mulix.org


Attachments:
(No filename) (825.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2003-09-30 08:00:20

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Muli Ben-Yehuda <[email protected]> writes:
> Ok, that's a pretty convincing argument for scraping that
> version. I'll rewrite it to evaluate the arguments at compile time if
> they're constants, which they are, in our case. Unless someone else
> beats me to it, of course ;-)

What's wrong with the macro version? The presence of a __ prefix
suggests that it's only used in controlled circumstances anyway, so is
validity-checking on the bit arguments really worthwhile?

-miles
--
The secret to creativity is knowing how to hide your sources.
--Albert Einstein

2003-09-30 09:41:44

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

On Tue, Sep 30, 2003 at 04:59:56PM +0900, Miles Bader wrote:
> Muli Ben-Yehuda <[email protected]> writes:
> > Ok, that's a pretty convincing argument for scraping that
> > version. I'll rewrite it to evaluate the arguments at compile time if
> > they're constants, which they are, in our case. Unless someone else
> > beats me to it, of course ;-)
>
> What's wrong with the macro version? The presence of a __ prefix
> suggests that it's only used in controlled circumstances anyway, so is
> validity-checking on the bit arguments really worthwhile?

I like code that is "future proof", especially when it doesn't cost
anything. These examples generate identical code for me with gcc-3.3
and almost identical code with gcc-2.96 (one instruction difference,
and I can't tell which is faster), and the inline function barfs when
its arguments are incorrect during compile time, whereas the macro
will silently give you the wrong results. How does it fare on your
arch?

#include <stdio.h>
#include <assert.h>

#define BUG_ON(x) assert(x)

/* compiler trap for assert_single_bit. */
extern void __assert_single_bit_failed_dont_exist(void);

/*
* assert that only a single bit is on in 'bit'
*/
#define assert_single_bit(bit) do { \
if (__builtin_constant_p(bit)) { \
if ((bit & (bit -1))) \
__assert_single_bit_failed_dont_exist(); \
} else \
BUG_ON(!(bit & (bit - 1))); \
} while(0)
/*
* Optimisation function. It is equivalent to:
* (x & bit1) ? bit2 : 0
* but this version is faster.
* ("bit1" and "bit2" must be single bits).
*/
static inline unsigned long
inline_calc_vm_trans(unsigned long x, unsigned long bit1, unsigned long bit2)
{
assert_single_bit(bit1);
assert_single_bit(bit2);

return ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1))
: ((x) & (bit1)) / ((bit1) / (bit2)));
}

/* Optimisation macro. */
#define macro_calc_vm_trans(x,bit1,bit2) \
((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1)) \
: ((x) & (bit1)) / ((bit1) / (bit2)))


int test3(unsigned long arg)
{
return macro_calc_vm_trans(arg, 0x20, 0x80);
}

int test4(unsigned long arg)
{
return inline_calc_vm_trans(arg, 0x20, 0x80);
}

int main(void)
{
int l1 = test4(~0UL);
int l2 = test3(~0UL);
return l1 & l2; /* don't optimize them out */
}

gcc-2.96:

test3:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
andl $32, %eax
sall $2, %eax
popl %ebp
ret

test4:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
andl $32, %eax
sall $2, %eax
leave
ret
--
Muli Ben-Yehuda
http://www.mulix.org


Attachments:
(No filename) (2.54 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2003-09-30 10:01:30

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Muli Ben-Yehuda <[email protected]> writes:
> I like code that is "future proof", especially when it doesn't cost
> anything.

Sure, it's not always worth the trouble/pain to do so unless you've
actually got some reason to think it will be mis-used.

This is seeming more and more like such a case: it's a very special
purpose macro which is only used by two other small functions, both of
which are located immediately adjacent to the macro's definition.

If it _really_ makes you nervous, give it a big warning comment with
capital letters or something; that's almost certainly enough.

> How does it fare on your arch?

Quite grotesquely:

int test3_0 (unsigned val)
{
return _calc_vm_trans_3 (val,0x20,0x80);
}

int test3_1 (unsigned val)
{
return _calc_vm_trans_3 (val,0x80,0x20);
}

==>

__calc_vm_trans_3:
addi -20,sp,sp #,,
addi -1,r7,r5 #, bit1, tmp46
addi -1,r8,r11 #, bit2, tmp51
mov r6,r12 # x, tmp53
st.w r31,16[sp] #,
and r7,r12 # bit1, tmp53
and r7,r5 # bit1, tmp46
and r8,r11 # bit2, tmp51
cmp r0,r5 # tmp46
be .L24 #,
cmp r0,r11 # tmp51
be .L24 #,
cmp r8,r7 # bit2, bit1
bnh .L26 #,
mov r6,r12 # x, tmp53
and r7,r12 # bit1, tmp53
divu r8,r7,r10 # bit2, bit1, tmp61
ld.w 16[sp],r31 #,
divu r7,r12,r11 # bit1, tmp53, tmp63
mov r12,r10 # tmp53, <result>
addi 20,sp,sp #,,
jmp [r31]
.L26:
divu r7,r8,r10 # bit1, bit2, tmp57
mul r8,r12,r0 # bit2, tmp53
ld.w 16[sp],r31 #,
mov r12,r10 # tmp53, <result>
addi 20,sp,sp #,,
jmp [r31]
.L24:
jarl ___bug,r31 #
_test3_1:
addi -20,sp,sp #,,
st.w r31,16[sp] #,
movea lo(128),r0,r7 #,
movea lo(32),r0,r8 #,
jarl __calc_vm_trans_3,r31 #
ld.w 16[sp],r31 #,
addi 20,sp,sp #,,
jmp [r31]
_test3_0:
addi -20,sp,sp #,,
st.w r31,16[sp] #,
movea lo(32),r0,r7 #,
movea lo(128),r0,r8 #,
jarl __calc_vm_trans_3,r31 #
ld.w 16[sp],r31 #,
addi 20,sp,sp #,,
jmp [r31]


If I only use one test function, gcc does a bit better, but still
generates various unnecessary function-call-related artifacts:

int test3_0 (unsigned val)
{
return _calc_vm_trans_3 (val,0x20,0x80);
}

==>

_test3_0:
add -16,sp #,
andi 32,r6,r10 #, val, tmp56
shl 2,r10 #, <result>
addi 16,sp,sp #,,
jmp [r31]

-Miles
--
Freedom's just another word, for nothing left to lose --Janis Joplin

2003-09-30 10:06:39

by Miles Bader

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Argh!

> Sure, it's not always worth the trouble/pain to do so unless you've
^ insert `but'

-Miles
--
"Though they may have different meanings, the cries of 'Yeeeee-haw!' and
'Allahu akbar!' are, in spirit, not actually all that different."

2003-09-30 10:35:56

by Thomas Schlichter

[permalink] [raw]
Subject: Re: [PATCH] document optimizing macro for translating PROT_ to VM_ bits

Hi,

here just my 2 cents...

On Tuesday 30 September 2003 11:24, Muli Ben-Yehuda wrote:

~~ snip ~~

> /*
> * assert that only a single bit is on in 'bit'
> */
> #define assert_single_bit(bit) do { \
> if (__builtin_constant_p(bit)) { \
> if ((bit & (bit -1))) \
> __assert_single_bit_failed_dont_exist(); \
> } else \
> BUG_ON(!(bit & (bit - 1))); \
> } while(0)

In the BUG_ON statement the "!" looks wrong to me...

> /*
> * Optimisation function. It is equivalent to:
> * (x & bit1) ? bit2 : 0
> * but this version is faster.
> * ("bit1" and "bit2" must be single bits).
> */
> static inline unsigned long
> inline_calc_vm_trans(unsigned long x, unsigned long bit1, unsigned long
> bit2) {
> assert_single_bit(bit1);
> assert_single_bit(bit2);
>
> return ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1))
>
> : ((x) & (bit1)) / ((bit1) / (bit2)));
>
> }

Why don't we do:

static inline unsigned long calc_vm_trans(const unsigned long x,
const unsigned long bit1, const unsigned long bit2) {
assert_single_bit(bit1);
assert_single_bit(bit2);

/* Optimisation function */
if (__builtin_constant_p(bit1) && __builtin_constant_p(bit2)) {
return ((bit1) <= (bit2) ? ((x) & (bit1)) * ((bit2) / (bit1))
: ((x) & (bit1)) / ((bit1) / (bit2)));
}

return (x & bit1) ? bit2 : 0;
}

Best regards
Thomas Schlichter


Attachments:
(No filename) (1.36 kB)
(No filename) (189.00 B)
signature
Download all attachments