Will there be any objections to using a quasi-documented mutation of the
x86's AAD instruction in the 387 emulator? Every CPU around has to do this
mutation correctly or a LOT of existing code will break...
The performance of storing to user space of BCD numbers in the 387 emulator
code could be improved significantly by using the mutant AAD instruction
trick (i.e. alter its implicit base from 10 to 16). See reg_ld_str.c, in
function FPU_store_bcd()
As it stands now, the BCD digits are being decoded one at a time in a 10
iteration divide by 10 loop that makes two calls to an extended precision
division routine.
This loop could be morphed into a 5 iteration divide by 100 loop.
The remainder of a divide by 100 would be processed thus (pseudo asm code):
AL = remainder
AH = 0
AAM /* this creates BCD nibbles in low 4 of AH and AL */
AAD (mutated with 16 as base rather than 10)
Now AL contains two packed BCD digits. Here's a worked out example of the
transformation of data starting with an initial remainder of 35 (decimal) in
AX
1) Start with AX = 0x0023
2) Execute AAM instruction
3) Now AX = 0x0203 (unpacked BCD)
4) Execute base 16 AAD instruction
5) Now AX = 0x0023 (packed BCD)
AAM and AAD aren't cheap instructions, but compared to the cost of 2X trips
through the extended precision divide routine, they are quite a bargain.
Brain fade...example should be:
1) Start with AX = 0x0023
2) Execute AAM instruction
3) Now AX = 0x0305 (unpacked BCD)
4) Execute base 16 AAD instruction
5) Now AX = 0x0035 (packed BCD)
----- Original Message ----- >
> 1) Start with AX = 0x0023
> 2) Execute AAM instruction
> 3) Now AX = 0x0203 (unpacked BCD)
> 4) Execute base 16 AAD instruction
> 5) Now AX = 0x0023 (packed BCD)
[email protected] writes:
> Will there be any objections to using a quasi-documented mutation of the
> x86's AAD instruction in the 387 emulator? Every CPU around has to do this
> mutation correctly or a LOT of existing code will break...
>
> The performance of storing to user space of BCD numbers in the 387 emulator
> code could be improved significantly by using the mutant AAD instruction
> trick (i.e. alter its implicit base from 10 to 16). See reg_ld_str.c, in
> function FPU_store_bcd()
What do you mean by "quasi-documented" and "mutant"?
Intel certainly documents the "D5 ib" form as being a
valid way to change the base from the default 10.
The only issue AFAIK is that assemblers may only
recognise the plain base-10 AAD syntax. No biggie.
/Mikael
On Friday 27 May 2005 10:44, [email protected] wrote:
> Brain fade...example should be:
>
> 1) Start with AX = 0x0023
> 2) Execute AAM instruction
> 3) Now AX = 0x0305 (unpacked BCD)
> 4) Execute base 16 AAD instruction
> 5) Now AX = 0x0035 (packed BCD)
Intel syntax:
shl ah,4
or al,ah
mov ah,0 (if needed)
No need to use AAD16, it is
a) doesnt work on some obscure ancient NEC x86 clones IIRC
b) is microcoded (slow)
--
vda
You're right about AAD16 Denis :)
Sometimes my mind forgets we're not dealing with something having only 4K of
ROM/RAM to play with. An embedded Linux won't fit on THAT ;-> A few bytes
more here is definately worth it. No matter anyway, I've managed to rip
several hundred bytes out of the emulator in the .S assembler files and made
it faster in the process. I've just now started looking at the .c files and
this opportunity kind of jumped out at me and seemed significant since this
BCD stuff is common in C runtimes printf/sprintf for generating displayable
floating point numbers.
AAM/SHL/OR/MOV looks like a big win though compared to multiple trips
through the extended precision divide routines for a BCD pair.
Mikael: The reason I say "quasi/mutant" is because Intel didn't officially
confess to this particular AAD behavior until many years later. All the
earlier 8086-486 programmer refs describe only the base 10 form (i.e. their
instruction pseudocode in those manuals is in fact wrong). As Denis
mentions, it was NEC (on the V20/V30) who got one of these wrong by trusting
the printed manual rather than the silicon - never a good thing to do with
Intel ;-> It took Intel until the Pentium to confess to SETALC's existence
which had been around since 8088/86.<g>
Tony
----- Original Message -----
From: "Denis Vlasenko" <[email protected]>
To: <[email protected]>; <[email protected]>
Sent: Friday, May 27, 2005 05:35
Subject: Re: 387 emulator hack - mutant AAD trick - any objections?
> On Friday 27 May 2005 10:44, [email protected] wrote:
> > Brain fade...example should be:
> >
> > 1) Start with AX = 0x0023
> > 2) Execute AAM instruction
> > 3) Now AX = 0x0305 (unpacked BCD)
> > 4) Execute base 16 AAD instruction
> > 5) Now AX = 0x0035 (packed BCD)
>
> Intel syntax:
>
> shl ah,4
> or al,ah
> mov ah,0 (if needed)
>
> No need to use AAD16, it is
> a) doesnt work on some obscure ancient NEC x86 clones IIRC
> b) is microcoded (slow)
> --
> vda
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, 27 May 2005, Mikael Pettersson wrote:
> The only issue AFAIK is that assemblers may only
> recognise the plain base-10 AAD syntax. No biggie.
And actually gas has supported the explicit operand variations for "aad"
and "aam" for a long time now.
Maciej
On Fri, 27 May 2005, Denis Vlasenko wrote:
> No need to use AAD16, it is
> a) doesnt work on some obscure ancient NEC x86 clones IIRC
Who cares about 16-bit silicon?
> b) is microcoded (slow)
But certainly not worse than the alternatives for these processors which
actually lack the x87 subset.
Maciej
From: "Maciej W. Rozycki" <[email protected]>
>
> But certainly not worse than the alternatives for these processors which
> actually lack the x87 subset.
>
What I'm looking for is reasonable compromise on the space issue. I'm
willing to spend a few bytes to help out the weak CPU's this will run on.
The embedded market (and whatever low end desktop machines are still out
there) is fairly size sensitive. Cutting a few pages out might be the
difference between thrashing and not thrashing ;-> I've also got some GCC
hacks in mind so it can treat the SX differently than 386DX. Right now it
doesn't distinguish and makes some very poor (for the SX) choices even when
told to space optimize. Being hyper memory constrained, even something like
the ubiquitous MOV EAX,1 (5 bytes) is slower on the SX than XOR EAX,EAX (2
bytes) / INC EAX (1 byte) - in fact the MOV is about 30% slower than the
XOR/INC :O The situation reverses on the DX and 486. 386SX really needs
*completely* different code gen rules than DX or 486.
I just benchmarked the AAD performance versus alternatives on a 386SLC(a
modestly cached 386SX variation IBM produced) and the the AAD is visible
loser. Using the AAM is a win over the existing code though.
I pondered on the the extended precision divide routine that's being called
in this loop and with a little underhanded treachery managed to eliminate a
push/pop of ESI from it by recycling EBP to address its parms once the frame
was no longer needed. This is an awsome trick when the code is simple enough
that you can get away with doing it and don't need to reference a bunch of
parameters. In this case it only needed two, so you can peel the first off
using the normal EBP, then peel the second directly into EBP itself (which
destroys its usability to address the frame of course, but you've already
got what you wanted and don't care at that point<g>). Then use EPB as you
might use ESI from then on and eliminate the save/restore of ESI. Its
sneaky, but it works.
For a modest number of EBP references (less than 6 or so) the occasional
instruction plumping necessary with implied 0 displacements is well offset
by the elimination of sloshing a couple of dwords on and off the stack to
save a register. push/pop dwords really hurt the SX .