Hello all,
The attached patch quite noticeably minimizes rounding errors and improves
audio quality.
I decided to drop non-SIMD variant because it would require quite a bit of
work to update for better precision. Most of the CPU cores which are
relevant nowadays have support for some kind of SIMD extension anyway.
I will also do ARMv6 SIMD version of the analysis filter after all the high
level SBC optimizations are in place.
Audio quality estimation done with tiny_psnr (lower stddev value is better):
=== before patch (4 subbands) ===
./sbc_encode_test.rb BigBuckBunny-stereo.flac
[2, 48000]
["-j -s4 -B16 -b128", "-p -j -l16 -n4 -r1584000"]
--- comparing original / sbcenc + sbcdec ---
stddev: 3.58 PSNR: 85.23 bytes:114519660/114520000
--- comparing original / sbcenc + sbc_decoder.exe ---
stddev: 1.70 PSNR: 91.71 bytes:114519660/114520000
--- comparing original / sbc_encoder.exe + sbc_decoder.exe ---
stddev: 1.44 PSNR: 93.09 bytes:114519660/114520000
--- comparing sbcenc + sbc_decoder.exe / sbc_encoder.exe + sbc_decoder.exe
stddev: 0.99 PSNR: 96.36 bytes:114519808/114519808
=== after patch (4 subbands) ===
./sbc_encode_test.rb BigBuckBunny-stereo.flac
[2, 48000]
["-j -s4 -B16 -b128", "-p -j -l16 -n4 -r1584000"]
--- comparing original / sbcenc + sbcdec ---
stddev: 3.55 PSNR: 85.31 bytes:114519660/114520000
--- comparing original / sbcenc + sbc_decoder.exe ---
stddev: 1.62 PSNR: 92.09 bytes:114519660/114520000
--- comparing original / sbc_encoder.exe + sbc_decoder.exe ---
stddev: 1.44 PSNR: 93.09 bytes:114519660/114520000
--- comparing sbcenc + sbc_decoder.exe / sbc_encoder.exe + sbc_decoder.exe
stddev: 0.77 PSNR: 98.57 bytes:114519808/114519808
=== before patch (8 subbands) ===
./sbc_encode_test.rb BigBuckBunny-stereo.flac
[2, 48000]
["-j -s8 -B16 -b255", "-p -j -l16 -n8 -r1569000"]
--- comparing original / sbcenc + sbcdec ---
stddev: 4.85 PSNR: 82.60 bytes:114519260/114520000
--- comparing original / sbcenc + sbc_decoder.exe ---
stddev: 2.07 PSNR: 89.98 bytes:114519260/114520000
--- comparing original / sbc_encoder.exe + sbc_decoder.exe ---
stddev: 1.09 PSNR: 95.56 bytes:114519260/114520000
--- comparing sbcenc + sbc_decoder.exe / sbc_encoder.exe + sbc_decoder.exe
stddev: 1.77 PSNR: 91.34 bytes:114519552/114519552
=== after patch (8 subbands) ===
./sbc_encode_test.rb BigBuckBunny-stereo.flac
[2, 48000]
["-j -s8 -B16 -b255", "-p -j -l16 -n8 -r1569000"]
--- comparing original / sbcenc + sbcdec ---
stddev: 4.55 PSNR: 83.16 bytes:114519260/114520000
--- comparing original / sbcenc + sbc_decoder.exe ---
stddev: 1.28 PSNR: 94.11 bytes:114519260/114520000
--- comparing original / sbc_encoder.exe + sbc_decoder.exe ---
stddev: 1.09 PSNR: 95.56 bytes:114519260/114520000
--- comparing sbcenc + sbc_decoder.exe / sbc_encoder.exe + sbc_decoder.exe
stddev: 0.73 PSNR: 98.96 bytes:114519552/114519552
===
So for 4 subbands encode, stddev is down from 1.70 to 1.62 (1.44 for the
reference encoder). For 8 subbands encode stddev is down from 2.07 to 1.28
(1.09 for the reference encoder).
It is very interesting to see what a more advanced PEAQ test will show.
Best regards,
Siarhei Siamashka
Hi,
On Jan 22, 2009, at 1:11, Siarhei Siamashka wrote:
> The attached patch quite noticeably minimizes rounding errors and
> improves
> audio quality.
>
> I decided to drop non-SIMD variant because it would require quite a
> bit of
> work to update for better precision. Most of the CPU cores which are
> relevant nowadays have support for some kind of SIMD extension anyway.
This patch is now also pushed upstream. Thanks.
Johan
On Thursday 22 January 2009 13:58:57 ext Christian Hoene wrote:
> Hello Siarhei,
>
> > Hello all,
> >
> > The attached patch quite noticeably minimizes rounding errors and
> > improves audio quality.
> >
> >
> > It is very interesting to see what a more advanced PEAQ test will show.
>
> The PEAQ results for latest version and the latest plus your latest patch
> can be found
> http://net.cs.uni-tuebingen.de/html/nexgenvoip/ in latest and latest+patch.
>
> Congratulations, the encoder is perfect now. Sometimes even better than the
> reference!
Thanks. The results have really exceeded my expectations. Looks like the
precision loss on rounding is now really insignificant so that the rounding
errors are now smaller than the sensitivity of PEAQ method. My guess is that
very minor differences in results in both directions are just some kind of
random deviation and can't be clearly interpreted as an advantage of either
implementation.
So appears that the perceived quality should be really good now (PSNR rating
is a bit worse than reference, but it is not an objective way to measure
audio quality). Looks like there is even no need to introduce a high precision
configuration option for enabling 32-bit fixed point implementation in
practice. It makes everything a bit easier :)
--
Best regards,
Siarhei Siamashka
On Thursday 22 January 2009 15:36:36 ext Luiz Augusto von Dentz wrote:
> Hi Siarhei,
>
> > I decided to drop non-SIMD variant because it would require quite a bit
> > of work to update for better precision. Most of the CPU cores which are
> > relevant nowadays have support for some kind of SIMD extension anyway. I
> > will also do ARMv6 SIMD version of the analysis filter after all the high
> > level SBC optimizations are in place.
>
> Perhaps we can just disable it, since it is probably useful to
> maintain a version in C as a reference code just in case someone want
> to do its own optimizations in the future.
Right now there are two reference C versions:
1. "simple" one which uses smaller constant tables and may be modified not to
require any input data reordering (actually it reverses the order of audio
samples, but this can be avoided).
2. "simd-friendly" one with larger data tables and it also has to reorder
input data in all cases.
Extra size for constant tables is not an issue because a good optimizing
compiler should be able to optimize the constants pool. Let's consider the
following simplified example:
/*************************/
const short table1[4] = { 0x1234, 0x4321, 0x0000, 0x1234 };
const short table2[4] = { 0x4321, 0x1234, 0x1234, 0x0000 };
static inline int dotproduct(const short *x, const short *y)
{
return x[0] * y[0] + x[1] * y[1] + x[2] * y[2] + x[3] * y[3];
}
int f(const short *in, int *out)
{
out[0] = dotproduct(in + 0, table1);
out[1] = dotproduct(in + 4, table2);
}
/*************************/
It compiles into the following code for x86 (gcc 4.3.2):
00000000 <f>:
0: 53 push %ebx
1: 8b 4c 24 08 mov 0x8(%esp),%ecx
5: 8b 5c 24 0c mov 0xc(%esp),%ebx
9: 0f bf 51 02 movswl 0x2(%ecx),%edx
d: 0f bf 41 06 movswl 0x6(%ecx),%eax
11: 69 d2 21 43 00 00 imul $0x4321,%edx,%edx
17: 69 c0 34 12 00 00 imul $0x1234,%eax,%eax
1d: 01 c2 add %eax,%edx
1f: 0f bf 01 movswl (%ecx),%eax
22: 69 c0 34 12 00 00 imul $0x1234,%eax,%eax
28: 01 c2 add %eax,%edx
2a: 8d 41 08 lea 0x8(%ecx),%eax
2d: 89 13 mov %edx,(%ebx)
2f: 0f bf 50 02 movswl 0x2(%eax),%edx
33: 0f bf 40 04 movswl 0x4(%eax),%eax
37: 01 d0 add %edx,%eax
39: 0f bf 51 08 movswl 0x8(%ecx),%edx
3d: 69 c0 34 12 00 00 imul $0x1234,%eax,%eax
43: 69 d2 21 43 00 00 imul $0x4321,%edx,%edx
49: 01 d0 add %edx,%eax
4b: 89 43 04 mov %eax,0x4(%ebx)
4e: 5b pop %ebx
4f: c3 ret
The compiler did not use any tables at all, but emitted all the constants as
immediate operands for instructions. Also it eliminated all the
multiplications with zero constants (so we have only 6 IMUL instructions in
the code). So gcc seems to be clever enough to optimize this code well.
On ARM the generated code is the following (gcc 4.2.1, -mcpu=arm926ej-s):
00000000 <f>:
0: e92d41f0 push {r4, r5, r6, r7, r8, lr}
4: e59fc040 ldr ip, [pc, #64] ; 4c <table2+0x44>
8: e2808008 add r8, r0, #8 ; 0x8
c: e59f703c ldr r7, [pc, #60] ; 50 <table2+0x48>
10: e1d030b2 ldrh r3, [r0, #2]
14: e1d820b2 ldrh r2, [r8, #2]
18: e1d0e0f0 ldrsh lr, [r0]
1c: e1d050f8 ldrsh r5, [r0, #8]
20: e1630783 smulbb r3, r3, r7
24: e1620c82 smulbb r2, r2, ip
28: e0263c9e mla r6, lr, ip, r3
2c: e0242795 mla r4, r5, r7, r2
30: e1d830f4 ldrsh r3, [r8, #4]
34: e1d020f6 ldrsh r2, [r0, #6]
38: e0204c93 mla r0, r3, ip, r4
3c: e02e6c92 mla lr, r2, ip, r6
40: e5810004 str r0, [r1, #4]
44: e581e000 str lr, [r1]
48: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
4c: 00001234 .word 0x00001234
50: 00004321 .word 0x00004321
Here the compiler reduced the tables to only 2 constants. It was also able to
eliminate multiplications by zero. Regarding 16-bit constants, it could use
only 2 fast 16-bit SMULBB instructions, performing the rest of multiplications
with a slower 32-bit MLA. So the compiler is not very clever about generating
optimal code, but it at least could perform some basic optimizations.
Of course, when handling a more complex code, the compiler may screw up
something and miss some optimization opportunities. But if it happens,
bugreport should be submitted to gcc. In any case, handwritten assembly is
still much better for such type of code at the moment, at least on ARM.
So the only reason to have "simple" C reference version are the potential
savings on input samples reordering. But it is probably not worth the efforts.
In addition, when having non-native byte order for input data, "simple"
version will gain nothing because processing and copying data will be still
unavoidable.
The more I think about it, the more I'm getting inclined to the idea that only
SIMD-style version of C reference code should be kept in order to have better
maintainability.
--
Best regards,
Siarhei Siamashka
Hi Siarhei,
> I decided to drop non-SIMD variant because it would require quite a bit of
> work to update for better precision. Most of the CPU cores which are
> relevant nowadays have support for some kind of SIMD extension anyway.
> I will also do ARMv6 SIMD version of the analysis filter after all the high
> level SBC optimizations are in place.
Perhaps we can just disable it, since it is probably useful to
maintain a version in C as a reference code just in case someone want
to do its own optimizations in the future.
--
Luiz Augusto von Dentz
Engenheiro de Computa??o
Hello Siarhei,
> Hello all,
>
> The attached patch quite noticeably minimizes rounding errors and improves
> audio quality.
>
> It is very interesting to see what a more advanced PEAQ test will show.
The PEAQ results for latest version and the latest plus your latest patch
can be found
http://net.cs.uni-tuebingen.de/html/nexgenvoip/ in latest and latest+patch.
Congratulations, the encoder is perfect now. Sometimes even better than the
reference!
Greetings
Christian
Hello all,
> Hello all,
>
> The attached patch quite noticeably minimizes rounding errors and improves
> audio quality.
>
> It is very interesting to see what a more advanced PEAQ test will show.
The PEAQ results for latest version and the latest plus your latest patch
can be found
http://net.cs.uni-tuebingen.de/html/nexgenvoip/ in latest and latest+patch.
I need to write a diff webpage...
Greetings
Christian