2008-12-31 16:03:45

by Siarhei Siamashka

[permalink] [raw]
Subject: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hello all,

This is a preliminary preview of SIMD optimizations for SBC encoder analysis filter.

It already contains MMX optimization for 4 subbands case (yes, all this insane
amount of extra lines of code finally starts to pay off) ;)

Important notice: in order to test MMX optimizations, you need to have
extra '-mmmx' command line option passed to gcc. Runtime MMX autodetection
can be easily added later. Also don't forget to pass -s4 option to sbcenc
because 8 subbands case is still not accelerated. By the way, SSE2 is twice
wider than MMX and should be a lot faster. Though MMX is supported on
virtually every x86 cpu that is in use nowadays and can be considered "lowest
common denominator".

My quick benchmark showed that the performance gets improved about ~10%
overall (and about twice better for the analysis filter function alone) when
compared with bluez-4.23 release which had the old buggy code. Improvement is
much more noticeable over the release 4.25 which contains a new fixed and
mostly nonoptimized filter.

So now the performance is better than ever. And I guess, all the platforms
should use SIMD optimizations nowadays, so they should gain performance
improvements too. Those 'anamatrix' style optimizations in older code feel
so much like the previous century ;)

I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
these will be submitted a bit later. Also, as I have already written before,
the other parts of code are quite inefficient too and can be optimized. There
are still lots of things to improve.


But right now I would like to hear some opinions about the following things
regarding the attached patch:

The first question is about the use of extra source file for SIMD
optimizations and introduction of 'sbc_encoder_init_simd_optimized_analyze'
function to the global namespace. The rationale for that is the intention to
stop adding changes to 'sbc.c' (otherwise it will become bloated pretty soon
with the addition of multiple optimizations for various platforms). If anyone
has a better idea, I'm very much interested to hear it.

And if the addition of a new source file gets approved, I wonder about what
text should go to the copyright header?

Now we have two "reference" C implementations of analysis filter. Is it OK to
keep both? Or only SIMD-friendly one should remain in the end?

PS. Happy New Year

Best regards,
Siarhei Siamashka


Attachments:
(No filename) (2.35 kB)
preview-0002-SIMD-optimizations-for-SBC-encoder-analysis-filter.patch (24.94 kB)
Download all attachments

2008-12-31 20:55:24

by Luiz Augusto von Dentz

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

I wonder why don't we use liboil
(http://liboil.freedesktop.org/wiki/). Since we can't keep
implementing, or don't want to, optimization code for each instruction
extension around. Liboil detects which implementation is faster at
runtime and there are many other codec implementations that depend on
it, it actually makes a lot of sense to gstream and PulseAudio which
already uses liboil. I know that means adding another dependency to
BlueZ, or perhaps it is time to make libsbc a real library?


--
Luiz Augusto von Dentz
Engenheiro de Computa??o

2009-01-15 23:29:12

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Siarhei,

> > > > The attached patch contains what I would consider to be a final
> > > > variant. MMX support is now complete. It works for both x86 and amd64,
> > > > has runtime autodetection of MMX availability, supports 4 and 8
> > > > subbands cases. I also ensured that only original MMX instructions are
> > > > used (and no SSE or other later additions), so the code should work
> > > > fine even on the old Pentium1 MMX. New MMX optimized functions produce
> > > > bit identical results when compared with bluez-4.25 release.
> > > >
> > > > With this patch applied, new filtering functions are noticeably faster
> > > > than than the old ones on x86 (so now they are both faster and have
> > > > better quality). Assembly optimizations for the other platforms can be
> > > > easily added too.
> > >
> > > can you re-base your patch against the latest tree and re-send the
> > > patch.
> >
> > Yes, I will submit an updated SIMD optimizations patchset in a few days.
>
> Updated patches are attached.
>
> Performance improvement when testing with big buck bunny soundtrack varies
> somewhere between 1.4x (4 subbands, MMX analysis filter, Intel Core2 CPU) and
> 2x factor (8 subbands, NEON analysis filter, ARM Cortex-A8 CPU). But these
> numbers are for default bitpool settings (32) and no joint stereo, this
> configuration is quite sensitive to analysis filter performance.
>
> SIMD optimized code provides exactly the same output as C version.
>
> But even with this optimization done, there are still a lot more things
> to improve. I'm going to improve input data permutation/endian
> conversion/channels deinterleaving next. Also scalefactors processing
> can be vectorized. Audio quality can be still improved by tweaking
> constant tables.

patch has been applied. Thanks.

Regards

Marcel



2009-01-15 19:34:02

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Friday 09 January 2009 18:50:54 ext Siarhei Siamashka wrote:
> On Tuesday 06 January 2009 04:49:06 ext Marcel Holtmann wrote:
> > > The attached patch contains what I would consider to be a final
> > > variant. MMX support is now complete. It works for both x86 and amd64,
> > > has runtime autodetection of MMX availability, supports 4 and 8
> > > subbands cases. I also ensured that only original MMX instructions are
> > > used (and no SSE or other later additions), so the code should work
> > > fine even on the old Pentium1 MMX. New MMX optimized functions produce
> > > bit identical results when compared with bluez-4.25 release.
> > >
> > > With this patch applied, new filtering functions are noticeably faster
> > > than than the old ones on x86 (so now they are both faster and have
> > > better quality). Assembly optimizations for the other platforms can be
> > > easily added too.
> >
> > can you re-base your patch against the latest tree and re-send the
> > patch.
>
> Yes, I will submit an updated SIMD optimizations patchset in a few days.

Updated patches are attached.

Performance improvement when testing with big buck bunny soundtrack varies
somewhere between 1.4x (4 subbands, MMX analysis filter, Intel Core2 CPU) and
2x factor (8 subbands, NEON analysis filter, ARM Cortex-A8 CPU). But these
numbers are for default bitpool settings (32) and no joint stereo, this
configuration is quite sensitive to analysis filter performance.

SIMD optimized code provides exactly the same output as C version.

But even with this optimization done, there are still a lot more things
to improve. I'm going to improve input data permutation/endian
conversion/channels deinterleaving next. Also scalefactors processing
can be vectorized. Audio quality can be still improved by tweaking
constant tables.


Best regards,
Siarhei Siamashka


Attachments:
(No filename) (1.81 kB)
0001-SIMD-friendly-variant-of-SBC-encoder-analysis-filter.patch (34.64 kB)
0002-SBC-arrays-and-constant-tables-aligned-at-16-byte-bo.patch (5.13 kB)
0003-MMX-and-ARM-NEON-optimized-versions-of-analysis-filt.patch (24.99 kB)
Download all attachments

2009-01-09 16:50:54

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Tuesday 06 January 2009 04:49:06 ext Marcel Holtmann wrote:
> > The attached patch contains what I would consider to be a final variant.
> > MMX support is now complete. It works for both x86 and amd64, has runtime
> > autodetection of MMX availability, supports 4 and 8 subbands cases. I
> > also ensured that only original MMX instructions are used (and no SSE or
> > other later additions), so the code should work fine even on the old
> > Pentium1 MMX. New MMX optimized functions produce bit identical results
> > when compared with bluez-4.25 release.
> >
> > With this patch applied, new filtering functions are noticeably faster
> > than than the old ones on x86 (so now they are both faster and have
> > better quality). Assembly optimizations for the other platforms can be
> > easily added too.
>
> can you re-base your patch against the latest tree and re-send the
> patch.

Yes, I will submit an updated SIMD optimizations patchset in a few days.


Best regards,
Siarhei Siamashka

2009-01-07 09:31:25

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Tuesday 06 January 2009 07:45:01 ext Marcel Holtmann wrote:
> Hi Christian,
>
> > > Do we still need the high precession stuff. I wanna cut down the number
> > > of ifdefs in the code as much as possible.
> >
> > Yes, because provides better audio quality.
>
> okay, but we have to make a choice in what we want. We can't just have a
> lots of ifdefs around. They will be killing us eventually. It is a
> nightmare from a release engineering perspective.

That's a single ifdef, which was added for testing purposes. The analysis
filter code itself is flexible enough to work in both configurations as the
shift constants depend on the use of 'sizeof' operator. The original floating
point constants are also wrapped into macros, which expand to the needed
fixed point data type automagically.

And as it was discussed before, It is possible to have both fast and high
precision implementations compiled in at the same time. Something like having:

sbc_analysis_filter_template.h - with the tables and implementation of
analysis function as a static inline function, with a custom preprocessor
managed suffix for its name

And 'sbc_analysis_filter.c' having code like this:

#define SBC_HIGH_PRECISION
#define SB_ANALYSIS_FUNCTION_SUFFIX _hq
#include "sbc_analysis_filter_template.h"

#undef SBC_HIGH_PRECISION
#undef SB_ANALYSIS_FUNCTION_SUFFIX
#define SB_ANALYSIS_FUNCTION_SUFFIX _fast
#include "sbc_analysis_filter_template.h"

This double include will instantiate both implementations from the same
template. Or something like this. It does not increase source code size
much.

> What is the downside for doing high precession only?

Performance is a lot better for 16-bit fixed point version because it can
benefit from DSP/multimedia instruction set extensions of modern processors.
A performance difference can be seen when benchmarking MMX enabled vs.
high precision build. The relative difference will get even bigger after
optimizing other parts of SBC encoder.

--
Best regards,
Siarhei Siamashka

2009-01-06 05:45:01

by Marcel Holtmann

[permalink] [raw]
Subject: RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Christian,

> > Do we still need the high precession stuff. I wanna cut down the number
> > of ifdefs in the code as much as possible.
>
> Yes, because provides better audio quality.

okay, but we have to make a choice in what we want. We can't just have a
lots of ifdefs around. They will be killing us eventually. It is a
nightmare from a release engineering perspective.

What is the downside for doing high precession only?

Regards

Marcel



2009-01-06 05:27:37

by Christian Hoene

[permalink] [raw]
Subject: RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

> Do we still need the high precession stuff. I wanna cut down the number
> of ifdefs in the code as much as possible.

Yes, because provides better audio quality.

Greetings

Christian

2009-01-06 02:50:26

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Luiz,

> I wonder why don't we use liboil
> (http://liboil.freedesktop.org/wiki/). Since we can't keep
> implementing, or don't want to, optimization code for each instruction
> extension around. Liboil detects which implementation is faster at
> runtime and there are many other codec implementations that depend on
> it, it actually makes a lot of sense to gstream and PulseAudio which
> already uses liboil. I know that means adding another dependency to
> BlueZ, or perhaps it is time to make libsbc a real library?

let me stop the discussion about liboil right now. I don't see any big
advantage for BlueZ right now. This might change at some point in the
future, but right now, we will not base around liboil.

Regards

Marcel



2009-01-06 02:49:06

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Siarhei,

> > > This is a preliminary preview of SIMD optimizations for SBC encoder
> > > analysis filter.
> > >
> > > It already contains MMX optimization for 4 subbands case (yes, all this
> > > insane amount of extra lines of code finally starts to pay off) ;)
> > >
> > > Important notice: in order to test MMX optimizations, you need to have
> > > extra '-mmmx' command line option passed to gcc. Runtime MMX
> > > autodetection can be easily added later. Also don't forget to pass -s4
> > > option to sbcenc because 8 subbands case is still not accelerated. By the
> > > way, SSE2 is twice wider than MMX and should be a lot faster. Though MMX
> > > is supported on virtually every x86 cpu that is in use nowadays and can
> > > be considered "lowest common denominator".
> > >
> > > My quick benchmark showed that the performance gets improved about ~10%
> > > overall (and about twice better for the analysis filter function alone)
> > > when compared with bluez-4.23 release which had the old buggy code.
> > > Improvement is much more noticeable over the release 4.25 which contains
> > > a new fixed and mostly nonoptimized filter.
> > >
> > > So now the performance is better than ever. And I guess, all the
> > > platforms should use SIMD optimizations nowadays, so they should gain
> > > performance improvements too. Those 'anamatrix' style optimizations in
> > > older code feel so much like the previous century ;)
> > >
> > > I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> > > these will be submitted a bit later. Also, as I have already written
> > > before, the other parts of code are quite inefficient too and can be
> > > optimized. There are still lots of things to improve.
> > >
> > >
> > > But right now I would like to hear some opinions about the following
> > > things regarding the attached patch:
> > >
> > > The first question is about the use of extra source file for SIMD
> > > optimizations and introduction of
> > > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > > namespace. The rationale for that is the intention to stop adding changes
> > > to 'sbc.c' (otherwise it will become bloated pretty soon with the
> > > addition of multiple optimizations for various platforms). If anyone has
> > > a better idea, I'm very much interested to hear it.
> > >
> > > And if the addition of a new source file gets approved, I wonder about
> > > what text should go to the copyright header?
> > >
> > > Now we have two "reference" C implementations of analysis filter. Is it
> > > OK to keep both? Or only SIMD-friendly one should remain in the end?
> >
> > I am fine with keeping both, but if one is just not useful, we are going
> > to remove it.
>
> The only problem with SIMD-friendly code is that it uses two tables instead of
> one (that's a sacrifice for the nice and symmetric code layout which fits SIMD
> instructions of modern processors quite well). It may be somewhat less
> optimal for the legacy processors without SIMD capabilities.
>
> I wonder what CPU architectures are the most important for bluez?
>
> > Also two separate files are fine for me. Personally I prefer a runtime
> > selection since compile time options are always painful
> > to test before making the release.
>
> The attached patch contains what I would consider to be a final variant. MMX
> support is now complete. It works for both x86 and amd64, has runtime
> autodetection of MMX availability, supports 4 and 8 subbands cases. I also
> ensured that only original MMX instructions are used (and no SSE or other
> later additions), so the code should work fine even on the old Pentium1 MMX.
> New MMX optimized functions produce bit identical results when compared
> with bluez-4.25 release.
>
> With this patch applied, new filtering functions are noticeably faster than
> than the old ones on x86 (so now they are both faster and have better
> quality). Assembly optimizations for the other platforms can be easily
> added too.

can you re-base your patch against the latest tree and re-send the
patch.

Do we still need the high precession stuff. I wanna cut down the number
of ifdefs in the code as much as possible.

Regards

Marcel



2009-01-05 11:08:39

by Simon Pickering

[permalink] [raw]
Subject: RE: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter


> > The ideal is to have portable devices mitigate this with dsp
hardware,
> > but we can't count on the hardware or the driver to be there in all
> > cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
> > using the TI dsp)
>
> Yes, I know about this project. And its maintainer should be
> subscribed here
> too :)

He is now :)

> Making code a bit more C55x friendly is not difficult at all
> and can be surely
> done.

I'm catching up on the flood of patches and will have a look at this
soon,

Cheers,


Simon


2009-01-05 08:57:03

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Friday 02 January 2009 18:07:17 ext Siarhei Siamashka wrote:
> On Thursday 01 January 2009 10:58:03 ext Marcel Holtmann wrote:
[...]
> > > But right now I would like to hear some opinions about the following
> > > things regarding the attached patch:
> > >
> > > The first question is about the use of extra source file for SIMD
> > > optimizations and introduction of
> > > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > > namespace. The rationale for that is the intention to stop adding
> > > changes to 'sbc.c' (otherwise it will become bloated pretty soon with
> > > the addition of multiple optimizations for various platforms). If
> > > anyone has a better idea, I'm very much interested to hear it.
> > >
> > > And if the addition of a new source file gets approved, I wonder about
> > > what text should go to the copyright header?
> > >
> > > Now we have two "reference" C implementations of analysis filter. Is it
> > > OK to keep both? Or only SIMD-friendly one should remain in the end?
> >
> > I am fine with keeping both, but if one is just not useful, we are going
> > to remove it.
>
> The only problem with SIMD-friendly code is that it uses two tables instead
> of one (that's a sacrifice for the nice and symmetric code layout which
> fits SIMD instructions of modern processors quite well). It may be somewhat
> less optimal for the legacy processors without SIMD capabilities.
>
> I wonder what CPU architectures are the most important for bluez?
>
> > Also two separate files are fine for me. Personally I prefer a runtime
> > selection since compile time options are always painful
> > to test before making the release.
>
> The attached patch contains what I would consider to be a final variant.
> MMX support is now complete. It works for both x86 and amd64, has runtime
> autodetection of MMX availability, supports 4 and 8 subbands cases. I also
> ensured that only original MMX instructions are used (and no SSE or other
> later additions), so the code should work fine even on the old Pentium1
> MMX. New MMX optimized functions produce bit identical results when
> compared with bluez-4.25 release.
>
> With this patch applied, new filtering functions are noticeably faster than
> than the old ones on x86 (so now they are both faster and have better
> quality). Assembly optimizations for the other platforms can be easily
> added too.
>
> > For the copyright header it is pretty simple. We copy the current header
> > and then later on I will add the appropriate Nokia copyright to it. So
> > don't worry about that part, I take care of that.
>
> OK, thanks

I understand that it is too early to ping you regarding the status of the
patch :)

But it would be nice if all the SBC encoder optimizations that are relatively
easy to implement got done and integrated fast (keeping the encoder output
bit identical to that of version 4.25 for now)

After the second thought, I propose the following source files layout:
sbc_dsplib.c, sbc_dsplib.h - contains reference C code for the supplementary
helper functions which can be used in SBC encoder/decoder and can be
efficiently SIMD/assembly optimized
sbc_dsplib_mmx.c - x86 MMX optimizations
sbc_dsplib_sse2.c - x86 SSE2 optimizations
sbc_dsplib_neon.c - ARM NEON optimizations
sbc_dsplib_armv6.c - ARMv6 optimizations
...

sbc_dsplib.c would also contain an initialization function, which sets up the
function pointers in 'sbc_encoder state' structure to the best available
implementations for the current platform.

The content of sbc_dsplib* files can be then considered for the future
submission into liboil if this is desired.

Would you prefer an updated patchset, which implements all of this stuff one
step at a time?

--
Best regards,
Siarhei Siamashka

2009-01-04 17:56:11

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Friday 02 January 2009 21:40:48 ext Luiz Augusto von Dentz wrote:
> Hi Siarhei,
>
> On Fri, Jan 2, 2009 at 1:33 PM, Siarhei Siamashka
>
> <[email protected]> wrote:
> > On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
> >> I wonder why don't we use liboil
> >> (http://liboil.freedesktop.org/wiki/).
> >
> > Can you clarify your proposal a bit? Which functions/implementations from
> > liboil do you suggest for use in bluez sbc?
>
> Liboil stands to optimized inner loops, that exactly what we need,
> transforming the whole code will, already is, depend on each simd
> extention to be implemented.
>
> What we basically do is multiply and
> accumulate arrays, what could be done with:
> http://liboil.freedesktop.org/documentation/liboil-liboilfuncs-math.html#oi
>l-multsum-f32

Right now from what I see, we need SIMD optimized versions of:
- analysis filter
- channels deinterleaving with optional endian conversion
- scalefactors processing
- joining channels
- maybe quantization

Liboil does not seem to directly provide any of these (I really looked through
all of it, but could of course miss something). Your example is not very good
and does not clarify anything, because it is even a floating point function.

Let's take the SBC analysis filter as an example. It's a function, which reads
data from the samples buffer, constants buffer and writes some results in the
output buffer. We want all the operations inside of it to be done with
registers only, avoiding any intermediate stores to memory. The arrays t1[8]
and t2[8] are supposed to be mapped directly on the registers. If you try to
implement analysis function using liboil 'inner loop' functions, the resulting
performance would be simply horrible. If you don't trust me, just have a look
at some more stuff from liboil such as DCT functions. The analysis filter from
SBC falls exactly into the same category.

The other functions that need to be done and that I have mentioned above are
also the same.

Moreover, the arrays which SBC operates on are rather small. That's why
special care needs to be taken about proper loops unrolling, alignment and
the other stuff in order not to have any unneeded overhead.

> > Or do you suggest to submit the sbc analysis filter function to liboil,
> > add it as sbc dependency and hope that somebody would translate the code
> > to the instruction sets of other architectures? Will it turn out to be
> > beneficial? IMHO It may easily become just an unnecessary burden and
> > wasted effort too.
>
> What about if there is any other codec that might benefit from the
> code we are producing, Im not talking about the whole sbc analysis
> filter but the inner loops.

Than it is good for these other codecs :) They will be able to take some
code from SBC (either directly, or via liboil library if it gets to suck in
the stuff from bluez like it did with some other samples of optimized code).

> Also read careful what liboil does, there is a whole instruction set
> detection/benchmark system very similar to what you have proposed for
> choosing implementation in runtime.

The detection of MMX needs only a dozen of lines of trivial code (checking
EFLAGS and CPUID). Adding a big library as a dependency just for a few lines
of code is kind of overkill.

In addition, by spending 15 minutes on writing and testing this trivial code
using just an Architecture Software Developer's Manual from Intel, I avoid
all the hassle of making sure that I don't violate the licenses or copyrights
of somebody else :)

By the way, I had a look and didn't quite like the way liboil does this CPU
capability check. Instead of checking EFLAGS first, it tries to execute CPUID
directly and has the code to catch SIGILL. I'm not sure if it is a good idea
to mess with the signals from a *library*.

--
Best regards,
Siarhei Siamashka

2009-01-02 19:40:48

by Luiz Augusto von Dentz

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Siarhei,

On Fri, Jan 2, 2009 at 1:33 PM, Siarhei Siamashka
<[email protected]> wrote:
> On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
>> I wonder why don't we use liboil
>> (http://liboil.freedesktop.org/wiki/).
>
> Can you clarify your proposal a bit? Which functions/implementations from
> liboil do you suggest for use in bluez sbc?

Liboil stands to optimized inner loops, that exactly what we need,
transforming the whole code will, already is, depend on each simd
extention to be implemented. What we basically do is multiply and
accumulate arrays, what could be done with:
http://liboil.freedesktop.org/documentation/liboil-liboilfuncs-math.html#oil-multsum-f32


> Or do you suggest to submit the sbc analysis filter function to liboil, add it
> as sbc dependency and hope that somebody would translate the code to the
> instruction sets of other architectures? Will it turn out to be beneficial?
> IMHO It may easily become just an unnecessary burden and wasted effort too.

What about if there is any other codec that might benefit from the
code we are producing, Im not talking about the whole sbc analysis
filter but the inner loops.

Also read careful what liboil does, there is a whole instruction set
detection/benchmark system very similar to what you have proposed for
choosing implementation in runtime.


--
Luiz Augusto von Dentz
Engenheiro de Computa??o

2009-01-02 18:03:53

by Brad Midgley

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Siarhei

> other hand, if we sacrifice performance let's say for MIPS when adding
> some of the changes beneficial for other platforms, will it be considered an
> important regression?

I think it would be best to consider these case-by-case.

btw, I have a mips-based access point I will try things out on to
assess it. (It's a first-rev asus wl-500gp)

Brad Midgley

2009-01-02 17:11:20

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Friday 02 January 2009 18:27:33 ext Brad Midgley wrote:
> Siarhei
>
> > I wonder what CPU architectures are the most important for bluez?
>
> This is not an easy question, but one perspective is to consider the
> impact on battery life. Running sbc encoding on a phone will have a
> greater impact on battery life than it does to run it on a laptop.

I see. I'm mostly interested in ARM, so this one should be quite fine. On the
other hand, if we sacrifice performance let's say for MIPS when adding
some of the changes beneficial for other platforms, will it be considered an
important regression?

I also submitted MMX implementation first as this is the code which can be
hopefully tested by more people. Anyway, the most hard part was to transform
the code to be efficiently vectorizable (done by writing several additional
scripts which were used to find an optimal input data permutation and generate
the tables). After that, just converting C code to the appropriate MMX
instructions in gcc inline assembly took probably only ~1 day of working time,
including testing. Support for the other architectures should be quite easy
from this point (ARM will follow next).

> The ideal is to have portable devices mitigate this with dsp hardware,
> but we can't count on the hardware or the driver to be there in all
> cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
> using the TI dsp)

Yes, I know about this project. And its maintainer should be subscribed here
too :)

Making code a bit more C55x friendly is not difficult at all and can be surely
done.

--
Best regards,
Siarhei Siamashka

2009-01-02 16:33:13

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Wednesday 31 December 2008 22:55:24 ext Luiz Augusto von Dentz wrote:
> I wonder why don't we use liboil
> (http://liboil.freedesktop.org/wiki/).

Can you clarify your proposal a bit? Which functions/implementations from
liboil do you suggest for use in bluez sbc?

> Since we can't keep implementing, or don't want to, optimization code for
> each instruction extension around.

Or do you suggest to submit the sbc analysis filter function to liboil, add it
as sbc dependency and hope that somebody would translate the code to the
instruction sets of other architectures? Will it turn out to be beneficial?
IMHO It may easily become just an unnecessary burden and wasted effort too.

> Liboil detects which implementation is
> faster at runtime and there are many other codec implementations that depend
> on it, it actually makes a lot of sense to gstream and PulseAudio which
> already uses liboil. I know that means adding another dependency to
> BlueZ, or perhaps it is time to make libsbc a real library?

I had a quick look at liboil and it did not impress me that much yet.


Best regards,
Siarhei Siamashka

2009-01-02 16:27:33

by Brad Midgley

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Siarhei

> I wonder what CPU architectures are the most important for bluez?

This is not an easy question, but one perspective is to consider the
impact on battery life. Running sbc encoding on a phone will have a
greater impact on battery life than it does to run it on a laptop.

The ideal is to have portable devices mitigate this with dsp hardware,
but we can't count on the hardware or the driver to be there in all
cases. (see https://garage.maemo.org/projects/dsp-sbc/ for some work
using the TI dsp)

--
Brad Midgley

2009-01-02 16:07:17

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

On Thursday 01 January 2009 10:58:03 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > This is a preliminary preview of SIMD optimizations for SBC encoder
> > analysis filter.
> >
> > It already contains MMX optimization for 4 subbands case (yes, all this
> > insane amount of extra lines of code finally starts to pay off) ;)
> >
> > Important notice: in order to test MMX optimizations, you need to have
> > extra '-mmmx' command line option passed to gcc. Runtime MMX
> > autodetection can be easily added later. Also don't forget to pass -s4
> > option to sbcenc because 8 subbands case is still not accelerated. By the
> > way, SSE2 is twice wider than MMX and should be a lot faster. Though MMX
> > is supported on virtually every x86 cpu that is in use nowadays and can
> > be considered "lowest common denominator".
> >
> > My quick benchmark showed that the performance gets improved about ~10%
> > overall (and about twice better for the analysis filter function alone)
> > when compared with bluez-4.23 release which had the old buggy code.
> > Improvement is much more noticeable over the release 4.25 which contains
> > a new fixed and mostly nonoptimized filter.
> >
> > So now the performance is better than ever. And I guess, all the
> > platforms should use SIMD optimizations nowadays, so they should gain
> > performance improvements too. Those 'anamatrix' style optimizations in
> > older code feel so much like the previous century ;)
> >
> > I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> > these will be submitted a bit later. Also, as I have already written
> > before, the other parts of code are quite inefficient too and can be
> > optimized. There are still lots of things to improve.
> >
> >
> > But right now I would like to hear some opinions about the following
> > things regarding the attached patch:
> >
> > The first question is about the use of extra source file for SIMD
> > optimizations and introduction of
> > 'sbc_encoder_init_simd_optimized_analyze' function to the global
> > namespace. The rationale for that is the intention to stop adding changes
> > to 'sbc.c' (otherwise it will become bloated pretty soon with the
> > addition of multiple optimizations for various platforms). If anyone has
> > a better idea, I'm very much interested to hear it.
> >
> > And if the addition of a new source file gets approved, I wonder about
> > what text should go to the copyright header?
> >
> > Now we have two "reference" C implementations of analysis filter. Is it
> > OK to keep both? Or only SIMD-friendly one should remain in the end?
>
> I am fine with keeping both, but if one is just not useful, we are going
> to remove it.

The only problem with SIMD-friendly code is that it uses two tables instead of
one (that's a sacrifice for the nice and symmetric code layout which fits SIMD
instructions of modern processors quite well). It may be somewhat less
optimal for the legacy processors without SIMD capabilities.

I wonder what CPU architectures are the most important for bluez?

> Also two separate files are fine for me. Personally I prefer a runtime
> selection since compile time options are always painful
> to test before making the release.

The attached patch contains what I would consider to be a final variant. MMX
support is now complete. It works for both x86 and amd64, has runtime
autodetection of MMX availability, supports 4 and 8 subbands cases. I also
ensured that only original MMX instructions are used (and no SSE or other
later additions), so the code should work fine even on the old Pentium1 MMX.
New MMX optimized functions produce bit identical results when compared
with bluez-4.25 release.

With this patch applied, new filtering functions are noticeably faster than
than the old ones on x86 (so now they are both faster and have better
quality). Assembly optimizations for the other platforms can be easily
added too.

> For the copyright header it is pretty simple. We copy the current header
> and then later on I will add the appropriate Nokia copyright to it. So
> don't worry about that part, I take care of that.

OK, thanks


Best regards,
Siarhei Siamashka


Attachments:
(No filename) (4.07 kB)
0001-SIMD-optimizations-for-SBC-encoder-analysis-filter.patch (32.38 kB)
Download all attachments

2009-01-01 08:58:03

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH/RFC] SIMD optimizations for SBC encoder analysis filter

Hi Siarhei,

> This is a preliminary preview of SIMD optimizations for SBC encoder analysis filter.
>
> It already contains MMX optimization for 4 subbands case (yes, all this insane
> amount of extra lines of code finally starts to pay off) ;)
>
> Important notice: in order to test MMX optimizations, you need to have
> extra '-mmmx' command line option passed to gcc. Runtime MMX autodetection
> can be easily added later. Also don't forget to pass -s4 option to sbcenc
> because 8 subbands case is still not accelerated. By the way, SSE2 is twice
> wider than MMX and should be a lot faster. Though MMX is supported on
> virtually every x86 cpu that is in use nowadays and can be considered "lowest
> common denominator".
>
> My quick benchmark showed that the performance gets improved about ~10%
> overall (and about twice better for the analysis filter function alone) when
> compared with bluez-4.23 release which had the old buggy code. Improvement is
> much more noticeable over the release 4.25 which contains a new fixed and
> mostly nonoptimized filter.
>
> So now the performance is better than ever. And I guess, all the platforms
> should use SIMD optimizations nowadays, so they should gain performance
> improvements too. Those 'anamatrix' style optimizations in older code feel
> so much like the previous century ;)
>
> I'm going to primarily focus on NEON and maybe ARMv6 SIMD optimizations,
> these will be submitted a bit later. Also, as I have already written before,
> the other parts of code are quite inefficient too and can be optimized. There
> are still lots of things to improve.
>
>
> But right now I would like to hear some opinions about the following things
> regarding the attached patch:
>
> The first question is about the use of extra source file for SIMD
> optimizations and introduction of 'sbc_encoder_init_simd_optimized_analyze'
> function to the global namespace. The rationale for that is the intention to
> stop adding changes to 'sbc.c' (otherwise it will become bloated pretty soon
> with the addition of multiple optimizations for various platforms). If anyone
> has a better idea, I'm very much interested to hear it.
>
> And if the addition of a new source file gets approved, I wonder about what
> text should go to the copyright header?
>
> Now we have two "reference" C implementations of analysis filter. Is it OK to
> keep both? Or only SIMD-friendly one should remain in the end?

I am fine with keeping both, but if one is just not useful, we are going
to remove it. Also two separate files are fine for me. Personally I
prefer a runtime selection since compile time options are always painful
to test before making the release.

For the copyright header it is pretty simple. We copy the current header
and then later on I will add the appropriate Nokia copyright to it. So
don't worry about that part, I take care of that.

Regards

Marcel