2008-12-11 20:27:06

by Siarhei Siamashka

[permalink] [raw]
Subject: [PATCH] sbc: bitstream packing optimization for encoder.

Hi All,

Appears that bitstream packing is also one of the most CPU hungry
operations in SBC encoder, which did not get proper attention yet.
Could somebody review this patch and provide feedback/comments?

Benchmarks were done by running the following script:

./sbcenc -s 8 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -s 8 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 8 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 4 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -s 4 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -s 4 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 128 -s 8 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -b 128 -s 8 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 128 -s 8 -j big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 64 -s 4 big_buck_bunny_480p_mono.au > /dev/null
./sbcenc -b 64 -s 4 big_buck_bunny_480p_stereo.au > /dev/null
./sbcenc -b 64 -s 4 -j big_buck_bunny_480p_stereo.au > /dev/null

Times from ARM device (ARM11):

before patch:
real ? ?9m 42.98s
user ? ?8m 51.06s
sys ? ? 0m 39.50s

after patch:
real ? ?6m 35.55s
user ? ?5m 44.09s
sys ? ? 0m 40.43s

Times from x86 PC (Intel Core2):

before patch:
real ? ?0m49.437s
user ? ?0m45.267s
sys ? ? 0m3.708s

after patch:
real ? ?0m27.606s
user ? ?0m23.869s
sys ? ? 0m3.656s


Best regards,
Siarhei Siamashka


Attachments:
(No filename) (1.32 kB)
0001-sbc-bitstream-writing-optimization-for-encoder.patch (4.62 kB)
Download all attachments

2008-12-17 14:16:52

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

On Friday 12 December 2008 17:23:33 ext Christian Hoene wrote:
> Hi all,
>
> next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> ill this week.
>
> The sound quality of the fix point implementation still remains below of
> the quality of the floating point version.

Has the sound quality improved with the latest patches?

By the way, which floating point version do you have in mind? Is it btsco,
which is referenced from this wiki page: http://wiki.bluez.org/wiki/SBC ?

I tried to have a look at btsco, but it seems to be much worse quality wise
than the fixed point version with an updated filtering function. I tried to
encode some audio sample, decode it back and compare result with the
original file in both cases.

Also I compared 32-bit fixed point vs. 16-bit fixed point, and looks like
16-bit fixed point version has a sufficient precision. Anyway, the values,
received from the filter, are quantized later and much more information
about lower order bits is usually lost at this stage.

--
Best regards,
Siarhei Siamashka

2008-12-16 21:46:38

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

On Friday 12 December 2008 03:39:34 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > Appears that bitstream packing is also one of the most CPU hungry
> > operations in SBC encoder, which did not get proper attention yet.
> > Could somebody review this patch and provide feedback/comments?
>
> again thanks for working on this. I am actually willing to include your
> patches quickly in one of the next releases to get more wider testing
> audience.

Well, after doing a bit more testing, appears that a bug slipped in:
+ PUT_BITS(audio_sample, bits[ch][sb]);

There should be the following line instead:
+ PUT_BITS(audio_sample & levels[ch][sb], bits[ch][sb]);

Updated patch is attached. Any additional testing/feedback is very much
welcome.

Best regards
Siarhei Siamashka


Attachments:
(No filename) (782.00 B)
0001-sbc-bitstream-writing-optimization-for-encoder-try2.patch (4.64 kB)
Download all attachments

2008-12-12 16:07:28

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

On Friday 12 December 2008 13:04:09 ext Marcel Holtmann wrote:
[...]
> > But for little endian-systems and also to ensure endian neutrality, data
> > writes are done one byte at a time.
>
> Lets keep doing it byte by byte for now. However keep in mind that the
> input stream can also be in big endian even it is runs on a little
> machine. And vice-versa.

Yes I know, thanks for reminding :) This patch does not touch endian
conversion for the input data, so everything should be OK. Anyway, it's
better to do some real tests on big-endian systems just to be sure.

The endian conversion and channels deinterleaving part for the input data
is actually another performance bottleneck in SBC and can be also optimized.

--
Best regards,
Siarhei Siamashka

2008-12-12 16:01:01

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

On Friday 12 December 2008 17:30:47 ext Marcel Holtmann wrote:
> Hi Christian,
>
> can we please keep this mailing a top-posting free zone. I know it is
> hard for Outlock to get this right, but it is possible.
>
> > next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> > ill this week.
> >
> > The sound quality of the fix point implementation still remains below of
> > the quality of the floating point version.
> >
> > Maybe, we shall support both depending on the performance requirements?

Well, it is quite natural that sound quality for 16 bit fixed point version
can't be better than the quality of the floating point one. But when
dealing with lossy compression, it is not always strictly necessary to
precisely match the output of the reference implementation, carefully
reproducing all the compression artefacts :)

More important is whether this implementation passes a standard SBC
conformance test. If it does, it would be good to have 16 bit fixed point
implementation.

Sound quality of Jaska's patch can be probably tweaked a bit by trying
different bias for coefficient values (not just clip values to 16 bit, but
also try different directions for rounding).

> I think we should focus a fixed point version. There should be no need
> for floating point at all. If fixed point isn't good enough, then we
> screwed it up.
>
> And in case of embedded devices we are seriously limited with floating
> point and doing that in software just doesn't work out. And this is
> mostly not directly a performance problem. It is more a power
> consumption problem. We don't wanna have A2DP drain the battery.

At least for ARM, performance of the possible polyphase filter implementations
can be approximately ranged in the following way (from fastest to slowest):
1. 16-bit*16-bit->32-bit integer multiplications (the best for any
ARM cores >=armv5te)
2. single precision floating point multiplications (if floating point math is
supported by hardware)
3. 32-bit*32-bit->32-bit integer multiplications
4. double precision floating point multiplications (if floating point math is
supported by hardware and FPU is pipelined for this kind of operation - not
the case for Cortex-A8 unfortunately)
5. 32-bit*32-bit->64-bit integer multiplications

Currently SBC contains ARM inline assembly macros to utilize
32-bit*32-bit->32-bit multiplication (multiply-accumulate variant) which
is not the fastest option. It is only good for armv4 cores which do not
support anything better. Even quite an old Nokia 770 internet tablet
has support for armv5te instructions.

For x86 platform, 16-bit integer multiplications can be done quite efficiently
using MMX or SSE2.

To sum up, the 16-bit fixed point version will be the fastest and is the best
variant if it can provide the required precision. Otherwise single precision
floating point version is the next natural choice.

Surely there are also other factors to consider and raw multiplications
throughput performance is not the only thing. The need of scaling values for
fixed point versions and int->float/float->int conversions for the floating
point ones also plays some role.

--
Best regards,
Siarhei Siamashka

2008-12-12 15:56:56

by Marcel Holtmann

[permalink] [raw]
Subject: RE: [PATCH] sbc: bitstream packing optimization for encoder.

Hi Christian,

> > > The sound quality of the fix point implementation still remains below
> > of the
> > > quality of the floating point version.
> > >
> > > Maybe, we shall support both depending on the performance
> > requirements?
> >
> > I think we should focus a fixed point version. There should be no need
> > for floating point at all. If fixed point isn't good enough, then we
> > screwed it up.
> >
> > And in case of embedded devices we are seriously limited with floating
> > point and doing that in software just doesn't work out. And this is
> > mostly not directly a performance problem. It is more a power
> > consumption problem. We don't wanna have A2DP drain the battery.
> >
>
> It would help if we use 64 bit integers if a good sound is required but this
> is again a problem for embedded devices.

depends on. Looking at the libmad library, they do a pretty good job in
just using integers.

Regards

Marcel



2008-12-12 15:49:08

by Christian Hoene

[permalink] [raw]
Subject: RE: [PATCH] sbc: bitstream packing optimization for encoder.

> > The sound quality of the fix point implementation still remains below
> of the
> > quality of the floating point version.
> >
> > Maybe, we shall support both depending on the performance
> requirements?
>
> I think we should focus a fixed point version. There should be no need
> for floating point at all. If fixed point isn't good enough, then we
> screwed it up.
>
> And in case of embedded devices we are seriously limited with floating
> point and doing that in software just doesn't work out. And this is
> mostly not directly a performance problem. It is more a power
> consumption problem. We don't wanna have A2DP drain the battery.
>

It would help if we use 64 bit integers if a good sound is required but this
is again a problem for embedded devices.

Regards
Christian


>

2008-12-12 15:30:47

by Marcel Holtmann

[permalink] [raw]
Subject: RE: [PATCH] sbc: bitstream packing optimization for encoder.

Hi Christian,

can we please keep this mailing a top-posting free zone. I know it is
hard for Outlock to get this right, but it is possible.

> next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
> ill this week.
>
> The sound quality of the fix point implementation still remains below of the
> quality of the floating point version.
>
> Maybe, we shall support both depending on the performance requirements?

I think we should focus a fixed point version. There should be no need
for floating point at all. If fixed point isn't good enough, then we
screwed it up.

And in case of embedded devices we are seriously limited with floating
point and doing that in software just doesn't work out. And this is
mostly not directly a performance problem. It is more a power
consumption problem. We don't wanna have A2DP drain the battery.

Regards

Marcel



2008-12-12 15:23:33

by Christian Hoene

[permalink] [raw]
Subject: RE: [PATCH] sbc: bitstream packing optimization for encoder.

Hi all,

next week I will discuss the bug fixing of SBC with Jaska Uimonen, who is
ill this week.

The sound quality of the fix point implementation still remains below of the
quality of the floating point version.

Maybe, we shall support both depending on the performance requirements?

Regards,

Christian


> -----Original Message-----
> From: [email protected] [mailto:linux-bluetooth-
> [email protected]] On Behalf Of Marcel Holtmann
> Sent: Friday, December 12, 2008 12:04 PM
> To: Siarhei Siamashka
> Cc: [email protected]
> Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.
>
> Hi Siarhei,
>
> > > > Appears that bitstream packing is also one of the most CPU hungry
> > > > operations in SBC encoder, which did not get proper attention
> yet.
> > > > Could somebody review this patch and provide feedback/comments?
> > >
> > > again thanks for working on this. I am actually willing to include
> your
> > > patches quickly in one of the next releases to get more wider
> testing
> > > audience.
> >
> > Thanks. Are the contributors of the last SBC-related modifications
> reachable
> > nowadays? They could probably help with the review of our patches.
>
> they should be at least all subscribed to this mailing list now. Since
> I
> will actually close the other ones in two weeks.
>
> > SBC code still can be improved in many ways from the performance
> point of
> > view. The current implementation follows SBC specification pretty
> much
> > literally. But rearranging the code a bit and getting rid of
> multidimentional
> > arrays can provide identical output, but work noticeably faster. Also
> SBC
> > can benefit from ARM assembly optimizations (for ARM11 and Cortex-
> A8),
> > so these could be applied a bit later once all the high level stuff
> is in
> > place.
> >
> > Current implementation only needs to be checked to ensure that it is
> correct
> > and fully adheres to the specification before applying large scale
> > optimizations.
>
> I agree. So lets get a fully compliant Linux version first and then we
> go from there.
>
> > > One comment from my side is that this should work with little-
> endian and
> > > big-endian as input streams. Please keep that in mind.
> >
> > Yes, it should work fine on both big and little endian systems,
> though I did
> > not test it on big-endian hardware myself. Actually SBC bitstream
> packer
> > favors big-endian systems, because the following part...
> >
> > + if (bits_count >= 16) {\
> > + bits_count -= 8;\
> > + *data_ptr++ = (uint8_t)(bits_cache >>
> bits_count);\
> > + bits_count -= 8;\
> > + *data_ptr++ = (uint8_t)(bits_cache >>
> bits_count);\
> > + }\
> >
> > ...could be implemented as something like this for big-endian systems
> > (provided that we additionally take care of alignment and convert
> 'data_ptr'
> > into a pointer to uint16_t):
> >
> > + if (bits_count >= 16) {\
> > + bits_count -= 16;\
> > + *data_ptr++ = (uint16_t)(bits_cache >>
> bits_count);\
> > + }\
> >
> > But for little endian-systems and also to ensure endian neutrality,
> data
> > writes are done one byte at a time.
>
> Lets keep doing it byte by byte for now. However keep in mind that the
> input stream can also be in big endian even it is runs on a little
> machine. And vice-versa.
>
> > And Just an additional comment about portability, there are also
> systems where
> > uint8_t data type is not supported (and CHAR_BIT is different from
> 8). For
> > example, there is a port SBC encoder from bluez to C55x DSP [1]. I'm
> in no
> > way involved in this project, but they could probably want to submit
> some
> > changes to mainline SBC code to make it usable on such systems with a
> bit
> > less tweaks.
>
> We are doing Linux first and then worry about other systems.
>
> Regards
>
> Marcel
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> bluetooth" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-12-12 11:04:09

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

Hi Siarhei,

> > > Appears that bitstream packing is also one of the most CPU hungry
> > > operations in SBC encoder, which did not get proper attention yet.
> > > Could somebody review this patch and provide feedback/comments?
> >
> > again thanks for working on this. I am actually willing to include your
> > patches quickly in one of the next releases to get more wider testing
> > audience.
>
> Thanks. Are the contributors of the last SBC-related modifications reachable
> nowadays? They could probably help with the review of our patches.

they should be at least all subscribed to this mailing list now. Since I
will actually close the other ones in two weeks.

> SBC code still can be improved in many ways from the performance point of
> view. The current implementation follows SBC specification pretty much
> literally. But rearranging the code a bit and getting rid of multidimentional
> arrays can provide identical output, but work noticeably faster. Also SBC
> can benefit from ARM assembly optimizations (for ARM11 and Cortex-A8),
> so these could be applied a bit later once all the high level stuff is in
> place.
>
> Current implementation only needs to be checked to ensure that it is correct
> and fully adheres to the specification before applying large scale
> optimizations.

I agree. So lets get a fully compliant Linux version first and then we
go from there.

> > One comment from my side is that this should work with little-endian and
> > big-endian as input streams. Please keep that in mind.
>
> Yes, it should work fine on both big and little endian systems, though I did
> not test it on big-endian hardware myself. Actually SBC bitstream packer
> favors big-endian systems, because the following part...
>
> + if (bits_count >= 16) {\
> + bits_count -= 8;\
> + *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
> + bits_count -= 8;\
> + *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
> + }\
>
> ...could be implemented as something like this for big-endian systems
> (provided that we additionally take care of alignment and convert 'data_ptr'
> into a pointer to uint16_t):
>
> + if (bits_count >= 16) {\
> + bits_count -= 16;\
> + *data_ptr++ = (uint16_t)(bits_cache >> bits_count);\
> + }\
>
> But for little endian-systems and also to ensure endian neutrality, data
> writes are done one byte at a time.

Lets keep doing it byte by byte for now. However keep in mind that the
input stream can also be in big endian even it is runs on a little
machine. And vice-versa.

> And Just an additional comment about portability, there are also systems where
> uint8_t data type is not supported (and CHAR_BIT is different from 8). For
> example, there is a port SBC encoder from bluez to C55x DSP [1]. I'm in no
> way involved in this project, but they could probably want to submit some
> changes to mainline SBC code to make it usable on such systems with a bit
> less tweaks.

We are doing Linux first and then worry about other systems.

Regards

Marcel



2008-12-12 09:37:17

by Siarhei Siamashka

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

On Friday 12 December 2008 03:39:34 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > Appears that bitstream packing is also one of the most CPU hungry
> > operations in SBC encoder, which did not get proper attention yet.
> > Could somebody review this patch and provide feedback/comments?
>
> again thanks for working on this. I am actually willing to include your
> patches quickly in one of the next releases to get more wider testing
> audience.

Thanks. Are the contributors of the last SBC-related modifications reachable
nowadays? They could probably help with the review of our patches.

SBC code still can be improved in many ways from the performance point of
view. The current implementation follows SBC specification pretty much
literally. But rearranging the code a bit and getting rid of multidimentional
arrays can provide identical output, but work noticeably faster. Also SBC
can benefit from ARM assembly optimizations (for ARM11 and Cortex-A8),
so these could be applied a bit later once all the high level stuff is in
place.

Current implementation only needs to be checked to ensure that it is correct
and fully adheres to the specification before applying large scale
optimizations.

> One comment from my side is that this should work with little-endian and
> big-endian as input streams. Please keep that in mind.

Yes, it should work fine on both big and little endian systems, though I did
not test it on big-endian hardware myself. Actually SBC bitstream packer
favors big-endian systems, because the following part...

+ if (bits_count >= 16) {\
+ bits_count -= 8;\
+ *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+ bits_count -= 8;\
+ *data_ptr++ = (uint8_t)(bits_cache >> bits_count);\
+ }\

...could be implemented as something like this for big-endian systems
(provided that we additionally take care of alignment and convert 'data_ptr'
into a pointer to uint16_t):

+ if (bits_count >= 16) {\
+ bits_count -= 16;\
+ *data_ptr++ = (uint16_t)(bits_cache >> bits_count);\
+ }\

But for little endian-systems and also to ensure endian neutrality, data
writes are done one byte at a time.


And Just an additional comment about portability, there are also systems where
uint8_t data type is not supported (and CHAR_BIT is different from 8). For
example, there is a port SBC encoder from bluez to C55x DSP [1]. I'm in no
way involved in this project, but they could probably want to submit some
changes to mainline SBC code to make it usable on such systems with a bit
less tweaks.

1. https://garage.maemo.org/projects/dsp-sbc


Best regards,
Siarhei Siamashka

2008-12-12 01:39:34

by Marcel Holtmann

[permalink] [raw]
Subject: Re: [PATCH] sbc: bitstream packing optimization for encoder.

Hi Siarhei,

> Appears that bitstream packing is also one of the most CPU hungry
> operations in SBC encoder, which did not get proper attention yet.
> Could somebody review this patch and provide feedback/comments?

again thanks for working on this. I am actually willing to include your
patches quickly in one of the next releases to get more wider testing
audience.

One comment from my side is that this should work with little-endian and
big-endian as input streams. Please keep that in mind.

Regards

Marcel