Hello all,
SBC encoder contains the following fragment:
> #if __BYTE_ORDER == __LITTLE_ENDIAN
> if (sbc->endian == SBC_BE)
> #elif __BYTE_ORDER == __BIG_ENDIAN
> if (sbc->endian == SBC_LE)
> #else
> #error "Unknown byte order"
> #endif
> s = (ptr[0] & 0xff) << 8 | (ptr[1] & 0xff);
> else
> s = (ptr[0] & 0xff) | (ptr[1] & 0xff) << 8;
This does not look right. Because as far as I can see, it will work
differently on big and little endian systems (though I did not test it
on real hardware). Is anybody using SBC on big endian systems?
Does it actually work as expected?
Best regards,
Siarhei Siamashka
Holtmann, Johan
>
>Holtmann, Johan
>
> The attach patch is for the SAP DBUS API doc and the initial framework code
>for SAP plugin.
>
> Could you help to have a review. Thanks!
Resend the patch according to your comments.
Best Regards,
Raymond Liu
On Wednesday 07 January 2009 14:40:01 ext Siarhei Siamashka wrote:
> On Monday 05 January 2009 21:26:17 ext Brad Midgley wrote:
> > The copy is inefficient but it would be even better if we didn't have
> > to do it at all. I was investigating zero copy here and came up with a
> > patch but it was too complicated to be accepted.
>
> Yes, it is possible to reduce the number of data copies. Having
> both 'sbc_encoder_state->X' and 'sbc_frame->pcm_sample' arrays
> is obviously redundant and only one of them should remain.
>
> But eliminating any kind of data copies completely is hardly possible. The
> encoder needs to always have data for the 72 or 36 previous samples, so
> they need to be stored somewhere between the calls to frame encode
> function. And frame encode function will probably get data in small chunks
> if having low latency is desired. So at least this part of data will need
> to be moved around. Additionally, SIMD optimizations require input data
> permutation, so they can't work directly with the input buffer.
>
> Preserving the input samples 'history' is currently achieved by having
> 'sbc_encoder_state->X' array which works as some kind of ring buffer.
> If this buffer had infinite size, we would not have to duplicate
> information in the lower and higher halves of this buffer. But this is of
> course not possible due to the need to keep memory use reasonable and also
> in order to efficiently use data cache. Anyway, increasing
> 'sbc_encoder_state->X' size to some reasonable value may help to reduce
> extra overhead. One of the variants is to have space in X buffer for all
> the input data of a frame plus these 72/36 samples from the previous frame,
> copy the previous 72/36 samples to it, and then perform "endian conversion
> + channels deinterleaving + do SIMD permutation" in one pass directly into
> the X buffer from the buffer provided by the user at the entry of the frame
> packing function. Increasing X buffer more may allow to copy the previous
> 72/36 samples only once per 2 frames, once per 3 frames or whatever.
>
> This is not complicated at all, but is indeed a bit intrusive in the sense
> that it will touch quite a large part of code. But we are ready to do it
> now, am I right?
The abovementioned optimization is implemented in the attached experimental
patch. It makes sbcenc ~20% faster, which is a really nice improvement.
> > The messy part here is we let the caller specify the byte order of the
> > array. It would simplify a lot to standardize on host endian. I don't
> > remember what the reasoning was against this.
Yeah, this is one of the things that can simplify code and make maintenance
easier.
OK, before I submit a final/committable version of this patch, some things may
need to be discussed/decided:
1. Alignment of the input data. Currently data is passed in as a void *
pointer, which does not imply any alignment requirements. It would be nice if
the input data would be guaranteed to be accessible as int16_t elements
(16-bit sound samples). On the platforms with strict alignment (older ARM
cores), unaligned memory accesses are not supported. Current sbcenc utility
uses 'unsigned char' buffer to read the data from file and pass it to the
encoder. With this setup, the compiler is theoretically allowed to allocate
this buffer on uneven address, causing the troubles if we interpret it as
(int16_t *) buffer in the encoder. I suggest changing sbc_encode/sbc_decode
function prototypes to int16_t * data type for the buffers with audio samples
and fixing sbcenc utility and other clients (gstreamer/alsa) which may
potentially have alignment problems.
2. Is support for non-simd variant of the analysis filter useful enough to
keep it (sbc_analyze_4b_4s/sbc_analyze_4b_8s functions)? They have
a slight advantage over the simd variant in the sense that they only require a
single table with coefficients (as opposed to two tables for simd). Also it is
possible to implement non-simd input data handling as a simple memory
copy (no permutation needed) or even directly work on some part of the input
data (if the input data is large enough, some copying is still needed to carry
on data between sbc_encode calls). But in the case of having simd/multimedia
extension support, simd variant becomes clearly preferable. Most cpus
have some sort of simd supported nowadays and this is interesting for some
really low end or very old cpu. Is keeping this code worth extra maintenance
burden?
3. Status of endian conversion in the SBC codec. Dropping endian conversion
functionality in SBC and relaying it to the client application will make SBC
code simpler. On the other hand, built-in endian conversion is a bit faster
if we get data in inherently non-native byte order (loaded from big endian
data format on little endian cpu for example). Is it worth extra complexity?
Comments are very much welcome.
Best regards,
Siarhei Siamashka
On Monday 19 January 2009 17:02:03 ext Brad Midgley wrote:
> > Also an interesting observation is that gstreamer is very slow. Using
> > gst-launch to encode audio file to SBC is something like 1.5 times slower
> > than sbcenc utility when benchmarked on x86 system. I wonder if
> > something could be done about this.
>
> I noticed something similar back in 3.x. On an xscale, it went from
> about 15% cpu usage under libalsa to 50% using gst-launch.
By the way, if that XScale chip is IWMMXT capable, it can benefit from SIMD
optimizations too. This should probably reduce CPU usage way below 15%.
Moreover, IWMMXT is essentially an MMX clone for ARM, so x86 MMX code can
be directly converted to it.
> Is there a best practices for gstreamer that the plugin is not following?
I'm not quite familiar with gstreamer yet, but these performance issues might
be really worth investigating...
Best regards,
Siarhei Siamashka
Holtmann, Johan
The attach patch is for the SAP DBUS API doc and the initial framework code for SAP plugin.
Could you help to have a review. Thanks!
Best Regards,
Raymond Liu
Siarhei
> Also an interesting observation is that gstreamer is very slow. Using
> gst-launch to encode audio file to SBC is something like 1.5 times slower
> than sbcenc utility when benchmarked on x86 system. I wonder if
> something could be done about this.
I noticed something similar back in 3.x. On an xscale, it went from
about 15% cpu usage under libalsa to 50% using gst-launch.
Is there a best practices for gstreamer that the plugin is not following?
--
Brad Midgley
Hi Siarhei,
> > I did not verify that patch thoroughly (I only checked that it compiles)
> > and also seems like I somehow forgot to pass it through 'checkpatch.pl'.
> > Thanks for a good catch.
> >
> > Regarding G_BYTE_ORDER, you are probably right. I'm not very familiar with
> > gstreamer yet, so somebody else could probably fix this stuff much faster
> > than me.
> >
> > In any case, it probably makes sense to relay all the endian conversion
> > related things to gstreamer and ALSA, and use only native byte order in
> > SBC. This will simplify the code a bit and will make optimizing it easier.
> >
> > I'm quite worried about big endian systems, but I have plans to buy a PS3
> > home in about a month timeframe unless something extraordinary happens :)
>
> OK, here is an updated patch.
>
> Now I also tested GStreamer plugin on qemu mips big endian image. It works
> fine with this change. Having native byte order and not strictly little endian
> is a good idea because all the other gstreamer elements (vorbisdec,
> flacdec, ...) produce output in native byte order. Unless sbcenc uses native
> byte order on this big endian system, audioconvert element is needed to be
> added to gstreamer pipeline and it most likely introduces some extra overhead.
>
> I did not check ALSA part and probably can't even do this with qemu, but
> removing _LE suffix should also do the job there (according to ALSA
> documentation).
patch has been applied. Thanks.
> Also an interesting observation is that gstreamer is very slow. Using
> gst-launch to encode audio file to SBC is something like 1.5 times slower
> than sbcenc utility when benchmarked on x86 system. I wonder if
> something could be done about this.
I suspected GStreamer to be slower because of its GObject overhead, but
1.5 times is a lot. Not sure if GStreamer or GObject is to blame here.
Regards
Marcel
On Saturday 17 January 2009 20:10:04 ext Siarhei Siamashka wrote:
> On Saturday 17 January 2009 00:02:09 ext Luiz Augusto von Dentz wrote:
> > Hi Siarhei,
> >
> > It seems gstreamer does have G_BYTE_ORDER, so you probably don't need
> > those #ifdef. I also notice some unnecessary white spaces in the of a
> > line.
>
> Hi,
>
> I did not verify that patch thoroughly (I only checked that it compiles)
> and also seems like I somehow forgot to pass it through 'checkpatch.pl'.
> Thanks for a good catch.
>
> Regarding G_BYTE_ORDER, you are probably right. I'm not very familiar with
> gstreamer yet, so somebody else could probably fix this stuff much faster
> than me.
>
> In any case, it probably makes sense to relay all the endian conversion
> related things to gstreamer and ALSA, and use only native byte order in
> SBC. This will simplify the code a bit and will make optimizing it easier.
>
> I'm quite worried about big endian systems, but I have plans to buy a PS3
> home in about a month timeframe unless something extraordinary happens :)
OK, here is an updated patch.
Now I also tested GStreamer plugin on qemu mips big endian image. It works
fine with this change. Having native byte order and not strictly little endian
is a good idea because all the other gstreamer elements (vorbisdec,
flacdec, ...) produce output in native byte order. Unless sbcenc uses native
byte order on this big endian system, audioconvert element is needed to be
added to gstreamer pipeline and it most likely introduces some extra overhead.
I did not check ALSA part and probably can't even do this with qemu, but
removing _LE suffix should also do the job there (according to ALSA
documentation).
Also an interesting observation is that gstreamer is very slow. Using
gst-launch to encode audio file to SBC is something like 1.5 times slower
than sbcenc utility when benchmarked on x86 system. I wonder if
something could be done about this.
Best regards,
Siarhei Siamashka
On Saturday 17 January 2009 00:02:09 ext Luiz Augusto von Dentz wrote:
> Hi Siarhei,
>
> It seems gstreamer does have G_BYTE_ORDER, so you probably don't need
> those #ifdef. I also notice some unnecessary white spaces in the of a
> line.
Hi,
I did not verify that patch thoroughly (I only checked that it compiles) and
also seems like I somehow forgot to pass it through 'checkpatch.pl'. Thanks
for a good catch.
Regarding G_BYTE_ORDER, you are probably right. I'm not very familiar with
gstreamer yet, so somebody else could probably fix this stuff much faster
than me.
In any case, it probably makes sense to relay all the endian conversion
related things to gstreamer and ALSA, and use only native byte order in SBC.
This will simplify the code a bit and will make optimizing it easier.
I'm quite worried about big endian systems, but I have plans to buy a PS3 home
in about a month timeframe unless something extraordinary happens :)
Best regards,
Siarhei Siamashka
Hi Siarhei,
It seems gstreamer does have G_BYTE_ORDER, so you probably don't need
those #ifdef. I also notice some unnecessary white spaces in the of a
line.
--
Luiz Augusto von Dentz
Engenheiro de Computa??o
On Wednesday 07 January 2009 15:43:57 ext Marcel Holtmann wrote:
> Hi Siarhei,
>
> > > The messy part here is we let the caller specify the byte order of the
> > > array. It would simplify a lot to standardize on host endian. I don't
> > > remember what the reasoning was against this.
> > >
> > > > Let's suppose that we have the following two bytes in memory:
> > > >
> > > > 0x12 0x34
> > > >
> > > > This equals to 0x1234 for big endian systems or 0x3412 for little
> > > > endian systems if you read data via int16_t * pointer.
> > >
> > > if sbc->endian is set to the same as the host endian, then this could
> > > be done with a memcopy or skipped (zero copy).
> > >
> > > I believe sbc->endian is always set to little endian the way it works
> > > now, meaning the array is storing data little endian regardless of the
> > > architecture. The patch I wrote made a copy of the memory, swapping
> > > bytes, if sbc->endian didn't match host endian. The way we use the
> > > code, this swapping only happens on big endian machines.
> >
> > Well, having no other option to verify this big endian bug, I used this
> > MIPS big endian QEMU image for testing:
> > http://people.debian.org/~aurel32/qemu/mips/
> >
> > The current code is indeed broken on big endian systems. The attached
> > patch makes it work correct. Of course this part needs heavy performance
> > optimizations (as discussed above), but even just fixing it may be a good
> > idea at the moment.
>
> patch has been applied. Thanks.
Thanks, though seems like there are some problems.
By default, SBC codec uses native byte order now (configured
by 'sbc_set_defaults' function, which is called from 'sbc_init').
But SBC gstreamer elements ('gstsbcdec.c', 'gstsbcenc.c') specify
endianness as little endian: "endianness = (int) LITTLE_ENDIAN, ".
So now after the patch got applied, "sbcenc/sbcdec" utilites got fixed,
but gstreamer elements may have problems on big endian systems
instead. Another concern is about 'pcm_bluetooth.c'. It does not specifically
configure endianness in any way, so SBC codec is used with the native byte
order as well. And there the supported format is also specified as
SND_PCM_FORMAT_S16_LE.
Looks like this was the case of double errors which were cancelling each other
or misleading constant names. In any case, SBC has been always treating data
by default as little endian before regardless of the host native byte order.
So in order to get gstreamer and pcm_bluetooth working right again, the
default configuration of SBC codec can be changed to SBC_LE (this way all the
old clients which did not change the default configuration will continue to
work just like before). A variant of patch with this kind of solution is
attached.
Alternatively, gstreamer elements and ALSA plugin could try to use native byte
order (and this can be probably good for the performance as with the strictly
little endian support for the data, big endian systems may actually have to do
the conversion twice - first in the gstreamer in order to provide the data to
the element in little endian format, and then in SBC codec itself to convert
it back to native format to use it in number crunching). This variant of patch
is also attached (but untested because I don't have a big endian system).
Some kind of fix needs to be used (and the attached patches are mutually
exclusive), as apparently 4.26 release does not work right on big endian
systems :(
Best regards,
Siarhei Siamashka
Hi Siarhei,
> > The messy part here is we let the caller specify the byte order of the
> > array. It would simplify a lot to standardize on host endian. I don't
> > remember what the reasoning was against this.
> >
> > > Let's suppose that we have the following two bytes in memory:
> > >
> > > 0x12 0x34
> > >
> > > This equals to 0x1234 for big endian systems or 0x3412 for little endian
> > > systems if you read data via int16_t * pointer.
> >
> > if sbc->endian is set to the same as the host endian, then this could
> > be done with a memcopy or skipped (zero copy).
> >
> > I believe sbc->endian is always set to little endian the way it works
> > now, meaning the array is storing data little endian regardless of the
> > architecture. The patch I wrote made a copy of the memory, swapping
> > bytes, if sbc->endian didn't match host endian. The way we use the
> > code, this swapping only happens on big endian machines.
>
> Well, having no other option to verify this big endian bug, I used this MIPS
> big endian QEMU image for testing:
> http://people.debian.org/~aurel32/qemu/mips/
>
> The current code is indeed broken on big endian systems. The attached
> patch makes it work correct. Of course this part needs heavy performance
> optimizations (as discussed above), but even just fixing it may be a good
> idea at the moment.
patch has been applied. Thanks.
Regards
Marcel
On Monday 05 January 2009 21:26:17 ext Brad Midgley wrote:
> Siarhei
>
> The copy is inefficient but it would be even better if we didn't have
> to do it at all. I was investigating zero copy here and came up with a
> patch but it was too complicated to be accepted.
Yes, it is possible to reduce the number of data copies. Having
both 'sbc_encoder_state->X' and 'sbc_frame->pcm_sample' arrays
is obviously redundant and only one of them should remain.
But eliminating any kind of data copies completely is hardly possible. The
encoder needs to always have data for the 72 or 36 previous samples, so
they need to be stored somewhere between the calls to frame encode
function. And frame encode function will probably get data in small chunks
if having low latency is desired. So at least this part of data will need to
be moved around. Additionally, SIMD optimizations require input data
permutation, so they can't work directly with the input buffer.
Preserving the input samples 'history' is currently achieved by having
'sbc_encoder_state->X' array which works as some kind of ring buffer.
If this buffer had infinite size, we would not have to duplicate information
in the lower and higher halves of this buffer. But this is of course not
possible due to the need to keep memory use reasonable and also in order to
efficiently use data cache. Anyway, increasing 'sbc_encoder_state->X' size
to some reasonable value may help to reduce extra overhead. One of the
variants is to have space in X buffer for all the input data of a frame plus
these 72/36 samples from the previous frame, copy the previous 72/36 samples
to it, and then perform "endian conversion + channels deinterleaving + do SIMD
permutation" in one pass directly into the X buffer from the buffer provided
by the user at the entry of the frame packing function. Increasing X buffer
more may allow to copy the previous 72/36 samples only once per 2 frames, once
per 3 frames or whatever.
This is not complicated at all, but is indeed a bit intrusive in the sense
that it will touch quite a large part of code. But we are ready to do it now,
am I right?
> The messy part here is we let the caller specify the byte order of the
> array. It would simplify a lot to standardize on host endian. I don't
> remember what the reasoning was against this.
>
> > Let's suppose that we have the following two bytes in memory:
> >
> > 0x12 0x34
> >
> > This equals to 0x1234 for big endian systems or 0x3412 for little endian
> > systems if you read data via int16_t * pointer.
>
> if sbc->endian is set to the same as the host endian, then this could
> be done with a memcopy or skipped (zero copy).
>
> I believe sbc->endian is always set to little endian the way it works
> now, meaning the array is storing data little endian regardless of the
> architecture. The patch I wrote made a copy of the memory, swapping
> bytes, if sbc->endian didn't match host endian. The way we use the
> code, this swapping only happens on big endian machines.
Well, having no other option to verify this big endian bug, I used this MIPS
big endian QEMU image for testing:
http://people.debian.org/~aurel32/qemu/mips/
The current code is indeed broken on big endian systems. The attached
patch makes it work correct. Of course this part needs heavy performance
optimizations (as discussed above), but even just fixing it may be a good
idea at the moment.
There is one more suspicious part of code which may cause problems:
> unsigned char input[2048], output[2048];
> ...
> au_hdr = (struct au_header *) input;
> if (au_hdr->magic != AU_MAGIC ||
> BE_INT(au_hdr->hdr_size) > 128 ||
> BE_INT(au_hdr->hdr_size) < 24 ||
> BE_INT(au_hdr->encoding) != AU_FMT_LIN16) {
> fprintf(stderr, "Not in Sun/NeXT audio S16_BE format\n");
> goto done;
> }
As 'input' array is only guaranteed to have byte alignment and the code later
accesses 32-bit au_hdr data, it may cause problems and the compiler
rightfully issues a warning about it.
Best regards,
Siarhei Siamashka
Siarhei
The copy is inefficient but it would be even better if we didn't have
to do it at all. I was investigating zero copy here and came up with a
patch but it was too complicated to be accepted.
The messy part here is we let the caller specify the byte order of the
array. It would simplify a lot to standardize on host endian. I don't
remember what the reasoning was against this.
> Let's suppose that we have the following two bytes in memory:
>
> 0x12 0x34
>
> This equals to 0x1234 for big endian systems or 0x3412 for little endian
> systems if you read data via int16_t * pointer.
if sbc->endian is set to the same as the host endian, then this could
be done with a memcopy or skipped (zero copy).
I believe sbc->endian is always set to little endian the way it works
now, meaning the array is storing data little endian regardless of the
architecture. The patch I wrote made a copy of the memory, swapping
bytes, if sbc->endian didn't match host endian. The way we use the
code, this swapping only happens on big endian machines.
--
Brad Midgley
On Monday 05 January 2009 18:25:24 ext Brad Midgley wrote:
> Siarhei
>
> > The logic is that the first line contains a portable endian neutral read
> > of
> >
> > big endian data into native format:
> >> s = (ptr[0] & 0xff) << 8 | (ptr[1] &
> >> 0xff);
>
> the intent is to swap bytes using this first statement if either
>
> - host order is little endian and the ptr array is stored big endian
> - host order is big endian and the ptr array is stored little endian
>
> the second case could be done with a cast to a 16-bit int rather than
> bit arithmetic.
>
> We don't exercise all the various cases.
No, it's a bit different. There is no 'swap byte' operation here, but just
portable reading of data in big endian format.
Let's suppose that we have the following two bytes in memory:
0x12 0x34
This equals to 0x1234 for big endian systems or 0x3412 for little endian
systems if you read data via int16_t * pointer.
But in the case of SBC, data is read one byte at a time and results in the
same value regardless of the endianness of the system. It will be always
0x1234 and this is generally a portable, but slow way of reading big endian
data (host system endian order does not matter).
So unless __BYTE_ORDER is somehow defined to be __LITTLE_ENDIAN on big
endian systems (two bugs which cancel each other ;-) ), the code from SBC is
supposed to work incorrectly and sbcenc should produce broken files
with distorted sound (though somewhat resembling the original file). I just
want somebody to verify this on real big endian hardware.
PS. I'm interested in this code because it is very slow and needs to be
optimized. In the case of single channel and no endian conversion, just
memcpy can be used. Other cases can be improved too.
--
Best regards,
Siarhei Siamashka
Siarhei
> The logic is that the first line contains a portable endian neutral read of
> big endian data into native format:
>> s = (ptr[0] & 0xff) << 8 | (ptr[1] & 0xff);
the intent is to swap bytes using this first statement if either
- host order is little endian and the ptr array is stored big endian
- host order is big endian and the ptr array is stored little endian
the second case could be done with a cast to a 16-bit int rather than
bit arithmetic.
We don't exercise all the various cases.
--
Brad Midgley
On Monday 05 January 2009 17:36:57 ext Brad Midgley wrote:
> Siarhei
>
> > SBC encoder contains the following fragment:
> >> #if __BYTE_ORDER == __LITTLE_ENDIAN
> >> if (sbc->endian == SBC_BE)
> >> #elif __BYTE_ORDER == __BIG_ENDIAN
> >> if (sbc->endian == SBC_LE)
> >> #else
> >> #error "Unknown byte order"
> >> #endif
> >> s = (ptr[0] & 0xff) << 8 | (ptr[1] &
> >> 0xff); else
> >> s = (ptr[0] & 0xff) | (ptr[1] & 0xff) <<
> >> 8;
> >
> > This does not look right. Because as far as I can see, it will work
> > differently on big and little endian systems
>
> I tested it some time ago on powerpc and I know Marcel also has a
> powerpc for testing. I'll need to update my machine for bluez 4.x.
>
> The logic here seems to be to swap bytes if the endian set in the sbc
> struct does not match host endian.
The logic is that the first line contains a portable endian neutral read of
big endian data into native format:
> s = (ptr[0] & 0xff) << 8 | (ptr[1] & 0xff);
The second line contains a portable endian neutral read of little endian
data into native format:
> s = (ptr[0] & 0xff) | (ptr[1] & 0xff) <<8;
If __BYTE_ORDER is defined as __LITTLE_ENDIAN, the first line is used when
sbc->endian == SBC_BE. But if __BYTE_ORDER is set to __BIG_ENDIAN, the
logic is reversed.
The behaviour and the end result seems to be different for the big endian
and little endian systems. This can be probably easily tested by encoding the
same file by sbcenc on big and little endian systems and comparing results.
--
Best regards,
Siarhei Siamashka
Siarhei
> SBC encoder contains the following fragment:
>
>> #if __BYTE_ORDER == __LITTLE_ENDIAN
>> if (sbc->endian == SBC_BE)
>> #elif __BYTE_ORDER == __BIG_ENDIAN
>> if (sbc->endian == SBC_LE)
>> #else
>> #error "Unknown byte order"
>> #endif
>> s = (ptr[0] & 0xff) << 8 | (ptr[1] & 0xff);
>> else
>> s = (ptr[0] & 0xff) | (ptr[1] & 0xff) << 8;
>
> This does not look right. Because as far as I can see, it will work
> differently on big and little endian systems
I tested it some time ago on powerpc and I know Marcel also has a
powerpc for testing. I'll need to update my machine for bluez 4.x.
The logic here seems to be to swap bytes if the endian set in the sbc
struct does not match host endian.
--
Brad Midgley