On Wednesday 07 January 2009 14:40:01 ext Siarhei Siamashka wrote: > On Monday 05 January 2009 21:26:17 ext Brad Midgley wrote: > > The copy is inefficient but it would be even better if we didn't have > > to do it at all. I was investigating zero copy here and came up with a > > patch but it was too complicated to be accepted. > > Yes, it is possible to reduce the number of data copies. Having > both 'sbc_encoder_state->X' and 'sbc_frame->pcm_sample' arrays > is obviously redundant and only one of them should remain. > > But eliminating any kind of data copies completely is hardly possible. The > encoder needs to always have data for the 72 or 36 previous samples, so > they need to be stored somewhere between the calls to frame encode > function. And frame encode function will probably get data in small chunks > if having low latency is desired. So at least this part of data will need > to be moved around. Additionally, SIMD optimizations require input data > permutation, so they can't work directly with the input buffer. > > Preserving the input samples 'history' is currently achieved by having > 'sbc_encoder_state->X' array which works as some kind of ring buffer. > If this buffer had infinite size, we would not have to duplicate > information in the lower and higher halves of this buffer. But this is of > course not possible due to the need to keep memory use reasonable and also > in order to efficiently use data cache. Anyway, increasing > 'sbc_encoder_state->X' size to some reasonable value may help to reduce > extra overhead. One of the variants is to have space in X buffer for all > the input data of a frame plus these 72/36 samples from the previous frame, > copy the previous 72/36 samples to it, and then perform "endian conversion > + channels deinterleaving + do SIMD permutation" in one pass directly into > the X buffer from the buffer provided by the user at the entry of the frame > packing function. Increasing X buffer more may allow to copy the previous > 72/36 samples only once per 2 frames, once per 3 frames or whatever. > > This is not complicated at all, but is indeed a bit intrusive in the sense > that it will touch quite a large part of code. But we are ready to do it > now, am I right? The abovementioned optimization is implemented in the attached experimental patch. It makes sbcenc ~20% faster, which is a really nice improvement. > > The messy part here is we let the caller specify the byte order of the > > array. It would simplify a lot to standardize on host endian. I don't > > remember what the reasoning was against this. Yeah, this is one of the things that can simplify code and make maintenance easier. OK, before I submit a final/committable version of this patch, some things may need to be discussed/decided: 1. Alignment of the input data. Currently data is passed in as a void * pointer, which does not imply any alignment requirements. It would be nice if the input data would be guaranteed to be accessible as int16_t elements (16-bit sound samples). On the platforms with strict alignment (older ARM cores), unaligned memory accesses are not supported. Current sbcenc utility uses 'unsigned char' buffer to read the data from file and pass it to the encoder. With this setup, the compiler is theoretically allowed to allocate this buffer on uneven address, causing the troubles if we interpret it as (int16_t *) buffer in the encoder. I suggest changing sbc_encode/sbc_decode function prototypes to int16_t * data type for the buffers with audio samples and fixing sbcenc utility and other clients (gstreamer/alsa) which may potentially have alignment problems. 2. Is support for non-simd variant of the analysis filter useful enough to keep it (sbc_analyze_4b_4s/sbc_analyze_4b_8s functions)? They have a slight advantage over the simd variant in the sense that they only require a single table with coefficients (as opposed to two tables for simd). Also it is possible to implement non-simd input data handling as a simple memory copy (no permutation needed) or even directly work on some part of the input data (if the input data is large enough, some copying is still needed to carry on data between sbc_encode calls). But in the case of having simd/multimedia extension support, simd variant becomes clearly preferable. Most cpus have some sort of simd supported nowadays and this is interesting for some really low end or very old cpu. Is keeping this code worth extra maintenance burden? 3. Status of endian conversion in the SBC codec. Dropping endian conversion functionality in SBC and relaying it to the client application will make SBC code simpler. On the other hand, built-in endian conversion is a bit faster if we get data in inherently non-native byte order (loaded from big endian data format on little endian cpu for example). Is it worth extra complexity? Comments are very much welcome. Best regards, Siarhei Siamashka