Looks like open-coded be_to_cpu.
GCC produces rather poor code for this.
be_to_cpu produces asm()s which are ~4 times shorter.
Compile-tested only.
I am not sure whether input can be 32bit-unaligned.
If it indeed can be, replace:
((u32*)(input))[I] -> get_unaligned( ((u32*)(input))+I )
--
vda