From: Andrew Morton Subject: Re: [PATCH] zerocopy NFS for 2.5.36 Date: Wed, 18 Sep 2002 17:16:43 -0700 Sender: linux-kernel-owner@vger.kernel.org Message-ID: <3D89176B.40FFD09B@digeo.com> References: <20020918.160057.17194839.davem@redhat.com> <1032393277.24895.8.camel@irongate.swansea.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: "David S. Miller" , taka@valinux.co.jp, neilb@cse.unsw.edu.au, linux-kernel@vger.kernel.org, nfs@lists.sourceforge.net Return-path: To: Alan Cox List-ID: Alan Cox wrote: > > On Thu, 2002-09-19 at 00:00, David S. Miller wrote: > > It was discussed long ago that csum_and_copy_from_user() performs > > better than plain copy_from_user() on x86. I do not remember all > > The better was a freak of PPro/PII scheduling I think > > > details, but I do know that using copy_from_user() is not a real > > improvement at least on x86 architecture. > > The same as bit is easy to explain. Its totally memory bandwidth limited > on current x86-32 processors. (Although I'd welcome demonstrations to > the contrary on newer toys) Nope. There are distinct alignment problems with movsl-based memcpy on PII and (at least) "Pentium III (Coppermine)", which is tested here: copy_32 uses movsl. copy_duff just uses a stream of "movl"s Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned: akpm:/usr/src/cptimer> ./cptimer -d -s nbytes=10240 from_align=0, to_align=0 copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec OK, movsl wins. But now give the source address 8+1 alignment: akpm:/usr/src/cptimer> ./cptimer -d -s -f 1 nbytes=10240 from_align=1, to_align=0 copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec The "movl"-based copy wins. By miles. Make the source 8+4 aligned: akpm:/usr/src/cptimer> ./cptimer -d -s -f 4 nbytes=10240 from_align=4, to_align=0 copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec __copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec So movl still beats movsl, by lots. I have various scriptlets which generate the entire matrix. I think I ended up deciding that we should use movsl _only_ when both src and dsc are 8-byte-aligned. And that when you multiply the gain from that by the frequency*size with which funny alignments are used by TCP the net gain was 2% or something. It needs redoing. These differences are really big, and this is the kernel's most expensive function. A little project for someone. The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz