2002-09-18 08:17:51

by Hirokazu Takahashi

[permalink] [raw]
Subject: [PATCH] zerocopy NFS for 2.5.36

Hello,

I ported the zerocopy NFS patches against linux-2.5.36.

I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic,
so that it would be easy to merge with NFSv4. Each procedure can
chose whether it can accept splitted buffers or not.
And I fixed a probelem that nfsd couldn't handle NFS-symlink
requests which were very large.


1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

2)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va11-udpsendfile-2.5.36.patch
This patch makes sendfile systemcall over UDP work. It also supports
UDP_CORK interface which is very similar to TCP_CORK. And you can call
sendmsg/senfile with MSG_MORE flags over UDP sockets.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

4)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va01-zerocopy-rpc-2.5.36.patch
This patch makes RPC can send some pieces of data and pages without copy.

5)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va02-zerocopy-nfsdread-2.5.36.patch
This patch makes NFSD send pages in pagecache directly when NFS clinets request
file-read.

6)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va03-zerocopy-nfsdreaddir-2.5.36.patch
nfsd_readdir can also send pages without copy.

7)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va04-zerocopy-shadowsock-2.5.36.patch
This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on
each prosessor simultaneously.
Without the patch we can send only one UDP frame at the time as a UDP socket
have to be locked during sending some pages to serialize them.

8)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va05-zerocopy-nfsdwrite-2.5.36.patch
This patch enables NFS-write uses writev interface. NFSd can handle NFS
requests without reassembling IP fragments into one UDP frame.

9)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/taka-writev-2.5.36.patch
This patch makes writev for regular file work faster.
It also can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/

Caution:
XFS doesn't support writev interface yet. NFS write on XFS might
slow down with No.8 patch. I wish SGI guys will implement it.

10)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va07-nfsbigbuf-2.5.36.patch
This makes NFS buffer much bigger (60KB).
60KB buffer is the same to 32KB buffer for linux-kernel as both of them
require 64KB chunk.


11)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va09-zerocopy-tempsendto-2.5.36.patch
If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.



Regards,
Hirokazu Takahashi


2002-09-18 23:05:08

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Hirokazu Takahashi <[email protected]>
Date: Wed, 18 Sep 2002 17:14:31 +0900 (JST)


1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

Can you explain the TCP parts? They look very wrong.

It was discussed long ago that csum_and_copy_from_user() performs
better than plain copy_from_user() on x86. I do not remember all
details, but I do know that using copy_from_user() is not a real
improvement at least on x86 architecture.

The rest of the changes (ie. the getfrag() logic to set
skb->ip_summed) looks fine.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

I've sent Linus this fix already.

2002-09-18 23:49:18

by Alan

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86. I do not remember all

The better was a freak of PPro/PII scheduling I think

> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

The same as bit is easy to explain. Its totally memory bandwidth limited
on current x86-32 processors. (Although I'd welcome demonstrations to
the contrary on newer toys)

2002-09-19 00:11:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Alan Cox wrote:
>
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86. I do not remember all
>
> The better was a freak of PPro/PII scheduling I think
>
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
>
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)

Nope. There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:

copy_32 uses movsl. copy_duff just uses a stream of "movl"s

Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s
nbytes=10240 from_align=0, to_align=0
copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

OK, movsl wins. But now give the source address 8+1 alignment:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240 from_align=1, to_align=0
copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec

The "movl"-based copy wins. By miles.

Make the source 8+4 aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240 from_align=4, to_align=0
copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec

So movl still beats movsl, by lots.

I have various scriptlets which generate the entire matrix.

I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned. And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.

It needs redoing. These differences are really big, and this
is the kernel's most expensive function.

A little project for someone.

The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz

2002-09-19 02:08:41

by Aaron Lehmann

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

> akpm:/usr/src/cptimer> ./cptimer -d -s
> nbytes=10240 from_align=0, to_align=0
> copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

It's disappointing that this program doesn't seem to support
benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
Those seem to be the more interesting memcpy functions on modern
systems.

2002-09-19 03:25:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Aaron Lehmann wrote:
>
> > akpm:/usr/src/cptimer> ./cptimer -d -s
> > nbytes=10240 from_align=0, to_align=0
> > copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
>
> It's disappointing that this program doesn't seem to support
> benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> Those seem to be the more interesting memcpy functions on modern
> systems.

Well the source is there, and the licensing terms are most reasonable.

But then, the source was there eighteen months ago and nothing happened.
Sigh.

I think in-kernel MMX has fatal drawbacks anyway. Not sure what
they are - I prefer to pretend that x86 CPUs execute raw C.

2002-09-19 10:33:37

by Alan

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 2002-09-19 at 04:30, Andrew Morton wrote:
> > It's disappointing that this program doesn't seem to support
> > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> > Those seem to be the more interesting memcpy functions on modern
> > systems.
>
> Well the source is there, and the licensing terms are most reasonable.
>
> But then, the source was there eighteen months ago and nothing happened.
> Sigh.
>
> I think in-kernel MMX has fatal drawbacks anyway. Not sure what
> they are - I prefer to pretend that x86 CPUs execute raw C.

MMX isnt useful for anything smaller than about 512bytes-1K. Its not
useful in interrupt handlers. The list goes on.

2002-09-19 13:18:39

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

> > > details, but I do know that using copy_from_user() is not a real
> > > improvement at least on x86 architecture.
> >
> > The same as bit is easy to explain. Its totally memory bandwidth limited
> > on current x86-32 processors. (Although I'd welcome demonstrations to
> > the contrary on newer toys)
>
> Nope. There are distinct alignment problems with movsl-based
> memcpy on PII and (at least) "Pentium III (Coppermine)", which is
> tested here:
...
> I have various scriptlets which generate the entire matrix.
>
> I think I ended up deciding that we should use movsl _only_
> when both src and dsc are 8-byte-aligned. And that when you
> multiply the gain from that by the frequency*size with which
> funny alignments are used by TCP the net gain was 2% or something.

Amazing! I beleived 4-byte-aligned was enough.
read/write systemcalls may also reduce their penalties.

> It needs redoing. These differences are really big, and this
> is the kernel's most expensive function.
>
> A little project for someone.

OK, if there is nobody who wants to do it I'll do it by myself.

> The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz

2002-09-21 11:51:50

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hi!
>
> 1)
> ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
> This patch enables HW-checksum against outgoing packets including UDP frames.
>
> Can you explain the TCP parts? They look very wrong.
>
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86. I do not remember all
> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-).

Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2002-10-14 05:44:45

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Wednesday September 18, [email protected] wrote:
> Hello,
>
> I ported the zerocopy NFS patches against linux-2.5.36.
>

hi,
I finally got around to looking at this.
It looks good.

However it really needs the MSG_MORE support for udp_sendmsg to be
accepted before there is any point merging the rpc/nfsd bits.

Would you like to see if davem is happy with that bit first and get
it in? Then I will be happy to forward the nfsd specific bit.

I'm bit I'm not very sure about is the 'shadowsock' patch for having
several xmit sockets, one per CPU. What sort of speedup do you get
from this? How important is it really?

NeilBrown

2002-10-14 06:16:46

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Neil Brown <[email protected]>
Date: Mon, 14 Oct 2002 15:50:02 +1000

Would you like to see if davem is happy with that bit first and get
it in? Then I will be happy to forward the nfsd specific bit.

Alexey is working on this, or at least he was. :-)
(Alexey this is about the UDP cork changes)

I'm bit I'm not very sure about is the 'shadowsock' patch for having
several xmit sockets, one per CPU. What sort of speedup do you get
from this? How important is it really?

Personally, it seems rather essential for scalability on SMP.

2002-10-14 10:43:44

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello!

> Alexey is working on this, or at least he was. :-)
> (Alexey this is about the UDP cork changes)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Alexey

2002-10-14 10:50:34

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: [email protected]
Date: Mon, 14 Oct 2002 14:45:33 +0400 (MSD)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Neil and the NFS folks can take care of those parts
once the generic UDP parts are in.

So, no worries.

2002-10-14 12:07:11

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello, Neil

> > I ported the zerocopy NFS patches against linux-2.5.36.
>
> hi,
> I finally got around to looking at this.
> It looks good.

Thanks!

> However it really needs the MSG_MORE support for udp_sendmsg to be
> accepted before there is any point merging the rpc/nfsd bits.
>
> Would you like to see if davem is happy with that bit first and get
> it in? Then I will be happy to forward the nfsd specific bit.

Yes.

> I'm bit I'm not very sure about is the 'shadowsock' patch for having
> several xmit sockets, one per CPU. What sort of speedup do you get
> from this? How important is it really?

It's not so important.

davem> Personally, it seems rather essential for scalability on SMP.

Yes.
It will be effective on large scale SMP machines as all kNFSd shares
one NFS port. A udp socket can't send data on each CPU at the same
time while MSG_MORE/UDP_CORK options are set.
The UDP socket have to block any other requests during making a UDP frame.


Thank you,
Hirokazu Takahashi.

2002-10-14 14:01:15

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

> Hello, Neil
>
> > > I ported the zerocopy NFS patches against linux-2.5.36.
> >
> > hi,
> > I finally got around to looking at this.
> > It looks good.
>
> Thanks!
>
> > However it really needs the MSG_MORE support for udp_sendmsg to be
> > accepted before there is any point merging the rpc/nfsd bits.
> >
> > Would you like to see if davem is happy with that bit first and get
> > it in? Then I will be happy to forward the nfsd specific bit.
>
> Yes.
>
> > I'm bit I'm not very sure about is the 'shadowsock' patch for having
> > several xmit sockets, one per CPU. What sort of speedup do you get
> > from this? How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.

I experienced this exact problem a few months ago. I had a test where
several clients read a file or files cached on a linux server. TCP was just
fine, I could get 100% CPU on all CPUs on the server. TCP zerocopy was even
better, by about 50% throughput. UDP could not get better than 33% CPU, one
CPU working on those UDP requests and I assume a portion of another CPU
handling some inturrupt stuff. Essentially 2P and 4P throughput was only as
good as UP throughput. It is essential to get scaling on UDP. That
combined with the UDP zerocopy, we will have one extremely fast NFS server.

Andrew Theurer
IBM LTC

2002-10-16 03:38:32

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Monday October 14, [email protected] wrote:
> > I'm bit I'm not very sure about is the 'shadowsock' patch for having
> > several xmit sockets, one per CPU. What sort of speedup do you get
> > from this? How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.
>

After thinking about this some more, I suspect it would have to be
quite large scale SMP to get much contention.
The only contention on the udp socket is, as you say, assembling a udp
frame, and it would be surprised if that takes a substantial faction
of the time to handle a request.

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs. Maybe it would make sense to
have one udp socket for each NIC. Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

NeilBrown

2002-10-16 04:32:25

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Neil Brown <[email protected]>
Date: Wed, 16 Oct 2002 13:44:04 +1000

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs. Maybe it would make sense to
have one udp socket for each NIC. Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

Doesn't make much sense.

Usually we are talking via one IP address, and thus over
one device. It could be using multiple NICs via BONDING,
but that would be transparent to anything at the socket
level.

Really, I think there is real value to making the socket
per-cpu even on a 2 or 4 way system.

2002-10-16 11:14:46

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

> > It will be effective on large scale SMP machines as all kNFSd shares
> > one NFS port. A udp socket can't send data on each CPU at the same
> > time while MSG_MORE/UDP_CORK options are set.
> > The UDP socket have to block any other requests during making a UDP frame.
> >

> After thinking about this some more, I suspect it would have to be
> quite large scale SMP to get much contention.

I have no idea how much contention will happen. I haven't checked the
performance of it on large scale SMP yet as I don't have such a great
machines.

Does anyone help us?

> The only contention on the udp socket is, as you say, assembling a udp
> frame, and it would be surprised if that takes a substantial faction
> of the time to handle a request.

After assembling a udp frame, kNFSd may drive a NIC to transmit the frame.

> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs. Maybe it would make sense to
> have one udp socket for each NIC. Would that make sense? or work?

Some CPUs often share one GbE NIC today as a NIC can handle much data
than one CPU can. I think that CPU seems likely to become bottleneck.
Personally I guess several CPUs will share one 10GbE NIC in the near
future even if it's a high end machine. (It's just my guess)

But I don't know how effective this patch works......

devem> Doesn't make much sense.
devem>
devem> Usually we are talking via one IP address, and thus over
devem> one device. It could be using multiple NICs via BONDING,
devem> but that would be transparent to anything at the socket
devem> level.
devem>
devem> Really, I think there is real value to making the socket
devem> per-cpu even on a 2 or 4 way system.

I wish so.


2002-10-16 15:11:04

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Tuesday 15 October 2002 11:31 pm, David S. Miller wrote:
> From: Neil Brown <[email protected]>
> Date: Wed, 16 Oct 2002 13:44:04 +1000
>
> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs. Maybe it would make sense to
> have one udp socket for each NIC. Would that make sense? or work?
> It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device. It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am trying my best today to get a 4 way system up and running for this test.
IMO, per cpu is best.. with just one socket, I seriously could not get over
33% cpu utilization on a 4 way (back in April). With TCP, I could max it
out. I'll update later today hopefully with some promising results.

-Andrew

2002-10-16 16:58:47

by kaza

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

On Wed, Oct 16, 2002 at 08:09:00PM +0900,
Hirokazu Takahashi-san wrote:
> > After thinking about this some more, I suspect it would have to be
> > quite large scale SMP to get much contention.
>
> I have no idea how much contention will happen. I haven't checked the
> performance of it on large scale SMP yet as I don't have such a great
> machines.
>
> Does anyone help us?

Why don't you propose the performance test to OSDL? (OSDL-J is more
better, I think) OSDL provide hardware resources and operation staffs.

If you want, I can help you to propose it. :-)

--
Ko Kazaana / editor-in-chief of "TechStyle" ( http://techstyle.jp/ )
GnuPG Fingerprint = 1A50 B204 46BD EE22 2E8C 903F F2EB CEA7 4BCF 808F

2002-10-17 04:30:57

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 17 Oct 2002 [email protected] wrote:

| Hello,
|
| On Wed, Oct 16, 2002 at 08:09:00PM +0900,
| Hirokazu Takahashi-san wrote:
| > > After thinking about this some more, I suspect it would have to be
| > > quite large scale SMP to get much contention.
| >
| > I have no idea how much contention will happen. I haven't checked the
| > performance of it on large scale SMP yet as I don't have such a great
| > machines.
| >
| > Does anyone help us?
|
| Why don't you propose the performance test to OSDL? (OSDL-J is more
| better, I think) OSDL provide hardware resources and operation staffs.

and why do you say that? 8;)

| If you want, I can help you to propose it. :-)

That's the right thing to do.

--
~Randy