2002-09-18 08:14:31

by Hirokazu Takahashi

[permalink] [raw]
Subject: [PATCH] zerocopy NFS for 2.5.36

Hello,

I ported the zerocopy NFS patches against linux-2.5.36.

I made va05-zerocopy-nfsdwrite-2.5.36.patch more generic,
so that it would be easy to merge with NFSv4. Each procedure can
chose whether it can accept splitted buffers or not.
And I fixed a probelem that nfsd couldn't handle NFS-symlink
requests which were very large.


1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

2)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va11-udpsendfile-2.5.36.patch
This patch makes sendfile systemcall over UDP work. It also supports
UDP_CORK interface which is very similar to TCP_CORK. And you can call
sendmsg/senfile with MSG_MORE flags over UDP sockets.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

4)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va01-zerocopy-rpc-2.5.36.patch
This patch makes RPC can send some pieces of data and pages without copy.

5)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va02-zerocopy-nfsdread-2.5.36.patch
This patch makes NFSD send pages in pagecache directly when NFS clinets request
file-read.

6)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va03-zerocopy-nfsdreaddir-2.5.36.patch
nfsd_readdir can also send pages without copy.

7)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va04-zerocopy-shadowsock-2.5.36.patch
This patch makes per-cpu UDP sockets so that NFSD can send UDP frames on
each prosessor simultaneously.
Without the patch we can send only one UDP frame at the time as a UDP socket
have to be locked during sending some pages to serialize them.

8)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va05-zerocopy-nfsdwrite-2.5.36.patch
This patch enables NFS-write uses writev interface. NFSd can handle NFS
requests without reassembling IP fragments into one UDP frame.

9)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/taka-writev-2.5.36.patch
This patch makes writev for regular file work faster.
It also can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.35/2.5.35-mm1/broken-out/

Caution:
XFS doesn't support writev interface yet. NFS write on XFS might
slow down with No.8 patch. I wish SGI guys will implement it.

10)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va07-nfsbigbuf-2.5.36.patch
This makes NFS buffer much bigger (60KB).
60KB buffer is the same to 32KB buffer for linux-kernel as both of them
require 64KB chunk.


11)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va09-zerocopy-tempsendto-2.5.36.patch
If you don't want to use sendfile over UDP yet, you can apply it instead of No.1 and No.2 patches.



Regards,
Hirokazu Takahashi


2002-09-21 11:56:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hi!
>
> 1)
> ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
> This patch enables HW-checksum against outgoing packets including UDP frames.
>
> Can you explain the TCP parts? They look very wrong.
>
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86. I do not remember all
> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

Well, if this is the case, we need to #define copy_from_user csum_and_copy_from_user :-).

Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-09-18 23:00:57

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Hirokazu Takahashi <[email protected]>
Date: Wed, 18 Sep 2002 17:14:31 +0900 (JST)


1)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va10-hwchecksum-2.5.36.patch
This patch enables HW-checksum against outgoing packets including UDP frames.

Can you explain the TCP parts? They look very wrong.

It was discussed long ago that csum_and_copy_from_user() performs
better than plain copy_from_user() on x86. I do not remember all
details, but I do know that using copy_from_user() is not a real
improvement at least on x86 architecture.

The rest of the changes (ie. the getfrag() logic to set
skb->ip_summed) looks fine.

3)
ftp://ftp.valinux.co.jp/pub/people/taka/2.5.36/va-csumpartial-fix-2.5.36.patch
This patch fixes the problem of x86 csum_partilal() routines which
can't handle odd addressed buffers.

I've sent Linus this fix already.

2002-09-18 23:53:29

by Alan

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> It was discussed long ago that csum_and_copy_from_user() performs
> better than plain copy_from_user() on x86. I do not remember all

The better was a freak of PPro/PII scheduling I think

> details, but I do know that using copy_from_user() is not a real
> improvement at least on x86 architecture.

The same as bit is easy to explain. Its totally memory bandwidth limited
on current x86-32 processors. (Although I'd welcome demonstrations to
the contrary on newer toys)



-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-09-19 00:16:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Alan Cox wrote:
>
> On Thu, 2002-09-19 at 00:00, David S. Miller wrote:
> > It was discussed long ago that csum_and_copy_from_user() performs
> > better than plain copy_from_user() on x86. I do not remember all
>
> The better was a freak of PPro/PII scheduling I think
>
> > details, but I do know that using copy_from_user() is not a real
> > improvement at least on x86 architecture.
>
> The same as bit is easy to explain. Its totally memory bandwidth limited
> on current x86-32 processors. (Although I'd welcome demonstrations to
> the contrary on newer toys)

Nope. There are distinct alignment problems with movsl-based
memcpy on PII and (at least) "Pentium III (Coppermine)", which is
tested here:

copy_32 uses movsl. copy_duff just uses a stream of "movl"s

Time uncached-to-uncached memcpy, source and dest are 8-byte-aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s
nbytes=10240 from_align=0, to_align=0
copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

OK, movsl wins. But now give the source address 8+1 alignment:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 1
nbytes=10240 from_align=1, to_align=0
copy_32: copied 19.1 Mbytes in 0.158 seconds at 120.8 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.091 seconds at 210.3 Mbytes/sec

The "movl"-based copy wins. By miles.

Make the source 8+4 aligned:

akpm:/usr/src/cptimer> ./cptimer -d -s -f 4
nbytes=10240 from_align=4, to_align=0
copy_32: copied 19.1 Mbytes in 0.134 seconds at 142.1 Mbytes/sec
__copy_duff: copied 19.1 Mbytes in 0.089 seconds at 214.0 Mbytes/sec

So movl still beats movsl, by lots.

I have various scriptlets which generate the entire matrix.

I think I ended up deciding that we should use movsl _only_
when both src and dsc are 8-byte-aligned. And that when you
multiply the gain from that by the frequency*size with which
funny alignments are used by TCP the net gain was 2% or something.

It needs redoing. These differences are really big, and this
is the kernel's most expensive function.

A little project for someone.

The tools are at http://www.zip.com.au/~akpm/linux/cptimer.tar.gz

2002-09-19 02:13:33

by Aaron Lehmann

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

> akpm:/usr/src/cptimer> ./cptimer -d -s
> nbytes=10240 from_align=0, to_align=0
> copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec

It's disappointing that this program doesn't seem to support
benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
Those seem to be the more interesting memcpy functions on modern
systems.

2002-09-19 03:31:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Aaron Lehmann wrote:
>
> > akpm:/usr/src/cptimer> ./cptimer -d -s
> > nbytes=10240 from_align=0, to_align=0
> > copy_32: copied 19.1 Mbytes in 0.078 seconds at 243.9 Mbytes/sec
> > __copy_duff: copied 19.1 Mbytes in 0.090 seconds at 211.1 Mbytes/sec
>
> It's disappointing that this program doesn't seem to support
> benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> Those seem to be the more interesting memcpy functions on modern
> systems.

Well the source is there, and the licensing terms are most reasonable.

But then, the source was there eighteen months ago and nothing happened.
Sigh.

I think in-kernel MMX has fatal drawbacks anyway. Not sure what
they are - I prefer to pretend that x86 CPUs execute raw C.


-------------------------------------------------------
This SF.NET email is sponsored by: AMD - Your access to the experts
on Hammer Technology! Open Source & Linux Developers, register now
for the AMD Developer Symposium. Code: EX8664
http://www.developwithamd.com/developerlab
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-09-19 10:38:29

by Alan

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 2002-09-19 at 04:30, Andrew Morton wrote:
> > It's disappointing that this program doesn't seem to support
> > benchmarking of MMX copy loops (like the ones in arch/i386/lib/mmx.c).
> > Those seem to be the more interesting memcpy functions on modern
> > systems.
>
> Well the source is there, and the licensing terms are most reasonable.
>
> But then, the source was there eighteen months ago and nothing happened.
> Sigh.
>
> I think in-kernel MMX has fatal drawbacks anyway. Not sure what
> they are - I prefer to pretend that x86 CPUs execute raw C.

MMX isnt useful for anything smaller than about 512bytes-1K. Its not
useful in interrupt handlers. The list goes on.



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-09-19 13:15:13

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

> > > details, but I do know that using copy_from_user() is not a real
> > > improvement at least on x86 architecture.
> >
> > The same as bit is easy to explain. Its totally memory bandwidth limited
> > on current x86-32 processors. (Although I'd welcome demonstrations to
> > the contrary on newer toys)
>
> Nope. There are distinct alignment problems with movsl-based
> memcpy on PII and (at least) "Pentium III (Coppermine)", which is
> tested here:
...
> I have various scriptlets which generate the entire matrix.
>
> I think I ended up deciding that we should use movsl _only_
> when both src and dsc are 8-byte-aligned. And that when you
> multiply the gain from that by the frequency*size with which
> funny alignments are used by TCP the net gain was 2% or something.

Amazing! I beleived 4-byte-aligned was enough.
read/write systemcalls may also reduce their penalties.

> It needs redoing. These differences are really big, and this
> is the kernel's most expensive function.
>
> A little project for someone.

OK, if there is nobody who wants to do it I'll do it by myself.

> The tools are at http://www.zip.com.au/~/linux/cptimer.tar.gz

2002-10-18 13:18:22

by Hirokazu Takahashi

[permalink] [raw]
Subject: [PATCH] zerocopy NFS for 2.5.43

--- linux.ORG/include/linux/nfsd/const.h Sat Oct 12 13:22:12 2002
+++ linux/include/linux/nfsd/const.h Sun Oct 13 22:07:37 2030
@@ -20,9 +20,9 @@
#define NFSSVC_MAXVERS 3

/*
- * Maximum blocksize supported by daemon currently at 32K
+ * Maximum blocksize supported by daemon currently at 60K
*/
-#define NFSSVC_MAXBLKSIZE (32*1024)
+#define NFSSVC_MAXBLKSIZE ((60*1024)&~(PAGE_SIZE-1))

#ifdef __KERNEL__


Attachments:
rpcfix2.5.43-2.patch (1.07 kB)
va01-zerocopy-rpc-2.5.43.patch (9.88 kB)
va02-zerocopy-nfsdread-2.5.43.patch (6.54 kB)
va03-zerocopy-nfsdreaddir-2.5.43.patch (1.39 kB)
va04-zerocopy-shadowsock-2.5.43.patch (6.49 kB)
va05-zerocopy-nfsdwrite-2.5.43.patch (18.43 kB)
va07-nfsbigbuf-2.5.43.patch (416.00 B)
Download all attachments

2002-10-18 15:24:11

by Andrew Theurer

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.36

> > > Congestion avoidance mechanism of NFS clients might cause this
> > > situation. I think the congestion window size is not enough
> > > for high end machines. You can make the window be larger as a
> > > test.
> >
> > The congestion avoidance window is supposed to adapt to the bandwidth
> > that is available. Turn congestion avoidance off if you like, but my
> > experience is that doing so tends to seriously degrade performance as
> > the number of timeouts + resends skyrockets.
>
> Yes, you must be right.
>
> But I guess Andrew may use a great machine so that the transfer rate
> has exeeded the maximum size of the congestion avoidance window.
> Can we determin preferable maximum window size dynamically?

Is this a concern on the client only? I can run a test with just one client
and see if I can saturate the 100Mbit adapter. If I can, would we need to
make any adjustments then? FYI, at 115 MB/sec total throughput, that's only
2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec,
that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
100Mbit client.

Andrew Theurer



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-19 20:41:50

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

> > Congestion avoidance mechanism of NFS clients might cause this
> > situation. I think the congestion window size is not enough
> > for high end machines. You can make the window be larger as a
> > test.

> Is this a concern on the client only? I can run a test with just one client
> and see if I can saturate the 100Mbit adapter. If I can, would we need to
> make any adjustments then? FYI, at 115 MB/sec total throughput, that's only
> 2.875 MB/sec for each of the 40 clients. For the TCP result of 181 MB/sec,
> that's 4.525 MB/sec, IMO, both of which are comfortable throughputs for a
> 100Mbit client.

I think it's a client issue. NFS servers don't care about cogestion of UDP
traffic and they will try to response to all NFS requests as fast as they can.

You can try to increase the number of clients or the number of mount points
for a test. It's easy to mount the same directory of the server on some
directries of the client so that each of them can work simultaneously.
# mount -t nfs server:/foo /baa1
# mount -t nfs server:/foo /baa2
# mount -t nfs server:/foo /baa3

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by:
Access Your PC Securely with GoToMyPC. Try Free Now
https://www.gotomypc.com/s/OSND/DD
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-22 21:16:23

by Andrew Theurer

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.36

On Saturday 19 October 2002 15:34, Hirokazu Takahashi wrote:
> Hello,
>
> > > Congestion avoidance mechanism of NFS clients might cause this
> > > situation. I think the congestion window size is not enough
> > > for high end machines. You can make the window be larger as a
> > > test.
> >
> > Is this a concern on the client only? I can run a test with just one
> > client and see if I can saturate the 100Mbit adapter. If I can, woul=
d we
> > need to make any adjustments then? FYI, at 115 MB/sec total throughp=
ut,
> > that's only 2.875 MB/sec for each of the 40 clients. For the TCP res=
ult
> > of 181 MB/sec, that's 4.525 MB/sec, IMO, both of which are comfortabl=
e
> > throughputs for a 100Mbit client.
>
> I think it's a client issue. NFS servers don't care about cogestion of =
UDP
> traffic and they will try to response to all NFS requests as fast as th=
ey
> can.
>
> You can try to increase the number of clients or the number of mount po=
ints
> for a test. It's easy to mount the same directory of the server on some
> directries of the client so that each of them can work simultaneously.
> # mount -t nfs server:/foo /baa1
> # mount -t nfs server:/foo /baa2
> # mount -t nfs server:/foo /baa3

I don't think it is a client congestion issue at this point. I can run t=
he=20
test with just one client on UDP and achieve 11.2 MB/sec with just one mo=
unt=20
point. The client has 100 Mbit Ethernet, so should be the upper limit (o=
r=20
really close). In the 40 client read test, I have only achieved 2.875 MB=
/sec=20
per client. That and the fact that there are never more than 2 nfsd thre=
ads=20
in a run state at one time (for UDP only) leads me to believe there is st=
ill=20
a scaling problem on the server for UDP. I will continue to run the test=
and=20
poke a prod around. Hopefully something will jump out at me. Thanks for=
all=20
the input!

Andrew Theurer


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 01:18:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday October 18, [email protected] wrote:
> Hello,
>
> I've ported the zerocopy patches against linux-2.5.43 with
> davem's udp-sendfile patches and your patches which you posted
> on Wed,16 Oct.

Thanks for these...

I have been thinking some more about this, trying to understand the
big picture, and I'm afraid that I think I want some more changes.

In particular, I think it would be good to use 'struct xdr_buf' from
sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
we could share some of the infrastructure.

I think this would work quite well for sending read responses as there
is a 'head' iovec for the interesting bits of the packet, an array of
pages for the data, and a 'tail' iovec for the padding.

I'm not certain about receiving write requests.
I imagine that it might work to:
1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
skb into the head iovec, and hold onto the skbuf (like we
currently do).
2/ enter the nfs server to parse that header.
3/ When the server finds it needs more data for a write, it
collects the pages and calls xdr_partial_copy_from_skb
to copy the rest of the skb directly into the page cache.

Does that make any sense?

Also, I am wondering about the way that you put zero-copy support into
nfsd_readdir.

Presumably the gain is that sock_sendmsg does a copy into a
skbuf and then a DMA out of that, while ->sendpage does just the DMA.
In that case, maybe it would be better to get "struct page *" pointers
for the pages in the default buffer, and pass them to
->sendpage.


I would like to get the a situation where we don't need to do a 64K
kmalloc for each server, but can work entirely with individual pages.

I might try converting svcsock etc to use xdr_buf later today or
tomorrow unless I heard a good reason why it wont work, or someone
else beats me to it...

NeilBrown


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 04:00:26

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > I've ported the zerocopy patches against linux-2.5.43 with
> > davem's udp-sendfile patches and your patches which you posted
> > on Wed,16 Oct.
>
> Thanks for these...
>
> I have been thinking some more about this, trying to understand the
> big picture, and I'm afraid that I think I want some more changes.
>
> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
> we could share some of the infrastructure.

It sounds good that they share the same infrastructure.
I agree with your approach.

> I think this would work quite well for sending read responses as there
> is a 'head' iovec for the interesting bits of the packet, an array of
> pages for the data, and a 'tail' iovec for the padding.

I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound
operation correctly yet. I don't know what will happen if we send some
page data and some non-page data together as it will try to pack some
operations in one xdr_buf.

If we care about NFSv4 it could be like this:

struct svc_buf {
u32 * area; /* allocated memory */
u32 * base; /* base of RPC datagram */
int buflen; /* total length of buffer */
u32 * buf; /* read/write pointer */
int len; /* current end of buffer */

struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
int nriov;
}

I guess it would be better to fix NFSv4 problems after Halloween.

> I'm not certain about receiving write requests.
> I imagine that it might work to:
> 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> skb into the head iovec, and hold onto the skbuf (like we
> currently do).
> 2/ enter the nfs server to parse that header.
> 3/ When the server finds it needs more data for a write, it
> collects the pages and calls xdr_partial_copy_from_skb
> to copy the rest of the skb directly into the page cache.

I think it will be hard work that it's the same that we make another
generic_file_write function. I feel it may be overkill.
e.g. We must read a page if it isn't on the cache.
We must allocate disk blocks if the file don't have yet X-(
Some filesytems like XFS have its own way of updating pagecache.

We should make kNFSd keep away from the implementation of VM/FS
as possible as we can.

> Does that make any sense?
>
> Also, I am wondering about the way that you put zero-copy support into
> nfsd_readdir.
>
> Presumably the gain is that sock_sendmsg does a copy into a
> skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> In that case, maybe it would be better to get "struct page *" pointers
> for the pages in the default buffer, and pass them to
> ->sendpage.

It seems good idea.

The problem is that it's hard to know when the page will be released.
The page will be held by TCP/IP stack. TCP may hold it for a while
by way of retransmition. UDP pakcets may also held in driver-queue
after ->sendpage has done.

We should check reference count of the default buffer and
decide to use the buffer or allocate new one.
We think Almost request can use the default buffer.

> I would like to get the a situation where we don't need to do a 64K
> kmalloc for each server, but can work entirely with individual pages.
>
> I might try converting svcsock etc to use xdr_buf later today or
> tomorrow unless I heard a good reason why it wont work, or someone
> else beats me to it...

If you don't mind I'll do about the readdir stuff
while you're fighting with the xdr_buf stuffs.

Thank you,
Hirokazu Takahashi


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 05:48:17

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > Also, I am wondering about the way that you put zero-copy support into
> > nfsd_readdir.
> >
> > Presumably the gain is that sock_sendmsg does a copy into a
> > skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> > In that case, maybe it would be better to get "struct page *" pointers
> > for the pages in the default buffer, and pass them to
> > ->sendpage.
>
> It seems good idea.
>
> The problem is that it's hard to know when the page will be released.
> The page will be held by TCP/IP stack. TCP may hold it for a while
> by way of retransmition. UDP pakcets may also held in driver-queue
> after ->sendpage has done.
>
> We should check reference count of the default buffer and
> decide to use the buffer or allocate new one.
> We think Almost request can use the default buffer.

I mean we can't use a page in the default buffer.
We should use the page next to the default buffer or we should
prepare another page for nfsd_readdir.

I don't know whether allocating an extra page for each server
is good or not.
How do you think about it?

> > I would like to get the a situation where we don't need to do a 64K
> > kmalloc for each server, but can work entirely with individual pages.


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 06:03:52

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Wednesday October 23, [email protected] wrote:
> Hello,
>
> > > Also, I am wondering about the way that you put zero-copy support into
> > > nfsd_readdir.
> > >
> > > Presumably the gain is that sock_sendmsg does a copy into a
> > > skbuf and then a DMA out of that, while ->sendpage does just the DMA.
> > > In that case, maybe it would be better to get "struct page *" pointers
> > > for the pages in the default buffer, and pass them to
> > > ->sendpage.
> >
> > It seems good idea.
> >
> > The problem is that it's hard to know when the page will be released.
> > The page will be held by TCP/IP stack. TCP may hold it for a while
> > by way of retransmition. UDP pakcets may also held in driver-queue
> > after ->sendpage has done.
> >
> > We should check reference count of the default buffer and
> > decide to use the buffer or allocate new one.
> > We think Almost request can use the default buffer.
>
> I mean we can't use a page in the default buffer.
> We should use the page next to the default buffer or we should
> prepare another page for nfsd_readdir.
>
> I don't know whether allocating an extra page for each server
> is good or not.
> How do you think about it?

I think I would change the approach to buffering.
Instead of having a fixed set of pages, we just allocate new pages as
needed, having handed old ones over to the networking layer.

So we have a pool of pages that we draw from when generating replies,
and refill before accepting a new request.

Ofcourse that is a fairly big change from where we are now so it might
take a while. We should probably get zero copy reads in first...

NeilBrown


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 06:10:54

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Wednesday October 23, [email protected] wrote:
>
> I'm wondering one point that the xdr_buf can't hanldle NFSv4 compound
> operation correctly yet. I don't know what will happen if we send some
> page data and some non-page data together as it will try to pack some
> operations in one xdr_buf.
>
> If we care about NFSv4 it could be like this:
>
> struct svc_buf {
> u32 * area; /* allocated memory */
> u32 * base; /* base of RPC datagram */
> int buflen; /* total length of buffer */
> u32 * buf; /* read/write pointer */
> int len; /* current end of buffer */
>
> struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
> int nriov;
> }
>
> I guess it would be better to fix NFSv4 problems after Halloween.
>

Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client.
Andy? Trond?

I suspect that COMPOUNDS with multiple READ or WRITE requests would be
fairly rare, and it would probably be reasonable to respond with
ERESOURCE (or however it is spelt).

i.e. Reject any operation that would need to use a second set of pages
in a response.


> > I'm not certain about receiving write requests.
> > I imagine that it might work to:
> > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> > skb into the head iovec, and hold onto the skbuf (like we
> > currently do).
> > 2/ enter the nfs server to parse that header.
> > 3/ When the server finds it needs more data for a write, it
> > collects the pages and calls xdr_partial_copy_from_skb
> > to copy the rest of the skb directly into the page cache.
>
> I think it will be hard work that it's the same that we make another
> generic_file_write function. I feel it may be overkill.
> e.g. We must read a page if it isn't on the cache.
> We must allocate disk blocks if the file don't have yet X-(
> Some filesytems like XFS have its own way of updating pagecache.
>
> We should make kNFSd keep away from the implementation of VM/FS
> as possible as we can.

Could we not use 'mmap'? Maybe not, and probably best to avoid it as
you say.

I was thinking it would be nice to be able to do the udp-checksum at
the same time as the copy-into-page-cache, but maybe we just say that
you need a NIC that does checksums if you want to do single-copy NFS
writes.

NeilBrown


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 07:15:33

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > If we care about NFSv4 it could be like this:
> >
> > struct svc_buf {
> > u32 * area; /* allocated memory */
> > u32 * base; /* base of RPC datagram */
> > int buflen; /* total length of buffer */
> > u32 * buf; /* read/write pointer */
> > int len; /* current end of buffer */
> >
> > struct xdr_buf iov[I_HAVE_NO_IDEA_HOW_MANY_IOVs_NFSV4_REQUIRES];
> > int nriov;
> > }
> >
> > I guess it would be better to fix NFSv4 problems after Halloween.
> >
>
> Hmm. I wonder what plans there are for this w.r.t. to NFSv4 client.
> Andy? Trond?
>
> I suspect that COMPOUNDS with multiple READ or WRITE requests would be
> fairly rare, and it would probably be reasonable to respond with
> ERESOURCE (or however it is spelt).

Yeah, It might be.

> i.e. Reject any operation that would need to use a second set of pages
> in a response.

> > > I'm not certain about receiving write requests.
> > > I imagine that it might work to:
> > > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> > > skb into the head iovec, and hold onto the skbuf (like we
> > > currently do).
> > > 2/ enter the nfs server to parse that header.
> > > 3/ When the server finds it needs more data for a write, it
> > > collects the pages and calls xdr_partial_copy_from_skb
> > > to copy the rest of the skb directly into the page cache.
> >
> > I think it will be hard work that it's the same that we make another
> > generic_file_write function. I feel it may be overkill.
> > e.g. We must read a page if it isn't on the cache.
> > We must allocate disk blocks if the file don't have yet X-(
> > Some filesytems like XFS have its own way of updating pagecache.
> >
> > We should make kNFSd keep away from the implementation of VM/FS
> > as possible as we can.
>
> Could we not use 'mmap'? Maybe not, and probably best to avoid it as
> you say.

Using mmap sounds intersting to me and I was thinking about it.

Regular mmap will cause many reading blocks on disk on each pagefault
as its handler can't know what size of write will happen after the fault.
It will be meaningless if the size is 4KB which will often happens on NFS.

Standard write/writev can handle it without reading blocks.

> I was thinking it would be nice to be able to do the udp-checksum at
> the same time as the copy-into-page-cache, but maybe we just say that
> you need a NIC that does checksums if you want to do single-copy NFS
> writes.

Or we can enhance the standard generic_file_write() to assign a
copy-routine like this:

generic_file_write(file, buf, count, ppos, nfsd_write_actor);
generic_file_writev(file, iovec, nr_segs, ppos, nfsd_write_actor);

nfsd_write_actor(struct page *page, int offset, ......)
{
xdr_partial_copy_from_skb(.....)
}

But I realized there is one big problem on the both approach.
What can we do when the result of checksum is wrong?
The pages will be filled with broken data.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 09:29:25

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36

Hi,

> > > > Congestion avoidance mechanism of NFS clients might cause this
> > > > situation. I think the congestion window size is not enough
> > > > for high end machines. You can make the window be larger as a
> > > > test.

> I don't think it is a client congestion issue at this point. I can run the
> test with just one client on UDP and achieve 11.2 MB/sec with just one mount
> point. The client has 100 Mbit Ethernet, so should be the upper limit (or
> really close). In the 40 client read test, I have only achieved 2.875 MB/sec
> per client. That and the fact that there are never more than 2 nfsd threads
> in a run state at one time (for UDP only) leads me to believe there is still
> a scaling problem on the server for UDP. I will continue to run the test and
> poke a prod around. Hopefully something will jump out at me. Thanks for all
> the input!

Can You check /proc/net/rpc/nfsd which shows how many NFS requests have
been retransmitted ?

# cat /proc/net/rpc/nfsd
rc 0 27680 162118
^^^
This field means the clinents have retransmitted pakeckets.
The transmission ratio will slow down if it have happened once.
It may occur if the response from the server is slower than the
clinents expect.

And you can use older version - e.g. linux-2.4 series - for clients
and see what will happen as older versions don't have any intelligent
features.

Thank you,
Hirokazu Takahashi.

2002-10-23 15:23:30

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Neil Brown <[email protected]> writes:

> Hmm. I wonder what plans there are for this w.r.t. to NFSv4
> client. Andy? Trond?

There's really no need for anything beyond what we have. There's no
call in the client for stringing more than one set of pages together:
you are always dealing with a single set of contiguous pages to
read/write.

> I suspect that COMPOUNDS with multiple READ or WRITE requests
> would be fairly rare, and it would probably be reasonable to
> respond with ERESOURCE (or however it is spelt).

Alternatively, you could add a list_head to the xdr_buf struct so that
you can string several of them together. Frankly, though, it would be
a rather strange NFSv4 client that wants to do this sort of
operation. There's just no advantage to it...

> I was thinking it would be nice to be able to do the
> udp-checksum at the same time as the copy-into-page-cache, but
> maybe we just say that you need a NIC that does checksums if
> you want to do single-copy NFS writes.

Right. The very last thing you want to do is to copy into the page
cache, then find out that the checksum didn't match up.

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 21:57:42

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
> we could share some of the infrastructure.

I was thinking about the nfs clients.
Why don't we make xprt_sendmsg() use the sendpage interface instead
of calling sock_sendmsg() so that we can avoid dead-lock which
multiple kmap()s in xprt_sendmsg() might cause on heavily loaded machines.


Thank you,
Hirokazu Takahashi.



-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 22:42:34

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > > > Also, I am wondering about the way that you put zero-copy support into
> > > > nfsd_readdir.

> I think I would change the approach to buffering.
> Instead of having a fixed set of pages, we just allocate new pages as
> needed, having handed old ones over to the networking layer.
>
> So we have a pool of pages that we draw from when generating replies,
> and refill before accepting a new request.

We can also put RPC/NFS headers on pages and send them without copy.
This seems good for NFSv4 COMPOUNDS.

> Ofcourse that is a fairly big change from where we are now so it might
> take a while. We should probably get zero copy reads in first...

Yes.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-23 23:55:42

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:

> I was thinking about the nfs clients. Why don't we make
> xprt_sendmsg() use the sendpage interface instead of calling
> sock_sendmsg() so that we can avoid dead-lock which multiple
> kmap()s in xprt_sendmsg() might cause on heavily loaded
> machines.

I'm definitely in favour of such a change. Particularly so if the UDP
interface is ready.

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-24 01:41:15

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > I was thinking about the nfs clients. Why don't we make
> > xprt_sendmsg() use the sendpage interface instead of calling
> > sock_sendmsg() so that we can avoid dead-lock which multiple
> > kmap()s in xprt_sendmsg() might cause on heavily loaded
> > machines.
>
> I'm definitely in favour of such a change. Particularly so if the UDP
> interface is ready.

I've implemented it and we can find it in linux-2.5.44.
The interface is the same as the TCP's one.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0002en

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-25 10:00:12

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> I have been thinking some more about this, trying to understand the
> big picture, and I'm afraid that I think I want some more changes.
>
> In particular, I think it would be good to use 'struct xdr_buf' from
> sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
> we could share some of the infrastructure.

I just realized it would be hard to use the xdr_buf as it couldn't
handle data in a socket buffer. Each socket burfer consists of
some non-page data and some pages and each of them might have its
own offset and length.

> I'm not certain about receiving write requests.
> I imagine that it might work to:
> 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> skb into the head iovec, and hold onto the skbuf (like we
> currently do).

And I came up with another idea that kNFSd could handles TCP data
in a socket buffer directly without copy if we can enhancemence the
tcp_read_sock() not to release it while kNFSd is using it.
kNFSd would handle TCP data as if it were a UDP datagram.
The differences are kNFSd may grab some TCP socket buffers at once
and the buffers may be shared to other kNFSd's.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-25 17:23:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:


>> In particular, I think it would be good to use 'struct xdr_buf'
>> from sunrpc/xdr.h instead of svc_buf. This is what the nfs
>> client uses and we could share some of the infrastructure.

> I just realized it would be hard to use the xdr_buf as it
> couldn't handle data in a socket buffer. Each socket burfer
> consists of some non-page data and some pages and each of them
> might have its own offset and length.

Then the following trivial modification would be quite sufficient

struct xdr_buf {
struct list_head list; /* Further xdr_buf */
struct iovec head[1], /* RPC header + non-page data */
tail[1]; /* Appended after page data */

struct page ** pages; /* Array of contiguous pages */
unsigned int page_base, /* Start of page data */
page_len; /* Length of page data */

unsigned int len; /* Total length of data */

};

With equally trivial fixes to xdr_kmap() and friends. None of this
needs to affect existing client usage, and may in fact be useful for
optimizing use of v4 COMPOUNDS later.
(I was wrong about this BTW: being able to flush out all the dirty
pages in a file to disk using a single COMPOUND would indeed be worth
the trouble once we've managed to drop UDP as the primary NFS
transport mechanism. For one thing, you would only tie up a single
nfsd thread when writing to the file)

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-25 20:09:14

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday October 25, [email protected] wrote:
> Hello,
>
> > I have been thinking some more about this, trying to understand the
> > big picture, and I'm afraid that I think I want some more changes.
> >
> > In particular, I think it would be good to use 'struct xdr_buf' from
> > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
> > we could share some of the infrastructure.
>
> I just realized it would be hard to use the xdr_buf as it couldn't
> handle data in a socket buffer. Each socket burfer consists of
> some non-page data and some pages and each of them might have its
> own offset and length.

You would only want this for single-copy write request - right?

I think we have treat them as a special case and pass the skbuf all
the way up to nfsd in that case.
You would only want to try this if:
The NIC had verified the checksum
The packets was some minimum size (1K? 1 PAGE ??)
We were using AUTH_UNIX, nothing more interesting like crypto
security
The first fragment were some minimum size (size of a write without
the data).

I would make a special 'fast-path' for that case which didn't copy any
data but passed a skbuf up, and code in nfs*xdr.c would convert that
into an iovec[];

I am working on a patch which changes rpcsvc to use xdr_buf. Some of
it works. Some doesn't. I include it below for your reference I
repeat: it doesn't work yet.
Once it is done, adding the rest of zero-copy should be fairly easy.

>
> > I'm not certain about receiving write requests.
> > I imagine that it might work to:
> > 1/ call xdr_partial_copy_from_skb to just copy the first 1K from the
> > skb into the head iovec, and hold onto the skbuf (like we
> > currently do).
>
> And I came up with another idea that kNFSd could handles TCP data
> in a socket buffer directly without copy if we can enhancemence the
> tcp_read_sock() not to release it while kNFSd is using it.
> kNFSd would handle TCP data as if it were a UDP datagram.
> The differences are kNFSd may grab some TCP socket buffers at once
> and the buffers may be shared to other kNFSd's.

That might work... though TCP doesn't have the same concept of a
'packet' that udp does. You might endup with a socket buffer that had
all of one request and part of the next... still I'm sure it is
possible.

NeilBrown


-----incomplete, buggy, don't-use-it patch starts here----
--- ./fs/nfsd/nfssvc.c 2002/10/21 03:23:44 1.2
+++ ./fs/nfsd/nfssvc.c 2002/10/25 05:08:01
@@ -277,7 +277,8 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3

/* Decode arguments */
xdr = proc->pc_decode;
- if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp)) {
+ if (xdr && !xdr(rqstp, (u32*)rqstp->rq_arg.head[0].iov_base,
+ rqstp->rq_argp)) {
dprintk("nfsd: failed to decode arguments!\n");
nfsd_cache_update(rqstp, RC_NOCACHE, NULL);
*statp = rpc_garbage_args;
@@ -293,14 +294,15 @@ nfsd_dispatch(struct svc_rqst *rqstp, u3
}

if (rqstp->rq_proc != 0)
- svc_putu32(&rqstp->rq_resbuf, nfserr);
+ svc_putu32(&rqstp->rq_res.head[0], nfserr);

/* Encode result.
* For NFSv2, additional info is never returned in case of an error.
*/
if (!(nfserr && rqstp->rq_vers == 2)) {
xdr = proc->pc_encode;
- if (xdr && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) {
+ if (xdr && !xdr(rqstp, (u32*)rqstp->rq_res.head[0].iov_base,
+ rqstp->rq_resp)) {
/* Failed to encode result. Release cache entry */
dprintk("nfsd: failed to encode result!\n");
nfsd_cache_update(rqstp, RC_NOCACHE, NULL);
--- ./fs/nfsd/vfs.c 2002/10/24 01:35:37 1.1
+++ ./fs/nfsd/vfs.c 2002/10/24 04:13:31
@@ -571,13 +571,35 @@ found:
}

/*
+ * reduce iovec:
+ * Reduce the effective size of the passed iovec to
+ * match the count
+ */
+static void reduce_iovec(struct iovec *vec, int *vlenp, int count)
+{
+ int vlen = *vlenp;
+ int i;
+
+ i = 0;
+ while (i < vlen && count > vec->iov_len) {
+ count -= vec->iov_len;
+ i++;
+ }
+ if (i >= vlen)
+ return; /* ERROR??? */
+ vec->iov_len -= count;
+ if (count) i++;
+ *vlenp = i;
+}
+
+/*
* Read data from a file. count must contain the requested read count
* on entry. On return, *count contains the number of bytes actually read.
* N.B. After this call fhp needs an fh_put
*/
int
nfsd_read(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- char *buf, unsigned long *count)
+ struct iovec *vec, int vlen, unsigned long *count)
{
struct raparms *ra;
mm_segment_t oldfs;
@@ -601,9 +623,10 @@ nfsd_read(struct svc_rqst *rqstp, struct
if (ra)
file.f_ra = ra->p_ra;

+ reduce_iovec(vec, &vlen, *count);
oldfs = get_fs();
set_fs(KERNEL_DS);
- err = vfs_read(&file, buf, *count, &offset);
+ err = vfs_readv(&file, vec, vlen, *count, &offset);
set_fs(oldfs);

/* Write back readahead params */
@@ -629,7 +652,8 @@ out:
*/
int
nfsd_write(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t offset,
- char *buf, unsigned long cnt, int *stablep)
+ struct iovec *vec, int vlen,
+ unsigned long cnt, int *stablep)
{
struct svc_export *exp;
struct file file;
@@ -675,9 +699,10 @@ nfsd_write(struct svc_rqst *rqstp, struc
if (stable && !EX_WGATHER(exp))
file.f_flags |= O_SYNC;

+ reduce_iovec(vec, &vlen, cnt);
/* Write the data. */
oldfs = get_fs(); set_fs(KERNEL_DS);
- err = vfs_write(&file, buf, cnt, &offset);
+ err = vfs_writev(&file, vec, vlen, cnt, &offset);
if (err >= 0)
nfsdstats.io_write += cnt;
set_fs(oldfs);
--- ./fs/nfsd/nfsctl.c 2002/10/21 06:35:17 1.2
+++ ./fs/nfsd/nfsctl.c 2002/10/24 11:22:53
@@ -130,13 +130,12 @@ static int exports_open(struct inode *in
char *namebuf = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (namebuf == NULL)
return -ENOMEM;
- else
- ((struct seq_file *)file->private_data)->private = namebuf;

res = seq_open(file, &nfs_exports_op);
- if (!res)
+ if (res)
kfree(namebuf);
-
+ else
+ ((struct seq_file *)file->private_data)->private = namebuf;
return res;
}
static int exports_release(struct inode *inode, struct file *file)
--- ./fs/nfsd/nfsxdr.c 2002/10/24 01:06:36 1.1
+++ ./fs/nfsd/nfsxdr.c 2002/10/25 05:31:51
@@ -14,6 +14,7 @@
#include <linux/sunrpc/svc.h>
#include <linux/nfsd/nfsd.h>
#include <linux/nfsd/xdr.h>
+#include <linux/mm.h>

#define NFSDDBG_FACILITY NFSDDBG_XDR

@@ -176,27 +177,6 @@ encode_fattr(struct svc_rqst *rqstp, u32
return p;
}

-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_argbuf;
-
- return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_resbuf;
-
- buf->len = p - buf->base;
- dprintk("nfsd: ressize_check p %p base %p len %d\n",
- p, buf->base, buf->buflen);
- return (buf->len <= buf->buflen);
-}

/*
* XDR decode functions
@@ -241,13 +221,29 @@ int
nfssvc_decode_readargs(struct svc_rqst *rqstp, u32 *p,
struct nfsd_readargs *args)
{
+ int len;
+ int v,pn;
if (!(p = decode_fh(p, &args->fh)))
return 0;

args->offset = ntohl(*p++);
- args->count = ntohl(*p++);
- args->totalsize = ntohl(*p++);
+ len = args->count = ntohl(*p++);
+ p++; /* totalcount - unused */

+ /* FIXME range check ->count */
+ /* set up somewhere to store response.
+ * We take pages, put them on reslist and include in iovec
+ */
+ v=0;
+ while (len > 0) {
+ pn=rqstp->rq_resused;
+ take_page(rqstp);
+ args->vec[v].iov_base = page_address(rqstp->rq_respages[pn]);
+ args->vec[v].iov_len = PAGE_SIZE;
+ v++;
+ len -= PAGE_SIZE;
+ }
+ args->vlen = v;
return xdr_argsize_check(rqstp, p);
}

@@ -255,17 +251,27 @@ int
nfssvc_decode_writeargs(struct svc_rqst *rqstp, u32 *p,
struct nfsd_writeargs *args)
{
+ int len;
+ int v;
if (!(p = decode_fh(p, &args->fh)))
return 0;

p++; /* beginoffset */
args->offset = ntohl(*p++); /* offset */
p++; /* totalcount */
- args->len = ntohl(*p++);
- args->data = (char *) p;
- p += XDR_QUADLEN(args->len);
-
- return xdr_argsize_check(rqstp, p);
+ len = args->len = ntohl(*p++);
+ args->vec[0].iov_base = (void*)p;
+ args->vec[0].iov_len = rqstp->rq_arg.head[0].iov_len -
+ (((void*)p) - rqstp->rq_arg.head[0].iov_base);
+ v = 0;
+ while (len > args->vec[v].iov_len) {
+ len -= args->vec[v].iov_len;
+ v++;
+ args->vec[v].iov_base = page_address(rqstp->rq_argpages[v]);
+ args->vec[v].iov_len = PAGE_SIZE;
+ }
+ args->vlen = v+1;
+ return 1; /* FIXME */
}

int
@@ -371,9 +377,22 @@ nfssvc_encode_readres(struct svc_rqst *r
{
p = encode_fattr(rqstp, p, &resp->fh);
*p++ = htonl(resp->count);
- p += XDR_QUADLEN(resp->count);
+ xdr_ressize_check(rqstp, p);

- return xdr_ressize_check(rqstp, p);
+ /* now update rqstp->rq_res to reflect data aswell */
+ rqstp->rq_res.page_base = 0;
+ rqstp->rq_res.page_len = resp->count;
+ if (resp->count & 3) {
+ /* need to pad with tail */
+ rqstp->rq_res.tail[0].iov_base = p;
+ *p = 0;
+ rqstp->rq_res.tail[0].iov_len = 4 - (resp->count&3);
+ }
+ rqstp->rq_res.len =
+ rqstp->rq_res.head[0].iov_len+
+ rqstp->rq_res.page_len+
+ rqstp->rq_res.tail[0].iov_len;
+ return 1;
}

int
--- ./fs/nfsd/nfs3xdr.c 2002/10/24 01:07:00 1.1
+++ ./fs/nfsd/nfs3xdr.c 2002/10/25 05:14:26
@@ -269,27 +269,6 @@ encode_wcc_data(struct svc_rqst *rqstp,
return encode_post_op_attr(rqstp, p, fhp);
}

-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_argbuf;
-
- return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_resbuf;
-
- buf->len = p - buf->base;
- dprintk("nfsd: ressize_check p %p base %p len %d\n",
- p, buf->base, buf->buflen);
- return (buf->len <= buf->buflen);
-}

/*
* XDR decode functions
--- ./fs/nfsd/nfscache.c 2002/10/24 03:37:10 1.1
+++ ./fs/nfsd/nfscache.c 2002/10/24 04:30:23
@@ -41,7 +41,7 @@ static struct svc_cacherep * lru_tail;
static struct svc_cacherep * nfscache;
static int cache_disabled = 1;

-static int nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data);
+static int nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *vec);

/*
* locking for the reply cache:
@@ -107,7 +107,7 @@ nfsd_cache_shutdown(void)

for (rp = lru_head; rp; rp = rp->c_lru_next) {
if (rp->c_state == RC_DONE && rp->c_type == RC_REPLBUFF)
- kfree(rp->c_replbuf.buf);
+ kfree(rp->c_replvec.iov_base);
}

cache_disabled = 1;
@@ -242,8 +242,8 @@ nfsd_cache_lookup(struct svc_rqst *rqstp

/* release any buffer */
if (rp->c_type == RC_REPLBUFF) {
- kfree(rp->c_replbuf.buf);
- rp->c_replbuf.buf = NULL;
+ kfree(rp->c_replvec.iov_base);
+ rp->c_replvec.iov_base = NULL;
}
rp->c_type = RC_NOCACHE;
out:
@@ -272,11 +272,11 @@ found_entry:
case RC_NOCACHE:
break;
case RC_REPLSTAT:
- svc_putu32(&rqstp->rq_resbuf, rp->c_replstat);
+ svc_putu32(&rqstp->rq_res.head[0], rp->c_replstat);
rtn = RC_REPLY;
break;
case RC_REPLBUFF:
- if (!nfsd_cache_append(rqstp, &rp->c_replbuf))
+ if (!nfsd_cache_append(rqstp, &rp->c_replvec))
goto out; /* should not happen */
rtn = RC_REPLY;
break;
@@ -308,13 +308,14 @@ void
nfsd_cache_update(struct svc_rqst *rqstp, int cachetype, u32 *statp)
{
struct svc_cacherep *rp;
- struct svc_buf *resp = &rqstp->rq_resbuf, *cachp;
+ struct iovec *resv = &rqstp->rq_res.head[0], *cachv;
int len;

if (!(rp = rqstp->rq_cacherep) || cache_disabled)
return;

- len = resp->len - (statp - resp->base);
+ len = resv->iov_len - ((char*)statp - (char*)resv->iov_base);
+ len >>= 2;

/* Don't cache excessive amounts of data and XDR failures */
if (!statp || len > (256 >> 2)) {
@@ -329,16 +330,16 @@ nfsd_cache_update(struct svc_rqst *rqstp
rp->c_replstat = *statp;
break;
case RC_REPLBUFF:
- cachp = &rp->c_replbuf;
- cachp->buf = (u32 *) kmalloc(len << 2, GFP_KERNEL);
- if (!cachp->buf) {
+ cachv = &rp->c_replvec;
+ cachv->iov_base = kmalloc(len << 2, GFP_KERNEL);
+ if (!cachv->iov_base) {
spin_lock(&cache_lock);
rp->c_state = RC_UNUSED;
spin_unlock(&cache_lock);
return;
}
- cachp->len = len;
- memcpy(cachp->buf, statp, len << 2);
+ cachv->iov_len = len << 2;
+ memcpy(cachv->iov_base, statp, len << 2);
break;
}
spin_lock(&cache_lock);
@@ -353,19 +354,20 @@ nfsd_cache_update(struct svc_rqst *rqstp

/*
* Copy cached reply to current reply buffer. Should always fit.
+ * FIXME as reply is in a page, we should just attach the page, and
+ * keep a refcount....
*/
static int
-nfsd_cache_append(struct svc_rqst *rqstp, struct svc_buf *data)
+nfsd_cache_append(struct svc_rqst *rqstp, struct iovec *data)
{
- struct svc_buf *resp = &rqstp->rq_resbuf;
+ struct iovec *vec = &rqstp->rq_res.head[0];

- if (resp->len + data->len > resp->buflen) {
+ if (vec->iov_len + data->iov_len > PAGE_SIZE) {
printk(KERN_WARNING "nfsd: cached reply too large (%d).\n",
- data->len);
+ data->iov_len);
return 0;
}
- memcpy(resp->buf, data->buf, data->len << 2);
- resp->buf += data->len;
- resp->len += data->len;
+ memcpy((char*)vec->iov_base + vec->iov_len, data->iov_base, data->iov_len);
+ vec->iov_len += data->iov_len;
return 1;
}
--- ./fs/nfsd/nfsproc.c 2002/10/24 02:23:57 1.1
+++ ./fs/nfsd/nfsproc.c 2002/10/25 05:32:04
@@ -30,11 +30,11 @@ typedef struct svc_buf svc_buf;
#define NFSDDBG_FACILITY NFSDDBG_PROC


-static void
-svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr)
+static inline void
+svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr)
{
- *ptr = buf->buf + nr;
- *len = buf->buflen - buf->len - nr;
+ *ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr;
+ *len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr;
}

static int
@@ -109,7 +109,7 @@ nfsd_proc_readlink(struct svc_rqst *rqst
dprintk("nfsd: READLINK %s\n", SVCFH_fmt(&argp->fh));

/* Reserve room for status and path length */
- svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy, 2);
+ svcbuf_reserve(&rqstp->rq_res, &path, &dummy, 2);

/* Read the symlink. */
resp->len = NFS_MAXPATHLEN;
@@ -127,8 +127,7 @@ static int
nfsd_proc_read(struct svc_rqst *rqstp, struct nfsd_readargs *argp,
struct nfsd_readres *resp)
{
- u32 * buffer;
- int nfserr, avail;
+ int nfserr;

dprintk("nfsd: READ %s %d bytes at %d\n",
SVCFH_fmt(&argp->fh),
@@ -137,22 +136,21 @@ nfsd_proc_read(struct svc_rqst *rqstp, s
/* Obtain buffer pointer for payload. 19 is 1 word for
* status, 17 words for fattr, and 1 word for the byte count.
*/
- svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail, 19);

- if ((avail << 2) < argp->count) {
+ if ((32768/*FIXME*/) < argp->count) {
printk(KERN_NOTICE
"oversized read request from %08x:%d (%d bytes)\n",
ntohl(rqstp->rq_addr.sin_addr.s_addr),
ntohs(rqstp->rq_addr.sin_port),
argp->count);
- argp->count = avail << 2;
+ argp->count = 32768;
}
svc_reserve(rqstp, (19<<2) + argp->count + 4);

resp->count = argp->count;
nfserr = nfsd_read(rqstp, fh_copy(&resp->fh, &argp->fh),
argp->offset,
- (char *) buffer,
+ argp->vec, argp->vlen,
&resp->count);

return nfserr;
@@ -175,7 +173,7 @@ nfsd_proc_write(struct svc_rqst *rqstp,

nfserr = nfsd_write(rqstp, fh_copy(&resp->fh, &argp->fh),
argp->offset,
- argp->data,
+ argp->vec, argp->vlen,
argp->len,
&stable);
return nfserr;
@@ -477,7 +475,7 @@ nfsd_proc_readdir(struct svc_rqst *rqstp
argp->count, argp->cookie);

/* Reserve buffer space for status */
- svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count, 1);
+ svcbuf_reserve(&rqstp->rq_res, &buffer, &count, 1);

/* Shrink to the client read size */
if (count > (argp->count >> 2))
--- ./fs/nfsd/nfs3proc.c 2002/10/24 04:37:41 1.1
+++ ./fs/nfsd/nfs3proc.c 2002/10/25 05:34:44
@@ -43,11 +43,11 @@ static int nfs3_ftypes[] = {
/*
* Reserve room in the send buffer
*/
-static void
-svcbuf_reserve(struct svc_buf *buf, u32 **ptr, int *len, int nr)
+static inline void
+svcbuf_reserve(struct xdr_buf *buf, u32 **ptr, int *len, int nr)
{
- *ptr = buf->buf + nr;
- *len = buf->buflen - buf->len - nr;
+ *ptr = (u32*)(buf->head[0].iov_base+buf->head[0].iov_len) + nr;
+ *len = ((PAGE_SIZE-buf->head[0].iov_len)>>2) - nr;
}

/*
@@ -150,7 +150,7 @@ nfsd3_proc_readlink(struct svc_rqst *rqs
dprintk("nfsd: READLINK(3) %s\n", SVCFH_fmt(&argp->fh));

/* Reserve room for status, post_op_attr, and path length */
- svcbuf_reserve(&rqstp->rq_resbuf, &path, &dummy,
+ svcbuf_reserve(&rqstp->rq_res, &path, &dummy,
1 + NFS3_POST_OP_ATTR_WORDS + 1);

/* Read the symlink. */
@@ -179,7 +179,7 @@ nfsd3_proc_read(struct svc_rqst *rqstp,
* 1 (status) + 22 (post_op_attr) + 1 (count) + 1 (eof)
* + 1 (xdr opaque byte count) = 26
*/
- svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &avail,
+ svcbuf_reserve(&rqstp->rq_res, &buffer, &avail,
1 + NFS3_POST_OP_ATTR_WORDS + 3);
resp->count = argp->count;
if ((avail << 2) < resp->count)
@@ -447,7 +447,7 @@ nfsd3_proc_readdir(struct svc_rqst *rqst
argp->count, (u32) argp->cookie);

/* Reserve buffer space for status, attributes and verifier */
- svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count,
+ svcbuf_reserve(&rqstp->rq_res, &buffer, &count,
1 + NFS3_POST_OP_ATTR_WORDS + 2);

/* Make sure we've room for the NULL ptr & eof flag, and shrink to
@@ -482,7 +482,7 @@ nfsd3_proc_readdirplus(struct svc_rqst *
argp->count, (u32) argp->cookie);

/* Reserve buffer space for status, attributes and verifier */
- svcbuf_reserve(&rqstp->rq_resbuf, &buffer, &count,
+ svcbuf_reserve(&rqstp->rq_res, &buffer, &count,
1 + NFS3_POST_OP_ATTR_WORDS + 2);

/* Make sure we've room for the NULL ptr & eof flag, and shrink to
--- ./fs/lockd/xdr.c 2002/10/24 01:01:26 1.1
+++ ./fs/lockd/xdr.c 2002/10/25 05:14:36
@@ -216,25 +216,6 @@ nlm_encode_testres(u32 *p, struct nlm_re
return p;
}

-/*
- * Check buffer bounds after decoding arguments
- */
-static inline int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_argbuf;
-
- return p - buf->base <= buf->buflen;
-}
-
-static inline int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_resbuf;
-
- buf->len = p - buf->base;
- return (buf->len <= buf->buflen);
-}

/*
* First, the server side XDR functions
--- ./fs/lockd/xdr4.c 2002/10/24 01:05:40 1.1
+++ ./fs/lockd/xdr4.c 2002/10/25 05:14:44
@@ -223,26 +223,6 @@ nlm4_encode_testres(u32 *p, struct nlm_r


/*
- * Check buffer bounds after decoding arguments
- */
-static int
-xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_argbuf;
-
- return p - buf->base <= buf->buflen;
-}
-
-static int
-xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
-{
- struct svc_buf *buf = &rqstp->rq_resbuf;
-
- buf->len = p - buf->base;
- return (buf->len <= buf->buflen);
-}
-
-/*
* First, the server side XDR functions
*/
int
--- ./fs/read_write.c 2002/10/24 01:22:09 1.1
+++ ./fs/read_write.c 2002/10/24 02:54:13
@@ -207,6 +207,53 @@ ssize_t vfs_read(struct file *file, char
return ret;
}

+ssize_t vfs_readv(struct file *file, struct iovec *vec, int vlen, size_t count, loff_t *pos)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ ssize_t ret;
+
+ if (!(file->f_mode & FMODE_READ))
+ return -EBADF;
+ if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
+ return -EINVAL;
+
+ ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count);
+ if (!ret) {
+ ret = security_ops->file_permission (file, MAY_READ);
+ if (!ret) {
+ if (file->f_op->readv)
+ ret = file->f_op->readv(file, vec, vlen, pos);
+ else {
+ /* do it by hand */
+ struct iovec *vector = vec;
+ ret = 0;
+ while (vlen > 0) {
+ void * base = vector->iov_base;
+ size_t len = vector->iov_len;
+ ssize_t nr;
+ vector++;
+ vlen--;
+ if (file->f_op->read)
+ nr = file->f_op->read(file, base, len, pos);
+ else
+ nr = do_sync_read(file, base, len, pos);
+ if (nr < 0) {
+ if (!ret) ret = nr;
+ break;
+ }
+ ret += nr;
+ if (nr != len)
+ break;
+ }
+ }
+ if (ret > 0)
+ dnotify_parent(file->f_dentry, DN_ACCESS);
+ }
+ }
+
+ return ret;
+}
+
ssize_t do_sync_write(struct file *filp, const char *buf, size_t len, loff_t *ppos)
{
struct kiocb kiocb;
@@ -247,6 +294,53 @@ ssize_t vfs_write(struct file *file, con
return ret;
}

+ssize_t vfs_writev(struct file *file, const struct iovec *vec, int vlen, size_t count, loff_t *pos)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ ssize_t ret;
+
+ if (!(file->f_mode & FMODE_WRITE))
+ return -EBADF;
+ if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
+ return -EINVAL;
+
+ ret = locks_verify_area(FLOCK_VERIFY_WRITE, inode, file, *pos, count);
+ if (!ret) {
+ ret = security_ops->file_permission (file, MAY_WRITE);
+ if (!ret) {
+ if (file->f_op->writev)
+ ret = file->f_op->writev(file, vec, vlen, pos);
+ else {
+ /* do it by hand */
+ struct iovec *vector = vec;
+ ret = 0;
+ while (vlen > 0) {
+ void * base = vector->iov_base;
+ size_t len = vector->iov_len;
+ ssize_t nr;
+ vector++;
+ vlen--;
+ if (file->f_op->write)
+ nr = file->f_op->write(file, base, len, pos);
+ else
+ nr = do_sync_write(file, base, len, pos);
+ if (nr < 0) {
+ if (!ret) ret = nr;
+ break;
+ }
+ ret += nr;
+ if (nr != len)
+ break;
+ }
+ }
+ if (ret > 0)
+ dnotify_parent(file->f_dentry, DN_MODIFY);
+ }
+ }
+
+ return ret;
+}
+
asmlinkage ssize_t sys_read(unsigned int fd, char * buf, size_t count)
{
struct file *file;
--- ./include/linux/sunrpc/svc.h 2002/10/23 00:38:26 1.1
+++ ./include/linux/sunrpc/svc.h 2002/10/25 05:14:06
@@ -48,43 +48,49 @@ struct svc_serv {
* This is use to determine the max number of pages nfsd is
* willing to return in a single READ operation.
*/
-#define RPCSVC_MAXPAYLOAD 16384u
+#define RPCSVC_MAXPAYLOAD (64*1024u)

/*
- * Buffer to store RPC requests or replies in.
- * Each server thread has one of these beasts.
+ * RPC Requsts and replies are stored in one or more pages.
+ * We maintain an array of pages for each server thread.
+ * Requests are copied into these pages as they arrive. Remaining
+ * pages are available to write the reply into.
*
- * Area points to the allocated memory chunk currently owned by the
- * buffer. Base points to the buffer containing the request, which is
- * different from area when directly reading from an sk_buff. buf is
- * the current read/write position while processing an RPC request.
+ * Currently pages are all re-used by the same server. Later we
+ * will use ->sendpage to transmit pages with reduced copying. In
+ * that case we will need to give away the page and allocate new ones.
+ * In preparation for this, we explicitly move pages off the recv
+ * list onto the transmit list, and back.
*
- * The array of iovecs can hold additional data that the server process
- * may not want to copy into the RPC reply buffer, but pass to the
- * network sendmsg routines directly. The prime candidate for this
- * will of course be NFS READ operations, but one might also want to
- * do something about READLINK and READDIR. It might be worthwhile
- * to implement some generic readdir cache in the VFS layer...
+ * We use xdr_buf for holding responses as it fits well with NFS
+ * read responses (that have a header, and some data pages, and possibly
+ * a tail) and means we can share some client side routines.
*
- * On the receiving end of the RPC server, the iovec may be used to hold
- * the list of IP fragments once we get to process fragmented UDP
- * datagrams directly.
- */
-#define RPCSVC_MAXIOV ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1)
-struct svc_buf {
- u32 * area; /* allocated memory */
- u32 * base; /* base of RPC datagram */
- int buflen; /* total length of buffer */
- u32 * buf; /* read/write pointer */
- int len; /* current end of buffer */
-
- /* iovec for zero-copy NFS READs */
- struct iovec iov[RPCSVC_MAXIOV];
- int nriov;
-};
-#define svc_getu32(argp, val) { (val) = *(argp)->buf++; (argp)->len--; }
-#define svc_putu32(resp, val) { *(resp)->buf++ = (val); (resp)->len++; }
+ * The xdr_buf.head iovec always points to the first page in the rq_*pages
+ * list. The xdr_buf.pages pointer points to the second page on that
+ * list. xdr_buf.tail points to the end of the first page.
+ * This assumes that the non-page part of an rpc reply will fit
+ * in a page - NFSd ensures this. lockd also has no trouble.
+ */
+#define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE + 1)
+
+static inline u32 svc_getu32(struct iovec *iov)
+{
+ u32 val, *vp;
+ vp = iov->iov_base;
+ val = *vp++;
+ iov->iov_base = (void*)vp;
+ iov->iov_len -= sizeof(u32);
+ return val;
+}
+static inline void svc_putu32(struct iovec *iov, u32 val)
+{
+ u32 *vp = iov->iov_base + iov->iov_len;
+ *vp = val;
+ iov->iov_len += sizeof(u32);
+}

+
/*
* The context of a single thread, including the request currently being
* processed.
@@ -102,9 +108,15 @@ struct svc_rqst {
struct svc_cred rq_cred; /* auth info */
struct sk_buff * rq_skbuff; /* fast recv inet buffer */
struct svc_deferred_req*rq_deferred; /* deferred request we are replaying */
- struct svc_buf rq_defbuf; /* default buffer */
- struct svc_buf rq_argbuf; /* argument buffer */
- struct svc_buf rq_resbuf; /* result buffer */
+
+ struct xdr_buf rq_arg;
+ struct xdr_buf rq_res;
+ struct page * rq_argpages[RPCSVC_MAXPAGES];
+ struct page * rq_respages[RPCSVC_MAXPAGES];
+ short rq_argused; /* pages used for argument */
+ short rq_arghi; /* pages available in argument page list */
+ short rq_resused; /* pages used for result */
+
u32 rq_xid; /* transmission id */
u32 rq_prog; /* program number */
u32 rq_vers; /* program version */
@@ -136,6 +148,38 @@ struct svc_rqst {
wait_queue_head_t rq_wait; /* synchronization */
};

+/*
+ * Check buffer bounds after decoding arguments
+ */
+static inline int
+xdr_argsize_check(struct svc_rqst *rqstp, u32 *p)
+{
+ char *cp = (char *)p;
+ struct iovec *vec = &rqstp->rq_arg.head[0];
+ return cp - (char*)vec->iov_base <= vec->iov_len;
+}
+
+static inline int
+xdr_ressize_check(struct svc_rqst *rqstp, u32 *p)
+{
+ struct iovec *vec = &rqstp->rq_res.head[0];
+ char *cp = (char*)p;
+
+ vec->iov_len = cp - (char*)vec->iov_base;
+ rqstp->rq_res.len = vec->iov_len;
+
+ return vec->iov_len <= PAGE_SIZE;
+}
+
+static int inline take_page(struct svc_rqst *rqstp)
+{
+ if (rqstp->rq_arghi <= rqstp->rq_argused)
+ return -ENOMEM;
+ rqstp->rq_respages[rqstp->rq_resused++] =
+ rqstp->rq_argpages[--rqstp->rq_arghi];
+ return 0;
+}
+
struct svc_deferred_req {
struct svc_serv *serv;
u32 prot; /* protocol (UDP or TCP) */
--- ./include/linux/nfsd/xdr.h 2002/10/24 01:49:48 1.1
+++ ./include/linux/nfsd/xdr.h 2002/10/25 02:21:03
@@ -29,16 +29,16 @@ struct nfsd_readargs {
struct svc_fh fh;
__u32 offset;
__u32 count;
- __u32 totalsize;
+ struct iovec vec[RPCSVC_MAXPAGES];
+ int vlen;
};

struct nfsd_writeargs {
svc_fh fh;
- __u32 beginoffset;
__u32 offset;
- __u32 totalcount;
- __u8 * data;
int len;
+ struct iovec vec[RPCSVC_MAXPAGES];
+ int vlen;
};

struct nfsd_createargs {
--- ./include/linux/nfsd/nfsd.h 2002/10/24 04:04:03 1.1
+++ ./include/linux/nfsd/nfsd.h 2002/10/24 04:13:19
@@ -97,9 +97,9 @@ int nfsd_open(struct svc_rqst *, struct
int, struct file *);
void nfsd_close(struct file *);
int nfsd_read(struct svc_rqst *, struct svc_fh *,
- loff_t, char *, unsigned long *);
+ loff_t, struct iovec *,int, unsigned long *);
int nfsd_write(struct svc_rqst *, struct svc_fh *,
- loff_t, char *, unsigned long, int *);
+ loff_t, struct iovec *,int, unsigned long, int *);
int nfsd_readlink(struct svc_rqst *, struct svc_fh *,
char *, int *);
int nfsd_symlink(struct svc_rqst *, struct svc_fh *,
--- ./include/linux/nfsd/cache.h 2002/10/24 03:41:12 1.1
+++ ./include/linux/nfsd/cache.h 2002/10/24 03:41:35
@@ -32,12 +32,12 @@ struct svc_cacherep {
u32 c_vers;
unsigned long c_timestamp;
union {
- struct svc_buf u_buffer;
+ struct iovec u_vec;
u32 u_status;
} c_u;
};

-#define c_replbuf c_u.u_buffer
+#define c_replvec c_u.u_vec
#define c_replstat c_u.u_status

/* cache entry states */
--- ./include/linux/fs.h 2002/10/24 01:34:48 1.1
+++ ./include/linux/fs.h 2002/10/24 02:53:14
@@ -793,6 +793,8 @@ struct seq_file;

extern ssize_t vfs_read(struct file *, char *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char *, size_t, loff_t *);
+extern ssize_t vfs_readv(struct file *, struct iovec *, int, size_t, loff_t *);
+extern ssize_t vfs_writev(struct file *, const struct iovec *, int, size_t, loff_t *);

/*
* NOTE: write_inode, delete_inode, clear_inode, put_inode can be called
--- ./net/sunrpc/svc.c 2002/10/23 12:35:50 1.1
+++ ./net/sunrpc/svc.c 2002/10/25 05:41:14
@@ -13,6 +13,7 @@
#include <linux/net.h>
#include <linux/in.h>
#include <linux/unistd.h>
+#include <linux/mm.h>

#include <linux/sunrpc/types.h>
#include <linux/sunrpc/xdr.h>
@@ -35,7 +36,6 @@ svc_create(struct svc_program *prog, uns

if (!(serv = (struct svc_serv *) kmalloc(sizeof(*serv), GFP_KERNEL)))
return NULL;
-
memset(serv, 0, sizeof(*serv));
serv->sv_program = prog;
serv->sv_nrthreads = 1;
@@ -105,35 +105,41 @@ svc_destroy(struct svc_serv *serv)
}

/*
- * Allocate an RPC server buffer
- * Later versions may do nifty things by allocating multiple pages
- * of memory directly and putting them into the bufp->iov.
+ * Allocate an RPC server's buffer space.
+ * We allocate pages and place them in rq_argpages.
*/
-int
-svc_init_buffer(struct svc_buf *bufp, unsigned int size)
+static int
+svc_init_buffer(struct svc_rqst *rqstp, unsigned int size)
{
- if (!(bufp->area = (u32 *) kmalloc(size, GFP_KERNEL)))
- return 0;
- bufp->base = bufp->area;
- bufp->buf = bufp->area;
- bufp->len = 0;
- bufp->buflen = size >> 2;
-
- bufp->iov[0].iov_base = bufp->area;
- bufp->iov[0].iov_len = size;
- bufp->nriov = 1;
-
- return 1;
+ int pages = 2 + (size+ PAGE_SIZE -1) / PAGE_SIZE;
+ int arghi;
+
+ rqstp->rq_argused = 0;
+ rqstp->rq_resused = 0;
+ arghi = 0;
+ while (pages) {
+ struct page *p = alloc_page(GFP_KERNEL);
+ if (!p)
+ break;
+ printk("allocated page %d (%d to go)\n", arghi, pages-1);
+ rqstp->rq_argpages[arghi++] = p;
+ pages--;
+ }
+ rqstp->rq_arghi = arghi;
+ return ! pages;
}

/*
* Release an RPC server buffer
*/
-void
-svc_release_buffer(struct svc_buf *bufp)
+static void
+svc_release_buffer(struct svc_rqst *rqstp)
{
- kfree(bufp->area);
- bufp->area = 0;
+ while (rqstp->rq_arghi)
+ put_page(rqstp->rq_argpages[--rqstp->rq_arghi]);
+ while (rqstp->rq_resused)
+ put_page(rqstp->rq_respages[--rqstp->rq_resused]);
+ rqstp->rq_argused = 0;
}

/*
@@ -154,7 +160,7 @@ svc_create_thread(svc_thread_fn func, st

if (!(rqstp->rq_argp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL))
|| !(rqstp->rq_resp = (u32 *) kmalloc(serv->sv_xdrsize, GFP_KERNEL))
- || !svc_init_buffer(&rqstp->rq_defbuf, serv->sv_bufsz))
+ || !svc_init_buffer(rqstp, serv->sv_bufsz))
goto out_thread;

serv->sv_nrthreads++;
@@ -180,7 +186,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
{
struct svc_serv *serv = rqstp->rq_server;

- svc_release_buffer(&rqstp->rq_defbuf);
+ svc_release_buffer(rqstp);
if (rqstp->rq_resp)
kfree(rqstp->rq_resp);
if (rqstp->rq_argp)
@@ -242,37 +248,49 @@ svc_process(struct svc_serv *serv, struc
struct svc_program *progp;
struct svc_version *versp = NULL; /* compiler food */
struct svc_procedure *procp = NULL;
- struct svc_buf * argp = &rqstp->rq_argbuf;
- struct svc_buf * resp = &rqstp->rq_resbuf;
+ struct iovec * argv = &rqstp->rq_arg.head[0];
+ struct iovec * resv = &rqstp->rq_res.head[0];
kxdrproc_t xdr;
- u32 *bufp, *statp;
+ u32 *statp;
u32 dir, prog, vers, proc,
auth_stat, rpc_stat;

rpc_stat = rpc_success;
- bufp = argp->buf;

- if (argp->len < 5)
+ if (argv->iov_len < 6*4)
goto err_short_len;

- dir = ntohl(*bufp++);
- vers = ntohl(*bufp++);
+ /* setup response xdr_buf.
+ * Initially it has just one page
+ */
+ take_page(rqstp); /* must succeed */
+ resv->iov_base = page_address(rqstp->rq_respages[0]);
+ resv->iov_len = 0;
+ rqstp->rq_res.pages = rqstp->rq_respages+1;
+ rqstp->rq_res.len = 0;
+ /* tcp needs a space for the record length... */
+ if (rqstp->rq_prot == IPPROTO_TCP)
+ svc_putu32(resv, 0);
+
+ rqstp->rq_xid = svc_getu32(argv);
+ svc_putu32(resv, rqstp->rq_xid);
+
+ dir = ntohl(svc_getu32(argv));
+ vers = ntohl(svc_getu32(argv));

/* First words of reply: */
- svc_putu32(resp, xdr_one); /* REPLY */
- svc_putu32(resp, xdr_zero); /* ACCEPT */
+ svc_putu32(resv, xdr_one); /* REPLY */

if (dir != 0) /* direction != CALL */
goto err_bad_dir;
if (vers != 2) /* RPC version number */
goto err_bad_rpc;

- rqstp->rq_prog = prog = ntohl(*bufp++); /* program number */
- rqstp->rq_vers = vers = ntohl(*bufp++); /* version number */
- rqstp->rq_proc = proc = ntohl(*bufp++); /* procedure number */
+ svc_putu32(resv, xdr_zero); /* ACCEPT */

- argp->buf += 5;
- argp->len -= 5;
+ rqstp->rq_prog = prog = ntohl(svc_getu32(argv)); /* program number */
+ rqstp->rq_vers = vers = ntohl(svc_getu32(argv)); /* version number */
+ rqstp->rq_proc = proc = ntohl(svc_getu32(argv)); /* procedure number */

/*
* Decode auth data, and add verifier to reply buffer.
@@ -307,8 +325,8 @@ svc_process(struct svc_serv *serv, struc
serv->sv_stats->rpccnt++;

/* Build the reply header. */
- statp = resp->buf;
- svc_putu32(resp, rpc_success); /* RPC_SUCCESS */
+ statp = resv->iov_base +resv->iov_len;
+ svc_putu32(resv, rpc_success); /* RPC_SUCCESS */

/* Bump per-procedure stats counter */
procp->pc_count++;
@@ -327,14 +345,14 @@ svc_process(struct svc_serv *serv, struc
if (!versp->vs_dispatch) {
/* Decode arguments */
xdr = procp->pc_decode;
- if (xdr && !xdr(rqstp, rqstp->rq_argbuf.buf, rqstp->rq_argp))
+ if (xdr && !xdr(rqstp, argv->iov_base, rqstp->rq_argp))
goto err_garbage;

*statp = procp->pc_func(rqstp, rqstp->rq_argp, rqstp->rq_resp);

/* Encode reply */
if (*statp == rpc_success && (xdr = procp->pc_encode)
- && !xdr(rqstp, rqstp->rq_resbuf.buf, rqstp->rq_resp)) {
+ && !xdr(rqstp, resv->iov_base+resv->iov_len, rqstp->rq_resp)) {
dprintk("svc: failed to encode reply\n");
/* serv->sv_stats->rpcsystemerr++; */
*statp = rpc_system_err;
@@ -347,7 +365,7 @@ svc_process(struct svc_serv *serv, struc

/* Check RPC status result */
if (*statp != rpc_success)
- resp->len = statp + 1 - resp->base;
+ resv->iov_len = ((void*)statp) - resv->iov_base + 4;

/* Release reply info */
if (procp->pc_release)
@@ -369,7 +387,7 @@ svc_process(struct svc_serv *serv, struc

err_short_len:
#ifdef RPC_PARANOIA
- printk("svc: short len %d, dropping request\n", argp->len);
+ printk("svc: short len %d, dropping request\n", argv->iov_len);
#endif
goto dropit; /* drop request */

@@ -382,18 +400,19 @@ err_bad_dir:

err_bad_rpc:
serv->sv_stats->rpcbadfmt++;
- resp->buf[-1] = xdr_one; /* REJECT */
- svc_putu32(resp, xdr_zero); /* RPC_MISMATCH */
- svc_putu32(resp, xdr_two); /* Only RPCv2 supported */
- svc_putu32(resp, xdr_two);
+ svc_putu32(resv, xdr_one); /* REJECT */
+ svc_putu32(resv, xdr_zero); /* RPC_MISMATCH */
+ svc_putu32(resv, xdr_two); /* Only RPCv2 supported */
+ svc_putu32(resv, xdr_two);
goto sendit;

err_bad_auth:
dprintk("svc: authentication failed (%d)\n", ntohl(auth_stat));
serv->sv_stats->rpcbadauth++;
- resp->buf[-1] = xdr_one; /* REJECT */
- svc_putu32(resp, xdr_one); /* AUTH_ERROR */
- svc_putu32(resp, auth_stat); /* status */
+ resv->iov_len -= 4;
+ svc_putu32(resv, xdr_one); /* REJECT */
+ svc_putu32(resv, xdr_one); /* AUTH_ERROR */
+ svc_putu32(resv, auth_stat); /* status */
goto sendit;

err_bad_prog:
@@ -403,7 +422,7 @@ err_bad_prog:
/* else it is just a Solaris client seeing if ACLs are supported */
#endif
serv->sv_stats->rpcbadfmt++;
- svc_putu32(resp, rpc_prog_unavail);
+ svc_putu32(resv, rpc_prog_unavail);
goto sendit;

err_bad_vers:
@@ -411,9 +430,9 @@ err_bad_vers:
printk("svc: unknown version (%d)\n", vers);
#endif
serv->sv_stats->rpcbadfmt++;
- svc_putu32(resp, rpc_prog_mismatch);
- svc_putu32(resp, htonl(progp->pg_lovers));
- svc_putu32(resp, htonl(progp->pg_hivers));
+ svc_putu32(resv, rpc_prog_mismatch);
+ svc_putu32(resv, htonl(progp->pg_lovers));
+ svc_putu32(resv, htonl(progp->pg_hivers));
goto sendit;

err_bad_proc:
@@ -421,7 +440,7 @@ err_bad_proc:
printk("svc: unknown procedure (%d)\n", proc);
#endif
serv->sv_stats->rpcbadfmt++;
- svc_putu32(resp, rpc_proc_unavail);
+ svc_putu32(resv, rpc_proc_unavail);
goto sendit;

err_garbage:
@@ -429,6 +448,6 @@ err_garbage:
printk("svc: failed to decode args\n");
#endif
serv->sv_stats->rpcbadfmt++;
- svc_putu32(resp, rpc_garbage_args);
+ svc_putu32(resv, rpc_garbage_args);
goto sendit;
}
--- ./net/sunrpc/svcsock.c 2002/10/21 23:40:50 1.2
+++ ./net/sunrpc/svcsock.c 2002/10/25 07:22:30
@@ -234,7 +234,7 @@ svc_sock_received(struct svc_sock *svsk)
*/
void svc_reserve(struct svc_rqst *rqstp, int space)
{
- space += rqstp->rq_resbuf.len<<2;
+ space += rqstp->rq_res.head[0].iov_len;

if (space < rqstp->rq_reserved) {
struct svc_sock *svsk = rqstp->rq_sock;
@@ -278,13 +278,12 @@ svc_sock_release(struct svc_rqst *rqstp)
* But first, check that enough space was reserved
* for the reply, otherwise we have a bug!
*/
- if ((rqstp->rq_resbuf.len<<2) > rqstp->rq_reserved)
+ if ((rqstp->rq_res.len) > rqstp->rq_reserved)
printk(KERN_ERR "RPC request reserved %d but used %d\n",
rqstp->rq_reserved,
- rqstp->rq_resbuf.len<<2);
+ rqstp->rq_res.len);

- rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base;
- rqstp->rq_resbuf.len = 0;
+ rqstp->rq_res.head[0].iov_len = 0;
svc_reserve(rqstp, 0);
rqstp->rq_sock = NULL;

@@ -480,13 +479,15 @@ svc_write_space(struct sock *sk)
/*
* Receive a datagram from a UDP socket.
*/
+extern int
+csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb);
+
static int
svc_udp_recvfrom(struct svc_rqst *rqstp)
{
struct svc_sock *svsk = rqstp->rq_sock;
struct svc_serv *serv = svsk->sk_server;
struct sk_buff *skb;
- u32 *data;
int err, len;

if (test_and_clear_bit(SK_CHNGBUF, &svsk->sk_flags))
@@ -512,33 +513,19 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
}
set_bit(SK_DATA, &svsk->sk_flags); /* there may be more data... */

- /* Sorry. */
- if (skb_is_nonlinear(skb)) {
- if (skb_linearize(skb, GFP_KERNEL) != 0) {
- kfree_skb(skb);
- svc_sock_received(svsk);
- return 0;
- }
- }
+ len = skb->len - sizeof(struct udphdr);

- if (skb->ip_summed != CHECKSUM_UNNECESSARY) {
- if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) {
- skb_free_datagram(svsk->sk_sk, skb);
- svc_sock_received(svsk);
- return 0;
- }
+ if (csum_partial_copy_to_xdr(&rqstp->rq_arg, skb)) {
+ /* checksum error */
+ skb_free_datagram(svsk->sk_sk, skb);
+ svc_sock_received(svsk);
+ return 0;
}


- len = skb->len - sizeof(struct udphdr);
- data = (u32 *) (skb->data + sizeof(struct udphdr));
-
- rqstp->rq_skbuff = skb;
- rqstp->rq_argbuf.base = data;
- rqstp->rq_argbuf.buf = data;
- rqstp->rq_argbuf.len = (len >> 2);
- rqstp->rq_argbuf.buflen = (len >> 2);
- /* rqstp->rq_resbuf = rqstp->rq_defbuf; */
+ rqstp->rq_arg.len = len;
+ rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
+ rqstp->rq_argused += (rqstp->rq_arg.page_len + PAGE_SIZE - 1)/ PAGE_SIZE;
rqstp->rq_prot = IPPROTO_UDP;

/* Get sender address */
@@ -546,6 +533,8 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
rqstp->rq_addr.sin_port = skb->h.uh->source;
rqstp->rq_addr.sin_addr.s_addr = skb->nh.iph->saddr;

+ skb_free_datagram(svsk->sk_sk, skb);
+
if (serv->sv_stats)
serv->sv_stats->netudpcnt++;

@@ -559,21 +548,37 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
static int
svc_udp_sendto(struct svc_rqst *rqstp)
{
- struct svc_buf *bufp = &rqstp->rq_resbuf;
int error;
+ struct iovec vec[RPCSVC_MAXPAGES];
+ int v;
+ int base, len;

/* Set up the first element of the reply iovec.
* Any other iovecs that may be in use have been taken
* care of by the server implementation itself.
*/
- /* bufp->base = bufp->area; */
- bufp->iov[0].iov_base = bufp->base;
- bufp->iov[0].iov_len = bufp->len << 2;
+ vec[0] = rqstp->rq_res.head[0];
+ v=1;
+ base=rqstp->rq_res.page_base;
+ len = rqstp->rq_res.page_len;
+ while (len) {
+ vec[v].iov_base = page_address(rqstp->rq_res.pages[v-1]) + base;
+ vec[v].iov_len = PAGE_SIZE-base;
+ if (len <= vec[v].iov_len)
+ vec[v].iov_len = len;
+ len -= vec[v].iov_len;
+ base = 0;
+ v++;
+ }
+ if (rqstp->rq_res.tail[0].iov_len) {
+ vec[v] = rqstp->rq_res.tail[0];
+ v++;
+ }

- error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+ error = svc_sendto(rqstp, vec, v);
if (error == -ECONNREFUSED)
/* ICMP error on earlier request. */
- error = svc_sendto(rqstp, bufp->iov, bufp->nriov);
+ error = svc_sendto(rqstp, vec, v);

return error;
}
@@ -785,8 +790,9 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
{
struct svc_sock *svsk = rqstp->rq_sock;
struct svc_serv *serv = svsk->sk_server;
- struct svc_buf *bufp = &rqstp->rq_argbuf;
int len;
+ struct iovec vec[RPCSVC_MAXPAGES];
+ int pnum, vlen;

dprintk("svc: tcp_recv %p data %d conn %d close %d\n",
svsk, test_bit(SK_DATA, &svsk->sk_flags),
@@ -851,7 +857,7 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
}
svsk->sk_reclen &= 0x7fffffff;
dprintk("svc: TCP record, %d bytes\n", svsk->sk_reclen);
- if (svsk->sk_reclen > (bufp->buflen<<2)) {
+ if (svsk->sk_reclen > (32768 /*FIXME*/)) {
printk(KERN_NOTICE "RPC: bad TCP reclen 0x%08lx (large)\n",
(unsigned long) svsk->sk_reclen);
goto err_delete;
@@ -869,30 +875,35 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
svc_sock_received(svsk);
return -EAGAIN; /* record not complete */
}
+ len = svsk->sk_reclen;
set_bit(SK_DATA, &svsk->sk_flags);

- /* Frob argbuf */
- bufp->iov[0].iov_base += 4;
- bufp->iov[0].iov_len -= 4;
+ vec[0] = rqstp->rq_arg.head[0];
+ vlen = PAGE_SIZE;
+ pnum = 1;
+ while (vlen < len) {
+ vec[pnum].iov_base = page_address(rqstp->rq_argpages[rqstp->rq_argused++]);
+ vec[pnum].iov_len = PAGE_SIZE;
+ pnum++;
+ vlen += PAGE_SIZE;
+ }

/* Now receive data */
- len = svc_recvfrom(rqstp, bufp->iov, bufp->nriov, svsk->sk_reclen);
+ len = svc_recvfrom(rqstp, vec, pnum, len);
if (len < 0)
goto error;

dprintk("svc: TCP complete record (%d bytes)\n", len);
-
- /* Position reply write pointer immediately after args,
- * allowing for record length */
- rqstp->rq_resbuf.base = rqstp->rq_argbuf.base + 1 + (len>>2);
- rqstp->rq_resbuf.buf = rqstp->rq_resbuf.base + 1;
- rqstp->rq_resbuf.len = 1;
- rqstp->rq_resbuf.buflen= rqstp->rq_argbuf.buflen - (len>>2) - 1;
+ rqstp->rq_arg.len = len;
+ rqstp->rq_arg.page_base = 0;
+ if (len <= rqstp->rq_arg.head[0].iov_len) {
+ rqstp->rq_arg.head[0].iov_len = len;
+ rqstp->rq_arg.page_len = 0;
+ } else {
+ rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
+ }

rqstp->rq_skbuff = 0;
- rqstp->rq_argbuf.buf += 1;
- rqstp->rq_argbuf.len = (len >> 2);
- rqstp->rq_argbuf.buflen = (len >> 2) +1;
rqstp->rq_prot = IPPROTO_TCP;

/* Reset TCP read info */
@@ -928,23 +939,44 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
static int
svc_tcp_sendto(struct svc_rqst *rqstp)
{
- struct svc_buf *bufp = &rqstp->rq_resbuf;
+ struct xdr_buf *xbufp = &rqstp->rq_res;
+ struct iovec vec[RPCSVC_MAXPAGES];
+ int v;
+ int base, len;
int sent;
+ u32 reclen;

/* Set up the first element of the reply iovec.
* Any other iovecs that may be in use have been taken
* care of by the server implementation itself.
*/
- bufp->iov[0].iov_base = bufp->base;
- bufp->iov[0].iov_len = bufp->len << 2;
- bufp->base[0] = htonl(0x80000000|((bufp->len << 2) - 4));
+ reclen = htonl(0x80000000|((xbufp->len ) - 4));
+ memcpy(xbufp->head[0].iov_base, &reclen, 4);
+
+ vec[0] = rqstp->rq_res.head[0];
+ v=1;
+ base= xbufp->page_base;
+ len = xbufp->page_len;
+ while (len) {
+ vec[v].iov_base = page_address(xbufp->pages[v-1]) + base;
+ vec[v].iov_len = PAGE_SIZE-base;
+ if (len <= vec[v].iov_len)
+ vec[v].iov_len = len;
+ len -= vec[v].iov_len;
+ base = 0;
+ v++;
+ }
+ if (xbufp->tail[0].iov_len) {
+ vec[v] = xbufp->tail[0];
+ v++;
+ }

- sent = svc_sendto(rqstp, bufp->iov, bufp->nriov);
- if (sent != bufp->len<<2) {
+ sent = svc_sendto(rqstp, vec, v);
+ if (sent != xbufp->len) {
printk(KERN_NOTICE "rpc-srv/tcp: %s: %s %d when sending %d bytes - shutting down socket\n",
rqstp->rq_sock->sk_server->sv_name,
(sent<0)?"got error":"sent only",
- sent, bufp->len << 2);
+ sent, xbufp->len);
svc_delete_socket(rqstp->rq_sock);
sent = -EAGAIN;
}
@@ -1016,6 +1048,8 @@ svc_recv(struct svc_serv *serv, struct s
{
struct svc_sock *svsk =NULL;
int len;
+ int pages;
+ struct xdr_buf *arg;
DECLARE_WAITQUEUE(wait, current);

dprintk("svc: server %p waiting for data (to = %ld)\n",
@@ -1031,9 +1065,35 @@ svc_recv(struct svc_serv *serv, struct s
rqstp);

/* Initialize the buffers */
- rqstp->rq_argbuf = rqstp->rq_defbuf;
- rqstp->rq_resbuf = rqstp->rq_defbuf;
+ /* first reclaim pages that were moved to response list */
+ while (rqstp->rq_resused)
+ rqstp->rq_argpages[rqstp->rq_arghi++] =
+ rqstp->rq_respages[--rqstp->rq_resused];
+ /* now allocate needed pages. If we get a failure, sleep briefly */
+ pages = 2 + (serv->sv_bufsz + PAGE_SIZE -1) / PAGE_SIZE;
+ while (rqstp->rq_arghi < pages) {
+ struct page *p = alloc_page(GFP_KERNEL);
+ if (!p) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_timeout(HZ/2);
+ current->state = TASK_RUNNING;
+ continue;
+ }
+ rqstp->rq_argpages[rqstp->rq_arghi++] = p;
+ }

+ /* Make arg->head point to first page and arg->pages point to rest */
+ arg = &rqstp->rq_arg;
+ arg->head[0].iov_base = page_address(rqstp->rq_argpages[0]);
+ arg->head[0].iov_len = PAGE_SIZE;
+ rqstp->rq_argused = 1;
+ arg->pages = rqstp->rq_argpages + 1;
+ arg->page_base = 0;
+ /* save at least one page for response */
+ arg->page_len = (pages-2)*PAGE_SIZE;
+ arg->len = (pages-1)*PAGE_SIZE;
+ arg->tail[0].iov_len = 0;
+
if (signalled())
return -EINTR;

@@ -1109,12 +1169,6 @@ svc_recv(struct svc_serv *serv, struct s
rqstp->rq_userset = 0;
rqstp->rq_chandle.defer = svc_defer;

- svc_getu32(&rqstp->rq_argbuf, rqstp->rq_xid);
- svc_putu32(&rqstp->rq_resbuf, rqstp->rq_xid);
-
- /* Assume that the reply consists of a single buffer. */
- rqstp->rq_resbuf.nriov = 1;
-
if (serv->sv_stats)
serv->sv_stats->netcnt++;
return len;
@@ -1354,23 +1408,25 @@ static struct cache_deferred_req *
svc_defer(struct cache_req *req)
{
struct svc_rqst *rqstp = container_of(req, struct svc_rqst, rq_chandle);
- int size = sizeof(struct svc_deferred_req) + (rqstp->rq_argbuf.buflen << 2);
+ int size = sizeof(struct svc_deferred_req) + (rqstp->rq_arg.head[0].iov_len);
struct svc_deferred_req *dr;

+ if (rqstp->rq_arg.page_len)
+ return NULL; /* if more than a page, give up FIXME */
if (rqstp->rq_deferred) {
dr = rqstp->rq_deferred;
rqstp->rq_deferred = NULL;
} else {
/* FIXME maybe discard if size too large */
- dr = kmalloc(size<<2, GFP_KERNEL);
+ dr = kmalloc(size, GFP_KERNEL);
if (dr == NULL)
return NULL;

dr->serv = rqstp->rq_server;
dr->prot = rqstp->rq_prot;
dr->addr = rqstp->rq_addr;
- dr->argslen = rqstp->rq_argbuf.buflen;
- memcpy(dr->args, rqstp->rq_argbuf.base, dr->argslen<<2);
+ dr->argslen = rqstp->rq_arg.head[0].iov_len >> 2;
+ memcpy(dr->args, rqstp->rq_arg.head[0].iov_base, dr->argslen<<2);
}
spin_lock(&rqstp->rq_server->sv_lock);
rqstp->rq_sock->sk_inuse++;
@@ -1388,10 +1444,10 @@ static int svc_deferred_recv(struct svc_
{
struct svc_deferred_req *dr = rqstp->rq_deferred;

- rqstp->rq_argbuf.base = dr->args;
- rqstp->rq_argbuf.buf = dr->args;
- rqstp->rq_argbuf.len = dr->argslen;
- rqstp->rq_argbuf.buflen = dr->argslen;
+ rqstp->rq_arg.head[0].iov_base = dr->args;
+ rqstp->rq_arg.head[0].iov_len = dr->argslen<<2;
+ rqstp->rq_arg.page_len = 0;
+ rqstp->rq_arg.len = dr->argslen<<2;
rqstp->rq_prot = dr->prot;
rqstp->rq_addr = dr->addr;
return dr->argslen<<2;
--- ./net/sunrpc/svcauth.c 2002/10/24 06:01:17 1.1
+++ ./net/sunrpc/svcauth.c 2002/10/24 06:01:52
@@ -40,8 +40,7 @@ svc_authenticate(struct svc_rqst *rqstp,
*statp = rpc_success;
*authp = rpc_auth_ok;

- svc_getu32(&rqstp->rq_argbuf, flavor);
- flavor = ntohl(flavor);
+ flavor = ntohl(svc_getu32(&rqstp->rq_arg.head[0]));

dprintk("svc: svc_authenticate (%d)\n", flavor);
if (flavor >= RPC_AUTH_MAXFLAVOR || !(aops = authtab[flavor])) {
--- ./net/sunrpc/xprt.c 2002/10/24 00:34:53 1.1
+++ ./net/sunrpc/xprt.c 2002/10/24 01:00:36
@@ -655,7 +655,7 @@ skb_read_and_csum_bits(skb_reader_t *des
* We have set things up such that we perform the checksum of the UDP
* packet in parallel with the copies into the RPC client iovec. -DaveM
*/
-static int
+int
csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb)
{
skb_reader_t desc;
--- ./net/sunrpc/svcauth_unix.c 2002/10/24 06:09:05 1.1
+++ ./net/sunrpc/svcauth_unix.c 2002/10/25 07:14:44
@@ -287,20 +287,20 @@ void svcauth_unix_purge(void)
static int
svcauth_null_accept(struct svc_rqst *rqstp, u32 *authp, int proc)
{
- struct svc_buf *argp = &rqstp->rq_argbuf;
- struct svc_buf *resp = &rqstp->rq_resbuf;
+ struct iovec *argv = &rqstp->rq_arg.head[0];
+ struct iovec *resv = &rqstp->rq_res.head[0];
int rv=0;
struct ip_map key, *ipm;

- if ((argp->len -= 3) < 0) {
+ if (argv->iov_len < 3*4)
return SVC_GARBAGE;
- }
- if (*(argp->buf)++ != 0) { /* we already skipped the flavor */
+
+ if (svc_getu32(argv) != 0) {
dprintk("svc: bad null cred\n");
*authp = rpc_autherr_badcred;
return SVC_DENIED;
}
- if (*(argp->buf)++ != RPC_AUTH_NULL || *(argp->buf)++ != 0) {
+ if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) {
dprintk("svc: bad null verf\n");
*authp = rpc_autherr_badverf;
return SVC_DENIED;
@@ -312,8 +312,8 @@ svcauth_null_accept(struct svc_rqst *rqs
rqstp->rq_cred.cr_groups[0] = NOGROUP;

/* Put NULL verifier */
- svc_putu32(resp, RPC_AUTH_NULL);
- svc_putu32(resp, 0);
+ svc_putu32(resv, RPC_AUTH_NULL);
+ svc_putu32(resv, 0);

key.m_class = rqstp->rq_server->sv_program->pg_class;
key.m_addr = rqstp->rq_addr.sin_addr;
@@ -368,64 +368,70 @@ struct auth_ops svcauth_null = {
int
svcauth_unix_accept(struct svc_rqst *rqstp, u32 *authp, int proc)
{
- struct svc_buf *argp = &rqstp->rq_argbuf;
- struct svc_buf *resp = &rqstp->rq_resbuf;
+ struct iovec *argv = &rqstp->rq_arg.head[0];
+ struct iovec *resv = &rqstp->rq_res.head[0];
struct svc_cred *cred = &rqstp->rq_cred;
- u32 *bufp = argp->buf, slen, i;
- int len = argp->len;
+ u32 slen, i;
+ int len = argv->iov_len;
int rv=0;
struct ip_map key, *ipm;

- if ((len -= 3) < 0)
+ if ((len -= 3*4) < 0)
return SVC_GARBAGE;

- bufp++; /* length */
- bufp++; /* time stamp */
- slen = XDR_QUADLEN(ntohl(*bufp++)); /* machname length */
- if (slen > 64 || (len -= slen + 3) < 0)
+ svc_getu32(argv); /* length */
+ svc_getu32(argv); /* time stamp */
+ slen = XDR_QUADLEN(ntohl(svc_getu32(argv))); /* machname length */
+ if (slen > 64 || (len -= (slen + 3)*4) < 0)
goto badcred;
- bufp += slen; /* skip machname */
-
- cred->cr_uid = ntohl(*bufp++); /* uid */
- cred->cr_gid = ntohl(*bufp++); /* gid */
+printk("namelen %d name %.*s\n", slen, slen*4, (char*)argv->iov_base);
+ argv->iov_base = (void*)((u32*)argv->iov_base + slen); /* skip machname */

- slen = ntohl(*bufp++); /* gids length */
- if (slen > 16 || (len -= slen + 2) < 0)
+ cred->cr_uid = ntohl(svc_getu32(argv)); /* uid */
+ cred->cr_gid = ntohl(svc_getu32(argv)); /* gid */
+printk("uid=%d gid=%d\n", cred->cr_uid, cred->cr_gid);
+ slen = ntohl(svc_getu32(argv)); /* gids length */
+ printk("%d gids (%d)\n", slen, len);
+ if (slen > 16 || (len -= (slen + 2)*4) < 0)
goto badcred;
- for (i = 0; i < NGROUPS && i < slen; i++)
- cred->cr_groups[i] = ntohl(*bufp++);
+ for (i = 0; i < slen; i++)
+ if (i < NGROUPS)
+ cred->cr_groups[i] = ntohl(svc_getu32(argv));
+ else
+ svc_getu32(argv);
if (i < NGROUPS)
cred->cr_groups[i] = NOGROUP;
- bufp += (slen - i);
+ printk("..got %d\n", i);

- if (*bufp++ != RPC_AUTH_NULL || *bufp++ != 0) {
+ if (svc_getu32(argv) != RPC_AUTH_NULL || svc_getu32(argv) != 0) {
+ printk("nogo\n");
*authp = rpc_autherr_badverf;
return SVC_DENIED;
}

- argp->buf = bufp;
- argp->len = len;
-
/* Put NULL verifier */
- svc_putu32(resp, RPC_AUTH_NULL);
- svc_putu32(resp, 0);
+ svc_putu32(resv, RPC_AUTH_NULL);
+ svc_putu32(resv, 0);
+ printk("put NULL\n");

key.m_class = rqstp->rq_server->sv_program->pg_class;
key.m_addr = rqstp->rq_addr.sin_addr;

+ printk("key is <%s>, %x\n", key.m_class, key.m_addr.s_addr);
+
ipm = ip_map_lookup(&key, 0);

rqstp->rq_client = NULL;
-
+ printk(ipm?"Yes\n": "No\n");
if (ipm)
switch (cache_check(&ip_map_cache, &ipm->h, &rqstp->rq_chandle)) {
- case -EAGAIN:
+ case -EAGAIN:printk("EAGAIN\n");
rv = SVC_DROP;
break;
- case -ENOENT:
+ case -ENOENT:printk("NOENT\n");
rv = SVC_OK; /* rq_client is NULL */
break;
- case 0:
+ case 0: printk("Zero\n");
rqstp->rq_client = &ipm->m_client->h;
cache_get(&rqstp->rq_client->h);
ip_map_put(&ipm->h, &ip_map_cache);
@@ -434,7 +440,7 @@ svcauth_unix_accept(struct svc_rqst *rqs
default: BUG();
}
else rv = SVC_DROP;
-
+ if (rqstp->rq_client==NULL) printk("clinet NULL and proc %d\n", proc);
if (rqstp->rq_client == NULL && proc != 0)
goto badcred;
return rv;
--- ./kernel/ksyms.c 2002/10/24 01:33:59 1.1
+++ ./kernel/ksyms.c 2002/10/24 01:34:08
@@ -254,7 +254,9 @@ EXPORT_SYMBOL(find_inode_number);
EXPORT_SYMBOL(is_subdir);
EXPORT_SYMBOL(get_unused_fd);
EXPORT_SYMBOL(vfs_read);
+EXPORT_SYMBOL(vfs_readv);
EXPORT_SYMBOL(vfs_write);
+EXPORT_SYMBOL(vfs_writev);
EXPORT_SYMBOL(vfs_create);
EXPORT_SYMBOL(vfs_mkdir);
EXPORT_SYMBOL(vfs_mknod);


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-26 03:19:32

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello

> > > I have been thinking some more about this, trying to understand the
> > > big picture, and I'm afraid that I think I want some more changes.
> > >
> > > In particular, I think it would be good to use 'struct xdr_buf' from
> > > sunrpc/xdr.h instead of svc_buf. This is what the nfs client uses and
> > > we could share some of the infrastructure.
> >
> > I just realized it would be hard to use the xdr_buf as it couldn't
> > handle data in a socket buffer. Each socket burfer consists of
> > some non-page data and some pages and each of them might have its
> > own offset and length.
>
> You would only want this for single-copy write request - right?

Yes.

> I think we have treat them as a special case and pass the skbuf all
> the way up to nfsd in that case.
> You would only want to try this if:
> The NIC had verified the checksum
> The packets was some minimum size (1K? 1 PAGE ??)
> We were using AUTH_UNIX, nothing more interesting like crypto
> security
> The first fragment were some minimum size (size of a write without
> the data).
>
> I would make a special 'fast-path' for that case which didn't copy any
> data but passed a skbuf up, and code in nfs*xdr.c would convert that
> into an iovec[];

I implemented that only sunrpc layer could handle a skbuff and
made nfsd layer keep away from its implementation. I felt this approach
was not bad.

Yes, your approach is also good and will work fine.

> I am working on a patch which changes rpcsvc to use xdr_buf. Some of
> it works. Some doesn't. I include it below for your reference I
> repeat: it doesn't work yet.
> Once it is done, adding the rest of zero-copy should be fairly easy.

OK.

It's goot that you're implementing vfs_readv and vfs_writev which
I've also realized it doesn't support aio yet.


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-26 03:34:09

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> Then the following trivial modification would be quite sufficient

Yes, it looks good as it's rare to use two or more xdr_bufs.
We can allocate extra xdr_bufs dynamically.

> struct xdr_buf {
> struct list_head list; /* Further xdr_buf */
> struct iovec head[1], /* RPC header + non-page data */
> tail[1]; /* Appended after page data */
>
> struct page ** pages; /* Array of contiguous pages */
> unsigned int page_base, /* Start of page data */
> page_len; /* Length of page data */
>
> unsigned int len; /* Total length of data */
>
> };
>
> With equally trivial fixes to xdr_kmap() and friends. None of this
> needs to affect existing client usage, and may in fact be useful for
> optimizing use of v4 COMPOUNDS later.
> (I was wrong about this BTW: being able to flush out all the dirty
> pages in a file to disk using a single COMPOUND would indeed be worth
> the trouble once we've managed to drop UDP as the primary NFS
> transport mechanism. For one thing, you would only tie up a single
> nfsd thread when writing to the file)


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-26 03:46:38

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote:
> OK.
>
> It's goot that you're implementing vfs_readv and vfs_writev which
> I've also realized it doesn't support aio yet.

The aio methods are soon switching over to vectored operations for a
few reasons. It's likely that non-vectored methods will be gone soon.

-ben
--
"Do you seek knowledge in time travel?"


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-27 10:47:07

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > > I was thinking about the nfs clients. Why don't we make
> > > xprt_sendmsg() use the sendpage interface instead of calling
> > > sock_sendmsg() so that we can avoid dead-lock which multiple
> > > kmap()s in xprt_sendmsg() might cause on heavily loaded
> > > machines.
> >
> > I'm definitely in favour of such a change. Particularly so if the UDP
> > interface is ready.

I just modified the xprt_sendmsg() to use the sendpage interface.
I've checked it works fine on both of TCP and UDP.

I think this code need to be cleaned up but I don't have any good ideas
about it.


Thank you,
Hirokazu Takahashi.



--- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c Sun Oct 27 19:07:05 2030
@@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st
xdr->page_len = len;

if (len & 3) {
- struct iovec *iov = xdr->tail;
unsigned int pad = 4 - (len & 3);
-
- iov->iov_base = (void *) "\0\0\0";
- iov->iov_len = pad;
len += pad;
+ if (((base + len) & ~PAGE_CACHE_MASK) + pad <= PAGE_CACHE_SIZE) {
+ xdr->page_len += pad;
+ } else {
+ struct iovec *iov = xdr->tail;
+ iov->iov_base = (void *) "\0\0\0";
+ iov->iov_len = pad;
+ }
}
xdr->len += len;
}
--- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c Sun Oct 27 19:07:38 2030
@@ -60,6 +60,7 @@
#include <linux/unistd.h>
#include <linux/sunrpc/clnt.h>
#include <linux/file.h>
+#include <linux/pagemap.h>

#include <net/sock.h>
#include <net/checksum.h>
@@ -207,48 +208,107 @@ xprt_release_write(struct rpc_xprt *xprt
spin_unlock_bh(&xprt->sock_lock);
}

+static inline int
+__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip)
+{
+ unsigned int slen = xdr->len - skip;
+ mm_segment_t oldfs;
+ int result = 0;
+ struct page **ppage = xdr->pages;
+ unsigned int len, pglen = xdr->page_len;
+ size_t base = 0;
+ int flags;
+ int ret;
+ struct iovec niv;
+
+ msg->msg_iov = &niv;
+ msg->msg_iovlen = 1;
+
+ if (xdr->head[0].iov_len > skip) {
+ len = xdr->head[0].iov_len - skip;
+ niv.iov_base = xdr->head[0].iov_base + skip;
+ niv.iov_len = len;
+ if (slen > len)
+ msg->msg_flags |= MSG_MORE;
+ oldfs = get_fs(); set_fs(get_ds());
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ result = sock_sendmsg(sock, msg, len);
+ set_fs(oldfs);
+ if (result != len)
+ return result;
+ slen -= len;
+ skip = 0;
+ } else {
+ skip -= xdr->head[0].iov_len;
+ }
+ if (pglen == 0)
+ goto send_tail;
+ if (skip >= pglen) {
+ skip -= pglen;
+ goto send_tail;
+ }
+ if (skip || xdr->page_base) {
+ pglen -= skip;
+ base = xdr->page_base + skip;
+ ppage += base >> PAGE_CACHE_SHIFT;
+ base &= ~PAGE_CACHE_MASK;
+ }
+ len = PAGE_CACHE_SIZE - base;
+ if (len > pglen) len = pglen;
+ flags = MSG_MORE;
+ while (pglen > 0) {
+ if (slen == len)
+ flags = 0;
+ ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ if (ret > 0)
+ result += ret;
+ if (ret != len) {
+ if (result == 0)
+ result = ret;
+ return result;
+ }
+ slen -= len;
+ pglen -= len;
+ len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+ base = 0;
+ ppage++;
+ }
+ skip = 0;
+send_tail:
+ if (xdr->tail[0].iov_len) {
+ niv.iov_base = xdr->tail[0].iov_base + skip;
+ niv.iov_len = xdr->tail[0].iov_len - skip;
+ msg->msg_flags &= ~MSG_MORE;
+ oldfs = get_fs(); set_fs(get_ds());
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ ret = sock_sendmsg(sock, msg, niv.iov_len);
+ set_fs(oldfs);
+ if (ret > 0)
+ result += ret;
+ if (result == 0)
+ result = ret;
+ }
+ return result;
+}
+
/*
* Write data to socket.
*/
static inline int
xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
{
- struct socket *sock = xprt->sock;
struct msghdr msg;
- struct xdr_buf *xdr = &req->rq_snd_buf;
- struct iovec niv[MAX_IOVEC];
- unsigned int niov, slen, skip;
- mm_segment_t oldfs;
int result;

- if (!sock)
- return -ENOTCONN;
-
- xprt_pktdump("packet data:",
- req->rq_svec->iov_base,
- req->rq_svec->iov_len);
-
- /* Dont repeat bytes */
- skip = req->rq_bytes_sent;
- slen = xdr->len - skip;
- niov = xdr_kmap(niv, xdr, skip);
-
msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL;
- msg.msg_iov = niv;
- msg.msg_iovlen = niov;
msg.msg_name = (struct sockaddr *) &xprt->addr;
msg.msg_namelen = sizeof(xprt->addr);
msg.msg_control = NULL;
msg.msg_controllen = 0;

- oldfs = get_fs(); set_fs(get_ds());
- clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- result = sock_sendmsg(sock, &msg, slen);
- set_fs(oldfs);
-
- xdr_kunmap(xdr, skip);
+ result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent);

- dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result);
+ dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);

if (result >= 0)
return result;


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-27 22:47:07

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday October 25, [email protected] wrote:
> On Sat, Oct 26, 2002 at 12:11:50PM +0900, Hirokazu Takahashi wrote:
> > OK.
> >
> > It's goot that you're implementing vfs_readv and vfs_writev which
> > I've also realized it doesn't support aio yet.
>
> The aio methods are soon switching over to vectored operations for a
> few reasons. It's likely that non-vectored methods will be gone soon.

If you are introducing new 'vectored' operations, it would be nice if
they work well for kernel-space as well as user-space.

In the 'old days' before CONFIG_HIMEM, you could just

oldfs = get_fs(); set_fs(KERNEL_DS);
...whatever....
set_fs(oldfs);

to use kernel addresses. But with CONFIG_HIMEM the kernel often
wants to work with "struct page *" instead of just a "void *",
so this doesn't always work.
It would be nice if you could pass in an 'actor' which for user-space
access would call copy-to/from-user for kernel-space would do
kmap/copy/kunmap

Just a thought.....

NeilBrown


-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-28 16:31:58

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:

> --- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030
> +++ linux/net/sunrpc/xdr.c Sun Oct 27 19:07:05 2030
> @@ -110,12 +110,15 @@ xdr_encode_pages(struct xdr_buf *xdr, st
xdr-> page_len = len;

> if (len & 3) {
> - struct iovec *iov = xdr->tail;
> unsigned int pad = 4 - (len & 3);
> -
> - iov->iov_base = (void *) "\0\0\0";
> - iov->iov_len = pad;
> len += pad;
> + if (((base + len) & ~PAGE_CACHE_MASK) + pad <=
> PAGE_CACHE_SIZE) {
> + xdr->page_len += pad;

No!!! I believe I told you quite explicitly earlier:

- RFC1832 states that *all* variable length data must be padded with
zeros, and that is certainly not the case if the pages you are
pointing to are in the page cache.

- Worse: That data is not even guaranteed to have been initialized.
In effect this means that your 'optimization' is leaking random
data from the kernel and onto the internet. In security-conscious
circles this is not considered a good thing...

Please leave that padding so that it *always* returns zeros...

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-28 23:47:58

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> - RFC1832 states that *all* variable length data must be padded with
> zeros, and that is certainly not the case if the pages you are
> pointing to are in the page cache.

Yes, your're right.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-29 06:44:34

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> - RFC1832 states that *all* variable length data must be padded with
> zeros, and that is certainly not the case if the pages you are
> pointing to are in the page cache.

I've changed my aproach.

Shall we use ZERO_PAGE to pad RPC requests for the purpose of its
performance? Using non-page data is a little inefficient as
the implementation of skbuff doesn't allow to append non-page data
to a skbuff which already have pages. Only pages can be appneded to it.
If we didn't, TCP/IP stack would allocate a new page to store the
small zero-padded data.

The last page never be coalesced with data of a next RPC request
in case of UDP while it might be done on TCP.


How do you think of this approach?

Thank you,
Hirokazu Takahashi.


--- linux/include/linux/sunrpc/xdr.h.ORG Sun Oct 27 17:56:07 2030
+++ linux/include/linux/sunrpc/xdr.h Tue Oct 29 14:30:48 2030
@@ -48,12 +48,15 @@ typedef int (*kxdrproc_t)(void *rqstp, u
* operations and/or has a need for scatter/gather involving pages.
*/
struct xdr_buf {
- struct iovec head[1], /* RPC header + non-page data */
- tail[1]; /* Appended after page data */
+ struct iovec head[1]; /* RPC header + non-page data */
+ struct page * head_page; /* Page for head if needed */

struct page ** pages; /* Array of contiguous pages */
unsigned int page_base, /* Start of page data */
page_len; /* Length of page data */
+
+ struct iovec tail[1]; /* Appended after page data */
+ struct page * tail_page; /* Page for tail if needed */

unsigned int len; /* Total length of data */

--- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c Tue Oct 29 14:20:52 2030
@@ -113,8 +113,9 @@ xdr_encode_pages(struct xdr_buf *xdr, st
struct iovec *iov = xdr->tail;
unsigned int pad = 4 - (len & 3);

- iov->iov_base = (void *) "\0\0\0";
+ iov->iov_base = (void *)0;
iov->iov_len = pad;
+ xdr->tail_page = sunrpc_get_zeropage();
len += pad;
}
xdr->len += len;
--- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c Tue Oct 29 14:22:14 2030
@@ -60,6 +60,7 @@
#include <linux/unistd.h>
#include <linux/sunrpc/clnt.h>
#include <linux/file.h>
+#include <linux/pagemap.h>

#include <net/sock.h>
#include <net/checksum.h>
@@ -207,48 +208,101 @@ xprt_release_write(struct rpc_xprt *xprt
spin_unlock_bh(&xprt->sock_lock);
}

+static inline int
+__xprt_sendmsg(struct socket *sock, struct xdr_buf *xdr, struct msghdr *msg, size_t skip)
+{
+ unsigned int slen = xdr->len - skip;
+ mm_segment_t oldfs;
+ int result = 0;
+ struct page **ppage = xdr->pages;
+ unsigned int len, pglen = xdr->page_len;
+ size_t base = 0;
+ int flags;
+ int ret;
+ struct iovec niv;
+
+ msg->msg_iov = &niv;
+ msg->msg_iovlen = 1;
+
+ if (xdr->head[0].iov_len > skip) {
+ len = xdr->head[0].iov_len - skip;
+ niv.iov_base = xdr->head[0].iov_base + skip;
+ niv.iov_len = len;
+ if (slen > len)
+ msg->msg_flags |= MSG_MORE;
+ oldfs = get_fs(); set_fs(get_ds());
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ result = sock_sendmsg(sock, msg, len);
+ set_fs(oldfs);
+ if (result != len)
+ return result;
+ slen -= len;
+ skip = 0;
+ } else {
+ skip -= xdr->head[0].iov_len;
+ }
+ if (pglen == 0)
+ goto send_tail;
+ if (skip >= pglen) {
+ skip -= pglen;
+ goto send_tail;
+ }
+ if (skip || xdr->page_base) {
+ pglen -= skip;
+ base = xdr->page_base + skip;
+ ppage += base >> PAGE_CACHE_SHIFT;
+ base &= ~PAGE_CACHE_MASK;
+ }
+ len = PAGE_CACHE_SIZE - base;
+ if (len > pglen) len = pglen;
+ flags = MSG_MORE;
+ while (pglen > 0) {
+ if (slen == len)
+ flags = 0;
+ ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ if (ret > 0)
+ result += ret;
+ if (ret != len) {
+ if (result == 0)
+ result = ret;
+ return result;
+ }
+ slen -= len;
+ pglen -= len;
+ len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+ base = 0;
+ ppage++;
+ }
+ skip = 0;
+send_tail:
+ if (xdr->tail[0].iov_len) {
+ ret = sock->ops->sendpage(sock, xdr->tail_page, (int)xdr->tail[0].iov_base + skip, xdr->tail[0].iov_len - skip, 0);
+ if (ret > 0)
+ result += ret;
+ if (result == 0)
+ result = ret;
+ }
+ return result;
+}
+
/*
* Write data to socket.
*/
static inline int
xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
{
- struct socket *sock = xprt->sock;
struct msghdr msg;
- struct xdr_buf *xdr = &req->rq_snd_buf;
- struct iovec niv[MAX_IOVEC];
- unsigned int niov, slen, skip;
- mm_segment_t oldfs;
int result;

- if (!sock)
- return -ENOTCONN;
-
- xprt_pktdump("packet data:",
- req->rq_svec->iov_base,
- req->rq_svec->iov_len);
-
- /* Dont repeat bytes */
- skip = req->rq_bytes_sent;
- slen = xdr->len - skip;
- niov = xdr_kmap(niv, xdr, skip);
-
msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL;
- msg.msg_iov = niv;
- msg.msg_iovlen = niov;
msg.msg_name = (struct sockaddr *) &xprt->addr;
msg.msg_namelen = sizeof(xprt->addr);
msg.msg_control = NULL;
msg.msg_controllen = 0;

- oldfs = get_fs(); set_fs(get_ds());
- clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- result = sock_sendmsg(sock, &msg, slen);
- set_fs(oldfs);
-
- xdr_kunmap(xdr, skip);
+ result = __xprt_sendmsg(xprt->sock, &req->rq_snd_buf, &msg, req->rq_bytes_sent);

- dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result);
+ dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);

if (result >= 0)
return result;
--- linux/net/sunrpc/sunrpc_syms.c.ORG Tue Oct 29 14:18:45 2030
+++ linux/net/sunrpc/sunrpc_syms.c Tue Oct 29 14:15:27 2030
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup);
EXPORT_SYMBOL(cache_check);
EXPORT_SYMBOL(cache_clean);
EXPORT_SYMBOL(cache_flush);
+EXPORT_SYMBOL(cache_purge);
EXPORT_SYMBOL(cache_fresh);
EXPORT_SYMBOL(cache_init);
EXPORT_SYMBOL(cache_register);
@@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug);
EXPORT_SYMBOL(nlm_debug);
#endif

+/* RPC general use */
+EXPORT_SYMBOL(sunrpc_get_zeropage);
+
+static struct page *sunrpc_zero_page;
+
+struct page *
+sunrpc_get_zeropage(void)
+{
+ return sunrpc_zero_page;
+}
+
+static int __init
+sunrpc_init_zeropage(void)
+{
+ sunrpc_zero_page = alloc_page(GFP_ATOMIC);
+ if (sunrpc_zero_page == NULL) {
+ printk(KERN_ERR "RPC: couldn't allocate zero_page.\n");
+ return 1;
+ }
+ clear_page(page_address(sunrpc_zero_page));
+ return 0;
+}
+
+static void __exit
+sunrpc_cleanup_zeropage(void)
+{
+ put_page(sunrpc_zero_page);
+ sunrpc_zero_page = NULL;
+}
+
static int __init
init_sunrpc(void)
{
@@ -141,12 +172,14 @@ init_sunrpc(void)
#endif
cache_register(&auth_domain_cache);
cache_register(&ip_map_cache);
+ sunrpc_init_zeropage();
return 0;
}

static void __exit
cleanup_sunrpc(void)
{
+ sunrpc_cleanup_zeropage();
cache_unregister(&auth_domain_cache);
cache_unregister(&ip_map_cache);
#ifdef RPC_DEBUG
--- linux/include/linux/sunrpc/types.h.ORG Tue Oct 29 11:31:13 2030
+++ linux/include/linux/sunrpc/types.h Tue Oct 29 11:37:49 2030
@@ -13,10 +13,14 @@
#include <linux/workqueue.h>
#include <linux/sunrpc/debug.h>
#include <linux/list.h>
+#include <linux/mm.h>

/*
* Shorthands
*/
#define signalled() (signal_pending(current))
+
+extern struct page * sunrpc_get_zeropage(void);
+

#endif /* _LINUX_SUNRPC_TYPES_H_ */


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-29 15:09:44

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:

> Shall we use ZERO_PAGE to pad RPC requests for the purpose of
> its performance? Using non-page data is a little inefficient
> as the implementation of skbuff doesn't allow to append
> non-page data to a skbuff which already have pages. Only pages
> can be appneded to it. If we didn't, TCP/IP stack would
> allocate a new page to store the small zero-padded data.

Hmmm... What if we just drop actually storing a pointer to the
ZERO_PAGE? Instead, define the convention that

if (xdr_buf->tail[0].iov_base == NULL)
padding = xdr_buf->tail[0].iov_len;

and just have xprt_sendmsg() magically append 'padding' bytes from
your ZERO_PAGE.

Unless, of course, you've got another use for the head_page/tail_page?

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-29 16:35:37

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

Thank you for your reply.

> > Shall we use ZERO_PAGE to pad RPC requests for the purpose of
> > its performance? Using non-page data is a little inefficient
> > as the implementation of skbuff doesn't allow to append
> > non-page data to a skbuff which already have pages. Only pages
> > can be appneded to it. If we didn't, TCP/IP stack would
> > allocate a new page to store the small zero-padded data.
>
> Hmmm... What if we just drop actually storing a pointer to the
> ZERO_PAGE? Instead, define the convention that
>
> if (xdr_buf->tail[0].iov_base == NULL)
> padding = xdr_buf->tail[0].iov_len;
>
> and just have xprt_sendmsg() magically append 'padding' bytes from
> your ZERO_PAGE.

Yes, it's possible.
OK, I'll modify it.

> Unless, of course, you've got another use for the head_page/tail_page?

I just wanted to make it general.
I guessed head_page (or head_pages) might be usefull for big NFSv4
COMPOUND messages as we could send a head without any copies.
But it's just my guess.

Thank you,
Hirokazu Takahashi.


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-29 16:49:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

>>>>> " " == Hirokazu Takahashi <[email protected]> writes:

>> Unless, of course, you've got another use for the
>> head_page/tail_page?

> I just wanted to make it general. I guessed head_page (or
> head_pages) might be usefull for big NFSv4 COMPOUND messages as
> we could send a head without any copies. But it's just my
> guess.

It's good to know that this is possible, but lets not overdesign: we
don't want to implement this unless we know that we have a need.

Cheers,
Trond


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-30 03:26:22

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

I've modified the patch simple as you said.

> Hmmm... What if we just drop actually storing a pointer to the
> ZERO_PAGE? Instead, define the convention that
>
> if (xdr_buf->tail[0].iov_base == NULL)
> padding = xdr_buf->tail[0].iov_len;
>
> and just have xprt_sendmsg() magically append 'padding' bytes from
> your ZERO_PAGE.


--- linux/net/sunrpc/xdr.c.ORG Sat Oct 26 21:21:16 2030
+++ linux/net/sunrpc/xdr.c Wed Oct 30 11:11:03 2030
@@ -113,7 +113,8 @@ xdr_encode_pages(struct xdr_buf *xdr, st
struct iovec *iov = xdr->tail;
unsigned int pad = 4 - (len & 3);

- iov->iov_base = (void *) "\0\0\0";
+ /* NULL means a request to pad it with zero. */
+ iov->iov_base = NULL;
iov->iov_len = pad;
len += pad;
}
--- linux/net/sunrpc/xprt.c.ORG Sun Oct 27 17:07:17 2030
+++ linux/net/sunrpc/xprt.c Wed Oct 30 12:16:05 2030
@@ -60,6 +60,7 @@
#include <linux/unistd.h>
#include <linux/sunrpc/clnt.h>
#include <linux/file.h>
+#include <linux/pagemap.h>

#include <net/sock.h>
#include <net/checksum.h>
@@ -207,48 +208,113 @@ xprt_release_write(struct rpc_xprt *xprt
spin_unlock_bh(&xprt->sock_lock);
}

-/*
- * Write data to socket.
- */
static inline int
-xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
+__xprt_sendmsg(struct rpc_xprt *xprt, struct xdr_buf *xdr, size_t skip)
{
struct socket *sock = xprt->sock;
+ unsigned int slen = xdr->len - skip;
+ struct page **ppage = xdr->pages;
+ unsigned int len, pglen = xdr->page_len;
+ size_t base = 0;
struct msghdr msg;
- struct xdr_buf *xdr = &req->rq_snd_buf;
- struct iovec niv[MAX_IOVEC];
- unsigned int niov, slen, skip;
+ struct iovec niv;
+ int flags;
mm_segment_t oldfs;
- int result;
-
- if (!sock)
- return -ENOTCONN;
-
- xprt_pktdump("packet data:",
- req->rq_svec->iov_base,
- req->rq_svec->iov_len);
-
- /* Dont repeat bytes */
- skip = req->rq_bytes_sent;
- slen = xdr->len - skip;
- niov = xdr_kmap(niv, xdr, skip);
+ int result = 0;
+ int ret;

msg.msg_flags = MSG_DONTWAIT|MSG_NOSIGNAL;
- msg.msg_iov = niv;
- msg.msg_iovlen = niov;
msg.msg_name = (struct sockaddr *) &xprt->addr;
msg.msg_namelen = sizeof(xprt->addr);
msg.msg_control = NULL;
msg.msg_controllen = 0;
+ msg.msg_iov = &niv;
+ msg.msg_iovlen = 1;

- oldfs = get_fs(); set_fs(get_ds());
- clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- result = sock_sendmsg(sock, &msg, slen);
- set_fs(oldfs);
+ if (xdr->head[0].iov_len > skip) {
+ len = xdr->head[0].iov_len - skip;
+ niv.iov_base = xdr->head[0].iov_base + skip;
+ niv.iov_len = len;
+ if (slen > len)
+ msg.msg_flags |= MSG_MORE;
+ oldfs = get_fs(); set_fs(get_ds());
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ result = sock_sendmsg(sock, &msg, len);
+ set_fs(oldfs);
+ if (result != len)
+ return result;
+ slen -= len;
+ skip = 0;
+ } else {
+ skip -= xdr->head[0].iov_len;
+ }
+ if (pglen == 0)
+ goto send_tail;
+ if (skip >= pglen) {
+ skip -= pglen;
+ goto send_tail;
+ }
+ if (skip || xdr->page_base) {
+ pglen -= skip;
+ base = xdr->page_base + skip;
+ ppage += base >> PAGE_CACHE_SHIFT;
+ base &= ~PAGE_CACHE_MASK;
+ }
+ len = PAGE_CACHE_SIZE - base;
+ if (len > pglen) len = pglen;
+ flags = MSG_MORE;
+ while (pglen > 0) {
+ if (slen == len)
+ flags = 0;
+ ret = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ if (ret > 0)
+ result += ret;
+ if (ret != len) {
+ if (result == 0)
+ result = ret;
+ return result;
+ }
+ slen -= len;
+ pglen -= len;
+ len = PAGE_CACHE_SIZE < pglen ? PAGE_CACHE_SIZE : pglen;
+ base = 0;
+ ppage++;
+ }
+ skip = 0;
+send_tail:
+ if (xdr->tail[0].iov_len) {
+ if (xdr->tail[0].iov_base == NULL) {
+ /* tail[0].iov_base == NULL requires zero padding */
+ ret = sock->ops->sendpage(sock, sunrpc_get_zeropage(),
+ 0, xdr->tail[0].iov_len - skip, 0);
+ } else {
+ niv.iov_base = xdr->tail[0].iov_base + skip;
+ niv.iov_len = xdr->tail[0].iov_len - skip;
+ msg.msg_flags &= ~MSG_MORE;
+ oldfs = get_fs(); set_fs(get_ds());
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ ret = sock_sendmsg(sock, &msg, niv.iov_len);
+ set_fs(oldfs);
+ }
+ if (ret > 0)
+ result += ret;
+ if (result == 0)
+ result = ret;
+ }
+ return result;
+}
+
+/*
+ * Write data to socket.
+ */
+static inline int
+xprt_sendmsg(struct rpc_xprt *xprt, struct rpc_rqst *req)
+{
+ int result;

- xdr_kunmap(xdr, skip);
+ result = __xprt_sendmsg(xprt, &req->rq_snd_buf, req->rq_bytes_sent);

- dprintk("RPC: xprt_sendmsg(%d) = %d\n", slen, result);
+ dprintk("RPC: xprt_sendmsg(%d) = %d\n", req->rq_snd_buf.len - req->rq_bytes_sent, result);

if (result >= 0)
return result;
--- linux/net/sunrpc/sunrpc_syms.c.ORG Tue Oct 29 14:18:45 2030
+++ linux/net/sunrpc/sunrpc_syms.c Tue Oct 29 14:15:27 2030
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(auth_unix_lookup);
EXPORT_SYMBOL(cache_check);
EXPORT_SYMBOL(cache_clean);
EXPORT_SYMBOL(cache_flush);
+EXPORT_SYMBOL(cache_purge);
EXPORT_SYMBOL(cache_fresh);
EXPORT_SYMBOL(cache_init);
EXPORT_SYMBOL(cache_register);
@@ -130,6 +131,36 @@ EXPORT_SYMBOL(nfsd_debug);
EXPORT_SYMBOL(nlm_debug);
#endif

+/* RPC general use */
+EXPORT_SYMBOL(sunrpc_get_zeropage);
+
+static struct page *sunrpc_zero_page;
+
+struct page *
+sunrpc_get_zeropage(void)
+{
+ return sunrpc_zero_page;
+}
+
+static int __init
+sunrpc_init_zeropage(void)
+{
+ sunrpc_zero_page = alloc_page(GFP_ATOMIC);
+ if (sunrpc_zero_page == NULL) {
+ printk(KERN_ERR "RPC: couldn't allocate zero_page.\n");
+ return 1;
+ }
+ clear_page(page_address(sunrpc_zero_page));
+ return 0;
+}
+
+static void __exit
+sunrpc_cleanup_zeropage(void)
+{
+ put_page(sunrpc_zero_page);
+ sunrpc_zero_page = NULL;
+}
+
static int __init
init_sunrpc(void)
{
@@ -141,12 +172,14 @@ init_sunrpc(void)
#endif
cache_register(&auth_domain_cache);
cache_register(&ip_map_cache);
+ sunrpc_init_zeropage();
return 0;
}

static void __exit
cleanup_sunrpc(void)
{
+ sunrpc_cleanup_zeropage();
cache_unregister(&auth_domain_cache);
cache_unregister(&ip_map_cache);
#ifdef RPC_DEBUG
--- linux/include/linux/sunrpc/types.h.ORG Tue Oct 29 11:31:13 2030
+++ linux/include/linux/sunrpc/types.h Tue Oct 29 11:37:49 2030
@@ -13,10 +13,14 @@
#include <linux/workqueue.h>
#include <linux/sunrpc/debug.h>
#include <linux/list.h>
+#include <linux/mm.h>

/*
* Shorthands
*/
#define signalled() (signal_pending(current))
+
+extern struct page * sunrpc_get_zeropage(void);
+

#endif /* _LINUX_SUNRPC_TYPES_H_ */


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-30 23:37:41

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

How is it going?

neilb> I would make a special 'fast-path' for that case which didn't copy any
neilb> data but passed a skbuf up, and code in nfs*xdr.c would convert that
neilb> into an iovec[];
neilb>
neilb> I am working on a patch which changes rpcsvc to use xdr_buf. Some of
neilb> it works. Some doesn't. I include it below for your reference I
neilb> repeat: it doesn't work yet.
neilb> Once it is done, adding the rest of zero-copy should be fairly easy.



-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-14 05:50:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Wednesday September 18, [email protected] wrote:
> Hello,
>
> I ported the zerocopy NFS patches against linux-2.5.36.
>

hi,
I finally got around to looking at this.
It looks good.

However it really needs the MSG_MORE support for udp_sendmsg to be
accepted before there is any point merging the rpc/nfsd bits.

Would you like to see if davem is happy with that bit first and get
it in? Then I will be happy to forward the nfsd specific bit.

I'm bit I'm not very sure about is the 'shadowsock' patch for having
several xmit sockets, one per CPU. What sort of speedup do you get
from this? How important is it really?

NeilBrown

2002-10-14 06:15:34

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Neil Brown <[email protected]>
Date: Mon, 14 Oct 2002 15:50:02 +1000

Would you like to see if davem is happy with that bit first and get
it in? Then I will be happy to forward the nfsd specific bit.

Alexey is working on this, or at least he was. :-)
(Alexey this is about the UDP cork changes)

I'm bit I'm not very sure about is the 'shadowsock' patch for having
several xmit sockets, one per CPU. What sort of speedup do you get
from this? How important is it really?

Personally, it seems rather essential for scalability on SMP.

2002-10-14 10:45:33

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello!

> Alexey is working on this, or at least he was. :-)
> (Alexey this is about the UDP cork changes)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Alexey

2002-10-14 10:48:53

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: [email protected]
Date: Mon, 14 Oct 2002 14:45:33 +0400 (MSD)

I took two patches of the batch:

va10-hwchecksum-2.5.36.patch
va11-udpsendfile-2.5.36.patch

I did not worry about the rest i.e. sunrpc/* part.

Neil and the NFS folks can take care of those parts
once the generic UDP parts are in.

So, no worries.

2002-10-14 12:01:44

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello, Neil

> > I ported the zerocopy NFS patches against linux-2.5.36.
>
> hi,
> I finally got around to looking at this.
> It looks good.

Thanks!

> However it really needs the MSG_MORE support for udp_sendmsg to be
> accepted before there is any point merging the rpc/nfsd bits.
>
> Would you like to see if davem is happy with that bit first and get
> it in? Then I will be happy to forward the nfsd specific bit.

Yes.

> I'm bit I'm not very sure about is the 'shadowsock' patch for having
> several xmit sockets, one per CPU. What sort of speedup do you get
> from this? How important is it really?

It's not so important.

davem> Personally, it seems rather essential for scalability on SMP.

Yes.
It will be effective on large scale SMP machines as all kNFSd shares
one NFS port. A udp socket can't send data on each CPU at the same
time while MSG_MORE/UDP_CORK options are set.
The UDP socket have to block any other requests during making a UDP frame.


Thank you,
Hirokazu Takahashi.

2002-10-14 14:12:56

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

> Hello, Neil
>
> > > I ported the zerocopy NFS patches against linux-2.5.36.
> >
> > hi,
> > I finally got around to looking at this.
> > It looks good.
>
> Thanks!
>
> > However it really needs the MSG_MORE support for udp_sendmsg to be
> > accepted before there is any point merging the rpc/nfsd bits.
> >
> > Would you like to see if davem is happy with that bit first and get
> > it in? Then I will be happy to forward the nfsd specific bit.
>
> Yes.
>
> > I'm bit I'm not very sure about is the 'shadowsock' patch for having
> > several xmit sockets, one per CPU. What sort of speedup do you get
> > from this? How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.

I experienced this exact problem a few months ago. I had a test where
several clients read a file or files cached on a linux server. TCP was just
fine, I could get 100% CPU on all CPUs on the server. TCP zerocopy was even
better, by about 50% throughput. UDP could not get better than 33% CPU, one
CPU working on those UDP requests and I assume a portion of another CPU
handling some inturrupt stuff. Essentially 2P and 4P throughput was only as
good as UP throughput. It is essential to get scaling on UDP. That
combined with the UDP zerocopy, we will have one extremely fast NFS server.

Andrew Theurer
IBM LTC

2002-10-16 03:44:09

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Monday October 14, [email protected] wrote:
> > I'm bit I'm not very sure about is the 'shadowsock' patch for having
> > several xmit sockets, one per CPU. What sort of speedup do you get
> > from this? How important is it really?
>
> It's not so important.
>
> davem> Personally, it seems rather essential for scalability on SMP.
>
> Yes.
> It will be effective on large scale SMP machines as all kNFSd shares
> one NFS port. A udp socket can't send data on each CPU at the same
> time while MSG_MORE/UDP_CORK options are set.
> The UDP socket have to block any other requests during making a UDP frame.
>

After thinking about this some more, I suspect it would have to be
quite large scale SMP to get much contention.
The only contention on the udp socket is, as you say, assembling a udp
frame, and it would be surprised if that takes a substantial faction
of the time to handle a request.

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs. Maybe it would make sense to
have one udp socket for each NIC. Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

NeilBrown

2002-10-16 04:31:02

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

From: Neil Brown <[email protected]>
Date: Wed, 16 Oct 2002 13:44:04 +1000

Presumably on a sufficiently large SMP machine that this became an
issue, there would be multiple NICs. Maybe it would make sense to
have one udp socket for each NIC. Would that make sense? or work?
It feels to me to be cleaner than one for each CPU.

Doesn't make much sense.

Usually we are talking via one IP address, and thus over
one device. It could be using multiple NICs via BONDING,
but that would be transparent to anything at the socket
level.

Really, I think there is real value to making the socket
per-cpu even on a 2 or 4 way system.

2002-10-16 11:09:00

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

> > It will be effective on large scale SMP machines as all kNFSd shares
> > one NFS port. A udp socket can't send data on each CPU at the same
> > time while MSG_MORE/UDP_CORK options are set.
> > The UDP socket have to block any other requests during making a UDP frame.
> >

> After thinking about this some more, I suspect it would have to be
> quite large scale SMP to get much contention.

I have no idea how much contention will happen. I haven't checked the
performance of it on large scale SMP yet as I don't have such a great
machines.

Does anyone help us?

> The only contention on the udp socket is, as you say, assembling a udp
> frame, and it would be surprised if that takes a substantial faction
> of the time to handle a request.

After assembling a udp frame, kNFSd may drive a NIC to transmit the frame.

> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs. Maybe it would make sense to
> have one udp socket for each NIC. Would that make sense? or work?

Some CPUs often share one GbE NIC today as a NIC can handle much data
than one CPU can. I think that CPU seems likely to become bottleneck.
Personally I guess several CPUs will share one 10GbE NIC in the near
future even if it's a high end machine. (It's just my guess)

But I don't know how effective this patch works......

devem> Doesn't make much sense.
devem>
devem> Usually we are talking via one IP address, and thus over
devem> one device. It could be using multiple NICs via BONDING,
devem> but that would be transparent to anything at the socket
devem> level.
devem>
devem> Really, I think there is real value to making the socket
devem> per-cpu even on a 2 or 4 way system.

I wish so.

2002-10-16 15:04:27

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Tuesday 15 October 2002 11:31 pm, David S. Miller wrote:
> From: Neil Brown <[email protected]>
> Date: Wed, 16 Oct 2002 13:44:04 +1000
>
> Presumably on a sufficiently large SMP machine that this became an
> issue, there would be multiple NICs. Maybe it would make sense to
> have one udp socket for each NIC. Would that make sense? or work?
> It feels to me to be cleaner than one for each CPU.
>
> Doesn't make much sense.
>
> Usually we are talking via one IP address, and thus over
> one device. It could be using multiple NICs via BONDING,
> but that would be transparent to anything at the socket
> level.
>
> Really, I think there is real value to making the socket
> per-cpu even on a 2 or 4 way system.

I am trying my best today to get a 4 way system up and running for this test.
IMO, per cpu is best.. with just one socket, I seriously could not get over
33% cpu utilization on a 4 way (back in April). With TCP, I could max it
out. I'll update later today hopefully with some promising results.

-Andrew

2002-10-16 17:02:15

by kaza

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

Hello,

On Wed, Oct 16, 2002 at 08:09:00PM +0900,
Hirokazu Takahashi-san wrote:
> > After thinking about this some more, I suspect it would have to be
> > quite large scale SMP to get much contention.
>
> I have no idea how much contention will happen. I haven't checked the
> performance of it on large scale SMP yet as I don't have such a great
> machines.
>
> Does anyone help us?

Why don't you propose the performance test to OSDL? (OSDL-J is more
better, I think) OSDL provide hardware resources and operation staffs.

If you want, I can help you to propose it. :-)

--
Ko Kazaana / editor-in-chief of "TechStyle" ( http://techstyle.jp/ )
GnuPG Fingerprint = 1A50 B204 46BD EE22 2E8C 903F F2EB CEA7 4BCF 808F

2002-10-17 04:36:32

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [PATCH] zerocopy NFS for 2.5.36

On Thu, 17 Oct 2002 [email protected] wrote:

| Hello,
|
| On Wed, Oct 16, 2002 at 08:09:00PM +0900,
| Hirokazu Takahashi-san wrote:
| > > After thinking about this some more, I suspect it would have to be
| > > quite large scale SMP to get much contention.
| >
| > I have no idea how much contention will happen. I haven't checked the
| > performance of it on large scale SMP yet as I don't have such a great
| > machines.
| >
| > Does anyone help us?
|
| Why don't you propose the performance test to OSDL? (OSDL-J is more
| better, I think) OSDL provide hardware resources and operation staffs.

and why do you say that? 8;)

| If you want, I can help you to propose it. :-)

That's the right thing to do.

--
~Randy

2002-10-17 13:22:13

by Andrew Theurer

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.36

Subject: Re: [NFS] Re: [PATCH] zerocopy NFS for 2.5.36


> Hello,
>
> Thanks for testing my patches.
>
> > I am still seeing some sort of problem on an 8 way (hyperthreaded 8
> > logical/4 physical) on UDP with these patches. I cannot get more than 2
> > NFSd threads in a run state at one time. TCP usually has 8 or more.
The
> > test involves 40 100Mbit clients reading a 200 MB file on one server (4
> > acenic adapters) in cache. I am fighting some other issues at the
moment
> > (acpi wierdness), but so far before the patches, 82 MB/sec for NFSv2,UDP
and
> > 138 MB/sec for NFSv2,TCP. With the patches, 115 MB/sec for NFSv2,UDP
and
> > 181 MB/sec for NFSv2,TCP. One CPU is maxed due to acpi int storm, so I
> > think the results will get better. I'm not sure what other lock or
> > contention point this is hitting on UDP. If there is anything I can do
to
> > help, please let me know, thanks.
>
> I guess some UDP packets might be lost. It may happen easily as UDP
protocol
> doesn't support flow control.
> Can you check how many errors has happened?
> You can see them in /proc/net/snmp of the server and the clients.

server: Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 1000665 41 0 1000666

clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
Udp: 200403 0 0 200406
(all clients the same)

> And how many threads did you start on your machine?
> Buffer size of a UDP socket depends on number of kNFS threads.
> Large number of threads might help you.

128 threads. client rsize=8196. Server and client MTU is 1500.

Andrew Theurer



-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-10-17 13:33:10

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.36

Hi,

> server: Udp: InDatagrams NoPorts InErrors OutDatagrams
> Udp: 1000665 41 0 1000666
> clients: Udp: InDatagrams NoPorts InErrors OutDatagrams
> Udp: 200403 0 0 200406
> (all clients the same)

How about IP datagrams? You can see the IP fields in /proc/net/snmp
IP layer may also discard them.

> > And how many threads did you start on your machine?
> > Buffer size of a UDP socket depends on number of kNFS threads.
> > Large number of threads might help you.
>
> 128 threads. client rsize=8196. Server and client MTU is 1500.

It seems enough...



-------------------------------------------------------
This sf.net email is sponsored by: viaVerio will pay you up to
$1,000 for every account that you consolidate with us.
http://ad.doubleclick.net/clk;4749864;7604308;v?
http://www.viaverio.com/consolidator/osdn.cfm
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-01 00:54:56

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday November 1, [email protected] wrote:
> Hello,
>
> > The rest of the zero copy stuff should fit in quite easily, with the
> > possible exception of single-copy writes: I haven't looked very hard
> > at that yet.
>
> I just ported part of the zero copy stuff against linux-2.5.45.
> single-copy writes and per-cpu sokcets are not included yet.
> And I fixed a problem that NFS over TCP wouldn't work.
>
>
> va-nfsd-sendpage.patch ....use sendpage instead of sock_sendmsg.
> va-sunrpc-zeropage.patch ....zero filled page for padding.
> va-nfsd-vfsread.patch ....zero-copy nfsd_read/nfsd_readdir.

A lot of this looks fine.

I would like to leave the tail pointing into the end of the first page
(just after the head) rather than using the sunrpc_zero_page thing as
the later doesn't seem necessary.
Also, I would like to send the head and tail with sendpage rather
than using sock_sendmsg.
To give the destination address, you can call sock_sendmsg with
a length of 0, and then call ->sendpage for each page or page
fragment.


You should be able to remove the calls to svcbuf_reserve in
nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
'buffer' variable as well.


If you could make those changes (or convince me otherwise), I will
forward the patches to Linus,
Thanks.

NeilBrown


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-01 01:10:59

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday November 1, [email protected] wrote:
> Hello,
>
> > > The rest of the zero copy stuff should fit in quite easily, with the
> > > possible exception of single-copy writes: I haven't looked very hard
> > > at that yet.
> >
> > I just ported part of the zero copy stuff against linux-2.5.45.
> > single-copy writes and per-cpu sokcets are not included yet.
> > And I fixed a problem that NFS over TCP wouldn't work.
>
> I also ported the per-cpu socket patch against linux2.5.45.
>

I still don't really like this patch.
I appreciate that some sort of SMP awareness may be appropriate for
nfsd, but this just doesn't feel right.

Once possibility that I have considered goes like this:

- Allow a (udp) socket to have 'cpu affinity' registered.
- Get udp_v4_lookup add to the score for sockets that
like the current cpu, and reject sockets that don't like this
cpu.
- Have some cpu affinity with the nfsd threads, probably having
a separate idle-server-queue for each cpu. Possibly half the
threads would be tied to a cpu, the other half would float, and
only be used if no cpu-local threads were available.

Then instead of have special 'shadow' sockets, we just create NCPUS
normal udp sockets, instead of one, and give each a cpu affinity.
This would mean that receiving would benefit from multiple sockets
as well as sending.

I have very little experience with these sort of SMP issues, so I may
be missing something obvious, but to me, this approach seems cleaner
and more general.

Dave: what would you think of having a "unsigned long cpus_allowed"
in struct inet_opt and putting the appropriate checks in
udp_v4_lookup?? Is it worth experimenting with?

NeilBrown


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-01 01:47:35

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

> > va-nfsd-sendpage.patch ....use sendpage instead of sock_sendmsg.
> > va-sunrpc-zeropage.patch ....zero filled page for padding.
> > va-nfsd-vfsread.patch ....zero-copy nfsd_read/nfsd_readdir.
>
> A lot of this looks fine.
>
> I would like to leave the tail pointing into the end of the first page
> (just after the head) rather than using the sunrpc_zero_page thing as
> the later doesn't seem necessary.
> Also, I would like to send the head and tail with sendpage rather
> than using sock_sendmsg.

Yes, we can.
I'll do it though it seems little bit tricky.

> To give the destination address, you can call sock_sendmsg with
> a length of 0, and then call ->sendpage for each page or page
> fragment.

Ok.

> You should be able to remove the calls to svcbuf_reserve in
> nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
> 'buffer' variable as well.

Yes, you're right.

> If you could make those changes (or convince me otherwise), I will
> forward the patches to Linus,
> Thanks.

Thanks!


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-01 03:48:47

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hello,

I updated the patches.
I'll send you 2 patches

va-nfsd-sendpage.patch
va-nfsd-vfsread.patch

> I would like to leave the tail pointing into the end of the first page
> (just after the head) rather than using the sunrpc_zero_page thing as
> the later doesn't seem necessary.
> Also, I would like to send the head and tail with sendpage rather
> than using sock_sendmsg.
> To give the destination address, you can call sock_sendmsg with
> a length of 0, and then call ->sendpage for each page or page
> fragment.
>
>
> You should be able to remove the calls to svcbuf_reserve in
> nfsd_proc_readdir and nfsd3_proc_readdir, and then discard the
> 'buffer' variable as well.
>
>
> If you could make those changes (or convince me otherwise), I will
> forward the patches to Linus,
> Thanks.

Thank you,
Hirokazu Takahashi.



Attachments:
zerocopy-2.5.45-new.taz (5.77 kB)

2002-11-01 04:21:02

by NeilBrown

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

On Friday November 1, [email protected] wrote:
> Hello,
>
> I updated the patches.
> I'll send you 2 patches
>
> va-nfsd-sendpage.patch
> va-nfsd-vfsread.patch
>

Thanks.
I made a couple of little changes and sent them to Linus and the list.
1/ I simplified the sending of the tail a bit more. We assume the
tail is *always* in the same page as the head, and just sendpage it.

2/ I removed svcbuf_reserve and the buffer variable from
nfsd3_proc_readdirplus as well :-)

Thanks again,
NeilBrown


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-01 05:15:41

by Hirokazu Takahashi

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

Hi,

> Thanks.
> I made a couple of little changes and sent them to Linus and the list.
> 1/ I simplified the sending of the tail a bit more. We assume the
> tail is *always* in the same page as the head, and just sendpage it.

I looks fine!

> 2/ I removed svcbuf_reserve and the buffer variable from
> nfsd3_proc_readdirplus as well :-)

Thanks.



-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-11-04 21:45:42

by Andrew Theurer

[permalink] [raw]
Subject: Re: Re: [PATCH] zerocopy NFS for 2.5.43

> > I also ported the per-cpu socket patch against linux2.5.45.
>
> I still don't really like this patch.
> I appreciate that some sort of SMP awareness may be appropriate for
> nfsd, but this just doesn't feel right.
>
> Once possibility that I have considered goes like this:
>
> - Allow a (udp) socket to have 'cpu affinity' registered.
> - Get udp_v4_lookup add to the score for sockets that
> like the current cpu, and reject sockets that don't like this
> cpu.
> - Have some cpu affinity with the nfsd threads, probably having
> a separate idle-server-queue for each cpu. Possibly half the
> threads would be tied to a cpu, the other half would float, and
> only be used if no cpu-local threads were available.

This all sounds great, I wish I knew how to do this :)

> Then instead of have special 'shadow' sockets, we just create NCPUS
> normal udp sockets, instead of one, and give each a cpu affinity.
> This would mean that receiving would benefit from multiple sockets
> as well as sending.

So, the target socket getting populated on inbound traffic would likely d=
epend=20
on which CPU took the net card inturrupt? And the resulting CPU would ha=
ndle=20
the NFS request? If so, and you had a good interrupt balance across CPUs=
,=20
that sounds fine. If you have an interrupt imbalance, it could be really=
=20
bad. This doesn't sound like a problem on a system like PIII, where=20
interrupts can float (don't know how irqbalance works with that) but I'm =
not=20
so sure about P4, even with irqbalance. Over time they do balance out, b=
ut=20
in my experience a particular interrupt is being handled by one CPU or=20
another with a significant (in this context) amount of time between=20
destination changes.=20

> I have very little experience with these sort of SMP issues, so I may
> be missing something obvious, but to me, this approach seems cleaner
> and more general.
>
> Dave: what would you think of having a "unsigned long cpus_allowed"
> in struct inet_opt and putting the appropriate checks in
> udp_v4_lookup?? Is it worth experimenting with?
>
> NeilBrown



-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs