2009-09-04 20:19:43

by Ben Greear

[permalink] [raw]
Subject: Reading NFS file without copying to user-space?

I'm trying to optimize a tool that should do NFS reads as fast as possible
from a server in order to stress test the server.

Currently, I open the file as normal and read into a pre-allocated buffer.

This causes a copy of the data to user-space.

Is there any way to cause the nfs client logic to still request the file-read,
but not actually copy anything to user-space?

Maybe some trick with mmap would do this?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com



2009-09-04 22:49:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On Fri, 2009-09-04 at 15:30 -0700, Ben Greear wrote:
> On 09/04/2009 03:15 PM, Trond Myklebust wrote:
> > On Fri, 2009-09-04 at 14:57 -0700, Ben Greear wrote:
> >> On 09/04/2009 01:58 PM, Trond Myklebust wrote:
> >>
> >>> You're missing the point. O_DIRECT does not copy data from the kernel
> >>> into userspace. The data is placed directly into the user buffer from
> >>> the socket.
> >>>
> >>> The only faster alternative would be to directly discard the data in the
> >>> socket, and we offer no option to do that.
> >>
> >> I was thinking I might be clever and use sendfile to send an nfs
> >> file to /dev/zero, but unfortunately it seems sendfile can only send
> >> to a destination that is a socket....
> >
> > Why do you think that would be any faster than standard O_DIRECT? It
> > should be slower, since it involves an extra copy.
>
> I was thinking that the kernel might take the data received in the skb's from
> the file-server and send it to /dev/null, ie basically just immediately
> discard the received data. If it could do that, it would be a zero-copy
> read: The only copying would be the NIC DMA'ing the packet into the skb.

No... The RPC layer will always copy the data from the socket into a
buffer. If you are using O_DIRECT reads, then that buffer will be the
same one that you supplied in userland (the kernel just uses page table
trickery to map those pages into the kernel address space). If you are
using any other type of read (even if it is being piped using sendfile()
or splice()) then it will copy that data into the NFS filesystem's page
cache.

> It would also seem to me that if one allowed sendfile to copy between
> files, it could do the same trick saving to a real file and save user-space
> having to read the file in and then write it out again to disk.

As I said above, sendfile and splice don't work that way. They both use
the page cache as the source, so the filesystem needs to fill the page
cache first.

> Out of curiosity, any one have any benchmarks for NFS on 10G hardware?

I'm not aware of any public figures. I'd be interested to hear how you
max out.

> Based on testing against another vendor's nfs server, it seems that the client
> is loosing packets (the server shows tcp retransmits).

Is the data being lost at the client, the switch or the server? Assuming
that you are using a managed switch, then a look at its statistics
should be able to answer that question.



2009-09-04 23:03:10

by Ben Greear

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On 09/04/2009 03:49 PM, Trond Myklebust wrote:
> On Fri, 2009-09-04 at 15:30 -0700, Ben Greear wrote:

>> I was thinking that the kernel might take the data received in the skb's from
>> the file-server and send it to /dev/null, ie basically just immediately
>> discard the received data. If it could do that, it would be a zero-copy
>> read: The only copying would be the NIC DMA'ing the packet into the skb.
>
> No... The RPC layer will always copy the data from the socket into a
> buffer. If you are using O_DIRECT reads, then that buffer will be the
> same one that you supplied in userland (the kernel just uses page table
> trickery to map those pages into the kernel address space). If you are
> using any other type of read (even if it is being piped using sendfile()
> or splice()) then it will copy that data into the NFS filesystem's page
> cache.

Ok, I think I understand that better now. Seems like one could have
RPC use a list of skbs as data store instead of copying the data,
but perhaps that would be optimizing for something no one would
ever really want in the real world.


>> Out of curiosity, any one have any benchmarks for NFS on 10G hardware?
>
> I'm not aware of any public figures. I'd be interested to hear how you
> max out.
>
>> Based on testing against another vendor's nfs server, it seems that the client
>> is loosing packets (the server shows tcp retransmits).
>
> Is the data being lost at the client, the switch or the server? Assuming
> that you are using a managed switch, then a look at its statistics
> should be able to answer that question.

At least for my local linux - linux tests, I'm using just fibre optic
cable to connect them, so definitely not a switch problem here. No obvious errors
reported by either NIC, and pktgen tests show that they can easily sustain
9Gbps. I need to do more detailed looking at the netstat
counters and such. I suspect I may have too-small network buffers. I last
set up their defaults when a 1GB RAM system was 'high end', and now
I'm using 12GB systems :P

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2009-09-04 20:35:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On Fri, 2009-09-04 at 12:48 -0700, Ben Greear wrote:
> I'm trying to optimize a tool that should do NFS reads as fast as possible
> from a server in order to stress test the server.
>
> Currently, I open the file as normal and read into a pre-allocated buffer.
>
> This causes a copy of the data to user-space.
>
> Is there any way to cause the nfs client logic to still request the file-read,
> but not actually copy anything to user-space?
>
> Maybe some trick with mmap would do this?

How about using O_DIRECT? That just copies the data directly into user
pages and avoids all the overhead of using the page cache?

Note that you can combine O_DIRECT with aio in order to further increase
the speeds.

Cheers
Trond


2009-09-04 20:49:44

by Ben Greear

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On 09/04/2009 01:35 PM, Trond Myklebust wrote:
> On Fri, 2009-09-04 at 12:48 -0700, Ben Greear wrote:
>> I'm trying to optimize a tool that should do NFS reads as fast as possible
>> from a server in order to stress test the server.
>>
>> Currently, I open the file as normal and read into a pre-allocated buffer.
>>
>> This causes a copy of the data to user-space.
>>
>> Is there any way to cause the nfs client logic to still request the file-read,
>> but not actually copy anything to user-space?
>>
>> Maybe some trick with mmap would do this?
>
> How about using O_DIRECT? That just copies the data directly into user
> pages and avoids all the overhead of using the page cache?
>
> Note that you can combine O_DIRECT with aio in order to further increase
> the speeds.

I'm using O_DIRECT (so that the server is continually stressed even if
the file would have otherwise been cached locally on the client).

This still causes a copy of the contents to user-space when I do a
read() call though, as far as I can tell. Since I'm normally not looking
at this data at all, the memory copy from kernel to user is wasted
effort in my case.

I haven't looked into aio yet..will go do some googling...

Thanks,
Ben


>
> Cheers
> Trond


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2009-09-04 20:58:34

by Trond Myklebust

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On Sep 4, 2009, at 16:49, Ben Greear <[email protected]> wrote:

> I'm using O_DIRECT (so that the server is continually stressed even if
> the file would have otherwise been cached locally on the client).
>
> This still causes a copy of the contents to user-space when I do a
> read() call though, as far as I can tell. Since I'm normally not
> looking
> at this data at all, the memory copy from kernel to user is wasted
> effort in my case.

You're missing the point. O_DIRECT does not copy data from the kernel
into userspace. The data is placed directly into the user buffer from
the socket.

The only faster alternative would be to directly discard the data in
the socket, and we offer no option to do that.

Trond

2009-09-04 21:12:06

by Ben Greear

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On 09/04/2009 01:58 PM, Trond Myklebust wrote:
> On Sep 4, 2009, at 16:49, Ben Greear <[email protected]> wrote:
>
>> I'm using O_DIRECT (so that the server is continually stressed even if
>> the file would have otherwise been cached locally on the client).
>>
>> This still causes a copy of the contents to user-space when I do a
>> read() call though, as far as I can tell. Since I'm normally not looking
>> at this data at all, the memory copy from kernel to user is wasted
>> effort in my case.
>
> You're missing the point. O_DIRECT does not copy data from the kernel
> into userspace. The data is placed directly into the user buffer from
> the socket.

I may be going about things all wrong...

>
> The only faster alternative would be to directly discard the data in the
> socket, and we offer no option to do that.

I'm opening an fd like this:


uint32 flgs = O_RDONLY | O_DIRECT | O_LARGEFILE;
fd = open(fname, flgs);

Then read from the fd it:
int retval = read(fd, rcv_buffer_ptr, my_read_len);

rcv_buffer_ptr is just a 1MB (or so) array of bytes.


Maybe I need to use aio_read with O_DIRECT to get the benefits you speak of?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2009-09-04 21:57:12

by Ben Greear

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On 09/04/2009 01:58 PM, Trond Myklebust wrote:

> You're missing the point. O_DIRECT does not copy data from the kernel
> into userspace. The data is placed directly into the user buffer from
> the socket.
>
> The only faster alternative would be to directly discard the data in the
> socket, and we offer no option to do that.

I was thinking I might be clever and use sendfile to send an nfs
file to /dev/zero, but unfortunately it seems sendfile can only send
to a destination that is a socket....

Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2009-09-04 22:00:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On Sep 4, 2009, at 17:12, Ben Greear <[email protected]> wrote:

> On 09/04/2009 01:58 PM, Trond Myklebust wrote:
>> On Sep 4, 2009, at 16:49, Ben Greear <[email protected]> wrote:
>>
>>> I'm using O_DIRECT (so that the server is continually stressed
>>> even if
>>> the file would have otherwise been cached locally on the client).
>>>
>>> This still causes a copy of the contents to user-space when I do a
>>> read() call though, as far as I can tell. Since I'm normally not
>>> looking
>>> at this data at all, the memory copy from kernel to user is wasted
>>> effort in my case.
>>
>> You're missing the point. O_DIRECT does not copy data from the kernel
>> into userspace. The data is placed directly into the user buffer from
>> the socket.
>
> I may be going about things all wrong...
>
>>
>> The only faster alternative would be to directly discard the data
>> in the
>> socket, and we offer no option to do that.
>
> I'm opening an fd like this:
>
>
> uint32 flgs = O_RDONLY | O_DIRECT | O_LARGEFILE;
> fd = open(fname, flgs);
>
> Then read from the fd it:
> int retval = read(fd, rcv_buffer_ptr, my_read_len);
>
> rcv_buffer_ptr is just a 1MB (or so) array of bytes.
>

Use a (much) larger buffer. Linux clients are capable of reading 2MB
in a single RPC, so you won't be doing much in the way of parallel
reads with 1MB.
I'd also suggest bumping up the number of tcp slots (see in /proc/sys/
fs/nfs/). This should be done before you mount the NFS partition.

2009-09-04 22:15:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On Fri, 2009-09-04 at 14:57 -0700, Ben Greear wrote:
> On 09/04/2009 01:58 PM, Trond Myklebust wrote:
>
> > You're missing the point. O_DIRECT does not copy data from the kernel
> > into userspace. The data is placed directly into the user buffer from
> > the socket.
> >
> > The only faster alternative would be to directly discard the data in the
> > socket, and we offer no option to do that.
>
> I was thinking I might be clever and use sendfile to send an nfs
> file to /dev/zero, but unfortunately it seems sendfile can only send
> to a destination that is a socket....

Why do you think that would be any faster than standard O_DIRECT? It
should be slower, since it involves an extra copy.

Trond



2009-09-04 22:31:00

by Ben Greear

[permalink] [raw]
Subject: Re: Reading NFS file without copying to user-space?

On 09/04/2009 03:15 PM, Trond Myklebust wrote:
> On Fri, 2009-09-04 at 14:57 -0700, Ben Greear wrote:
>> On 09/04/2009 01:58 PM, Trond Myklebust wrote:
>>
>>> You're missing the point. O_DIRECT does not copy data from the kernel
>>> into userspace. The data is placed directly into the user buffer from
>>> the socket.
>>>
>>> The only faster alternative would be to directly discard the data in the
>>> socket, and we offer no option to do that.
>>
>> I was thinking I might be clever and use sendfile to send an nfs
>> file to /dev/zero, but unfortunately it seems sendfile can only send
>> to a destination that is a socket....
>
> Why do you think that would be any faster than standard O_DIRECT? It
> should be slower, since it involves an extra copy.

I was thinking that the kernel might take the data received in the skb's from
the file-server and send it to /dev/null, ie basically just immediately
discard the received data. If it could do that, it would be a zero-copy
read: The only copying would be the NIC DMA'ing the packet into the skb.

It would also seem to me that if one allowed sendfile to copy between
files, it could do the same trick saving to a real file and save user-space
having to read the file in and then write it out again to disk.

Truth is, I don't know much about the low level of file-io, so I may
be completely confused about things :)

I'll try using much larger buffers for the read() call, and will also make
sure the networking buffer pools are big enough.

Out of curiosity, any one have any benchmarks for NFS on 10G hardware?

I have two 2.6.31-rc8 Linux systems that for a short time will serve & sink about 9Gbps of
file-io (serving from 2GB tmpfs, discarding as soon as we read). Something
goes weird after a minute or two and bandwidth drops down and bounces between
4Gbps-8Gbps.
Based on testing against another vendor's nfs server, it seems that the client
is loosing packets (the server shows tcp retransmits).

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com