2008-02-05 15:50:23

by Frank van Maarseveen

[permalink] [raw]
Subject: Re: [PATCH 4/5] NFSD: Remove NFSD_TCP kernel build option

On Tue, Feb 05, 2008 at 04:49:39PM +1100, Greg Banks wrote:
> Chuck Lever wrote:
> > On Feb 4, 2008, at 7:29 PM, Greg Banks wrote:
> >> Trond Myklebust wrote:
> >>> On Tue, 2008-02-05 at 11:19 +1100, Greg Banks wrote:
> >>>
> >>>> Chuck Lever wrote:
> >>>>
> >>>>> TCP support in the Linux NFS server is stable enough that we can
> >>>>> leave it
> >>>>> on always. CONFIG_NFSD_TCP adds about 10 lines of code, and
> >>>>> defaults to
> >>>>> "Y" anyway.
> >>>>>
> >>>>> A run-time switch might be more appropriate if people feel they
> >>>>> would like
> >>>>> to disable NFSD's TCP support.
> >>>>>
> >>>>>
> >>>>>
> >>>> Looks good.
> >>>>
> >>>> Actually, I'd be inclined to go one step further and set UDP support
> >>>> off by default.
> >>>>
> >>>
> >>> That will break older clients.
> >>>
> >>>
> >> Hence the default, rather than removing the code entirely.
> >
> >
> > What might make sense is to remove NFSD_TCP, but add NFSD_UDP,
> > defaulting to Y.
> >
> > Then in a year or two we can change the default to N.
> >
> Fine by me.

Last time I checked (around 2.6.22) writing large files on NFSv3 over
UDP was 20% faster compared to TCP (Gb LAN with one switch connecting
all machines).

TCP and its timeout/retransmission behavior isn't always the best choice.

--
Frank


2008-02-05 17:50:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 4/5] NFSD: Remove NFSD_TCP kernel build option


On Tue, 2008-02-05 at 16:50 +0100, Frank van Maarseveen wrote:
> Last time I checked (around 2.6.22) writing large files on NFSv3 over
> UDP was 20% faster compared to TCP (Gb LAN with one switch connecting
> all machines).
>
> TCP and its timeout/retransmission behavior isn't always the best choice.

If your environment has only 1 client working against a dedicated NFS
server on a clean network, then that may indeed be the case, but as soon
as you have more than 1 client, TCP almost always ends up outperforming
UDP.

Cheers
Trond


2008-02-05 22:59:08

by Greg Banks

[permalink] [raw]
Subject: Re: [PATCH 4/5] NFSD: Remove NFSD_TCP kernel build option

Frank van Maarseveen wrote:
> On Tue, Feb 05, 2008 at 04:49:39PM +1100, Greg Banks wrote:
>
>> Chuck Lever wrote:
>>
>>> On Feb 4, 2008, at 7:29 PM, Greg Banks wrote:
>>>
>>>> Trond Myklebust wrote:
>>>>
>>>>> On Tue, 2008-02-05 at 11:19 +1100, Greg Banks wrote:
>>>>>
>>>>>
>>>>>> Chuck Lever wrote:
>>>>>>
>>>>>>
>>>>>>> TCP support in the Linux NFS server is stable enough that we can
>>>>>>> leave it
>>>>>>> on always. CONFIG_NFSD_TCP adds about 10 lines of code, and
>>>>>>> defaults to
>>>>>>> "Y" anyway.
>>>>>>>
>>>>>>> A run-time switch might be more appropriate if people feel they
>>>>>>> would like
>>>>>>> to disable NFSD's TCP support.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Looks good.
>>>>>>
>>>>>> Actually, I'd be inclined to go one step further and set UDP support
>>>>>> off by default.
>>>>>>
>>>>>>
>>>>> That will break older clients.
>>>>>
>>>>>
>>>>>
>>>> Hence the default, rather than removing the code entirely.
>>>>
>>> What might make sense is to remove NFSD_TCP, but add NFSD_UDP,
>>> defaulting to Y.
>>>
>>> Then in a year or two we can change the default to N.
>>>
>>>
>> Fine by me.
>>
>
> Last time I checked (around 2.6.22) writing large files on NFSv3 over
> UDP was 20% faster compared to TCP (Gb LAN with one switch connecting
> all machines).
>
Did all of your file arrive at the server, and in the same order it left
the client? NFS on UDP relies on IP fragmentation, which is known to
introduce silent data corruption at high data rates (google for "IPID aliasing").

Also, last time I checked, UDP support in the server uses a single socket
for all traffic, and processes need to serialise on the svc_sock lock to send,
so aggregate UDP throughput is strictly limited compared to TCP. As in, 145 MB/s
for UDP compared to filling 12 1gige pipes for TCP. I have a patch to fix this,
but given the inherent data corruption issues of UDP I haven't bothered posting
the most recent version.



> TCP and its timeout/retransmission behavior isn't always the best choice.
>
>
The timeout & retrans that sunrpc implements on top of UDP is arguably worse,
especially if you use the "soft" mount option.

--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
The cake is *not* a lie.
I don't speak for SGI.