LinuxLists.cc - NFS Still broken in 2.6.x?

2006-02-23 20:35:29

Subject: NFS Still broken in 2.6.x?

Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
at least Trond thought, mid-January, that "The readahead algorithm has
been broken in 2.6.x for at least the past 6 months." (
http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
know if that has been fixed?

Basically, the problem I'm having is that downloads from an NFS server
using kernel 2.6 are no more than half as fast as the same from a
server using kernel 2.4. Write speed (uploading) seems to be about the
same, but reading is slow.

I'm using tcp as my protocol, at the suggestion of many posts, but
flipping over to udp doesn't seem to make any difference. I'm using
version 3, although I did switch to 2 just to check (it's no better,
usually slower). My read size is 32768 and my write size is 8192.
Decreasing the read size only slows down the transfers. Increasing
write size has no effect.

As for hardware, both machines are dual AMD Opterons, 100Mbps ethernet,
and the NFS is serving space on a RAID array. The 2.4 (2.4.21 to be
exact) kernel is running under SuSE 9.0, and the 2.6 (2.6.15 to be
exact) kernel is running under SuSE 10.0. I saw the same speed drop
when attempting to upgrade to SuSE 9.3. I stayed with 9.0 in hopes
that the problem would be fixed in the future.

Anyone have any ideas?

-Bryan

2006-02-23 22:47:13

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
> Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
> at least Trond thought, mid-January, that "The readahead algorithm has
> been broken in 2.6.x for at least the past 6 months." (
> http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
> know if that has been fixed?

No it hasn't been fixed. ...and no, this is not a problem that only
affects NFS: it just happens to give a more noticeable performance
impact due to the larger latency of NFS over a 100Mbps link.

I will get round to this, but the general opacity of the current
readahead code has been a bit of a put-off in the face of other NFS
problems.

Cheers,
Trond

2006-02-24 12:15:22

by Andrew Morton

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

Trond Myklebust <[email protected]> wrote:
>
> On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
> > Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
> > at least Trond thought, mid-January, that "The readahead algorithm has
> > been broken in 2.6.x for at least the past 6 months." (
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
> > know if that has been fixed?
>
> No it hasn't been fixed. ...and no, this is not a problem that only
> affects NFS: it just happens to give a more noticeable performance
> impact due to the larger latency of NFS over a 100Mbps link.

iirc, last time we went round this loop Ram and I were unable to reproduce it.

Does anyone have a testcase?

2006-02-24 13:36:56

by Trond Myklebust

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

On Fri, 2006-02-24 at 04:14 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
> > > Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
> > > at least Trond thought, mid-January, that "The readahead algorithm has
> > > been broken in 2.6.x for at least the past 6 months." (
> > > http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
> > > know if that has been fixed?
> >
> > No it hasn't been fixed. ...and no, this is not a problem that only
> > affects NFS: it just happens to give a more noticeable performance
> > impact due to the larger latency of NFS over a 100Mbps link.
>
> iirc, last time we went round this loop Ram and I were unable to reproduce it.
>
> Does anyone have a testcase?

Yes. A dead simple one

run iozone in sequential read mode on a tcp link w/ rsize == 32k

Monitor the traffic using tcpdump. Pretty soon you will see the size of
the NFS read requests drop from 32k to 4k, which indicates that there is
no readahead at all.

Cheers,
Trond

2006-02-24 14:22:21

by Bryan Fink

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

Trond Myklebust wrote:

>On Fri, 2006-02-24 at 04:14 -0800, Andrew Morton wrote:
>
>
>>Trond Myklebust <[email protected]> wrote:
>>
>>
>>>On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
>>> > Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
>>> > at least Trond thought, mid-January, that "The readahead algorithm has
>>> > been broken in 2.6.x for at least the past 6 months." (
>>> > http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
>>> > know if that has been fixed?
>>>
>>> No it hasn't been fixed. ...and no, this is not a problem that only
>>> affects NFS: it just happens to give a more noticeable performance
>>> impact due to the larger latency of NFS over a 100Mbps link.
>>>
>>>
>>iirc, last time we went round this loop Ram and I were unable to reproduce it.
>>
>>Does anyone have a testcase?
>>
>>
>
>Yes. A dead simple one
>
>run iozone in sequential read mode on a tcp link w/ rsize == 32k
>
>
I'm sure Trond's testcase is much more useful, but for reference, I
thought I'd add that I've been doing my testing with a simple "dd
if=/nfsmount/file of=/dev/null bs=32k". /nfsmount/file is usually 2.5-3
GB, which makes the difference between NFS servers long enough that I
feel safe throwing a "time" in front of the whole command. That is, the
difference is nowhere near millisecond resolution (it's nearer a
minute), so I like to start the test and then walk away to do other things.

Interesting that it's not an NFS-only bug. I assumed it was when I
logged into each server so I could run "dd if=file of=/dev/null bs=32k"
locally. When I did that, both servers gave roughly the same speed.
Sorry I left this bit out of my first email. I assume this example only
illustrates how opaque the code around this problem truly is.

Thanks very much for the help.

-Bryan

2006-02-24 15:25:42

by Oleg Nesterov

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

Andrew morton wrote:
>
> Trond Myklebust <[email protected]> wrote:
> >
> > On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
> > > Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
> > > at least Trond thought, mid-January, that "The readahead algorithm has
> > > been broken in 2.6.x for at least the past 6 months." (
> > > http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
> > > know if that has been fixed?
> >
> > No it hasn't been fixed. ...and no, this is not a problem that only
> > affects NFS: it just happens to give a more noticeable performance
> > impact due to the larger latency of NFS over a 100Mbps link.
>
> iirc, last time we went round this loop Ram and I were unable to reproduce it.
>
> Does anyone have a testcase?

Afaics, this problem was resolved a long ago.

The patch below should fix this problem. Does it?

Andrew, I'll resend it with a proper changelog and comments on Sunday,
currently I can't even do a compile test. I verified this patch still
applies cleanly.

------------------------------------------------------------------------------
>From - Thu Aug 4 20:33:03 2005
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Message-ID: <[email protected]>
Date: Thu, 04 Aug 2005 20:33:03 +0400
From: Oleg Nesterov <[email protected]>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.2.20 i686)
X-Accept-Language: en
MIME-Version: 1.0
To: Ram Pai <[email protected]>,
Trond Myklebust <[email protected]>,
Linus Torvalds <[email protected]>,
Steven Pratt <[email protected]>, Andrew Morton <[email protected]>
Subject: Re: Readahead algorithm problems again...
References: <[email protected]> ...
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 1255
Lines: 47

Oleg Nesterov wrote:
>
> What do you think about this patch?

Ohh... Sorry, I attached the wrong one.

--- 2.6.13-rc4/mm/readahead.c~ Thu Apr 7 12:59:41 2005
+++ 2.6.13-rc4/mm/readahead.c Thu Aug 4 20:25:14 2005
@@ -57,8 +57,8 @@ static inline void ra_off(struct file_ra
ra->start = 0;
ra->flags = 0;
ra->size = 0;
+ ra->ahead_size += ra->ahead_start;
ra->ahead_start = 0;
- ra->ahead_size = 0;
return;
}

@@ -423,8 +423,8 @@ static int make_ahead_window(struct addr
* congestion. The ahead window will any way be closed
* in case we failed due to excessive page cache hits.
*/
+ ra->ahead_size += ra->ahead_start;
ra->ahead_start = 0;
- ra->ahead_size = 0;
}

return ret;
@@ -507,7 +507,7 @@ page_cache_readahead(struct address_spac

if (ra->ahead_start == 0) { /* no ahead window yet */
if (!make_ahead_window(mapping, filp, ra, 0))
- goto out;
+ goto recheck;
}
/*
* Already have an ahead window, check if we crossed into it.
@@ -520,6 +520,9 @@ page_cache_readahead(struct address_spac
ra->start = ra->ahead_start;
ra->size = ra->ahead_size;
make_ahead_window(mapping, filp, ra, 0);
+recheck:
+ ra->prev_page = min(ra->prev_page,
+ ra->ahead_start + ra->ahead_size - 1);
}

out:

------------------------------------------------------------------------------

There is another one, from Steven Pratt:

------------------------------------------------------------------------------
>From - Sat Aug 13 11:49:43 2005
Return-Path: <[email protected]>
X-Original-To: [email protected]
Delivered-To: [email protected]
Received: from localhost (localhost [127.0.0.1])
by several.ru (Postfix) with ESMTP id 08412C014B
for <[email protected]>; Fri, 12 Aug 2005 23:12:25 +0400 (MSD)
Received: from several.ru ([127.0.0.1])
by localhost (several.ru [127.0.0.1]) (amavisd-new, port 10024) with ESMTP
id 23382-09 for <[email protected]>; Fri, 12 Aug 2005 23:12:20 +0400 (MSD)
Received: by several.ru (Postfix, from userid 106)
id 0F66CBFBC8; Fri, 12 Aug 2005 23:12:20 +0400 (MSD)
Received: from e35.co.us.ibm.com (e35.co.us.ibm.com [32.97.110.133])
by several.ru (Postfix) with ESMTP id 2DE8ABFB5D
for <[email protected]>; Fri, 12 Aug 2005 23:12:19 +0400 (MSD)
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106])
by e35.co.us.ibm.com (8.12.10/8.12.9) with ESMTP id j7CJCHWY067722
for <[email protected]>; Fri, 12 Aug 2005 15:12:17 -0400
Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170])
by d03relay04.boulder.ibm.com (8.12.10/NCO/VERS6.7) with ESMTP id j7CJCVc7234970
for <[email protected]>; Fri, 12 Aug 2005 13:12:31 -0600
Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1])
by d03av04.boulder.ibm.com (8.12.11/8.13.3) with ESMTP id j7CJCGGB032673
for <[email protected]>; Fri, 12 Aug 2005 13:12:16 -0600
Received: from [9.41.223.36] (slpratt-009041223036.austin.ibm.com [9.41.223.36])
by d03av04.boulder.ibm.com (8.12.11/8.12.11) with ESMTP id j7CJCFIx032651;
Fri, 12 Aug 2005 13:12:16 -0600
Message-ID: <[email protected]>
Date: Fri, 12 Aug 2005 14:12:01 -0500
From: Steven Pratt <[email protected]>
User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050317)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Andrew Morton <[email protected]>
Cc: Ram Pai <[email protected]>, [email protected],
[email protected], [email protected]
Subject: Re: Readahead algorithm problems again...
References: <[email protected]> ...
In-Reply-To: <[email protected]>
Content-Type: multipart/mixed;
boundary="------------010005060703040509010104"
X-Mozilla-Status: 8011
X-Mozilla-Status2: 00000000
X-UIDL: 434d6b27fc1a5f9a
Status: O
Content-Length: 1655
Lines: 71

This is a multi-part message in MIME format.
--------------010005060703040509010104
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

The current current get_init_ra_size is not optimal across different IO
sizes and max_readahead values. Here is a quick summary of sizes
computed under current design and under the attached patch. All of
these assume 1st IO at offset 0, or 1st detected sequential IO.

32k max, 4k request

old new
-----------------
8k 8k
16k 16k
32k 32k

128k max, 4k request
old new
-----------------
32k 16k
64k 32k
128k 64k
128k 128k

128k max, 32k request
old new
-----------------
32k 64k <-----
64k 128k
128k 128k

512k max, 4k request
old new
-----------------
4k 32k <----
16k 64k
64k 128k
128k 256k
512k 512k

Steve

--- linux-2.6.12/mm/readahead.org.c 2005-08-01 08:52:12.000000000 -0500
+++ linux-2.6.12/mm/readahead.c 2005-08-10 10:16:52.000000000 -0500
@@ -72,10 +72,10 @@ static unsigned long get_init_ra_size(un
{
unsigned long newsize = roundup_pow_of_two(size);

- if (newsize <= max / 64)
- newsize = newsize * newsize;
+ if (newsize <= max / 32)
+ newsize = newsize * 4;
else if (newsize <= max / 4)
- newsize = max / 4;
+ newsize = newsize * 2;
else
newsize = max;
return newsize;

------------------------------------------------------------------------------

Oleg.

2006-02-24 16:15:06

by Oleg Nesterov

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

Oleg Nesterov wrote:
>
> Afaics, this problem was resolved a long ago.
>
> The patch below should fix this problem. Does it?

Forgot to mention, this patch was tested,

Steven Pratt wrote:
>
> This is the patch I think we should apply. Running tiobench with 4k
> request size, 4GB working set, 256 threads and a 2MB max_readahead (to
> help induce thrashing) on a 1GB 8way machine, throughput of sequential
> IO increased from 50MB/sec to 92MB/sec on a 5disk raid0 array. Tests
> with smaller max_readaheads and smaller thread counts were all withing
> the noise range of the benchmark, which is to be expected.

Oleg.

2006-02-24 16:18:30

by Bryan Fink

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

Andrew Morton wrote:

>Trond Myklebust <[email protected]> wrote:
>
>
>>On Thu, 2006-02-23 at 15:35 -0500, Bryan Fink wrote:
>> > Hi All. I'm running into a bit of trouble with NFS on 2.6. I see that
>> > at least Trond thought, mid-January, that "The readahead algorithm has
>> > been broken in 2.6.x for at least the past 6 months." (
>> > http://www.ussg.iu.edu/hypermail/linux/kernel/0601.2/0559.html) Anyone
>> > know if that has been fixed?
>>
>> No it hasn't been fixed. ...and no, this is not a problem that only
>> affects NFS: it just happens to give a more noticeable performance
>> impact due to the larger latency of NFS over a 100Mbps link.
>>
>>
>
>iirc, last time we went round this loop Ram and I were unable to reproduce it.
>
>Does anyone have a testcase?
>
>

Hi again. I just found some new, very interesting information. Until
just a few minutes ago, I hadn't realized that one could change the I/O
scheduler at runtime. Looking into it, my system was using "cfq", and I
have three other options, "noop", "anticipatory", and "deadline". I've
now run tests using all three of the other schedulers, and they all
bring performance back up to the level I had with kernel 2.4. So, either
NFS is incompatible with cfq, or cfq has some issues that show very
vividly when used with NFS (or, I suppose, I just have my system tuned
wrong for use with cfq).

Hope this helps the bug hunt. Special thanks to Asfand Yar Qazi for
writing to the list this morning asking how to change schedulers at
runtime
(http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0135.html). Off to
find out exactly what the best scheduler is for my needs.

-Bryan

2006-02-24 22:33:00

by Grant Coady

[permalink] [raw]

Subject: Re: NFS Still broken in 2.6.x?

On Fri, 24 Feb 2006 11:18:44 -0500, Bryan Fink <[email protected]> wrote:

>Hi again. I just found some new, very interesting information. Until
>just a few minutes ago, I hadn't realized that one could change the I/O
>scheduler at runtime. Looking into it, my system was using "cfq", and I
>have three other options, "noop", "anticipatory", and "deadline". I've
>now run tests using all three of the other schedulers, and they all
>bring performance back up to the level I had with kernel 2.4. So, either
>NFS is incompatible with cfq, or cfq has some issues that show very
>vividly when used with NFS (or, I suppose, I just have my system tuned
>wrong for use with cfq).

I run NFS for ages -- all linux boxen here mount a shared export from
localnet controller box to get source + patches.

Only have 'deadline' installed on 2.6 kernels -- not seen any problems
with NFS here (apart from back when I had data corruption due a faulty
memory stick).

Grant.