2004-04-16 01:18:05

by Charles Shannon Hendrix

[permalink] [raw]
Subject: NFS and kernel 2.6.x




I'm having a hard time right now with NFS on kernel 2.6.

I tried to search archives but can't find much on my exact problem. If
I missed something good, a pointer would be great.

Anyway, the problem: NFS writes are broken in 2.6 on my machine.

I normally mount several volumes from a Sun SS5 running NetBSD.

It's worked great for years, and usually is not too bad on speed.

When I moved to Linux kernel 2.6.1, writes to the NetBSD server got
incredibly slow. Like it went from around 600K/sec to just a few K/sec
to maybe 25K/sec.

By contrast, rsync runs at around 900K/sec or faster, close to wire
speed (yes, raw speed, not compressed speed).

With kernels 2.6.3 and 2.6.5, it doesn't work at all. If I do something
like this:

% cp bigfile /public

It just hangs. After that umounts or even reads of that volume hang.
They can be killed, but not always. Gnome's Nautilus for example gets
permanently hung, though that might be its own issue.

Offhand, I cannot remember what NFS write performance was with Linux
kernel 2.4, but it was several hundred K/sec unless the server was
loaded.

Reading from the NFS server seems to still be fine. For example, just
now I copied a file from there at around 660K/sec using kernel 2.6.5
on the client.

Anyway, I would like to explore this further and solve the problem.

Details on my setup:

NFS server:

Sun SS5
10baseT ethernet (100baseT card available, not used)
NetBSD 1.6.1
pretty much a plain vanilla server setup

Network:

simple LAN with three machines, connected via a full duplex
multi-speed switch

NFS client:

vanilla PC
Intel Pro/100 ethernet
Slackware 9.1
Linux kernel 2.6.5, plain with no mods or patches, only enough
drivers and features enabled to run my workstation
configuration as close as I could get to my Linux 2.4
kernel

--
shannon "AT" widomaker.com -- ["All of us get lost in the darkness,
dreamers turn to look at the stars" -- Rush ]


2004-04-16 01:31:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

P? to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
>

> NFS server:
>
> Sun SS5
> 10baseT ethernet (100baseT card available, not used)
> NetBSD 1.6.1
> pretty much a plain vanilla server setup
>
> Network:
>
> simple LAN with three machines, connected via a full duplex
> multi-speed switch
>
> NFS client:
>
> vanilla PC
> Intel Pro/100 ethernet
> Slackware 9.1
> Linux kernel 2.6.5, plain with no mods or patches, only enough
> drivers and features enabled to run my workstation
> configuration as close as I could get to my Linux 2.4
> kernel

This is pretty much covered in the NFS FAQ entry B10.

You are experiencing the classical effects of using unreliable transport
(i.e. UDP) on a mixed speed network. Writes to the server are getting
lost, because it is on a slow segment that cannot keep up with the
faster 100Mbit clients.

Use the 'proto=tcp' mount option, and all will be well again.

Cheers,
Trond

2004-04-16 01:54:20

by Andrew Morton

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust <[email protected]> wrote:
>
> P? to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
> >
>
> > NFS server:
> >
> > Sun SS5
> > 10baseT ethernet (100baseT card available, not used)
> > NetBSD 1.6.1
> > pretty much a plain vanilla server setup
> >
> > Network:
> >
> > simple LAN with three machines, connected via a full duplex
> > multi-speed switch
> >
> > NFS client:
> >
> > vanilla PC
> > Intel Pro/100 ethernet
> > Slackware 9.1
> > Linux kernel 2.6.5, plain with no mods or patches, only enough
> > drivers and features enabled to run my workstation
> > configuration as close as I could get to my Linux 2.4
> > kernel
>
> This is pretty much covered in the NFS FAQ entry B10.
>
> You are experiencing the classical effects of using unreliable transport
> (i.e. UDP) on a mixed speed network. Writes to the server are getting
> lost, because it is on a slow segment that cannot keep up with the
> faster 100Mbit clients.

But Charles was seeing good performance with 2.4-based clients. When he
went to 2.6 everything fell apart.

Do we know why this regression occurred?

2004-04-16 02:54:22

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

P? to , 15/04/2004 klokka 18:53, skreiv Andrew Morton:
> But Charles was seeing good performance with 2.4-based clients. When he
> went to 2.6 everything fell apart.
>
> Do we know why this regression occurred?

What regression??? You have a statistic of 1 person whose 3 clients
changed from what was an apparently working setup to what has *always*
been the usual scenario for most people that tried to use the same
broken hardware/software combination whether it be in 2.2.x, 2.4.x or
2.6.x.

The whole problem is that UDP provides unreliable transport... It offers
NO guarantees that the packet will arrive at the destination.
If only 1 fragment out of the 22 that it takes to send a single
wsize=32k write request to the Sun server gets lost on the way, the
Sun's networking layer will ignore that entire packet, and so the whole
write has to time out and get resent.
Switches can usually cache a few fragments if the clients on the 100Mbit
network are sending requests at a rate that almost matches the 10Mbit
bandwidth that the Sun server supports, but if the network is swamped so
that the switch runs out of cache, then it will start to drop packets.

This is the whole reason why Sun set TCP to be their default mount
option when the changed their servers to use 32k read/write.

My biggest suspect for why this particular setup changed in 2.6.x would
therefore be the changes to the way in which writes are scheduled on the
wire. We cache them for longer, and so overall the bandwidth usage goes
down, but at the expense of more "burstiness" when the user closes the
file or does some other fsync()-like operation.



So in fact you have 2 possible workarounds:

- Use the TCP mount option (by far the better option, since TCP *does*
provide reliable transport).
- Keep UDP, but use the wsize mount option to explicitly override the
server's choice of write sizes. That works by reducing the number of
fragments per write, and so improving performance by reducing the amount
of data that need to be resent per fragment lost.


Cheers,
Trond

2004-04-16 04:59:29

by Phil Oester

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Actually I can concur -- I recently migrated 100+ servers from 2.4.x
to 2.6.3, and simply could not use UDP mounts and achieve acceptable
performance. Further, I wasn't using 32K r/w as you posit, but was
using 8K (against a NetApp FWIW).

If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
perhaps this should be documented -- or the option should be deprecated.

Phil Oester


On Thu, Apr 15, 2004 at 07:54:08PM -0700, Trond Myklebust wrote:
> P? to , 15/04/2004 klokka 18:53, skreiv Andrew Morton:
> > But Charles was seeing good performance with 2.4-based clients. When he
> > went to 2.6 everything fell apart.
> >
> > Do we know why this regression occurred?
>
> What regression??? You have a statistic of 1 person whose 3 clients

2004-04-16 05:29:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

P? to , 15/04/2004 klokka 21:59, skreiv Phil Oester:

> If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> perhaps this should be documented -- or the option should be deprecated.

Put simply: I am not interested in wasting _my_ time investigating cases
where UDP is performing badly if TCP is working fine. The variable
reliability issues with UDP are precisely why we worked to get the TCP
stuff working efficiently.

As for blanket statements like the above: I have seen no evidence yet
that they are any more warranted in 2.6.x than they were in 2.4.x. At
least not as long as I continue to see wire speed performance on reads
and writes on UDP on all my own test setups.

Cheers,
Trond

2004-04-16 07:13:59

by Paul Wagland

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x


On Apr 16, 2004, at 7:29, Trond Myklebust wrote:

> P? to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
>
>> If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts
>> unusable,
>> perhaps this should be documented -- or the option should be
>> deprecated.
>
> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x. At
> least not as long as I continue to see wire speed performance on reads
> and writes on UDP on all my own test setups.

Just as an aside, I can confirm this as well... we use UDP mounts, and
get a pretty constant 10MB/s (assuming people aren't running bloody
xscreensavers!*!)

Cheers,
Paul


Attachments:
PGP.sig (186.00 B)
This is a digitally signed message part

2004-04-16 09:03:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Andrew Morton wrote:
> > P? to , 15/04/2004 klokka 18:14, skreiv Charles Shannon Hendrix:
> But Charles was seeing good performance with 2.4-based clients. When he
> went to 2.6 everything fell apart.

Perhaps because 2.6 changes the UDP retransmit model for NFS, to
estimate the round-trip time and thus retransmit faster than 2.4
would. Sometimes _much_ faster: I observed retransmits within a few
hundred microseconds.

On networks with a lot of latency variance, i.e. anything with big
queues, that would increase congestion. That'd increase losses, and
because NFS over UDP uses large fragmented IP frames (TCP doesn't),
fragment loss will greatly increase IP frame loss, as Trond explained.

That's my hypothesis.

There was also a problem with late 2.5 clients and "soft" NFS mounts.
Requests would timeout after a fixed number of retransmits, which on a
LAN could be after a few milliseconds due to round-trip estimation and
fast server response. Then when an I/O on the server took longer,
e.g. due to a disk seek or contention, the client would timeout and
abort requests. 2.4 doesn't have this problem with "soft" due to the
longer, fixed retransmit timeout. I don't know if it is fixed in
current 2.6 kernels - but you can avoid it by not using "soft" anyway.

-- Jamie

2004-04-16 15:25:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, Apr 16, 2004 at 11:44:33AM -0300, Marcelo Tosatti wrote:
> On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote:
> > P? to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
> >
> > > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> > > perhaps this should be documented -- or the option should be deprecated.
> >
> > Put simply: I am not interested in wasting _my_ time investigating cases
> > where UDP is performing badly if TCP is working fine. The variable
> > reliability issues with UDP are precisely why we worked to get the TCP
> > stuff working efficiently.
> >
> > As for blanket statements like the above: I have seen no evidence yet
> > that they are any more warranted in 2.6.x than they were in 2.4.x. At
> > least not as long as I continue to see wire speed performance on reads
> > and writes on UDP on all my own test setups.
>
> Maaybe TCP should be the default then ?

Or just make a big warning in the Kconfig. Distros will
set it to the default...

> In case no one finds the reason
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in
> theory?

2004-04-16 15:25:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Thu, Apr 15, 2004 at 10:29:06PM -0700, Trond Myklebust wrote:
> P? to , 15/04/2004 klokka 21:59, skreiv Phil Oester:
>
> > If simply upgrading from 2.4.x to 2.6.x is going to make UDP mounts unusable,
> > perhaps this should be documented -- or the option should be deprecated.
>
> Put simply: I am not interested in wasting _my_ time investigating cases
> where UDP is performing badly if TCP is working fine. The variable
> reliability issues with UDP are precisely why we worked to get the TCP
> stuff working efficiently.
>
> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x. At
> least not as long as I continue to see wire speed performance on reads
> and writes on UDP on all my own test setups.

Maaybe TCP should be the default then ? In case no one finds the reason
why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in
theory?

2004-04-16 15:50:08

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 07:44, Marcelo Tosatti wrote:
> Maaybe TCP should be the default then ? In case no one finds the reason
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in
> theory?

Are you talking about the TCP server configuration option here, or the
TCP mount option? IMO both should be default.

I've got a patch for the "mount" program, which I've been intending to
send on to Andries (I've just been too busy for the past few weeks to
give it a last review).

Cheers,
Trond

2004-04-16 15:55:19

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 02:03, Jamie Lokier wrote:

> Perhaps because 2.6 changes the UDP retransmit model for NFS, to
> estimate the round-trip time and thus retransmit faster than 2.4
> would. Sometimes _much_ faster: I observed retransmits within a few
> hundred microseconds.

Retransmits within a few 100 microsecond should no longer be occurring.
Have you redone those measurements with a more recent kernel?
2.6.x and 2.4.x should have pretty much the same code for RTO
estimation.

In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one
difference is that 2.6.x uses zero copy writes.


> There was also a problem with late 2.5 clients and "soft" NFS mounts.
> Requests would timeout after a fixed number of retransmits, which on a
> LAN could be after a few milliseconds due to round-trip estimation and
> fast server response. Then when an I/O on the server took longer,
> e.g. due to a disk seek or contention, the client would timeout and
> abort requests. 2.4 doesn't have this problem with "soft" due to the
> longer, fixed retransmit timeout. I don't know if it is fixed in
> current 2.6 kernels - but you can avoid it by not using "soft" anyway.

Or changing the default value of "retrans" to something more sane. As
usual, Linux has a default that is lower than on any other platform.

Cheers,
Trond

2004-04-16 15:57:13

by Dave Gilbert (Home)

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Marcelo Tosatti wrote:

> Maaybe TCP should be the default then ? In case no one finds the reason
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in
> theory?

While it is reasonable to make TCP default it is important that if there
is a real problem with UDP NFS that it is sorted. Some of us have to
work with older machines and kernels on clients that don't support TCP NFS.

Dave

2004-04-16 16:13:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 08:55, Dave Gilbert (Home) wrote:
> While it is reasonable to make TCP default it is important that if there
> is a real problem with UDP NFS that it is sorted. Some of us have to
> work with older machines and kernels on clients that don't support TCP NFS.

Then "some of you" can send in a proper bugreport in the usual format if
and when that problem actually occurs.

So far I have NOTHING to tell me there is a problem here. Just a load of
people going ballistic over hot air....


2004-04-16 18:48:32

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust wrote:
> > Perhaps because 2.6 changes the UDP retransmit model for NFS, to
> > estimate the round-trip time and thus retransmit faster than 2.4
> > would. Sometimes _much_ faster: I observed retransmits within a few
> > hundred microseconds.
>
> Retransmits within a few 100 microsecond should no longer be occurring.
> Have you redone those measurements with a more recent kernel?

No, not since I sent you the packet trace from a 2.5 kernel that
wasn't working with "soft". I took your advice and stopped using
"soft". It causes the obvious problem when I (rarely) turn off the
server, otherwise it's been fine and I'm using 2.6.5 now, still fine
(with "soft" not being used).

> 2.6.x and 2.4.x should have pretty much the same code for RTO
> estimation.
>
> In fact pretty much all the 2.4.x and 2.6.x RPC code is shared. The one
> difference is that 2.6.x uses zero copy writes.
>
> > There was also a problem with late 2.5 clients and "soft" NFS mounts.
> > Requests would timeout after a fixed number of retransmits, which on a
> > LAN could be after a few milliseconds due to round-trip estimation and
> > fast server response. Then when an I/O on the server took longer,
> > e.g. due to a disk seek or contention, the client would timeout and
> > abort requests. 2.4 doesn't have this problem with "soft" due to the
> > longer, fixed retransmit timeout. I don't know if it is fixed in
> > current 2.6 kernels - but you can avoid it by not using "soft" anyway.
>
> Or changing the default value of "retrans" to something more sane. As
> usual, Linux has a default that is lower than on any other platform.

If few-100-microsecond retransmits no longer occur, perhaps it's no
longer relevant.

The problem I saw with "soft" was that the retransmit time was quite a
good estimate of the server response time. That part was fine, nice
even. But then the server response latency would increase by a factor
of 10000 (ten thousand) due to normal disk I/O activity (compare cache
response with disk response on a busy disk), and of course 3
retransmits doubling each time is not adequate to cover that. 2.4 was
fine because the default rtt and retrans together could never get
shorter than a few seconds.

That's why I felt that iff rtt was adapting to the server response
time, then a fixed number of retransmits was no longer appropriate: a
lower bound on the time before timing out is appropriate, e.g. 3
seconds or 10 seconds or whatever.

In other words, with adaptive rtt the concept of "retrans" being a
fixed number is fundamentally flawed -- unless it's also accompanied
by a minimum timeout time. You'd need a retrans value of 20 or so for
the above perfectly normal LAN situation, but then that's far too
large on other occasions with other networks or servers.

-- Jamie

2004-04-16 19:06:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 11:48, Jamie Lokier wrote:

> In other words, with adaptive rtt the concept of "retrans" being a
> fixed number is fundamentally flawed -- unless it's also accompanied
> by a minimum timeout time. You'd need a retrans value of 20 or so for
> the above perfectly normal LAN situation, but then that's far too
> large on other occasions with other networks or servers.

At that point, it makes sense to drop the entire "retrans+timeo"
paradigm, and just state that soft timeouts take a single parameter
("timeo") that determines the timeout value.

That's something that is dead easy to do...

Cheers,
Trond

2004-04-16 19:18:11

by Charles Shannon Hendrix

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Fri, 16 Apr 2004 @ 09:13 -0700, Trond Myklebust said:

> Then "some of you" can send in a proper bugreport in the usual format if
> and when that problem actually occurs.
>
> So far I have NOTHING to tell me there is a problem here. Just a load of
> people going ballistic over hot air....

Several people are reporting a problem and discussing it, but I don't
see any of them going ballistic.



--
shannon "AT" widomaker.com -- ["Secrecy is the beginning of tyranny." --
Unknown]

2004-04-16 19:39:23

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust wrote:
> > In other words, with adaptive rtt the concept of "retrans" being a
> > fixed number is fundamentally flawed -- unless it's also accompanied
> > by a minimum timeout time. You'd need a retrans value of 20 or so for
> > the above perfectly normal LAN situation, but then that's far too
> > large on other occasions with other networks or servers.
>
> At that point, it makes sense to drop the entire "retrans+timeo"
> paradigm, and just state that soft timeouts take a single parameter
> ("timeo") that determines the timeout value.

I agree. 30 seconds seems like a good default.

> That's something that is dead easy to do...

I'll test a patch for 2.6.5 if you provide one.

-- Jamie

2004-04-16 20:32:17

by Andi Kleen

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Marcelo Tosatti <[email protected]> writes:
>
> Maaybe TCP should be the default then ? In case no one finds the reason
> why NFS over UDP is slower on 2.6.x than 2.4.x. It seems there are
> quite a few reports confirming the slowdown. Maybe Jamie Lokier is right in
> theory?

Problem is that older linux knfsd (early 2.4) tend to crash or hang
after some time when they have to talk TCP. But I guess it would
be still a better default ...

-Andi

2004-04-16 22:55:26

by Daniel Egger

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On 16.04.2004, at 18:13, Trond Myklebust wrote:

> Then "some of you" can send in a proper bugreport in the usual format
> if
> and when that problem actually occurs.

> So far I have NOTHING to tell me there is a problem here. Just a load
> of
> people going ballistic over hot air....

Great you want to help here. So I've a system which is NFS root using a
3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
water somewhere in between 10 seconds and 5 minutes after boot using
NFS over UDP. The last thing I see are 3 or 4 messages of the type:

server 192.168.11.2 not responding, still trying

NFS seems to work better with 2.6.4 which unfortuntely has other nasty
bugs for me; currently I'm running 2.4.26 which works fine, over both
UDP and TCP.

Preempt is off as are the NFS features which I do not trust yet (v4 and
direct IO). Attached is the config for your viewing pleasure.

Please tell me how I can help here and I'll certainly do it.

Servus,
Daniel


Attachments:
config.gz (7.95 kB)
PGP.sig (478.00 B)
This is a digitally signed message part
Download all attachments

2004-04-17 04:57:28

by Chris Friesen

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Daniel Egger wrote:

> Great you want to help here. So I've a system which is NFS root using a
> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> water somewhere in between 10 seconds and 5 minutes after boot using
> NFS over UDP. The last thing I see are 3 or 4 messages of the type:

If this is an issue, it might make sense to have root be a tmpfs
filesystem, and then have specific network mounts. Note--don't make
"/var/log" network mounted, various apps default to trying to check for
files there--if the server goes away, you can't log in/out.

Chris

2004-04-17 05:24:24

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 12:07, Daniel Egger wrote:

> Great you want to help here. So I've a system which is NFS root using a
> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> water somewhere in between 10 seconds and 5 minutes after boot using
> NFS over UDP. The last thing I see are 3 or 4 messages of the type:

...and if you use TCP?

> server 192.168.11.2 not responding, still trying

The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000
-w dump.out"...

Cheers,
Trond

2004-04-17 05:28:47

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 17:03, Charles Shannon Hendrix wrote:
> >
> > 2.6.x can cache a lot more data, and will tend to write it out in a more
> > lazy fashion (i.e. only when the user requests it). That means the
> > writes tend to occur in a more bursty fashion.
>
> That makes sense.
>
> Was there a specific reason for making NFS traffic bursty, or did it
> just work out that way?

It's an inevitable side-effect of the increased caching. If you are
constantly writing out data, then you spread out the load a lot more
than if you wait until the user actually requests a flush.
On the other hand, it means that if your application reads/writs several
times over the same page, then you only write it out once.

Cheers,
Trond

2004-04-17 10:32:36

by Daniel Egger

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x


On 17.04.2004, at 06:56, Chris Friesen wrote:

>> Great you want to help here. So I've a system which is NFS root using
>> a
>> 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in
>> the
>> water somewhere in between 10 seconds and 5 minutes after boot using
>> NFS over UDP. The last thing I see are 3 or 4 messages of the type:
>
> If this is an issue, it might make sense to have root be a tmpfs
> filesystem,
> and then have specific network mounts.

I'm trying to keep this a standard Debian system as much as possible.
Also I've several machines having a large number of shared partitions,
some of them fulfill different purposes, so I would need to customize
several instances which sounds like much work to me; part of it
certainly unnecessary because it works just fine with older kernels...
:)

Also there is the issue that the only thing that is sort of guaranteed
to
be transported over the network is the kernel itself. Sometimes it hangs
already when or just after loading init. I'm not convinced it will be
always able to transfer the whole ramdisk....

Forgot to mention: I've also seen segfaults and wrong file contents
in random places while init executes the scripts in /etc/rc*.d but
those seem to have gone away after I used a more conservative set
of kernel config options. Now it'll only hang.

> Note--don't make "/var/log" network mounted, various apps default to
> trying to check for files there--if the server goes away, you can't
> log in/out.

There's unfortunately more to this. I also cannot log in if
any of the files (bash, bashrc, profiles, libraries, etc.)
needed for login are on nfs. The question here is what is more
reliable in terms of data transfer after an Oops: NFS or
syslogd (UDP). So far I'm satisfied with NFS here, so I don't
see a good reason to change.

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-04-17 14:17:41

by Daniel Egger

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 07:24, Trond Myklebust wrote:

> > Great you want to help here. So I've a system which is NFS root using a
> > 3c940 gigabit onboard NIC on kernel 2.6.5 and which is dead fish in the
> > water somewhere in between 10 seconds and 5 minutes after boot using
> > NFS over UDP. The last thing I see are 3 or 4 messages of the type:

> ...and if you use TCP?

My bad, I got confused; with TCP I get the hangs, with UDP the data
corruption. Unfortunately it doesn't want to hang for me me right now.
:( ...

> > server 192.168.11.2 not responding, still trying

> The other thing I'd need is a tcpdump. Something like "tcpdump -s 9000
> -w dump.out"...

but I have two different tasty cases of data corruption using NFS over
UDP traced for you which I'll send you in private. The first one
corrupts init so that it segfaults, the second one probably crashes the
rc starter to that I'm left with an unusable getty login on console.

I'll try to get the TCP problems traced as well but right now I don't
have the time to wait....

--
Servus,
Daniel


Attachments:
signature.asc (481.00 B)
This is a digitally signed message part

2004-04-17 16:48:02

by Matthias Urlichs

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Hi, Trond Myklebust wrote:

> As for blanket statements like the above: I have seen no evidence yet
> that they are any more warranted in 2.6.x than they were in 2.4.x.

Oh, I saw the problem too: a slow client couldn't do full-size reads from
a fast server because the buffer on the client's network card was just 8k.

Granted that the client is a slow m68k Mac, but 2.4 was fast enough to get
the first packet entirely off the card before the last one overruns the
buffer -- while 2.6 has a bit more latency, so it can't.

Apparently that bit of increased latency is offset by the fact that the
machine still limps along if I packet-bomb it. Under 2.4 it locked solid,
so overall I think that the 2.6 situation is an improvement.

--
Matthias Urlichs

2004-04-17 18:09:19

by Charles Shannon Hendrix

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Fri, 16 Apr 2004 @ 22:28 -0700, Trond Myklebust said:

> It's an inevitable side-effect of the increased caching.

OK. That answers my question of: was making NFS bursty done on purpose.
Answer: no.

> If you are constantly writing out data, then you spread out the load
> a lot more than if you wait until the user actually requests a flush.
> On the other hand, it means that if your application reads/writs
> several times over the same page, then you only write it out once.

Usually, eliminating redundant writes in your application is a better
optimization than relying on the OS to do it for you.

I find bursty I/O is less desirable in most cases.


--
shannon "AT" widomaker.com -- ["The trade of governing has always been
monopolized by the most ignorant and the most rascally individuals of
mankind. -- Thomas Paine"]

2004-04-17 18:15:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 09:44, Matthias Urlichs wrote:
> Hi, Trond Myklebust wrote:
>
> > As for blanket statements like the above: I have seen no evidence yet
> > that they are any more warranted in 2.6.x than they were in 2.4.x.
>
> Oh, I saw the problem too: a slow client couldn't do full-size reads from
> a fast server because the buffer on the client's network card was just 8k.

Right, and this has always been a problem. I had the same issues when
doing 8k reads on one of my 75MHz Pentiums some 10 years ago. The thing
would more or less lock up and just pump out a constant stream of "time
exceeded" ICMP messages.

The NFS/RPC layer knows nothing about the existence of network cards or
their buffer sizes. Only about sockets and how to read from/write to
them.
This sort of issue is precisely why I'd prefer to see people use TCP by
default. UDP with it's dependency on fragmentation works fine on fast
setups with homogeneous lossless networks. It sucks as soon as you break
one of those conditions.

Cheers,
Trond

2004-04-17 18:32:23

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote:
> This sort of issue is precisely why I'd prefer to see people use TCP by
> default. UDP with it's dependency on fragmentation works fine on fast
> setups with homogeneous lossless networks. It sucks as soon as you break
> one of those conditions.

I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
mount. It looks like the support is there. What activates it?

2004-04-17 18:58:35

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 11:32, Marc Singer wrote:
> On Sat, Apr 17, 2004 at 11:15:47AM -0700, Trond Myklebust wrote:
> > This sort of issue is precisely why I'd prefer to see people use TCP by
> > default. UDP with it's dependency on fragmentation works fine on fast
> > setups with homogeneous lossless networks. It sucks as soon as you break
> > one of those conditions.
>
> I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
> mount. It looks like the support is there. What activates it?

It's all there. Just use the "tcp" mount option.

Cheers,
Trond

2004-04-17 18:56:47

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 10:55, Charles Shannon Hendrix wrote:
> Usually, eliminating redundant writes in your application is a better
> optimization than relying on the OS to do it for you.

Fine. As long as you can convince all the other people sharing the same
page cache to do so too. We're not talking about single applications
here...

Cheers,
Trond

2004-04-17 19:01:12

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote:
> > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
> > mount. It looks like the support is there. What activates it?
>
> It's all there. Just use the "tcp" mount option.

I think you are talking about the fstab mount option. Is there a
kernel command line option for this? That's what I've been looking
for. I'm not using an initrd.

Cheers.

2004-04-17 19:09:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 12:01, Marc Singer wrote:

> I think you are talking about the fstab mount option. Is there a
> kernel command line option for this? That's what I've been looking
> for. I'm not using an initrd.

No. I'm talking about the built-in parser to enable NFSROOT to pass
mount options. As in:

nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]

See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>",
and your root partition will use TCP instead of UDP.

Cheers,
Trond

2004-04-17 19:19:21

by Russell King

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 12:09:24PM -0700, Trond Myklebust wrote:
> On Sat, 2004-04-17 at 12:01, Marc Singer wrote:
>
> > I think you are talking about the fstab mount option. Is there a
> > kernel command line option for this? That's what I've been looking
> > for. I'm not using an initrd.
>
> No. I'm talking about the built-in parser to enable NFSROOT to pass
> mount options. As in:
>
> nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
>
> See Documentation/nfsroot.txt. Put "tcp" as one of the "<nfs-options>",
> and your root partition will use TCP instead of UDP.

Trond,

Can you explain how this works?

static int __init root_nfs_parse(char *name, char *buf)
{
...
while ((p = strsep (&name, ",")) != NULL) {
int token;
if (!*p)
continue;
token = match_token(p, tokens, args);

/* %u tokens only */
if (match_int(&args[0], &option))
return 0;

Firstly, as far as I can see, args[] is uninitialised. If match_token
doesn't touch args[] then we pass match_int some uninitialised kernel
memory.

Secondly, we seem to exit if match_int doesn't parse a number. Not
all options in "tokens" have a number associated with them, including
ones like "tcp".

So, given that "tcp" is the only option, I think we'll end up passing
match_int() some uninitialised memory which may cause a kernel oops.
If not, it probably won't be a valid number, so we'll ignore the option.

However, it will appear to work as long as the first option has a
number associated with it (ie, is one of the first 9 options.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-04-17 19:30:49

by Daniel Egger

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On 17.04.2004, at 20:32, Marc Singer wrote:

> I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
> mount. It looks like the support is there. What activates it?

You need to add at least tcp as parameter to the nfsroot boot option,
like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 .

And, of course, if you mount/remount NFS partitions you also need to
specify the tcp parameter in your fstab.

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-04-17 20:22:27

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 09:01:38PM +0200, Daniel Egger wrote:
> On 17.04.2004, at 20:32, Marc Singer wrote:
>
> >I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
> >mount. It looks like the support is there. What activates it?
>
> You need to add at least tcp as parameter to the nfsroot boot option,
> like nfsroot=1.1.1.1:/tftpboot/foo,tcp,v3 .

What I'd like to do is use a command line like this

root=/dev/nfs ip=rarp nfsroot=,tcp,v3

But, it doesn't work. I'd like to let the kernel autoconfiguration
handle the addressing.

> And, of course, if you mount/remount NFS partitions you also need to
> specify the tcp parameter in your fstab.
>
> Servus,
> Daniel


2004-04-17 22:23:01

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 11:58:33AM -0700, Trond Myklebust wrote:
> > I'd be glad to compare TCP to UDP on my system. It's using an nfsroot
> > mount. It looks like the support is there. What activates it?
>
> It's all there. Just use the "tcp" mount option.
>

I have a data point for comparison.

I'm copying a 40MiB file over NFS. In five trials, the mean transfer
times are

UDP (v2): 48.5s
TCP (v3): 52.7s

2004-04-17 22:33:05

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Fri, 2004-04-16 at 12:39, Jamie Lokier wrote:

> > That's something that is dead easy to do...
>
> I'll test a patch for 2.6.5 if you provide one.

Here you go...

With this patch
- the major timeout is of fixed length "timeo<<retrans", and the
clock starts at the first attempt to send the packet.
- If a major timeout occurs, we now reset the RTT estimator so
as to "slow start" when the server becomes available again.

For the moment it does use the timeo + retrans values, because the
former is in fact wanted in order to initialize the RTT estimator.
However, it no longer uses the count of the number of actual
retransmissions in order to determine whether or not a major timeout
occurred.

Cheers,
Trond



Attachments:
linux-2.6.6-01-soft.dif (9.09 kB)

2004-04-18 00:57:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 15:22, Marc Singer wrote:
> I have a data point for comparison.
>
> I'm copying a 40MiB file over NFS. In five trials, the mean transfer
> times are
>
> UDP (v2): 48.5s
> TCP (v3): 52.7s

Against what kind of server on what kind of network, with what kind of
mount options?
The above would be quite reasonable performance on a 10Mbit network
against a filer or a Linux server with the (insecure) "async" option
set.

Cheers,
Trond

2004-04-18 02:52:11

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 12:19, Russell King wrote:

> Firstly, as far as I can see, args[] is uninitialised. If match_token
> doesn't touch args[] then we pass match_int some uninitialised kernel
> memory.
>
> Secondly, we seem to exit if match_int doesn't parse a number. Not
> all options in "tokens" have a number associated with them, including
> ones like "tcp".

Agreed. The correct fix should be something like the appended patch. It
depends on all tokens that do take an integer argument being listed
first in the enum.

Comments?

Cheers,
Trond
nfsroot.c | 17 +++++++++++++----
1 files changed, 13 insertions(+), 4 deletions(-)

--- linux-2.6.6-up/fs/nfs/nfsroot.c.orig 2004-04-17 11:05:10.000000000 -0700
+++ linux-2.6.6-up/fs/nfs/nfsroot.c 2004-04-17 18:47:05.000000000 -0700
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0; /
***************************************************************************/

enum {
+ /* Options that take integer arguments */
Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
- Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+ Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+ /* Options that take no arguments */
+ Opt_soft, Opt_hard, Opt_intr,
Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac,
Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
- Opt_broken_suid, Opt_err,
+ Opt_broken_suid,
+ /* Error token */
+ Opt_err
};

static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
{Opt_noac, "noac"},
{Opt_lock, "lock"},
{Opt_nolock, "nolock"},
+ {Opt_v2, "nfsvers=2"},
{Opt_v2, "v2"},
+ {Opt_v3, "nfsvers=3"},
{Opt_v3, "v3"},
+ {Opt_udp, "proto=udp"},
{Opt_udp, "udp"},
+ {Opt_udp, "proto=tcp"},
{Opt_tcp, "tcp"},
{Opt_broken_suid, "broken_suid"},
{Opt_err, NULL}
@@ -179,8 +188,8 @@ static int __init root_nfs_parse(char *n
continue;
token = match_token(p, tokens, args);

- /* %u tokens only */
- if (match_int(&args[0], &option))
+ /* %u tokens only. Beware if you add new tokens! */
+ if (token < Opt_soft && match_int(&args[0], &option))
return 0;
switch (token) {
case Opt_port:

2004-04-18 03:26:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust wrote:
> With this patch
> - the major timeout is of fixed length "timeo<<retrans", and the
> clock starts at the first attempt to send the packet.
> - If a major timeout occurs, we now reset the RTT estimator so
> as to "slow start" when the server becomes available again.
>
> For the moment it does use the timeo + retrans values, because the
> former is in fact wanted in order to initialize the RTT estimator.
> However, it no longer uses the count of the number of actual
> retransmissions in order to determine whether or not a major timeout
> occurred.

Ok, observations:

- The RTT converges to 0.1s on my LAN, just as it did before the patch.
Very sensible, and as you said the 100 microsecond problem is not
with us these days.

- The RTT is reset after a timeout (from 0.1-0.15s to 0.7s in my tests).
As expected.

- With the defaults (retrans=3, timeo=0.7s), I see:

After disconnecting the server, the client first times out after
about 5.5-6 seconds. First minor timeout is 0.1. This makes sense
as 0.7 << 3 == 5.6.

Subsequent timeouts take about 10.5 seconds. This also makes sense,
as you have set the timeout threshold at 0.7*8 == 5.6 seconds,
and three timeouts is 0.7*(1+2+4) == 4.9 seconds, too short.
Four timeouts is 0.7*(1+2+4+8) == 10.5 seconds.

The old behaviour before RTT estimation would have timed out
after 10.5 seconds, I think.

- With retrans=5, and timeo still has the default value of 0.7s:

After disconnecting the server, the minor timeout intervals are
approximately:

0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 3.2, 3.2, 3.2, 3.2, 3.2 seconds.

Are they intended to stop doubling at 3.2? The major timeout
thus happens after 22.3 seconds.

Unsurprisingly, subsequent major timeouts take 44.1 seconds.

So this patch is a big improvment, and I'm going to keep using it for my home
directory with retrans=5,soft so it gets some more background testing.
(retrans=3 is too short even with the patch).

However, there are potential improvements. One is that the 3.2 above
should continue doubling. The other is that behaviour would be nicer
if the major timeout time was more predictable: 22.3 to 44.1 seconds
is a big range. This is easy with the algorithm described below.

It isn't possible to have remove the variation completely. However,
it can easily by reduced by changing the doubling strategy: keep
doubling the retransmit time, until it exceeds timeo. When that
happens, set the retransmit time to the next greater or equal value of
timeo << N for some integer N.

For example, with RTT at 0.1s, retrans=5, timeo=0.7, these would be
the minor timeout intervals:

0.1, 0.2, 0.4, 0.7, 1.4, 2.8, 5.6, 11.2, 22.4

leading to a total major timeout time of 44.8 seconds.

Subsequent major timeouts, with the RTT reset to 0.7s, would take 44.1
seconds: 0.7, 1.4, 2.8, 5.6, 11.2, 22.4.

If the RTT estimator is larger than timeo to start with, the first
retransmit will timeout after RTT, but subsequent ones will be a value
of timeo << N. E.g. if RTT was 2s, this would be the minor timeout
sequence: 2.0, 2.8, 5.6, 11.2, 22.4.

The algorithm for deciding when a major timeout occurs is different
too. Instead of keeping track of the total time since the very first
transmission, you simply deem the major timeout to occur after the
minor timeout of timeo << retrans occurs. I.e. in these examples, the
22.4s minor timeout is always the final one.

This reduces the possible variation, with these parameters, to the
range 44.1 to 45.325 seconds: much more consistent than 22.05 to 44.1
seconds.

As well as giving more consistent results, this might even be simpler
than the algorithm in your patch, because there is no need to remember
the total time since the first transmission.

-- Jamie

2004-04-18 05:01:44

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, Apr 17, 2004 at 05:57:46PM -0700, Trond Myklebust wrote:
> On Sat, 2004-04-17 at 15:22, Marc Singer wrote:
> > I have a data point for comparison.
> >
> > I'm copying a 40MiB file over NFS. In five trials, the mean transfer
> > times are
> >
> > UDP (v2): 48.5s
> > TCP (v3): 52.7s
>
> Against what kind of server on what kind of network, with what kind of
> mount options?
> The above would be quite reasonable performance on a 10Mbit network
> against a filer or a Linux server with the (insecure) "async" option
> set.

Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
kernel nfs daemon; network is 100Mib. There is nothing else on the
network except intermittent broadband traffic. Async is set on the
server side.

While I have seen much worse performance in the last couple of weeks,
I cannot blame NFS when I look at the numbers.

2004-04-18 06:36:46

by Chris Friesen

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Marc Singer wrote:

> Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> kernel nfs daemon; network is 100Mib. There is nothing else on the
> network except intermittent broadband traffic. Async is set on the
> server side.

Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm
used to faster machines.

Chris

2004-04-18 07:03:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote:

> Are they intended to stop doubling at 3.2? The major timeout
> thus happens after 22.3 seconds.
>
> Unsurprisingly, subsequent major timeouts take 44.1 seconds.

Right... ...but since the timeout value is already capped at 60 seconds,
this is not a major problem. It is pretty pointless to be talking about
"predictable" or "consistent" behaviour when talking about a situation
where we believe that the server has crashed.

AFAICS, all we care about is to establish a predictable *lower limit*.

Cheers,
Trond

2004-04-18 07:56:24

by Russell King

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote:
> Marc Singer wrote:
>
> > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> > kernel nfs daemon; network is 100Mib. There is nothing else on the
> > network except intermittent broadband traffic. Async is set on the
> > server side.
>
> Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm
> used to faster machines.

It's probably the SMC91c111 ether chip causing all the problem - it's
only able to store about 4 packets before it starts dropping, which
isn't that much on a 100mbit network.

Running with rsize=4096 works wonders with this chip.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-04-18 11:16:42

by Daniel Egger

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On 17.04.2004, at 22:22, Marc Singer wrote:

> What I'd like to do is use a command line like this
>
> root=/dev/nfs ip=rarp nfsroot=,tcp,v3
>
> But, it doesn't work. I'd like to let the kernel autoconfiguration
> handle the addressing.

According to Documentation/nfsroot.txt you should be able
to do:

root=/dev/nfs ip=rarp nfsroot=/kernel,tcp,v3

i.e. the ip is optional. Just out of curiosity: How would you
supply the kernel name using rarp/bootp/dhcp? Since a few days
I'm using pxelinux but before that I needed to hardcode the
path into the tagged image. Actually I prefer this to restarting
the restarting the dhcp server, but...

Servus,
Daniel


Attachments:
PGP.sig (478.00 B)
This is a digitally signed message part

2004-04-18 17:31:43

by Marc Singer

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sun, Apr 18, 2004 at 08:56:19AM +0100, Russell King wrote:
> On Sun, Apr 18, 2004 at 02:36:14AM -0400, Chris Friesen wrote:
> > Marc Singer wrote:
> >
> > > Client is a 200MHz ARM; server is a Linux host running 2.6.3 with the
> > > kernel nfs daemon; network is 100Mib. There is nothing else on the
> > > network except intermittent broadband traffic. Async is set on the
> > > server side.
> >
> > Is the ARM that slow? under 2MB/s seems odd to me...but them maybe I'm
> > used to faster machines.
>
> It's probably the SMC91c111 ether chip causing all the problem - it's
> only able to store about 4 packets before it starts dropping, which
> isn't that much on a 100mbit network.

I suspect that it might be a CPU issue. On transmit only, it never
gets above 18Mib.

> Running with rsize=4096 works wonders with this chip.

Already there.

2004-04-18 23:22:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust wrote:
> On Sat, 2004-04-17 at 20:26, Jamie Lokier wrote:
> > Are they intended to stop doubling at 3.2? The major timeout
> > thus happens after 22.3 seconds.
> >
> > Unsurprisingly, subsequent major timeouts take 44.1 seconds.
>
> Right... ...but since the timeout value is already capped at 60 seconds,
> this is not a major problem. It is pretty pointless to be talking about
> "predictable" or "consistent" behaviour when talking about a situation
> where we believe that the server has crashed.

I agree, but would still prefer more consistent behaviour if it is
easy -- and I explained how to do it, it's an easy algorithm.

You don't respond to the other question: the doubling stopping at
3.2s. Is it intended? It goes againt a basic principle of congestion
control.

> AFAICS, all we care about is to establish a predictable *lower limit*.

I agree that is the most important thing, and the old behaviour was
probably the cause of problems for at least one poster on this thread.

-- Jamie

2004-04-19 09:05:14

by Helge Hafting

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Matthias Urlichs wrote:
> Hi, Trond Myklebust wrote:
>
>
>>As for blanket statements like the above: I have seen no evidence yet
>>that they are any more warranted in 2.6.x than they were in 2.4.x.
>
>
> Oh, I saw the problem too: a slow client couldn't do full-size reads from
> a fast server because the buffer on the client's network card was just 8k.
>
You can force nfs to use smaller packets, useful for those who
have to use udp because the server doesn't support nfs over tcp.
Try 8k, or even 4k.

Helge Hafting

2004-04-19 15:38:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

On Sun, 2004-04-18 at 19:22, Jamie Lokier wrote:

> I agree, but would still prefer more consistent behaviour if it is
> easy -- and I explained how to do it, it's an easy algorithm.

The reason I don't like it is that it continues to tie the major timeout
to the resend timeouts. You've convinced me that they should not be the
same thing.

The other reason is that it only improves matters for the first request.
Once we reset the RTO, all those other outstanding requests are anyway
going to see an immediate discontinuity as their basic timeout jumps
from 1ms to 700ms. So why go to all that trouble just for 1 request?

> You don't respond to the other question: the doubling stopping at
> 3.2s. Is it intended? It goes againt a basic principle of congestion
> control.

I can put it back in.

It was partly another "consistency" issue that initially worried me,
partly in order to avoid problems with overflow:
If you have more than one outstanding request, then those that get
scheduled after the first major timeout (when we reset the RTO
estimator) will see a "jump". If the "retries" variable is too large,
they will either jump straight over 60 seconds, and thus trigger the cap
or they will end up at zero due to 32-bit overflow.

I agree, though, that this is less of an issue.

Cheers,
Trond

2004-04-19 16:20:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

include/linux/sunrpc/xprt.h | 10 ++---
net/sunrpc/auth_gss/auth_gss.c | 2 -
net/sunrpc/clnt.c | 4 --
net/sunrpc/timer.c | 1
net/sunrpc/xprt.c | 81 +++++++++++++++++++++++++----------------
5 files changed, 57 insertions(+), 41 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/include/linux/sunrpc/xprt.h linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h
--- linux-2.6.6-rc1/include/linux/sunrpc/xprt.h 2004-04-17 23:01:09.000000000 -0400
+++ linux-2.6.6-01-soft/include/linux/sunrpc/xprt.h 2004-04-19 11:57:32.000000000 -0400
@@ -69,8 +69,7 @@ extern unsigned int xprt_tcp_slot_table_
* This describes a timeout strategy
*/
struct rpc_timeout {
- unsigned long to_current, /* current timeout */
- to_initval, /* initial timeout */
+ unsigned long to_initval, /* initial timeout */
to_maxval, /* max timeout */
to_increment; /* if !exponential */
unsigned int to_retries; /* max # of retries */
@@ -85,7 +84,6 @@ struct rpc_rqst {
* This is the user-visible part
*/
struct rpc_xprt * rq_xprt; /* RPC client */
- struct rpc_timeout rq_timeout; /* timeout parms */
struct xdr_buf rq_snd_buf; /* send buffer */
struct xdr_buf rq_rcv_buf; /* recv buffer */

@@ -103,6 +101,9 @@ struct rpc_rqst {
struct xdr_buf rq_private_buf; /* The receive buffer
* used in the softirq.
*/
+ unsigned long rq_majortimeo; /* major timeout alarm */
+ unsigned long rq_timeout; /* Current timeout value */
+ unsigned int rq_retries; /* # of retries */
/*
* For authentication (e.g. auth_des)
*/
@@ -115,7 +116,6 @@ struct rpc_rqst {
u32 rq_bytes_sent; /* Bytes we have sent */

unsigned long rq_xtime; /* when transmitted */
- int rq_ntimeo;
int rq_ntrans;
};
#define rq_svec rq_snd_buf.head
@@ -210,7 +210,7 @@ void xprt_reserve(struct rpc_task *);
int xprt_prepare_transmit(struct rpc_task *);
void xprt_transmit(struct rpc_task *);
void xprt_receive(struct rpc_task *);
-int xprt_adjust_timeout(struct rpc_timeout *);
+int xprt_adjust_timeout(struct rpc_rqst *req);
void xprt_release(struct rpc_task *);
void xprt_connect(struct rpc_task *);
int xprt_clear_backlog(struct rpc_xprt *);
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c
--- linux-2.6.6-rc1/net/sunrpc/auth_gss/auth_gss.c 2004-04-17 23:00:57.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/auth_gss/auth_gss.c 2004-04-19 11:57:32.000000000 -0400
@@ -736,10 +736,8 @@ static int
gss_refresh(struct rpc_task *task)
{
struct rpc_clnt *clnt = task->tk_client;
- struct rpc_xprt *xprt = task->tk_xprt;
struct rpc_cred *cred = task->tk_msg.rpc_cred;

- task->tk_timeout = xprt->timeout.to_current;
if (!gss_cred_is_uptodate_ctx(cred))
return gss_upcall(clnt, task, cred);
return 0;
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/clnt.c linux-2.6.6-01-soft/net/sunrpc/clnt.c
--- linux-2.6.6-rc1/net/sunrpc/clnt.c 2004-04-17 23:00:47.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/clnt.c 2004-04-19 11:57:32.000000000 -0400
@@ -788,13 +788,11 @@ static void
call_timeout(struct rpc_task *task)
{
struct rpc_clnt *clnt = task->tk_client;
- struct rpc_timeout *to = &task->tk_rqstp->rq_timeout;

- if (xprt_adjust_timeout(to)) {
+ if (xprt_adjust_timeout(task->tk_rqstp) == 0) {
dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid);
goto retry;
}
- to->to_retries = clnt->cl_timeout.to_retries;

dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
if (RPC_IS_SOFT(task)) {
diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/timer.c linux-2.6.6-01-soft/net/sunrpc/timer.c
--- linux-2.6.6-rc1/net/sunrpc/timer.c 2004-04-17 23:01:20.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/timer.c 2004-04-19 11:57:32.000000000 -0400
@@ -39,6 +39,7 @@ rpc_init_rtt(struct rpc_rtt *rt, unsigne
for (i = 0; i < 5; i++) {
rt->srtt[i] = init;
rt->sdrtt[i] = RPC_RTO_INIT;
+ rt->ntimeouts[i] = 0;
}
}

diff -u --recursive --new-file --show-c-function linux-2.6.6-rc1/net/sunrpc/xprt.c linux-2.6.6-01-soft/net/sunrpc/xprt.c
--- linux-2.6.6-rc1/net/sunrpc/xprt.c 2004-04-17 23:01:07.000000000 -0400
+++ linux-2.6.6-01-soft/net/sunrpc/xprt.c 2004-04-19 11:58:03.000000000 -0400
@@ -352,35 +352,57 @@ xprt_adjust_cwnd(struct rpc_xprt *xprt,
}

/*
+ * Reset the major timeout value
+ */
+static void xprt_reset_majortimeo(struct rpc_rqst *req)
+{
+ struct rpc_timeout *to = &req->rq_xprt->timeout;
+
+ req->rq_majortimeo = req->rq_timeout;
+ if (to->to_exponential)
+ req->rq_majortimeo <<= to->to_retries;
+ else
+ req->rq_majortimeo += to->to_increment * to->to_retries;
+ if (req->rq_majortimeo > to->to_maxval || req->rq_majortimeo == 0)
+ req->rq_majortimeo = to->to_maxval;
+ req->rq_majortimeo += jiffies;
+}
+
+/*
* Adjust timeout values etc for next retransmit
*/
-int
-xprt_adjust_timeout(struct rpc_timeout *to)
+int xprt_adjust_timeout(struct rpc_rqst *req)
{
- if (to->to_retries > 0) {
+ struct rpc_xprt *xprt = req->rq_xprt;
+ struct rpc_timeout *to = &xprt->timeout;
+ int status = 0;
+
+ if (time_before(jiffies, req->rq_majortimeo)) {
if (to->to_exponential)
- to->to_current <<= 1;
+ req->rq_timeout <<= 1;
else
- to->to_current += to->to_increment;
- if (to->to_maxval && to->to_current >= to->to_maxval)
- to->to_current = to->to_maxval;
+ req->rq_timeout += to->to_increment;
+ if (to->to_maxval && req->rq_timeout >= to->to_maxval)
+ req->rq_timeout = to->to_maxval;
+ req->rq_retries++;
+ pprintk("RPC: %lu retrans\n", jiffies);
} else {
- if (to->to_exponential)
- to->to_initval <<= 1;
- else
- to->to_initval += to->to_increment;
- if (to->to_maxval && to->to_initval >= to->to_maxval)
- to->to_initval = to->to_maxval;
- to->to_current = to->to_initval;
+ req->rq_timeout = to->to_initval;
+ req->rq_retries = 0;
+ xprt_reset_majortimeo(req);
+ /* Reset the RTT counters == "slow start" */
+ spin_lock_bh(&xprt->sock_lock);
+ rpc_init_rtt(req->rq_task->tk_client->cl_rtt, to->to_initval);
+ spin_unlock_bh(&xprt->sock_lock);
+ pprintk("RPC: %lu timeout\n", jiffies);
+ status = -ETIMEDOUT;
}

- if (!to->to_current) {
- printk(KERN_WARNING "xprt_adjust_timeout: to_current = 0!\n");
- to->to_current = 5 * HZ;
- }
- pprintk("RPC: %lu %s\n", jiffies,
- to->to_retries? "retrans" : "timeout");
- return to->to_retries-- > 0;
+ if (req->rq_timeout == 0) {
+ printk(KERN_WARNING "xprt_adjust_timeout: rq_timeout = 0!\n");
+ req->rq_timeout = 5 * HZ;
+ }
+ return status;
}

/*
@@ -1166,6 +1188,7 @@ xprt_transmit(struct rpc_task *task)
/* Add request to the receive list */
list_add_tail(&req->rq_list, &xprt->recv);
spin_unlock_bh(&xprt->sock_lock);
+ xprt_reset_majortimeo(req);
}
} else if (!req->rq_bytes_sent)
return;
@@ -1221,7 +1244,7 @@ xprt_transmit(struct rpc_task *task)
if (!xprt_connected(xprt))
task->tk_status = -ENOTCONN;
else if (test_bit(SOCK_NOSPACE, &xprt->sock->flags)) {
- task->tk_timeout = req->rq_timeout.to_current;
+ task->tk_timeout = req->rq_timeout;
rpc_sleep_on(&xprt->pending, task, NULL, NULL);
}
spin_unlock_bh(&xprt->sock_lock);
@@ -1248,13 +1271,11 @@ xprt_transmit(struct rpc_task *task)
if (!xprt->nocong) {
int timer = task->tk_msg.rpc_proc->p_timer;
task->tk_timeout = rpc_calc_rto(clnt->cl_rtt, timer);
- task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer);
- task->tk_timeout <<= clnt->cl_timeout.to_retries
- - req->rq_timeout.to_retries;
- if (task->tk_timeout > req->rq_timeout.to_maxval)
- task->tk_timeout = req->rq_timeout.to_maxval;
+ task->tk_timeout <<= rpc_ntimeo(clnt->cl_rtt, timer) + req->rq_retries;
+ if (task->tk_timeout > xprt->timeout.to_maxval || task->tk_timeout == 0)
+ task->tk_timeout = xprt->timeout.to_maxval;
} else
- task->tk_timeout = req->rq_timeout.to_current;
+ task->tk_timeout = req->rq_timeout;
/* Don't race with disconnect */
if (!xprt_connected(xprt))
task->tk_status = -ENOTCONN;
@@ -1324,7 +1345,7 @@ xprt_request_init(struct rpc_task *task,
{
struct rpc_rqst *req = task->tk_rqstp;

- req->rq_timeout = xprt->timeout;
+ req->rq_timeout = xprt->timeout.to_initval;
req->rq_task = task;
req->rq_xprt = xprt;
req->rq_xid = xprt_alloc_xid(xprt);
@@ -1381,7 +1402,6 @@ xprt_default_timeout(struct rpc_timeout
void
xprt_set_timeout(struct rpc_timeout *to, unsigned int retr, unsigned long incr)
{
- to->to_current =
to->to_initval =
to->to_increment = incr;
to->to_maxval = incr * retr;
@@ -1446,7 +1466,6 @@ xprt_setup(int proto, struct sockaddr_in
/* Set timeout parameters */
if (to) {
xprt->timeout = *to;
- xprt->timeout.to_current = to->to_initval;
} else
xprt_default_timeout(&xprt->timeout, xprt->prot);


Attachments:
linux-2.6.6-01-soft.dif (8.98 kB)

2004-04-19 16:41:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

nfsroot.c | 33 +++++++++++++++++++++------------
1 files changed, 21 insertions(+), 12 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c
--- linux-2.6.6-01-soft/fs/nfs/nfsroot.c 2004-04-17 23:01:09.000000000 -0400
+++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c 2004-04-19 12:08:31.000000000 -0400
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0; /
***************************************************************************/

enum {
+ /* Options that take integer arguments */
Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
- Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+ Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+ /* Options that take no arguments */
+ Opt_soft, Opt_hard, Opt_intr,
Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac,
Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
- Opt_broken_suid, Opt_err,
+ Opt_broken_suid,
+ /* Error token */
+ Opt_err
};

static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
{Opt_noac, "noac"},
{Opt_lock, "lock"},
{Opt_nolock, "nolock"},
+ {Opt_v2, "nfsvers=2"},
{Opt_v2, "v2"},
+ {Opt_v3, "nfsvers=3"},
{Opt_v3, "v3"},
+ {Opt_udp, "proto=udp"},
{Opt_udp, "udp"},
+ {Opt_tcp, "proto=tcp"},
{Opt_tcp, "tcp"},
{Opt_broken_suid, "broken_suid"},
{Opt_err, NULL}
@@ -162,25 +171,21 @@ static match_table_t tokens = {
static int __init root_nfs_parse(char *name, char *buf)
{

- char *p;
+ char *p, *path = name;
substring_t args[MAX_OPT_ARGS];
int option;

if (!name)
return 1;

- if (name[0] && strcmp(name, "default")){
- strlcpy(buf, name, NFS_MAXPATHLEN);
- return 1;
- }
while ((p = strsep (&name, ",")) != NULL) {
int token;
if (!*p)
continue;
token = match_token(p, tokens, args);

- /* %u tokens only */
- if (match_int(&args[0], &option))
+ /* %u tokens only. Beware if you add new tokens! */
+ if (token < Opt_soft && match_int(&args[0], &option))
return 0;
switch (token) {
case Opt_port:
@@ -265,6 +270,13 @@ static int __init root_nfs_parse(char *n
return 0;
}
}
+
+ /*
+ * Copy the NFS remote path to the output buffer.
+ * Relies on strsep() having converted the delimiting ',' to '\0'.
+ */
+ if (path[0] != '\0' && strcmp(path, "default") != 0)
+ strlcpy(buf, path, NFS_MAXPATHLEN);
return 1;
}

@@ -283,9 +295,6 @@ static int __init root_nfs_name(char *na
nfs_data.flags = NFS_MOUNT_NONLM; /* No lockd in nfs root yet */
nfs_data.rsize = NFS_DEF_FILE_IO_BUFFER_SIZE;
nfs_data.wsize = NFS_DEF_FILE_IO_BUFFER_SIZE;
- nfs_data.bsize = 0;
- nfs_data.timeo = 7;
- nfs_data.retrans = 3;
nfs_data.acregmin = 3;
nfs_data.acregmax = 60;
nfs_data.acdirmin = 30;


Attachments:
linux-2.6.6-02-fix_nfsroot.dif (2.88 kB)

2004-04-19 21:10:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

nfsroot.c | 30 +++++++++++++++++++-----------
1 files changed, 19 insertions(+), 11 deletions(-)

diff -u --recursive --new-file --show-c-function linux-2.6.6-01-soft/fs/nfs/nfsroot.c linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c
--- linux-2.6.6-01-soft/fs/nfs/nfsroot.c 2004-04-19 12:27:51.000000000 -0400
+++ linux-2.6.6-02-fix_nfsroot/fs/nfs/nfsroot.c 2004-04-19 16:26:12.000000000 -0400
@@ -117,11 +117,16 @@ static int mount_port __initdata = 0; /
***************************************************************************/

enum {
+ /* Options that take integer arguments */
Opt_port, Opt_rsize, Opt_wsize, Opt_timeo, Opt_retrans, Opt_acregmin,
- Opt_acregmax, Opt_acdirmin, Opt_acdirmax, Opt_soft, Opt_hard, Opt_intr,
+ Opt_acregmax, Opt_acdirmin, Opt_acdirmax,
+ /* Options that take no arguments */
+ Opt_soft, Opt_hard, Opt_intr,
Opt_nointr, Opt_posix, Opt_noposix, Opt_cto, Opt_nocto, Opt_ac,
Opt_noac, Opt_lock, Opt_nolock, Opt_v2, Opt_v3, Opt_udp, Opt_tcp,
- Opt_broken_suid, Opt_err,
+ Opt_broken_suid,
+ /* Error token */
+ Opt_err
};

static match_table_t tokens = {
@@ -146,9 +151,13 @@ static match_table_t tokens = {
{Opt_noac, "noac"},
{Opt_lock, "lock"},
{Opt_nolock, "nolock"},
+ {Opt_v2, "nfsvers=2"},
{Opt_v2, "v2"},
+ {Opt_v3, "nfsvers=3"},
{Opt_v3, "v3"},
+ {Opt_udp, "proto=udp"},
{Opt_udp, "udp"},
+ {Opt_tcp, "proto=tcp"},
{Opt_tcp, "tcp"},
{Opt_broken_suid, "broken_suid"},
{Opt_err, NULL}
@@ -169,18 +178,19 @@ static int __init root_nfs_parse(char *n
if (!name)
return 1;

- if (name[0] && strcmp(name, "default")){
- strlcpy(buf, name, NFS_MAXPATHLEN);
- return 1;
- }
+ /* Set the NFS remote path */
+ p = strsep(&name, ",");
+ if (p[0] != '\0' && strcmp(p, "default") != 0)
+ strlcpy(buf, p, NFS_MAXPATHLEN);
+
while ((p = strsep (&name, ",")) != NULL) {
int token;
if (!*p)
continue;
token = match_token(p, tokens, args);

- /* %u tokens only */
- if (match_int(&args[0], &option))
+ /* %u tokens only. Beware if you add new tokens! */
+ if (token < Opt_soft && match_int(&args[0], &option))
return 0;
switch (token) {
case Opt_port:
@@ -265,6 +275,7 @@ static int __init root_nfs_parse(char *n
return 0;
}
}
+
return 1;
}

@@ -283,9 +294,6 @@ static int __init root_nfs_name(char *na
nfs_data.flags = NFS_MOUNT_NONLM; /* No lockd in nfs root yet */
nfs_data.rsize = NFS_DEF_FILE_IO_BUFFER_SIZE;
nfs_data.wsize = NFS_DEF_FILE_IO_BUFFER_SIZE;
- nfs_data.bsize = 0;
- nfs_data.timeo = 7;
- nfs_data.retrans = 3;
nfs_data.acregmin = 3;
nfs_data.acregmax = 60;
nfs_data.acdirmin = 30;


Attachments:
linux-2.6.6-02-fix_nfsroot.dif (2.65 kB)

2004-04-20 00:09:46

by Jamie Lokier

[permalink] [raw]
Subject: Re: NFS and kernel 2.6.x

Trond Myklebust wrote:
> > I agree, but would still prefer more consistent behaviour if it is
> > easy -- and I explained how to do it, it's an easy algorithm.
>
> The reason I don't like it is that it continues to tie the major timeout
> to the resend timeouts. You've convinced me that they should not be the
> same thing.

Sorry, I don't understand that paragraph.

The algorithm I suggested _decouples_ the major timeout from the rtt
estimate. Your algorithm strongly couples them. I'm not sure what
you mean by saying the major timeout is "tied to the resend timeouts".

Your current (patched) algorithm sets the major timeout to be in the
range:

[timeo << retrans, (timeo << retrans) * 2]

The suggested algorithm sets the major timeout to be in the range:

[timeo << (retrans+1), (timeo << (retrans+1)) + 2 * timeo)

I.e. with retrans set to a new default of 5 (I think that's useful),
the major timeout is approx [44.8, 46] instead of [22.4, 44.8].

I agree it's not the most important thing in the world, but it is nice
to be able to fix the parameters and say that with the defaults, major
timeout happens after about 45 seconds.

You say you don't like it because major timeout is still tied to
something. Could you explain what the ideal behaviour you have in
mind is? Right now, with the patch, I think your intention is to have
a fixed major timeout time, but it doesn't work like that.

> The other reason is that it only improves matters for the first request.
> Once we reset the RTO, all those other outstanding requests are anyway
> going to see an immediate discontinuity as their basic timeout jumps
> from 1ms to 700ms.

Yes, that's the point: after a retransmits passes a threshold, we
should no longer depend on the RTO estimate because it doesn't seem to
be reliable.

> So why go to all that trouble just for 1 request?

Because it's visible behaviour with "soft" mounts. Someone unplugs
the cable or the network is down, and you see the I/O errors after
about 40 seconds. This is nicer than seeing them after an unknown
period between 40 and 80 (or 20 and 40 depending on your settings).

> It was partly another "consistency" issue that initially worried me,
> partly in order to avoid problems with overflow:
> If you have more than one outstanding request, then those that get
> scheduled after the first major timeout (when we reset the RTO
> estimator) will see a "jump". If the "retries" variable is too large,
> they will either jump straight over 60 seconds, and thus trigger the cap
> or they will end up at zero due to 32-bit overflow.

Ah. So you keep track of the number of retries per request, and each
time you send a request you set its timeout to (RTO << retries)?

If you do, maybe that's why my algorithm seems over complicated, and
you're concerned about overflows etc.

Instead of counting retries, don't. You don't need a per-request
retries counter. Instead: keep track of the request_timeout when the
request was last issued. When retransmitting, compare that value
against the global value (timeo << retrans). When a request times out
and request_timeout >= (timeo << retrans), that's a major timeout.
Otherwise you just check if request_timeout < timeo. If yes, double
it. If no, set request_timeout = timeo << N for the smallest integer
N such that it's an increase. And try again.

Notice how that logic is based on constants: it's independent of RTO,
and so outstanding requests aren't affected by changes in RTO.
There's no jump, no overflow, and you can compute the key constant
(timeo << retrans) when initialising: retrans isn't used by itself.

-- Jamie