2018-05-11 10:22:01

by Goran

[permalink] [raw]
Subject: unstable nfs connection

I have an issue with nfs which I'm using the first time.

I have a server running arch linux. Further the server is running
several container, each container itself is running arch linux too.
One container is running nfsd.

The nfs-container is connected via a virtual bridge. The bridge is
hosted by the server. One physical device is connected to that bridge.
A physical switch is connected to that physical device. The client is
connected to the physical switch.

To keep it simpler:
client -- clients NIC -- physical switch -- server NIC -- server
bridge -- container NIC -- container

At first sight, everything works well, I can boot diskless/readonly
from NFS. I use dracut for that. I can log into the client and start
working.

If the client is just doing nothing I get after some time (e.g. 5 minutes)

nfs: server 172.17.0.5 not responding. still trying

or

nfs: server 172.17.0.5 not responding.timed out

If 10 clients starts at same time, the above error comes immediatly.

I took a look like at the containers nic

23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
link/ether 42:3b:82:03:c0:7d brd ff:ff:ff:ff:ff:ff link-netnsid 0
RX: bytes packets errors dropped overrun mcast
44184595 388170 0 205 0 0
TX: bytes packets errors dropped carrier collsns
902023963 112922 0 0 0 0

but this seems not too much dropped packages for me. I'm more a C++
guy, not as much an Linux admin. So I'm asking how can I debug this
nfs-connection to make it stable?

Regards
Goran


2018-05-14 21:18:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: unstable nfs connection

On Fri, May 11, 2018 at 12:21:59PM +0200, Goran wrote:
> I have an issue with nfs which I'm using the first time.
>
> I have a server running arch linux. Further the server is running
> several container, each container itself is running arch linux too.
> One container is running nfsd.
>
> The nfs-container is connected via a virtual bridge. The bridge is
> hosted by the server. One physical device is connected to that bridge.
> A physical switch is connected to that physical device. The client is
> connected to the physical switch.
>
> To keep it simpler:
> client -- clients NIC -- physical switch -- server NIC -- server
> bridge -- container NIC -- container
>
> At first sight, everything works well, I can boot diskless/readonly
> from NFS. I use dracut for that. I can log into the client and start
> working.
>
> If the client is just doing nothing I get after some time (e.g. 5 minutes)
>
> nfs: server 172.17.0.5 not responding. still trying
>
> or
>
> nfs: server 172.17.0.5 not responding.timed out
>
> If 10 clients starts at same time, the above error comes immediatly.
>
> I took a look like at the containers nic
>
> 23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
> link/ether 42:3b:82:03:c0:7d brd ff:ff:ff:ff:ff:ff link-netnsid 0
> RX: bytes packets errors dropped overrun mcast
> 44184595 388170 0 205 0 0
> TX: bytes packets errors dropped carrier collsns
> 902023963 112922 0 0 0 0
>
> but this seems not too much dropped packages for me. I'm more a C++
> guy, not as much an Linux admin. So I'm asking how can I debug this
> nfs-connection to make it stable?

I don't know. My first impulse would be to blame some problem at the
networking level rather than NFS itself. (Are those dropped packets
normal? Can you reproduce problems with any protocol other than NFS?)

What are you using to set up the containers? What does the server
bridge look like? I wonder if there's some misconfiguration of the
bridge or of the server's virtual NIC, or an odd firewalling issue.

It *might* help to try watching the traffic over various interfaces with
wireshark.

--b.