2014-09-02 16:51:27

by Chris Perl

[permalink] [raw]
Subject: nfs-utils - TCP ephemeral port exhaustion results in mount failures

I've noticed that mount.nfs calls bind (in `nfs_bind' in
support/nfs/rpc_socket.c) before ultimately calling connect when
trying to get a tcp connection to talk to the remote portmapper
service (called from `nfs_get_tcpclient' which is called from
`nfs_gp_get_rpcbclient').

Unfortunately, this means you need to find a local ephemeral port such
that said ephemeral port is not a part of *any* existing TCP
connection (i.e. you're looking for a unique 2 tuple of (socket_type,
local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
in this case specifically SOCK_STREAM).

If you were to just call connect without calling bind first, then
you'd need to find a unique 5 tuple of (socket_type, local_ip,
loacl_port, remote_ip, remote_port).

The end result is a misbehaving application that creates many
connections to some service, using all ephemeral ports, can cause
attempts to mount remote NFS filesystems to fail with EADDRINUSE.

Don't get me wrong, I think we should fix our application, (and we
are) but I don't see any reason why mount.nfs couldn't just call
connect without calling bind first (thereby allowing it to happen
implicitly) and allowing mount.nfs to continue to work in this
situation.

I think an example may help explain what I'm talking about.

Lets take a Linux machine running CentOS 6.5
(2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
ephemeral ports to just 10:

[cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
60000 60009

Then create a TCP connection to a remote service which will just hold
that connection open:

[cperl@localhost ~]$ for in in {0..9}; do socat -u
tcp:192.168.1.12:9990 file:/dev/null & done
[1] 21578
[2] 21579
[3] 21580
[4] 21581
[5] 21582
[6] 21583
[7] 21584
[8] 21585
[9] 21586
[10] 21587

[cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
tcp 192.168.1.11:60000 192.168.1.12:9990
tcp 192.168.1.11:60001 192.168.1.12:9990
tcp 192.168.1.11:60002 192.168.1.12:9990
tcp 192.168.1.11:60003 192.168.1.12:9990
tcp 192.168.1.11:60004 192.168.1.12:9990
tcp 192.168.1.11:60005 192.168.1.12:9990
tcp 192.168.1.11:60006 192.168.1.12:9990
tcp 192.168.1.11:60007 192.168.1.12:9990
tcp 192.168.1.11:60008 192.168.1.12:9990
tcp 192.168.1.11:60009 192.168.1.12:9990

And now try to mount an NFS export:

[cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
mount.nfs: Address already in use

As mentioned before, this is because bind is trying to find a unique 2
tuple of (socket_type, local_port) (really I believe its the 3 tuple
(socket_type, local_ip, local_port), but calling bind with INADDR_ANY
as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.

However, just calling connect allows local ephemeral ports to be
"reused" (i.e. it looks for the unique 5 tuple of (socket_type,
local_ip, local_port, remote_ip, remote_port)).

For example, notice how the local ephemeral ports 60003 and 60004 are
"reused" below (because socat is just calling connect, not bind,
although we can make socat call bind with an option if we want and see
it fail like mount.nfs did above):

[cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
[11] 22433
[cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
[12] 22499
[cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
tcp 192.168.0.11:60000 192.168.1.12:9990
tcp 192.168.0.11:60001 192.168.1.12:9990
tcp 192.168.0.11:60002 192.168.1.12:9990
tcp 192.168.0.11:60003 192.168.1.12:9990
tcp 192.168.0.11:60003 192.168.1.12:9991
tcp 192.168.0.11:60004 192.168.1.12:9990
tcp 192.168.0.11:60004 192.168.1.13:9990
tcp 192.168.0.11:60005 192.168.1.12:9990
tcp 192.168.0.11:60006 192.168.1.12:9990
tcp 192.168.0.11:60007 192.168.1.12:9990
tcp 192.168.0.11:60008 192.168.1.12:9990
tcp 192.168.0.11:60009 192.168.1.12:9990

Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
in the case where its not using a reserved port?

For some color, this is particularly annoying for me because I have
extensive automount maps and this failure leads to attempts to access
a given automounted path returning ENOENT. Furthermore, automount
caches this failure and continues to return ENOENT for the duration of
whatever its negative cache timeout is.

For UDP, I don't think "bind before connect" matters as much. I
believe the difference is just in the error you'll get from either
bind or connect (if all ephemeral ports are used). If you attempt to
bind when all local ports are in use you seem to get EADDRINUSE,
whereas when you connect when all local ports are in use you get
EAGAIN.

It could be I'm missing something totally obvious for why this is. If
so, please let me know!


2014-09-03 11:00:51

by Jeff Layton

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

On Tue, 2 Sep 2014 12:51:06 -0400
Chris Perl <[email protected]> wrote:

> I've noticed that mount.nfs calls bind (in `nfs_bind' in
> support/nfs/rpc_socket.c) before ultimately calling connect when
> trying to get a tcp connection to talk to the remote portmapper
> service (called from `nfs_get_tcpclient' which is called from
> `nfs_gp_get_rpcbclient').
>
> Unfortunately, this means you need to find a local ephemeral port such
> that said ephemeral port is not a part of *any* existing TCP
> connection (i.e. you're looking for a unique 2 tuple of (socket_type,
> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
> in this case specifically SOCK_STREAM).
>
> If you were to just call connect without calling bind first, then
> you'd need to find a unique 5 tuple of (socket_type, local_ip,
> loacl_port, remote_ip, remote_port).
>
> The end result is a misbehaving application that creates many
> connections to some service, using all ephemeral ports, can cause
> attempts to mount remote NFS filesystems to fail with EADDRINUSE.
>
> Don't get me wrong, I think we should fix our application, (and we
> are) but I don't see any reason why mount.nfs couldn't just call
> connect without calling bind first (thereby allowing it to happen
> implicitly) and allowing mount.nfs to continue to work in this
> situation.
>
> I think an example may help explain what I'm talking about.
>
> Lets take a Linux machine running CentOS 6.5
> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
> ephemeral ports to just 10:
>
> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
> 60000 60009
>
> Then create a TCP connection to a remote service which will just hold
> that connection open:
>
> [cperl@localhost ~]$ for in in {0..9}; do socat -u
> tcp:192.168.1.12:9990 file:/dev/null & done
> [1] 21578
> [2] 21579
> [3] 21580
> [4] 21581
> [5] 21582
> [6] 21583
> [7] 21584
> [8] 21585
> [9] 21586
> [10] 21587
>
> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
> tcp 192.168.1.11:60000 192.168.1.12:9990
> tcp 192.168.1.11:60001 192.168.1.12:9990
> tcp 192.168.1.11:60002 192.168.1.12:9990
> tcp 192.168.1.11:60003 192.168.1.12:9990
> tcp 192.168.1.11:60004 192.168.1.12:9990
> tcp 192.168.1.11:60005 192.168.1.12:9990
> tcp 192.168.1.11:60006 192.168.1.12:9990
> tcp 192.168.1.11:60007 192.168.1.12:9990
> tcp 192.168.1.11:60008 192.168.1.12:9990
> tcp 192.168.1.11:60009 192.168.1.12:9990
>
> And now try to mount an NFS export:
>
> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
> mount.nfs: Address already in use
>
> As mentioned before, this is because bind is trying to find a unique 2
> tuple of (socket_type, local_port) (really I believe its the 3 tuple
> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY
> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.
>
> However, just calling connect allows local ephemeral ports to be
> "reused" (i.e. it looks for the unique 5 tuple of (socket_type,
> local_ip, local_port, remote_ip, remote_port)).
>
> For example, notice how the local ephemeral ports 60003 and 60004 are
> "reused" below (because socat is just calling connect, not bind,
> although we can make socat call bind with an option if we want and see
> it fail like mount.nfs did above):
>
> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
> [11] 22433
> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
> [12] 22499
> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
> tcp 192.168.0.11:60000 192.168.1.12:9990
> tcp 192.168.0.11:60001 192.168.1.12:9990
> tcp 192.168.0.11:60002 192.168.1.12:9990
> tcp 192.168.0.11:60003 192.168.1.12:9990
> tcp 192.168.0.11:60003 192.168.1.12:9991
> tcp 192.168.0.11:60004 192.168.1.12:9990
> tcp 192.168.0.11:60004 192.168.1.13:9990
> tcp 192.168.0.11:60005 192.168.1.12:9990
> tcp 192.168.0.11:60006 192.168.1.12:9990
> tcp 192.168.0.11:60007 192.168.1.12:9990
> tcp 192.168.0.11:60008 192.168.1.12:9990
> tcp 192.168.0.11:60009 192.168.1.12:9990
>
> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
> in the case where its not using a reserved port?
>
> For some color, this is particularly annoying for me because I have
> extensive automount maps and this failure leads to attempts to access
> a given automounted path returning ENOENT. Furthermore, automount
> caches this failure and continues to return ENOENT for the duration of
> whatever its negative cache timeout is.
>
> For UDP, I don't think "bind before connect" matters as much. I
> believe the difference is just in the error you'll get from either
> bind or connect (if all ephemeral ports are used). If you attempt to
> bind when all local ports are in use you seem to get EADDRINUSE,
> whereas when you connect when all local ports are in use you get
> EAGAIN.
>
> It could be I'm missing something totally obvious for why this is. If
> so, please let me know!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

(cc'ing Chuck since he wrote a lot of that code)

I'm not sure either. If there was a reason for that, it's likely lost
to antiquity. In some cases, we really are expected to use reserved
ports and I think you do have to bind() in order to get one. In the
non-reserved case though it's likely we could skip binding altogether.

What would probably be best is to roll up a patch that changes it, and
propose it on the list.

--
Jeff Layton <[email protected]>

2014-09-05 22:35:06

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

> Do these patches require one another?
>
> What happens if I have a patched nfs-utils, but not a patched kernel or the other way around?

They don't require each other per se.

If you have a patched kernel without a patched nfs-utils, then
attempts to mount when all ephemeral ports are already a part of at
least 1 TCP connection will fail with an EADDRINUSE from mount.nfs(8)
attempting to call bind(2). It will never get around to calling
mount(2) and never enter the kernel RPC code.

If you have a patched nfs-utils without a patched kernel, its a little
unfortunate in that mount.nfs(8) will call mount(2), but that will
give an EIO when `xs_bind' tries to call `kernel_bind' which returns
EADDRINUSE.

I say unfortunate because in this scenario you get the less helpful
EIO back from mount(2) rather than the EADDRINUSE from bind(2),
meaning the end user sees "input/output error" rather than "address
already in use," where the latter is the real issue.

2014-09-05 21:23:52

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

Do these patches require one another?

What happens if I have a patched nfs-utils, but not a patched kernel or the other way around?

-dros


On Sep 5, 2014, at 4:20 PM, Chris Perl <[email protected]> wrote:

> It looks like they may have come through after all, unfortunately I
> already sent them again. Apologies for the spam.
>
> On Fri, Sep 5, 2014 at 4:04 PM, Chris Perl <[email protected]> wrote:
>> I tried to send them to the list, but I guess they didn't come through
>> because my sender was set to [email protected], but I ran `git
>> send-email' from work, so the smtp sender ip wouldn't authorized.
>>
>> I'll figure out a way to send them now.
>>
>> On Fri, Sep 5, 2014 at 4:03 PM, Trond Myklebust
>> <[email protected]> wrote:
>>> On Fri, Sep 5, 2014 at 3:40 PM, Chris Perl <[email protected]> wrote:
>>>> I just submitted two patches, one for nfs-utils and one for linux-nfs.
>>>>
>>>> As I said in my previous email, the patch to nfs-utils was enough to
>>>> get us farther along, but we failed inside mount(2) with EIO (with a
>>>> decidedly more confusing error message).
>>>>
>>>> So, I've also submitted a patch for the rpc code in the kernel that
>>>> also avoids bind when asking for a random ephemeral port. I've tested
>>>> the combination of these two patches with my system while in the
>>>> situation I originally outlined. I can continue to successfully mount
>>>> NFS filesystems using both of these patches.
>>>>
>>>> I don't particularly love the kernel patch, as it makes `xs_bind' not
>>>> actually bind in all circumstances, which seems confusing. However, I
>>>> thought trying to rework things in a larger way would cause more
>>>> issues given that I'm not very familiar with this code. If everyone
>>>> hates it, I can try something else.
>>>
>>> To whom did you submit these patches? I don't see anything in the
>>> linux-nfs mailing list.
>>>
>>> --
>>> Trond Myklebust
>>>
>>> Linux NFS client maintainer, PrimaryData
>>>
>>> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2014-09-05 20:20:49

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

It looks like they may have come through after all, unfortunately I
already sent them again. Apologies for the spam.

On Fri, Sep 5, 2014 at 4:04 PM, Chris Perl <[email protected]> wrote:
> I tried to send them to the list, but I guess they didn't come through
> because my sender was set to [email protected], but I ran `git
> send-email' from work, so the smtp sender ip wouldn't authorized.
>
> I'll figure out a way to send them now.
>
> On Fri, Sep 5, 2014 at 4:03 PM, Trond Myklebust
> <[email protected]> wrote:
>> On Fri, Sep 5, 2014 at 3:40 PM, Chris Perl <[email protected]> wrote:
>>> I just submitted two patches, one for nfs-utils and one for linux-nfs.
>>>
>>> As I said in my previous email, the patch to nfs-utils was enough to
>>> get us farther along, but we failed inside mount(2) with EIO (with a
>>> decidedly more confusing error message).
>>>
>>> So, I've also submitted a patch for the rpc code in the kernel that
>>> also avoids bind when asking for a random ephemeral port. I've tested
>>> the combination of these two patches with my system while in the
>>> situation I originally outlined. I can continue to successfully mount
>>> NFS filesystems using both of these patches.
>>>
>>> I don't particularly love the kernel patch, as it makes `xs_bind' not
>>> actually bind in all circumstances, which seems confusing. However, I
>>> thought trying to rework things in a larger way would cause more
>>> issues given that I'm not very familiar with this code. If everyone
>>> hates it, I can try something else.
>>
>> To whom did you submit these patches? I don't see anything in the
>> linux-nfs mailing list.
>>
>> --
>> Trond Myklebust
>>
>> Linux NFS client maintainer, PrimaryData
>>
>> [email protected]

2014-09-05 20:03:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

On Fri, Sep 5, 2014 at 3:40 PM, Chris Perl <[email protected]> wrote:
> I just submitted two patches, one for nfs-utils and one for linux-nfs.
>
> As I said in my previous email, the patch to nfs-utils was enough to
> get us farther along, but we failed inside mount(2) with EIO (with a
> decidedly more confusing error message).
>
> So, I've also submitted a patch for the rpc code in the kernel that
> also avoids bind when asking for a random ephemeral port. I've tested
> the combination of these two patches with my system while in the
> situation I originally outlined. I can continue to successfully mount
> NFS filesystems using both of these patches.
>
> I don't particularly love the kernel patch, as it makes `xs_bind' not
> actually bind in all circumstances, which seems confusing. However, I
> thought trying to rework things in a larger way would cause more
> issues given that I'm not very familiar with this code. If everyone
> hates it, I can try something else.

To whom did you submit these patches? I don't see anything in the
linux-nfs mailing list.

--
Trond Myklebust

Linux NFS client maintainer, PrimaryData

[email protected]

2014-09-09 21:01:39

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

> I have not seen anything....

I didn't send the patches in reply to this thread, so perhaps that
confused matters. Trond already commented on the kernel one.

The two patches are referenced below via gmane.org (not sure if there
is a better way to direct you to them):

[1] http://thread.gmane.org/gmane.linux.nfs/66173 (sunrpc module patch)
[2] http://thread.gmane.org/gmane.linux.nfs/66175 (mount.nfs patch)

2014-09-09 18:19:00

by Steve Dickson

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures



On 09/05/2014 04:20 PM, Chris Perl wrote:
> It looks like they may have come through after all, unfortunately I
> already sent them again. Apologies for the spam.
I have not seen anything....

steved.


2014-09-03 13:55:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures


On Sep 3, 2014, at 7:00 AM, Jeff Layton <[email protected]> wrote:

> On Tue, 2 Sep 2014 12:51:06 -0400
> Chris Perl <[email protected]> wrote:
>
>> I've noticed that mount.nfs calls bind (in `nfs_bind' in
>> support/nfs/rpc_socket.c) before ultimately calling connect when
>> trying to get a tcp connection to talk to the remote portmapper
>> service (called from `nfs_get_tcpclient' which is called from
>> `nfs_gp_get_rpcbclient').
>>
>> Unfortunately, this means you need to find a local ephemeral port such
>> that said ephemeral port is not a part of *any* existing TCP
>> connection (i.e. you're looking for a unique 2 tuple of (socket_type,
>> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
>> in this case specifically SOCK_STREAM).
>>
>> If you were to just call connect without calling bind first, then
>> you'd need to find a unique 5 tuple of (socket_type, local_ip,
>> loacl_port, remote_ip, remote_port).
>>
>> The end result is a misbehaving application that creates many
>> connections to some service, using all ephemeral ports, can cause
>> attempts to mount remote NFS filesystems to fail with EADDRINUSE.
>>
>> Don't get me wrong, I think we should fix our application, (and we
>> are) but I don't see any reason why mount.nfs couldn't just call
>> connect without calling bind first (thereby allowing it to happen
>> implicitly) and allowing mount.nfs to continue to work in this
>> situation.
>>
>> I think an example may help explain what I'm talking about.
>>
>> Lets take a Linux machine running CentOS 6.5
>> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
>> ephemeral ports to just 10:
>>
>> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
>> 60000 60009
>>
>> Then create a TCP connection to a remote service which will just hold
>> that connection open:
>>
>> [cperl@localhost ~]$ for in in {0..9}; do socat -u
>> tcp:192.168.1.12:9990 file:/dev/null & done
>> [1] 21578
>> [2] 21579
>> [3] 21580
>> [4] 21581
>> [5] 21582
>> [6] 21583
>> [7] 21584
>> [8] 21585
>> [9] 21586
>> [10] 21587
>>
>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>> tcp 192.168.1.11:60000 192.168.1.12:9990
>> tcp 192.168.1.11:60001 192.168.1.12:9990
>> tcp 192.168.1.11:60002 192.168.1.12:9990
>> tcp 192.168.1.11:60003 192.168.1.12:9990
>> tcp 192.168.1.11:60004 192.168.1.12:9990
>> tcp 192.168.1.11:60005 192.168.1.12:9990
>> tcp 192.168.1.11:60006 192.168.1.12:9990
>> tcp 192.168.1.11:60007 192.168.1.12:9990
>> tcp 192.168.1.11:60008 192.168.1.12:9990
>> tcp 192.168.1.11:60009 192.168.1.12:9990
>>
>> And now try to mount an NFS export:
>>
>> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
>> mount.nfs: Address already in use
>>
>> As mentioned before, this is because bind is trying to find a unique 2
>> tuple of (socket_type, local_port) (really I believe its the 3 tuple
>> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY
>> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.
>>
>> However, just calling connect allows local ephemeral ports to be
>> "reused" (i.e. it looks for the unique 5 tuple of (socket_type,
>> local_ip, local_port, remote_ip, remote_port)).
>>
>> For example, notice how the local ephemeral ports 60003 and 60004 are
>> "reused" below (because socat is just calling connect, not bind,
>> although we can make socat call bind with an option if we want and see
>> it fail like mount.nfs did above):
>>
>> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
>> [11] 22433
>> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
>> [12] 22499
>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>> tcp 192.168.0.11:60000 192.168.1.12:9990
>> tcp 192.168.0.11:60001 192.168.1.12:9990
>> tcp 192.168.0.11:60002 192.168.1.12:9990
>> tcp 192.168.0.11:60003 192.168.1.12:9990
>> tcp 192.168.0.11:60003 192.168.1.12:9991
>> tcp 192.168.0.11:60004 192.168.1.12:9990
>> tcp 192.168.0.11:60004 192.168.1.13:9990
>> tcp 192.168.0.11:60005 192.168.1.12:9990
>> tcp 192.168.0.11:60006 192.168.1.12:9990
>> tcp 192.168.0.11:60007 192.168.1.12:9990
>> tcp 192.168.0.11:60008 192.168.1.12:9990
>> tcp 192.168.0.11:60009 192.168.1.12:9990
>>
>> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
>> in the case where its not using a reserved port?
>>
>> For some color, this is particularly annoying for me because I have
>> extensive automount maps and this failure leads to attempts to access
>> a given automounted path returning ENOENT. Furthermore, automount
>> caches this failure and continues to return ENOENT for the duration of
>> whatever its negative cache timeout is.
>>
>> For UDP, I don't think "bind before connect" matters as much. I
>> believe the difference is just in the error you'll get from either
>> bind or connect (if all ephemeral ports are used). If you attempt to
>> bind when all local ports are in use you seem to get EADDRINUSE,
>> whereas when you connect when all local ports are in use you get
>> EAGAIN.

There is only one place where mount.nfs uses connected UDP, which
is nfs_ca_sockname(). But UDP connected sockets are less of a
hazard because they lack a 120 second TIME_WAIT after they are
closed.

>> It could be I'm missing something totally obvious for why this is. If
>> so, please let me know!

The reason is I didn?t realize you could call connect(2) without
calling bind(2) first on STREAM sockets.

> (cc'ing Chuck since he wrote a lot of that code)
>
> I'm not sure either. If there was a reason for that, it's likely lost
> to antiquity. In some cases, we really are expected to use reserved
> ports and I think you do have to bind() in order to get one. In the
> non-reserved case though it's likely we could skip binding altogether.
>
> What would probably be best is to roll up a patch that changes it, and
> propose it on the list.

I?d like to see a prototype, too.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2014-09-05 19:41:04

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

I just submitted two patches, one for nfs-utils and one for linux-nfs.

As I said in my previous email, the patch to nfs-utils was enough to
get us farther along, but we failed inside mount(2) with EIO (with a
decidedly more confusing error message).

So, I've also submitted a patch for the rpc code in the kernel that
also avoids bind when asking for a random ephemeral port. I've tested
the combination of these two patches with my system while in the
situation I originally outlined. I can continue to successfully mount
NFS filesystems using both of these patches.

I don't particularly love the kernel patch, as it makes `xs_bind' not
actually bind in all circumstances, which seems confusing. However, I
thought trying to rework things in a larger way would cause more
issues given that I'm not very familiar with this code. If everyone
hates it, I can try something else.

The nfs-utils patch was on top of
82ab4b4e80199d606e5c40f373aaf384d3dfc081 (if it makes any difference)
as I couldn't build from newer commits on my CentOS 6.5 based system
because my keyutils-libs doesn't have `keyctl_invalidate' and there
was no obvious upgrade available.

Let me know if there is anything else I should do, or if I've done
anything obviously wrong.

On Wed, Sep 3, 2014 at 4:01 PM, Chris Perl <[email protected]> wrote:
> Thanks, I started putting something together, but have to do a little
> more digging.
>
> While making mount.nfs(8) only call connect(2) and not bind(2) gets us
> farther, we then fail in mount(2) (get an EIO) due to the in kernel
> rpc client invoking `xs_bind', which calls `kernel_bind', which calls
> `sock->ops->bind', which is the same thing bind(2) invokes and so it
> fails with EADDRINUSE.
>
> On Wed, Sep 3, 2014 at 9:55 AM, Chuck Lever <[email protected]> wrote:
>>
>> On Sep 3, 2014, at 7:00 AM, Jeff Layton <[email protected]> wrote:
>>
>>> On Tue, 2 Sep 2014 12:51:06 -0400
>>> Chris Perl <[email protected]> wrote:
>>>
>>>> I've noticed that mount.nfs calls bind (in `nfs_bind' in
>>>> support/nfs/rpc_socket.c) before ultimately calling connect when
>>>> trying to get a tcp connection to talk to the remote portmapper
>>>> service (called from `nfs_get_tcpclient' which is called from
>>>> `nfs_gp_get_rpcbclient').
>>>>
>>>> Unfortunately, this means you need to find a local ephemeral port such
>>>> that said ephemeral port is not a part of *any* existing TCP
>>>> connection (i.e. you're looking for a unique 2 tuple of (socket_type,
>>>> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
>>>> in this case specifically SOCK_STREAM).
>>>>
>>>> If you were to just call connect without calling bind first, then
>>>> you'd need to find a unique 5 tuple of (socket_type, local_ip,
>>>> loacl_port, remote_ip, remote_port).
>>>>
>>>> The end result is a misbehaving application that creates many
>>>> connections to some service, using all ephemeral ports, can cause
>>>> attempts to mount remote NFS filesystems to fail with EADDRINUSE.
>>>>
>>>> Don't get me wrong, I think we should fix our application, (and we
>>>> are) but I don't see any reason why mount.nfs couldn't just call
>>>> connect without calling bind first (thereby allowing it to happen
>>>> implicitly) and allowing mount.nfs to continue to work in this
>>>> situation.
>>>>
>>>> I think an example may help explain what I'm talking about.
>>>>
>>>> Lets take a Linux machine running CentOS 6.5
>>>> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
>>>> ephemeral ports to just 10:
>>>>
>>>> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
>>>> 60000 60009
>>>>
>>>> Then create a TCP connection to a remote service which will just hold
>>>> that connection open:
>>>>
>>>> [cperl@localhost ~]$ for in in {0..9}; do socat -u
>>>> tcp:192.168.1.12:9990 file:/dev/null & done
>>>> [1] 21578
>>>> [2] 21579
>>>> [3] 21580
>>>> [4] 21581
>>>> [5] 21582
>>>> [6] 21583
>>>> [7] 21584
>>>> [8] 21585
>>>> [9] 21586
>>>> [10] 21587
>>>>
>>>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>>>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>>>> tcp 192.168.1.11:60000 192.168.1.12:9990
>>>> tcp 192.168.1.11:60001 192.168.1.12:9990
>>>> tcp 192.168.1.11:60002 192.168.1.12:9990
>>>> tcp 192.168.1.11:60003 192.168.1.12:9990
>>>> tcp 192.168.1.11:60004 192.168.1.12:9990
>>>> tcp 192.168.1.11:60005 192.168.1.12:9990
>>>> tcp 192.168.1.11:60006 192.168.1.12:9990
>>>> tcp 192.168.1.11:60007 192.168.1.12:9990
>>>> tcp 192.168.1.11:60008 192.168.1.12:9990
>>>> tcp 192.168.1.11:60009 192.168.1.12:9990
>>>>
>>>> And now try to mount an NFS export:
>>>>
>>>> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
>>>> mount.nfs: Address already in use
>>>>
>>>> As mentioned before, this is because bind is trying to find a unique 2
>>>> tuple of (socket_type, local_port) (really I believe its the 3 tuple
>>>> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY
>>>> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.
>>>>
>>>> However, just calling connect allows local ephemeral ports to be
>>>> "reused" (i.e. it looks for the unique 5 tuple of (socket_type,
>>>> local_ip, local_port, remote_ip, remote_port)).
>>>>
>>>> For example, notice how the local ephemeral ports 60003 and 60004 are
>>>> "reused" below (because socat is just calling connect, not bind,
>>>> although we can make socat call bind with an option if we want and see
>>>> it fail like mount.nfs did above):
>>>>
>>>> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
>>>> [11] 22433
>>>> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
>>>> [12] 22499
>>>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>>>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>>>> tcp 192.168.0.11:60000 192.168.1.12:9990
>>>> tcp 192.168.0.11:60001 192.168.1.12:9990
>>>> tcp 192.168.0.11:60002 192.168.1.12:9990
>>>> tcp 192.168.0.11:60003 192.168.1.12:9990
>>>> tcp 192.168.0.11:60003 192.168.1.12:9991
>>>> tcp 192.168.0.11:60004 192.168.1.12:9990
>>>> tcp 192.168.0.11:60004 192.168.1.13:9990
>>>> tcp 192.168.0.11:60005 192.168.1.12:9990
>>>> tcp 192.168.0.11:60006 192.168.1.12:9990
>>>> tcp 192.168.0.11:60007 192.168.1.12:9990
>>>> tcp 192.168.0.11:60008 192.168.1.12:9990
>>>> tcp 192.168.0.11:60009 192.168.1.12:9990
>>>>
>>>> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
>>>> in the case where its not using a reserved port?
>>>>
>>>> For some color, this is particularly annoying for me because I have
>>>> extensive automount maps and this failure leads to attempts to access
>>>> a given automounted path returning ENOENT. Furthermore, automount
>>>> caches this failure and continues to return ENOENT for the duration of
>>>> whatever its negative cache timeout is.
>>>>
>>>> For UDP, I don't think "bind before connect" matters as much. I
>>>> believe the difference is just in the error you'll get from either
>>>> bind or connect (if all ephemeral ports are used). If you attempt to
>>>> bind when all local ports are in use you seem to get EADDRINUSE,
>>>> whereas when you connect when all local ports are in use you get
>>>> EAGAIN.
>>
>> There is only one place where mount.nfs uses connected UDP, which
>> is nfs_ca_sockname(). But UDP connected sockets are less of a
>> hazard because they lack a 120 second TIME_WAIT after they are
>> closed.
>>
>>>> It could be I'm missing something totally obvious for why this is. If
>>>> so, please let me know!
>>
>> The reason is I didn’t realize you could call connect(2) without
>> calling bind(2) first on STREAM sockets.
>>
>>> (cc'ing Chuck since he wrote a lot of that code)
>>>
>>> I'm not sure either. If there was a reason for that, it's likely lost
>>> to antiquity. In some cases, we really are expected to use reserved
>>> ports and I think you do have to bind() in order to get one. In the
>>> non-reserved case though it's likely we could skip binding altogether.
>>>
>>> What would probably be best is to roll up a patch that changes it, and
>>> propose it on the list.
>>
>> I’d like to see a prototype, too.
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>

2014-09-03 20:01:56

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

Thanks, I started putting something together, but have to do a little
more digging.

While making mount.nfs(8) only call connect(2) and not bind(2) gets us
farther, we then fail in mount(2) (get an EIO) due to the in kernel
rpc client invoking `xs_bind', which calls `kernel_bind', which calls
`sock->ops->bind', which is the same thing bind(2) invokes and so it
fails with EADDRINUSE.

On Wed, Sep 3, 2014 at 9:55 AM, Chuck Lever <[email protected]> wrote:
>
> On Sep 3, 2014, at 7:00 AM, Jeff Layton <[email protected]> wrote:
>
>> On Tue, 2 Sep 2014 12:51:06 -0400
>> Chris Perl <[email protected]> wrote:
>>
>>> I've noticed that mount.nfs calls bind (in `nfs_bind' in
>>> support/nfs/rpc_socket.c) before ultimately calling connect when
>>> trying to get a tcp connection to talk to the remote portmapper
>>> service (called from `nfs_get_tcpclient' which is called from
>>> `nfs_gp_get_rpcbclient').
>>>
>>> Unfortunately, this means you need to find a local ephemeral port such
>>> that said ephemeral port is not a part of *any* existing TCP
>>> connection (i.e. you're looking for a unique 2 tuple of (socket_type,
>>> local_port) where socket_type is either SOCK_STREAM or SOCK_DGRAM, but
>>> in this case specifically SOCK_STREAM).
>>>
>>> If you were to just call connect without calling bind first, then
>>> you'd need to find a unique 5 tuple of (socket_type, local_ip,
>>> loacl_port, remote_ip, remote_port).
>>>
>>> The end result is a misbehaving application that creates many
>>> connections to some service, using all ephemeral ports, can cause
>>> attempts to mount remote NFS filesystems to fail with EADDRINUSE.
>>>
>>> Don't get me wrong, I think we should fix our application, (and we
>>> are) but I don't see any reason why mount.nfs couldn't just call
>>> connect without calling bind first (thereby allowing it to happen
>>> implicitly) and allowing mount.nfs to continue to work in this
>>> situation.
>>>
>>> I think an example may help explain what I'm talking about.
>>>
>>> Lets take a Linux machine running CentOS 6.5
>>> (2.6.32-431.1.2.0.1.el6.x86_64) and restrict the number of available
>>> ephemeral ports to just 10:
>>>
>>> [cperl@localhost ~]$ cat /proc/sys/net/ipv4/ip_local_port_range
>>> 60000 60009
>>>
>>> Then create a TCP connection to a remote service which will just hold
>>> that connection open:
>>>
>>> [cperl@localhost ~]$ for in in {0..9}; do socat -u
>>> tcp:192.168.1.12:9990 file:/dev/null & done
>>> [1] 21578
>>> [2] 21579
>>> [3] 21580
>>> [4] 21581
>>> [5] 21582
>>> [6] 21583
>>> [7] 21584
>>> [8] 21585
>>> [9] 21586
>>> [10] 21587
>>>
>>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>>> tcp 192.168.1.11:60000 192.168.1.12:9990
>>> tcp 192.168.1.11:60001 192.168.1.12:9990
>>> tcp 192.168.1.11:60002 192.168.1.12:9990
>>> tcp 192.168.1.11:60003 192.168.1.12:9990
>>> tcp 192.168.1.11:60004 192.168.1.12:9990
>>> tcp 192.168.1.11:60005 192.168.1.12:9990
>>> tcp 192.168.1.11:60006 192.168.1.12:9990
>>> tcp 192.168.1.11:60007 192.168.1.12:9990
>>> tcp 192.168.1.11:60008 192.168.1.12:9990
>>> tcp 192.168.1.11:60009 192.168.1.12:9990
>>>
>>> And now try to mount an NFS export:
>>>
>>> [cperl@localhost ~]$ sudo mount 192.168.1.100:/export/a /tmp/a
>>> mount.nfs: Address already in use
>>>
>>> As mentioned before, this is because bind is trying to find a unique 2
>>> tuple of (socket_type, local_port) (really I believe its the 3 tuple
>>> (socket_type, local_ip, local_port), but calling bind with INADDR_ANY
>>> as `nfs_bind' does reduces it to the 2 tuple), which it cannot do.
>>>
>>> However, just calling connect allows local ephemeral ports to be
>>> "reused" (i.e. it looks for the unique 5 tuple of (socket_type,
>>> local_ip, local_port, remote_ip, remote_port)).
>>>
>>> For example, notice how the local ephemeral ports 60003 and 60004 are
>>> "reused" below (because socat is just calling connect, not bind,
>>> although we can make socat call bind with an option if we want and see
>>> it fail like mount.nfs did above):
>>>
>>> [cperl@localhost ~]$ socat -u tcp:192.168.1.12:9991 file:/dev/null &
>>> [11] 22433
>>> [cperl@localhost ~]$ socat -u tcp:192.168.1.13:9990 file:/dev/null &
>>> [12] 22499
>>> [cperl@localhost ~]$ netstat -n --tcp | awk '$6 ~ /ESTABLISHED/ && $5
>>> ~/:999[0-9]$/ {print $1, $4, $5}' | sort | column -t
>>> tcp 192.168.0.11:60000 192.168.1.12:9990
>>> tcp 192.168.0.11:60001 192.168.1.12:9990
>>> tcp 192.168.0.11:60002 192.168.1.12:9990
>>> tcp 192.168.0.11:60003 192.168.1.12:9990
>>> tcp 192.168.0.11:60003 192.168.1.12:9991
>>> tcp 192.168.0.11:60004 192.168.1.12:9990
>>> tcp 192.168.0.11:60004 192.168.1.13:9990
>>> tcp 192.168.0.11:60005 192.168.1.12:9990
>>> tcp 192.168.0.11:60006 192.168.1.12:9990
>>> tcp 192.168.0.11:60007 192.168.1.12:9990
>>> tcp 192.168.0.11:60008 192.168.1.12:9990
>>> tcp 192.168.0.11:60009 192.168.1.12:9990
>>>
>>> Is there any reason we couldn't modify `nfs_get_tcpclient' to not bind
>>> in the case where its not using a reserved port?
>>>
>>> For some color, this is particularly annoying for me because I have
>>> extensive automount maps and this failure leads to attempts to access
>>> a given automounted path returning ENOENT. Furthermore, automount
>>> caches this failure and continues to return ENOENT for the duration of
>>> whatever its negative cache timeout is.
>>>
>>> For UDP, I don't think "bind before connect" matters as much. I
>>> believe the difference is just in the error you'll get from either
>>> bind or connect (if all ephemeral ports are used). If you attempt to
>>> bind when all local ports are in use you seem to get EADDRINUSE,
>>> whereas when you connect when all local ports are in use you get
>>> EAGAIN.
>
> There is only one place where mount.nfs uses connected UDP, which
> is nfs_ca_sockname(). But UDP connected sockets are less of a
> hazard because they lack a 120 second TIME_WAIT after they are
> closed.
>
>>> It could be I'm missing something totally obvious for why this is. If
>>> so, please let me know!
>
> The reason is I didn’t realize you could call connect(2) without
> calling bind(2) first on STREAM sockets.
>
>> (cc'ing Chuck since he wrote a lot of that code)
>>
>> I'm not sure either. If there was a reason for that, it's likely lost
>> to antiquity. In some cases, we really are expected to use reserved
>> ports and I think you do have to bind() in order to get one. In the
>> non-reserved case though it's likely we could skip binding altogether.
>>
>> What would probably be best is to roll up a patch that changes it, and
>> propose it on the list.
>
> I’d like to see a prototype, too.
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>

2014-09-05 20:04:55

by Chris Perl

[permalink] [raw]
Subject: Re: nfs-utils - TCP ephemeral port exhaustion results in mount failures

I tried to send them to the list, but I guess they didn't come through
because my sender was set to [email protected], but I ran `git
send-email' from work, so the smtp sender ip wouldn't authorized.

I'll figure out a way to send them now.

On Fri, Sep 5, 2014 at 4:03 PM, Trond Myklebust
<[email protected]> wrote:
> On Fri, Sep 5, 2014 at 3:40 PM, Chris Perl <[email protected]> wrote:
>> I just submitted two patches, one for nfs-utils and one for linux-nfs.
>>
>> As I said in my previous email, the patch to nfs-utils was enough to
>> get us farther along, but we failed inside mount(2) with EIO (with a
>> decidedly more confusing error message).
>>
>> So, I've also submitted a patch for the rpc code in the kernel that
>> also avoids bind when asking for a random ephemeral port. I've tested
>> the combination of these two patches with my system while in the
>> situation I originally outlined. I can continue to successfully mount
>> NFS filesystems using both of these patches.
>>
>> I don't particularly love the kernel patch, as it makes `xs_bind' not
>> actually bind in all circumstances, which seems confusing. However, I
>> thought trying to rework things in a larger way would cause more
>> issues given that I'm not very familiar with this code. If everyone
>> hates it, I can try something else.
>
> To whom did you submit these patches? I don't see anything in the
> linux-nfs mailing list.
>
> --
> Trond Myklebust
>
> Linux NFS client maintainer, PrimaryData
>
> [email protected]