2007-12-04 18:54:34

by Simon Arlott

[permalink] [raw]
Subject: sockets affected by IPsec always block (2.6.23)

If I have a IPsec rule like:
spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
(i.e. a remote host 1.2.3.4 which will not respond)

Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 <-- non-blocking socket
connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("1.2.3.4")}, 16 <-- blocked connect()

[277657.564773] netcat S b06bcf20 0 9450 9449
[277657.564785] c8d51d28 00200046 00200046 b06bcf20 00200286 c8d51d14 00200286 c8d51d84
[277657.564814] b06bcf20 c8d51d28 b013680b c8d51d84 eeae8800 c8d51d78 c8d51dd0 b04d3fc5
[277657.564843] c8d51da4 00000002 00000001 ede87284 00000002 00000040 e9318ac0 db3f20a0
[277657.564874] Call Trace:
[277657.564881] [<b04d3fc5>] __xfrm_lookup+0x2f5/0x510
[277657.564905] [<b0494f9e>] ip_route_output_flow+0x4e/0x80
[277657.564919] [<b04af303>] tcp_v4_connect+0x183/0x6d0
[277657.564934] [<b04beaf2>] inet_stream_connect+0x122/0x1c0
[277657.564949] [<b0471c8e>] sys_connect+0x9e/0xd0
[277657.564963] [<b0472785>] sys_socketcall+0xa5/0x230
[277657.564973] [<b01042ba>] syscall_call+0x7/0xb
[277657.564984] =======================

I had a process using non-blocking sockets stuck in connect() for over 8 hours because of this...

00002630 <__xfrm_lookup>:
...
290b: b8 00 00 00 00 mov $0x0,%eax
290c: R_386_32 km_waitq
2910: e8 fc ff ff ff call 2911 <__xfrm_lookup+0x2e1>
2911: R_386_PC32 add_wait_queue
2915: a1 00 00 00 00 mov 0x0,%eax
2916: R_386_32 per_cpu__current_task
291a: c7 00 01 00 00 00 movl $0x1,(%eax)
2920: e8 fc ff ff ff call 2921 <__xfrm_lookup+0x2f1>
2921: R_386_PC32 schedule
2925: a1 00 00 00 00 mov 0x0,%eax
2926: R_386_32 per_cpu__current_task
292a: c7 00 00 00 00 00 movl $0x0,(%eax)
2930: b8 00 00 00 00 mov $0x0,%eax
2931: R_386_32 km_waitq
2935: 89 da mov %ebx,%edx
2937: e8 fc ff ff ff call 2938 <__xfrm_lookup+0x308>
2938: R_386_PC32 remove_wait_queue

--
Simon Arlott


2007-12-05 00:12:55

by Herbert Xu

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

On Tue, Dec 04, 2007 at 06:53:19PM +0000, Simon Arlott wrote:
> If I have a IPsec rule like:
> spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
> (i.e. a remote host 1.2.3.4 which will not respond)
>
> Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

This patch should help.

[INET]: Export non-blocking flags to proto connect call

Previously we made connect(2) block on IPsec SA resolution. This is
good in general but not desirable for non-blocking sockets.

To fix this properly we'd need to implement the larval IPsec dst stuff
that we talked about. For now let's just revert to the old behaviour
on non-blocking sockets.

Signed-off-by: Herbert Xu <[email protected]>

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/include/net/ip.h b/include/net/ip.h
index 83fb9f1..9b4ed7e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -121,7 +121,8 @@ extern void ip_flush_pending_frames(struct sock *sk);

/* datagram.c */
extern int ip4_datagram_connect(struct sock *sk,
- struct sockaddr *uaddr, int addr_len);
+ struct sockaddr *uaddr,
+ int addr_len, int flags);

/*
* Map a multicast IP onto multicast MAC for type Token Ring.
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index e90f962..2686850 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -567,7 +567,8 @@ extern void ipv6_packet_init(void);
extern void ipv6_packet_cleanup(void);

extern int ip6_datagram_connect(struct sock *sk,
- struct sockaddr *addr, int addr_len);
+ struct sockaddr *addr,
+ int addr_len, int flags);

extern int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len);
extern void ipv6_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port,
diff --git a/include/net/sock.h b/include/net/sock.h
index 43e3cd9..d70b110 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -522,8 +522,8 @@ struct proto {
void (*close)(struct sock *sk,
long timeout);
int (*connect)(struct sock *sk,
- struct sockaddr *uaddr,
- int addr_len);
+ struct sockaddr *uaddr,
+ int addr_len, int flags);
int (*disconnect)(struct sock *sk, int flags);

struct sock * (*accept) (struct sock *sk, int flags, int *err);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9dbed0b..d93eef6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -421,7 +421,7 @@ extern int tcp_v4_do_rcv(struct sock *sk,

extern int tcp_v4_connect(struct sock *sk,
struct sockaddr *uaddr,
- int addr_len);
+ int addr_len, int flags);

extern int tcp_connect(struct sock *sk);

diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index ee97950..4062068 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -292,7 +292,7 @@ extern int inet_dccp_listen(struct socket *sock, int backlog);
extern unsigned int dccp_poll(struct file *file, struct socket *sock,
poll_table *wait);
extern int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr,
- int addr_len);
+ int addr_len, int flags);

extern struct sk_buff *dccp_ctl_make_reset(struct socket *ctl,
struct sk_buff *skb);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index db17b83..e2aba0f 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -44,7 +44,8 @@ static int dccp_v4_get_port(struct sock *sk, const unsigned short snum)
inet_csk_bind_conflict);
}

-int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+ int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct dccp_sock *dp = dccp_sk(sk);
@@ -72,7 +73,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
tmp = ip_route_connect(&rt, nexthop, inet->saddr,
RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
IPPROTO_DCCP,
- inet->sport, usin->sin_port, sk, 1);
+ inet->sport, usin->sin_port, sk,
+ !(flags & O_NONBLOCK));
if (tmp < 0)
return tmp;

diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 87c98fb..7da5269 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -866,7 +866,7 @@ discard_and_relse:
}

static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- int addr_len)
+ int addr_len, int flags)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *)uaddr;
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -952,7 +952,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
icsk->icsk_af_ops = &dccp_ipv6_mapped;
sk->sk_backlog_rcv = dccp_v4_do_rcv;

- err = dccp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin));
+ err = dccp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin),
+ flags);
if (err) {
icsk->icsk_ext_hdr_len = exthdrlen;
icsk->icsk_af_ops = &dccp_ipv6_af_ops;
@@ -994,7 +995,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
if (final_p)
ipv6_addr_copy(&fl.fl6_dst, final_p);

- err = __xfrm_lookup(&dst, &fl, sk, 1);
+ err = __xfrm_lookup(&dst, &fl, sk, !(flags & O_NONBLOCK));
if (err < 0) {
if (err == -EREMOTE)
err = ip6_dst_blackhole(sk, &dst, &fl);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c75f20b..5fc5de0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -513,7 +513,8 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,

if (!inet_sk(sk)->num && inet_autobind(sk))
return -EAGAIN;
- return sk->sk_prot->connect(sk, (struct sockaddr *)uaddr, addr_len);
+ return sk->sk_prot->connect(sk, (struct sockaddr *)uaddr, addr_len,
+ flags);
}

static long inet_wait_for_connect(struct sock *sk, long timeo)
@@ -574,7 +575,7 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
if (sk->sk_state != TCP_CLOSE)
goto out;

- err = sk->sk_prot->connect(sk, uaddr, addr_len);
+ err = sk->sk_prot->connect(sk, uaddr, addr_len, flags);
if (err < 0)
goto out;

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 0301dd4..19ab5b1 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -20,7 +20,8 @@
#include <net/route.h>
#include <net/tcp_states.h>

-int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+ int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct sockaddr_in *usin = (struct sockaddr_in *) uaddr;
@@ -49,7 +50,8 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
err = ip_route_connect(&rt, usin->sin_addr.s_addr, saddr,
RT_CONN_FLAGS(sk), oif,
sk->sk_protocol,
- inet->sport, usin->sin_port, sk, 1);
+ inet->sport, usin->sin_port, sk,
+ !(flags & O_NONBLOCK));
if (err) {
if (err == -ENETUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_OUTNOROUTES);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a27e42..0a2ee0b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -166,7 +166,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
EXPORT_SYMBOL_GPL(tcp_twsk_unique);

/* This will initiate an outgoing connection. */
-int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+ int flags)
{
struct inet_sock *inet = inet_sk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -192,7 +193,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
tmp = ip_route_connect(&rt, nexthop, inet->saddr,
RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
IPPROTO_TCP,
- inet->sport, usin->sin_port, sk, 1);
+ inet->sport, usin->sin_port, sk,
+ !(flags & O_NONBLOCK));
if (tmp < 0) {
if (tmp == -ENETUNREACH)
IP_INC_STATS_BH(IPSTATS_MIB_OUTNOROUTES);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2ed689a..591634a 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -34,7 +34,8 @@
#include <linux/errqueue.h>
#include <asm/uaccess.h>

-int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+ int flags)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
struct inet_sock *inet = inet_sk(sk);
@@ -49,7 +50,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
if (usin->sin6_family == AF_INET) {
if (__ipv6_only_sock(sk))
return -EAFNOSUPPORT;
- err = ip4_datagram_connect(sk, uaddr, addr_len);
+ err = ip4_datagram_connect(sk, uaddr, addr_len, flags);
goto ipv4_connected;
}

@@ -94,7 +95,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)

err = ip4_datagram_connect(sk,
(struct sockaddr*) &sin,
- sizeof(sin));
+ sizeof(sin), flags);

ipv4_connected:
if (err)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f1294dc..9cb9068 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -122,7 +122,7 @@ static __u32 tcp_v6_init_sequence(struct sk_buff *skb)
}

static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- int addr_len)
+ int addr_len, int flags)
{
struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
struct inet_sock *inet = inet_sk(sk);
@@ -219,7 +219,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
tp->af_specific = &tcp_sock_ipv6_mapped_specific;
#endif

- err = tcp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin));
+ err = tcp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin),
+ flags);

if (err) {
icsk->icsk_ext_hdr_len = exthdrlen;
@@ -265,7 +266,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
if (final_p)
ipv6_addr_copy(&fl.fl6_dst, final_p);

- if ((err = __xfrm_lookup(&dst, &fl, sk, 1)) < 0) {
+ if ((err = __xfrm_lookup(&dst, &fl, sk, !(flags & O_NONBLOCK))) < 0) {
if (err == -EREMOTE)
err = ip6_dst_blackhole(sk, &dst, &fl);
if (err < 0)
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 248b9a5..1ca8913 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -960,7 +960,8 @@ out:
*/
static int __sctp_connect(struct sock* sk,
struct sockaddr *kaddrs,
- int addrs_size)
+ int addrs_size,
+ int f_flags)
{
struct sctp_sock *sp;
struct sctp_endpoint *ep;
@@ -977,7 +978,6 @@ static int __sctp_connect(struct sock* sk,
union sctp_addr *sa_addr = NULL;
void *addr_buf;
unsigned short port;
- unsigned int f_flags = 0;

sp = sctp_sk(sk);
ep = sp->ep;
@@ -1106,12 +1106,6 @@ static int __sctp_connect(struct sock* sk,
af->to_sk_daddr(sa_addr, sk);
sk->sk_err = 0;

- /* in-kernel sockets don't generally have a file allocated to them
- * if all they do is call sock_create_kern().
- */
- if (sk->sk_socket->file)
- f_flags = sk->sk_socket->file->f_flags;
-
timeo = sock_sndtimeo(sk, f_flags & O_NONBLOCK);

err = sctp_wait_for_connect(asoc, &timeo);
@@ -1194,6 +1188,7 @@ SCTP_STATIC int sctp_setsockopt_connectx(struct sock* sk,
{
int err = 0;
struct sockaddr *kaddrs;
+ int f_flags;

SCTP_DEBUG_PRINTK("%s - sk %p addrs %p addrs_size %d\n",
__FUNCTION__, sk, addrs, addrs_size);
@@ -1210,10 +1205,15 @@ SCTP_STATIC int sctp_setsockopt_connectx(struct sock* sk,
if (unlikely(!kaddrs))
return -ENOMEM;

+ /* in-kernel sockets don't generally have a file allocated to them
+ * if all they do is call sock_create_kern().
+ */
+ f_flags = sk->sk_socket->file ? sk->sk_socket->file->f_flags : 0;
+
if (__copy_from_user(kaddrs, addrs, addrs_size)) {
err = -EFAULT;
} else {
- err = __sctp_connect(sk, kaddrs, addrs_size);
+ err = __sctp_connect(sk, kaddrs, addrs_size, f_flags);
}

kfree(kaddrs);
@@ -3270,7 +3270,7 @@ out_nounlock:
* len: the size of the address.
*/
SCTP_STATIC int sctp_connect(struct sock *sk, struct sockaddr *addr,
- int addr_len)
+ int addr_len, int flags)
{
int err = 0;
struct sctp_af *af;
@@ -3288,7 +3288,7 @@ SCTP_STATIC int sctp_connect(struct sock *sk, struct sockaddr *addr,
/* Pass correct addr len to common routine (so it knows there
* is only one address being passed.
*/
- err = __sctp_connect(sk, addr, af->sockaddr_len);
+ err = __sctp_connect(sk, addr, af->sockaddr_len, flags);
}

sctp_release_sock(sk);

2007-12-05 06:06:46

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Simon Arlott <[email protected]>
Date: Tue, 04 Dec 2007 18:53:19 +0000

> If I have a IPsec rule like:
> spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
> (i.e. a remote host 1.2.3.4 which will not respond)
>
> Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

If you don't like this behavior:

echo "1" >/proc/sys/net/core/xfrm_larval_drop

but those initial connection setup packets will be dropped while
waiting for the IPSEC route to be resolved, and in your 8 hour case
the TCP connect will fail.

Anyways, the choice for different behavior is there, select it
to suit your tastes.

2007-12-05 06:30:35

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Herbert Xu <[email protected]>
Date: Wed, 5 Dec 2007 11:12:32 +1100

> [INET]: Export non-blocking flags to proto connect call
>
> Previously we made connect(2) block on IPsec SA resolution. This is
> good in general but not desirable for non-blocking sockets.
>
> To fix this properly we'd need to implement the larval IPsec dst stuff
> that we talked about. For now let's just revert to the old behaviour
> on non-blocking sockets.
>
> Signed-off-by: Herbert Xu <[email protected]>

We made an explicit decision not to do things this way.

Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
setting, and this is across the board. If xfrm_larval_drop is zero,
non-blocking semantics do not extend to IPSEC route resolution,
otherwise it does.

If he sets this sysctl to "1" as I detailed in my reply, he'll
get the behavior he wants.

2007-12-05 06:51:49

by Herbert Xu

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

On Tue, Dec 04, 2007 at 10:30:23PM -0800, David Miller wrote:
>
> We made an explicit decision not to do things this way.

Thanks for pointing this out.

> Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> setting, and this is across the board. If xfrm_larval_drop is zero,
> non-blocking semantics do not extend to IPSEC route resolution,
> otherwise it does.
>
> If he sets this sysctl to "1" as I detailed in my reply, he'll
> get the behavior he wants.

Does anybody actually need the 0 setting? What would we break if
the default became 1?

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-12-05 07:12:31

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Herbert Xu <[email protected]>
Date: Wed, 5 Dec 2007 17:51:32 +1100

> Does anybody actually need the 0 setting? What would we break if
> the default became 1?

I bet there are UDP apps out there that would break if we
didn't do this.

Actually, consider even a case like DNS. Let's say the timeout
is set to 2 seconds or something and you have 3 DNS servers
listed, on different IPSEC destinations, in your resolv.conf

Each IPSEC route that isn't currently resolved will cause packet loss
of the DNS lookup request with xfrm_larval_drop set to '1'.

If all 3 need to be resolved, the DNS lookup will fully fail
which defeats the purpose of listing 3 servers for redundancy
don't you think? :-)

As much as I even personally prefer the xfrm_larval_drop=1
behavior, it cases like above that keep me from jumping at
making it the default.

Arguably, potentially blocking forever (which is what can easily
happen with xfrm_larval_drop=0 if your IPSEC daemon cannot resolve the
IPSEC path for whatever reason) is worse than the above, but the
other cases are still something to consider as well.

2007-12-05 07:16:24

by Herbert Xu

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

On Tue, Dec 04, 2007 at 11:12:00PM -0800, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Wed, 5 Dec 2007 17:51:32 +1100
>
> > Does anybody actually need the 0 setting? What would we break if
> > the default became 1?
>
> I bet there are UDP apps out there that would break if we
> didn't do this.

Right. This is definitely bad for protocols without a retransmission
mechanism.

However, is the 0 setting ever useful for TCP and in particular, TCP's
connect(2) call? Perhaps we can just make that one always drop.

Well, until someone implements queueing to fix all of this properly
that is :)

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-12-05 07:34:44

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Herbert Xu <[email protected]>
Date: Wed, 5 Dec 2007 18:16:07 +1100

> Right. This is definitely bad for protocols without a retransmission
> mechanism.
>
> However, is the 0 setting ever useful for TCP and in particular, TCP's
> connect(2) call? Perhaps we can just make that one always drop.

TCP has some built-in assumptions about characteristics of
interent links and what constitutes a timeout which is "too long"
and should thus result in a full connection failure.

IPSEC changes this because of IPSEC route resolution via
ISAKMP.

With this in mind I can definitely see people preferring
the "block until IPSEC resolves" behavior, especially for
something like, say, periodic remote backups and stuff like
that where you really want the thing to just sit and wait
for the connect() to succeed instead of failing.

2007-12-05 07:39:44

by Herbert Xu

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

On Tue, Dec 04, 2007 at 11:34:32PM -0800, David Miller wrote:
>
> TCP has some built-in assumptions about characteristics of
> interent links and what constitutes a timeout which is "too long"
> and should thus result in a full connection failure.
>
> IPSEC changes this because of IPSEC route resolution via
> ISAKMP.
>
> With this in mind I can definitely see people preferring
> the "block until IPSEC resolves" behavior, especially for
> something like, say, periodic remote backups and stuff like
> that where you really want the thing to just sit and wait
> for the connect() to succeed instead of failing.

Hmm, but connect(2) should succeed in that case thanks to the
blackhole route, no? The subsequent SYNs will then be dropped
until the IPsec SAs are in place.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-12-05 09:56:16

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Herbert Xu <[email protected]>
Date: Wed, 5 Dec 2007 18:39:27 +1100

> On Tue, Dec 04, 2007 at 11:34:32PM -0800, David Miller wrote:
> >
> > TCP has some built-in assumptions about characteristics of
> > interent links and what constitutes a timeout which is "too long"
> > and should thus result in a full connection failure.
> >
> > IPSEC changes this because of IPSEC route resolution via
> > ISAKMP.
> >
> > With this in mind I can definitely see people preferring
> > the "block until IPSEC resolves" behavior, especially for
> > something like, say, periodic remote backups and stuff like
> > that where you really want the thing to just sit and wait
> > for the connect() to succeed instead of failing.
>
> Hmm, but connect(2) should succeed in that case thanks to the
> blackhole route, no? The subsequent SYNs will then be dropped
> until the IPsec SAs are in place.

If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard
fail.

2007-12-05 09:58:19

by Herbert Xu

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

On Wed, Dec 05, 2007 at 01:55:58AM -0800, David Miller wrote:
>
> If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard
> fail.

Right. Let's just forget about this until we have a queueing system :)

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2007-12-05 18:41:20

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Mittwoch, 5. Dezember 2007 08:12 schrieb David Miller:

> Actually, consider even a case like DNS. Let's say the timeout
> is set to 2 seconds or something and you have 3 DNS servers
> listed, on different IPSEC destinations, in your resolv.conf
>
> Each IPSEC route that isn't currently resolved will cause packet loss
> of the DNS lookup request with xfrm_larval_drop set to '1'.
>
> If all 3 need to be resolved, the DNS lookup will fully fail
> which defeats the purpose of listing 3 servers for redundancy
> don't you think? :-)

In your example, the DNS server might actually stop responding to other
clients while waiting for the (expected to be non-blocking) connect() to
return. This is much much worse.

Stefan

2007-12-05 18:42:42

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Mittwoch, 5. Dezember 2007 07:51 schrieb Herbert Xu:

> > If he sets this sysctl to "1" as I detailed in my reply, he'll
> > get the behavior he wants.
>
> Does anybody actually need the 0 setting? What would we break if
> the default became 1?

I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking
sockets is quite well defined in POSIX. If this is changed for some IP
sockets, event-driven applications will randomly and subtly break.

Stefan

2007-12-06 02:25:21

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Wed, 5 Dec 2007 19:39:07 +0100

> I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking
> sockets is quite well defined in POSIX.

You are entitled to your opinion.

POSIX says nothing about the semantics of route resolution.
Non-blocking doesn't mean "cannot sleep no matter what".

> If this is changed for some IP sockets, event-driven applications
> will randomly and subtly break.

If this was such a clear cut case we'd have changed things
a long time ago, but it isn't so don't pretend this is the
case.

2007-12-06 08:49:13

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Donnerstag, 6. Dezember 2007 03:25 schrieb David Miller:

> POSIX says nothing about the semantics of route resolution.

Of course not. Applications must not care about what happens at the transport
layer.

> Non-blocking doesn't mean "cannot sleep no matter what".

... and as O_CREAT on open() isn't specifically documented to apply to
filenames starting with 'a', it is perfectly normal that "echo x >ash" always
fails since 2.6.22. To revert to the old behaviour, please do "echo 1
>/proc/sys/fs/allow_a_file_creation".

Ok, irony aside. Just have a look at
http://www.opengroup.org/onlinepubs/009695399/functions/connect.html (I hope
009695399 is not a personalition cookie ;-)

"If the connection cannot be established immediately and O_NONBLOCK is set for
the file descriptor for the socket, connect() shall fail and set errno to
[EINPROGRESS], but the connection request shall not be aborted, and the
connection shall be established asynchronously."

I think the words "shall fail" and "immediately" are quite clear.

> > If this is changed for some IP sockets, event-driven applications
> > will randomly and subtly break.
>
> If this was such a clear cut case we'd have changed things
> a long time ago, but it isn't so don't pretend this is the
> case.

Well, the only reason this doesn't break on a daily basis is because the code
isn't in the kernel that long and not many people run applications on an
IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior
connections or dnssec based anonymous IPSEC someday. Trust me, you will
revert this misbehaviour in -stable then.

For some real life applications that break when nonblocking connect() blocks,
please look f.e. at squid or mozilla firefox.

Stefan

2007-12-06 08:53:56

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Thu, 6 Dec 2007 09:49:01 +0100

> "If the connection cannot be established immediately and O_NONBLOCK is set for
> the file descriptor for the socket, connect() shall fail and set errno to
> [EINPROGRESS], but the connection request shall not be aborted, and the
> connection shall be established asynchronously."
>
> I think the words "shall fail" and "immediately" are quite clear.

They are, but the context in which they apply is vague.

I can equally generate examples where the non-blocking behavior you
are a proponent of would break non-blocking UDP apps during a
sendmsg() call when we hit IPSEC resolution. Yet similar language on
blocking semantics exists for sendmsg() in the standards.

The world is shades of gray, implying anything else is foolhardy and
that's how I'm handling this.

> Well, the only reason this doesn't break on a daily basis is because the code
> isn't in the kernel that long and not many people run applications on an
> IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior
> connections or dnssec based anonymous IPSEC someday. Trust me, you will
> revert this misbehaviour in -stable then.

I use IPSEC every single day in this fashion, and I haven't.

2007-12-06 10:58:58

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:

> > I think the words "shall fail" and "immediately" are quite clear.
>
> They are, but the context in which they apply is vague.

"socket is connection-mode" => SOCK_STREAM

> I can equally generate examples where the non-blocking behavior you
> are a proponent of would break non-blocking UDP apps during a
> sendmsg() call when we hit IPSEC resolution. Yet similar language on
> blocking semantics exists for sendmsg() in the standards.

I am not a good enough kernel hacker to exactly understand the code flow in
udp_sendmsg(). However, it seems that it first checks destination validity
via ip_route_output_flow() and queues the message then. The sendmsg()
documentation only talks about buffer space. I can see your dilemma.

The reason why I'm pushing this issue another time is that I know quite a
bit about system level application development. A very typical design pattern
for non-naive single or multi threaded programs is that they set all
communication sockets to be nonblocking and use a select()/epoll() based loop
to dispatch IO. This often includes initiating a TCP connect() and
asynchronously waiting for it to finish or fail from the main loop.

The dangerous situation here is that in 99% of all cases things will just work
because the phase 2 SA exists. In 0.8%, the SA will be established in <1 sec.
However, in the rest of time the server application that you have considered
to be stable will end up sleeping with all threads in a connect() call that
is supposed to return immediatly.

> The world is shades of gray, implying anything else is foolhardy and
> that's how I'm handling this.

Even though I consider programmers that ignore the result code on a
nonblocking UDP sendmsg() fools, I agree. May be the best compromise is what
Herbert Xu suggested in <[email protected]> in this
thread: At least, for connect() O_NONBLOCK ist ALWAYS respected. Because this
is where the chance for breakage is highest.

Stefan

2007-12-06 11:13:21

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Thu, 6 Dec 2007 11:56:48 +0100

> Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:
>
> > > I think the words "shall fail" and "immediately" are quite clear.
> >
> > They are, but the context in which they apply is vague.
>
> "socket is connection-mode" => SOCK_STREAM

I meant whether "immediately" mean in reference to socket
state or includes auxiliary things like route lookups.

When you do a non-blocking write on a socket, things like
memory allocations can block, potentially for a long time.
It is an example where there are definite boundaries to where
the non-blocking'ness applies.

And therefore it is not so cut and dry and you present this
issue.

> The reason why I'm pushing this issue another time is that I know quite a
> bit about system level application development. A very typical design pattern
> for non-naive single or multi threaded programs is that they set all
> communication sockets to be nonblocking and use a select()/epoll() based loop
> to dispatch IO. This often includes initiating a TCP connect() and
> asynchronously waiting for it to finish or fail from the main loop.
>
> The dangerous situation here is that in 99% of all cases things will just work
> because the phase 2 SA exists. In 0.8%, the SA will be established in <1 sec.
> However, in the rest of time the server application that you have considered
> to be stable will end up sleeping with all threads in a connect() call that
> is supposed to return immediatly.

And that connect() call can hang for a long time due to any memory
allocation done in the connect() path.

You are not avoiding blocking by setting O_NONBLOCK on the socket, it
is quite foolhardy to think that it does so unilaterally.

And that's why this is a grey area. Why is waiting for memory
allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
resolution is not?

2007-12-06 11:34:16

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Donnerstag, 6. Dezember 2007 12:13 schrieb David Miller:

> And that's why this is a grey area. Why is waiting for memory
> allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
> resolution is not?

Because you just will put enough RAM modules into you server when setting up a
scalable system. Local resource, managable by the admin. What you cannot
control in many cases is the network connection to the remote node. Simon
Arlott has been talking about an 8 hour network outage.

Stefan

2007-12-06 11:39:23

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Thu, 6 Dec 2007 12:35:05 +0100

> Because you just will put enough RAM modules into you server when
> setting up a scalable system.

This suggestion is avoiding the important semantic issue, and
won't lead to a real discussion of the core problem.

2007-12-06 12:30:13

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Donnerstag, 6. Dezember 2007 12:39 schrieb David Miller:

> > Because you just will put enough RAM modules into you server when
> > setting up a scalable system.
>
> This suggestion is avoiding the important semantic issue, and
> won't lead to a real discussion of the core problem.

When writing applications for unix operating systems, it is known since ages
that stuff can be swapped out and that even things like memory accesses can
block. So it does not really surprise when a system call has to wait for
memory - just imagine the kernel code for connect() could be and has been
swapped out.

Even with moderate swap activity, this memory should be available in much less
than one second. If on the other hand the system is already threshing, it is
no difference if it does so within connect() or while reaching the connect()
system call in the application flow.

Btw, this is where admin responsibility to size their systems kicks in.

So where I would draw the line: connect() is clearly a network related
function. Therefore, if a nonblocking connect() has to sleep for a local,
controllable resource like memory to become available, this is ok. Maybe it
shouldn't wait for a 128MB buffer if someone configured such an abonimation,
haven't thought deeply about that. But when being told not to wait the
connection to complete, it should never ever wait for another network related
activity like IPSEC SA setup to complete, especially not for hours.

IMHO this is what developers expect, and is also consistent with the fact that
POSIX does not define O_NONBLOCK behaviour for local files.

Stefan

2007-12-06 13:55:28

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Thu, 6 Dec 2007 13:30:20 +0100

> IMHO this is what developers expect, and is also consistent with the
> fact that POSIX does not define O_NONBLOCK behaviour for local
> files.

You keep ignoring the fact that, as Herbert and I discussed, not
blocking for IPSEC resolution will make some connect() cases fail that
would otherwise not fail.

There are two sides to this issue, and we need to consider them
both.

Long term a resolution-packet-queue provides a solution that handles
both angles correctly, but we don't have that code yet.

2007-12-06 14:30:31

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Donnerstag, 6. Dezember 2007 14:55 schrieb David Miller:

> You keep ignoring the fact that, as Herbert and I discussed, not
> blocking for IPSEC resolution will make some connect() cases fail that
> would otherwise not fail.
>
> There are two sides to this issue, and we need to consider them
> both.

as far as I've understood Herbert's patch, at least TCP connect can be fixed
so that non blocking connect() will neither fail nor block, but just use the
first or second retransmission of the SYN packet to complete the handshake
after IPSEC is up. As this will fix the common breakage case, just do so and
keep UDP sendmsg() etc for later.

You are looking at this issue too much from the kernel side. Admitted, this is
a corner case, but therefore nobody cares if connection completion takes two
SYNs and three seconds instead of one SYN and may be two seconds. But
application developers and users will validly complain if their applications
block unexpectedly for hours just because some random provider has a network
outage and IPSEC cannot come up.

Stefan

2007-12-07 03:20:42

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Stefan Rompf <[email protected]>
Date: Thu, 6 Dec 2007 15:31:53 +0100

> as far as I've understood Herbert's patch, at least TCP connect can be fixed
> so that non blocking connect() will neither fail nor block, but just use the
> first or second retransmission of the SYN packet to complete the handshake
> after IPSEC is up.

If IPSEC takes a long time to resolve, and we don't block, the
connect() can hard fail (we will just keep dropping the outgoing SYN
packet send attempts, eventually hitting the retry limit) in cases
where if we did block it would not fail (because we wouldn't send
the first SYN until IPSEC resolved).

2007-12-07 09:29:30

by Stefan Rompf

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

Am Freitag, 7. Dezember 2007 04:20 schrieb David Miller:

> If IPSEC takes a long time to resolve, and we don't block, the
> connect() can hard fail (we will just keep dropping the outgoing SYN
> packet send attempts, eventually hitting the retry limit) in cases
> where if we did block it would not fail (because we wouldn't send
> the first SYN until IPSEC resolved).

David - I'm aware of this, the discussion is which behaviour is ok. Let's go
back to a real life example. I've already researched that the squid web proxy
has a poll() based main loop doing nonblocking connects, may be with multiple
threads.

Situation: One user wants to access a web page that needs IPSEC. The SA takes
30 seconds to come up.

a) Non-blocking connect is respected: SYN packets during the first 30 seconds
will be dropped as you said. Connection can be completed on the next SYN
retry (timeout in linux: 3 minutes). During this time, the 500 other users
can continue to browse using the proxy.

b) Non-blocking connect is ignored during IPSEC resolving as you advocate it:
Connection for the one user can be completed immediatly after IPSEC comes up.
That's the pro. However, until then, the other 500 proxy user CANNOT ACCESS
THE WEB because squid's threads are stuck in connect()s on sockets they
configured not to block. If the IPSEC SA never resolves due to some network
outage, squid will sleep forever or until an admin configures it that it
doesn't try to connect the adress in question and restarts it.

Don't you realize how broken this behaviour is? Can you give me ONE example of
an application that works better with b) and why this outweights the problems
it creates for everybody else?

Even the DNS example you posted in
<[email protected]> is wrong because the second
server will never queried if the kernel puts the process into coma while the
IPSEC SA to the first server cannot be resolved.

Stefan

2007-12-16 22:30:32

by Bill Davidsen

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Wed, 5 Dec 2007 11:12:32 +1100
>
>> [INET]: Export non-blocking flags to proto connect call
>>
>> Previously we made connect(2) block on IPsec SA resolution. This is
>> good in general but not desirable for non-blocking sockets.
>>
>> To fix this properly we'd need to implement the larval IPsec dst stuff
>> that we talked about. For now let's just revert to the old behaviour
>> on non-blocking sockets.
>>
>> Signed-off-by: Herbert Xu <[email protected]>
>
> We made an explicit decision not to do things this way.
>
> Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> setting, and this is across the board. If xfrm_larval_drop is zero,
> non-blocking semantics do not extend to IPSEC route resolution,
> otherwise it does.
>
> If he sets this sysctl to "1" as I detailed in my reply, he'll
> get the behavior he wants.
>
I think you for the hint, but I would hardly call this sentence
"detailed" in terms of being a cookbook solution to the problem.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-12-16 23:22:59

by David Miller

[permalink] [raw]
Subject: Re: sockets affected by IPsec always block (2.6.23)

From: Bill Davidsen <[email protected]>
Date: Sun, 16 Dec 2007 17:47:24 -0500

> David Miller wrote:
> > From: Herbert Xu <[email protected]>
> > Date: Wed, 5 Dec 2007 11:12:32 +1100
> >
> >> [INET]: Export non-blocking flags to proto connect call
> >>
> >> Previously we made connect(2) block on IPsec SA resolution. This is
> >> good in general but not desirable for non-blocking sockets.
> >>
> >> To fix this properly we'd need to implement the larval IPsec dst stuff
> >> that we talked about. For now let's just revert to the old behaviour
> >> on non-blocking sockets.
> >>
> >> Signed-off-by: Herbert Xu <[email protected]>
> >
> > We made an explicit decision not to do things this way.
> >
> > Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> > setting, and this is across the board. If xfrm_larval_drop is zero,
> > non-blocking semantics do not extend to IPSEC route resolution,
> > otherwise it does.
> >
> > If he sets this sysctl to "1" as I detailed in my reply, he'll
> > get the behavior he wants.
> >
> I think you for the hint, but I would hardly call this sentence
> "detailed" in terms of being a cookbook solution to the problem.

I guess "echo '1' >/proc/sys/net/core/xfrm_larval_drop" is not
explicit enough? What more would you like me to say?