THIS PATCH IS VERY POSSIBLY WRONG! But if it is I want some feedback.
Basically what I found was that if I added an iptables rule like so:
iptables -A OUTPUT -p tcp --dport 80 -j DROP
And then ran a web browser like links it would just hang on 'establishing
connection.' I expected that the application would immediately, or at least
very quickly, get notified that the connect failed. This waiting for timeout
would be expected if something else dropped the SYN or if we were dropping the
SYN/ACK packet coming back, but I figured if we knew we threw away the SYN we knew
right away that the connection was denied and we should be able to indicate
that to the application. Yes, I realize this is little different than if the
SYN was dropped in the first network device, but it is different because we
know what happened! We know that connect() call failed and that there isn't
anything coming back.
What I discovered was that we actually had 2 problems in making it possible.
For userspace to quickly realize the connect failed. The first was a problem
in the netfilter code which wasn't passing errors back up the stack correctly,
due to what I believe to be a mistake in precedence rules.
http://marc.info/?l=netfilter-devel&m=128950262021804&w=2
And the second was that tcp_connect() was just ignoring the return value from
tcp_transmit_skb(). Maybe this was intentional but I really wish we could
find out that connect failed long before the minutes long timeout. Once I
fixed both of those issues I find that links gets denied (with EPERM)
immediately when it calls connect(). Is this wrong? Is this bad to tell
userspace more quickly what happened? Does passing this error code back up
the stack here break something else? Why do some functions seem to pay
attention to tcp_transmit_skb() return codes and some functions just ignore
it? What do others think?
NOT-AT-ALL-SIGNED-OFF-BY
---
net/ipv4/tcp_output.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e961522..67b8535 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2592,6 +2592,7 @@ int tcp_connect(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *buff;
+ int ret;
tcp_connect_init(sk);
@@ -2614,7 +2615,7 @@ int tcp_connect(struct sock *sk)
sk->sk_wmem_queued += buff->truesize;
sk_mem_charge(sk, buff->truesize);
tp->packets_out += tcp_skb_pcount(buff);
- tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+ ret = tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
/* We change tp->snd_nxt after the tcp_transmit_skb() call
* in order to make this packet get counted in tcpOutSegs.
@@ -2626,7 +2627,7 @@ int tcp_connect(struct sock *sk)
/* Timer for repeating the SYN until an answer. */
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
- return 0;
+ return ret;
}
EXPORT_SYMBOL(tcp_connect);
Le jeudi 11 novembre 2010 à 16:03 -0500, Eric Paris a écrit :
> THIS PATCH IS VERY POSSIBLY WRONG! But if it is I want some feedback.
>
> Basically what I found was that if I added an iptables rule like so:
>
> iptables -A OUTPUT -p tcp --dport 80 -j DROP
>
> And then ran a web browser like links it would just hang on 'establishing
> connection.' I expected that the application would immediately, or at least
> very quickly, get notified that the connect failed. This waiting for timeout
> would be expected if something else dropped the SYN or if we were dropping the
> SYN/ACK packet coming back, but I figured if we knew we threw away the SYN we knew
> right away that the connection was denied and we should be able to indicate
> that to the application. Yes, I realize this is little different than if the
> SYN was dropped in the first network device, but it is different because we
> know what happened! We know that connect() call failed and that there isn't
> anything coming back.
>
> What I discovered was that we actually had 2 problems in making it possible.
> For userspace to quickly realize the connect failed. The first was a problem
> in the netfilter code which wasn't passing errors back up the stack correctly,
> due to what I believe to be a mistake in precedence rules.
>
> http://marc.info/?l=netfilter-devel&m=128950262021804&w=2
>
> And the second was that tcp_connect() was just ignoring the return value from
> tcp_transmit_skb(). Maybe this was intentional but I really wish we could
> find out that connect failed long before the minutes long timeout. Once I
> fixed both of those issues I find that links gets denied (with EPERM)
> immediately when it calls connect(). Is this wrong? Is this bad to tell
> userspace more quickly what happened? Does passing this error code back up
> the stack here break something else? Why do some functions seem to pay
> attention to tcp_transmit_skb() return codes and some functions just ignore
> it? What do others think?
>
I think its an interesting idea, but a temporary memory shortage would
abort the connect().
We could imagine some special handling of the first packet of a flow
being DROPED for whatever reason (flow control...)
So it needs some refinement I think.
SYN packets should be allowed to be re-transmitted before saying a TCP
connect() cannot succeed.
> Yes, I realize this is little different than if the
> SYN was dropped in the first network device, but it is different
> because we know what happened! We know that connect() call failed
> and that there isn't anything coming back.
I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).
It does not only make sense, but also is a highly useful testing technique that we use -j DROP in OUTPUT to emulate network losses and see how the application behaves.
On 11.11.2010 22:58, Hua Zhong wrote:
>> Yes, I realize this is little different than if the
>> SYN was dropped in the first network device, but it is different
>> because we know what happened! We know that connect() call failed
>> and that there isn't anything coming back.
>
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).
It sends an ICMP error or TCP reset. Interpretation is up to TCP.
On Thu, 2010-11-11 at 13:58 -0800, Hua Zhong wrote:
> > Yes, I realize this is little different than if the
> > SYN was dropped in the first network device, but it is different
> > because we know what happened! We know that connect() call failed
> > and that there isn't anything coming back.
>
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).
>
> It does not only make sense, but also is a highly useful testing technique that we use -j DROP in OUTPUT to emulate network losses and see how the application behaves.
I guess I can be a bit more descriptive of my specific situation,
although I'm not sure it matters. I don't actually plan to drop packets
with -j REJECT or -j DROP, that's just a simple example everyone can see
on their own machine. I plan to have the packets drop in the selinux
netfilter hook. The SELinux hook uses NF_DROP/NF_ACCEPT just like any
other netfilter hook. Maybe the answer is that I need to duplicate the
-j REJECT type operations in the SELinux hook. -j REJECT doesn't do
what I want today, but if that's the right way forward tell me and I'll
look down that path.
But the path I first started looking down rules in 2 distinct questions:
1) What should netfilter pass back up the stack. From my looking at
this I see that nf_hook_slow() will convert NF_DROP into -EPERM and pass
that back up the stack. Is this wrong? Should it more intelligently
pass errors back up the stack? Maybe it needs an NF_REJECT as well as
NF_DROP? NF_DROP returns 0 maybe and NF_REJECT return EPERM?
2) What should the generic TCP code (tcp_connect()) do if the skb failed
to send. Should it return error codes back up the stack somehow or
should they continue to be ignored? Obviously continuing to just ignore
information we have doesn't make me happy (otherwise I wouldn't have
started scratching this itch). But the point about ENOBUFS is well
taken. Maybe I should make tcp_connect(), or the caller to
tcp_connect() more intelligent about specific error codes?
I'm looking for a path forward. If SELinux is rejecting the SYN packets
on connect() I want to pass that info to userspace rather than just
hanging. What's the best way to accomplish that?
-Eric
Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> to send. Should it return error codes back up the stack somehow or
> should they continue to be ignored? Obviously continuing to just ignore
> information we have doesn't make me happy (otherwise I wouldn't have
> started scratching this itch). But the point about ENOBUFS is well
> taken. Maybe I should make tcp_connect(), or the caller to
> tcp_connect() more intelligent about specific error codes?
>
> I'm looking for a path forward. If SELinux is rejecting the SYN packets
> on connect() I want to pass that info to userspace rather than just
> hanging. What's the best way to accomplish that?
>
Eric, if you can differentiate a permanent reject, instead of a
temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
yes, you could make tcp_connect() report to user the permanent error,
and ignore the temporary one.
On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
>
> > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > to send. Should it return error codes back up the stack somehow or
> > should they continue to be ignored? Obviously continuing to just ignore
> > information we have doesn't make me happy (otherwise I wouldn't have
> > started scratching this itch). But the point about ENOBUFS is well
> > taken. Maybe I should make tcp_connect(), or the caller to
> > tcp_connect() more intelligent about specific error codes?
> >
> > I'm looking for a path forward. If SELinux is rejecting the SYN packets
> > on connect() I want to pass that info to userspace rather than just
> > hanging. What's the best way to accomplish that?
> >
>
> Eric, if you can differentiate a permanent reject, instead of a
> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> yes, you could make tcp_connect() report to user the permanent error,
> and ignore the temporary one.
If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
counterparts, which i guess they do but i didn't read the source ;),
then SELinux should use NF_REJECT in my opinion.
NF_DROP does exactly what the name says, it drops the packet aka
basically puts it in /dev/null. As with writing to /dev/null, you don't
get an error for that. Even more, if in the meantime the DROP rule does
not match anymore, the 2nd or 3rd SYN from the connect() can come
through and establish a connection (think of "-m statistic" & co.)
This is very different from REJECT.
If REJECT doesn't immediately get reported to the application, that *is*
a bug, but last time i checked i got EPERM immediately. I would fix
SELinux to use the same mechanism.
-David
Am 12.11.2010 17:35, schrieb David Lamparter:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
>> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
>>
>>> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
>>> to send. Should it return error codes back up the stack somehow or
>>> should they continue to be ignored? Obviously continuing to just ignore
>>> information we have doesn't make me happy (otherwise I wouldn't have
>>> started scratching this itch). But the point about ENOBUFS is well
>>> taken. Maybe I should make tcp_connect(), or the caller to
>>> tcp_connect() more intelligent about specific error codes?
>>>
>>> I'm looking for a path forward. If SELinux is rejecting the SYN packets
>>> on connect() I want to pass that info to userspace rather than just
>>> hanging. What's the best way to accomplish that?
>>>
>>
>> Eric, if you can differentiate a permanent reject, instead of a
>> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
>> yes, you could make tcp_connect() report to user the permanent error,
>> and ignore the temporary one.
Indeed. We could even make the NF_DROP return value configurable
by encoding it in the verdict.
> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.
There is no NF_REJECT.
> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
>
> This is very different from REJECT.
Returning NF_DROP results in -EPERM getting reported back. As Eric
noticed, this is ignored for SYN packets.
> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.
NF_DROP returns -EPERM, the REJECT targets send packets to reject
a connection. Whether this is reported immediately depends on the
error and the protocol in question. Using a TCP reset immediately
resets the connection.
On Fri, 2010-11-12 at 17:35 +0100, David Lamparter wrote:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> > Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> >
> > > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > > to send. Should it return error codes back up the stack somehow or
> > > should they continue to be ignored? Obviously continuing to just ignore
> > > information we have doesn't make me happy (otherwise I wouldn't have
> > > started scratching this itch). But the point about ENOBUFS is well
> > > taken. Maybe I should make tcp_connect(), or the caller to
> > > tcp_connect() more intelligent about specific error codes?
> > >
> > > I'm looking for a path forward. If SELinux is rejecting the SYN packets
> > > on connect() I want to pass that info to userspace rather than just
> > > hanging. What's the best way to accomplish that?
> > >
> >
> > Eric, if you can differentiate a permanent reject, instead of a
> > temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> > yes, you could make tcp_connect() report to user the permanent error,
> > and ignore the temporary one.
>
> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.
As it stands today there is no NF_REJECT. NF_DROP is the only (related)
permitted return value from a netfilter hook. Maybe I need to change
that fact though.
> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
>
> This is very different from REJECT.
>
> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.
I haven't looked at what -j REJECT does (or was intended to do) but it
most certainly does not return an error to sys_connect(). Try it out.
iptables -A OUTPUT -p tcp --dport 80 -j REJECT
links http://www.google.com
it just hangs on 'making connection' (exact same for -j DROP)
If everyone agrees that's the wrong behavior (for -j REJECT) I'll work
on fixing that (however is appropriate) and will change the SELinux code
if needed after we've fixed the -j REJECT code. Obviously there's
problems with my original way to fix the lack of error returns (namely
that I would immediately EACCES for DROP as well as REJECT).
I'm glad to hear that others seem to believe the current code is buggy
and I'm not completely off my rocker to think that applications should
be able to learn somehow that things fell down...
-Eric
Hello!
I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
are performed on unlocked sockets. It looks like at least this BUG_ON():
skb = tcp_write_queue_head(sk);
BUG_ON(!skb);
can be easily triggered.
Do I miss something?
Alexey
Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> Hello!
>
> I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> are performed on unlocked sockets. It looks like at least this BUG_ON():
>
> skb = tcp_write_queue_head(sk);
> BUG_ON(!skb);
>
> can be easily triggered.
>
> Do I miss something?
>
Hi Alexey !
I see socket is locked around line 368,
bh_lock_sock(sk);
/* If too many ICMPs get dropped on busy
* servers this needs to be solved differently.
*/
if (sock_owned_by_user(sk))
NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
Hmm, maybe some goto is missing ;)
Hello!
On Thu, Nov 11, 2010 at 04:03:41PM -0500, Eric Paris wrote:
> immediately when it calls connect(). Is this wrong? Is this bad to tell
> userspace more quickly what happened? Does passing this error code back up
> the stack here break something else? Why do some functions seem to pay
> attention to tcp_transmit_skb() return codes and some functions just ignore
> it?
Essentially, return value of tcp_transmit_skb() is always ignored.
It is used only for accounting and for some optimization of retransmission behaviour.
Generally, tcp does not react on errors coming outside of tcp protocol.
The only loophole is ICMP error in the same case as yours. In _violation_ of specs
linux immediately aborts unestablished connect on an icmp error. IMHO that thing
which you suggest is correct (of course, provided you filter out transient errors and react only
to EPERM or something like this). It was not done because it was expected
firewall rule prescribing immediate abort is configured with "--reject-with icmp-port-unreachable",
otherwise the rule orders real blackhole.
Alexey
Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > Hello!
> >
> > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > are performed on unlocked sockets. It looks like at least this BUG_ON():
> >
> > skb = tcp_write_queue_head(sk);
> > BUG_ON(!skb);
> >
> > can be easily triggered.
> >
> > Do I miss something?
> >
>
> Hi Alexey !
>
> I see socket is locked around line 368,
>
> bh_lock_sock(sk);
> /* If too many ICMPs get dropped on busy
> * servers this needs to be solved differently.
> */
> if (sock_owned_by_user(sk))
> NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
>
>
> Hmm, maybe some goto is missing ;)
>
Well, goto is not missing.
Why do you think BUG_ON(!skb) can be triggered ?
We test before :
if (seq != tp->snd_una || !icsk->icsk_retransmits ||
!icsk->icsk_backoff)
break;
So a concurrent user only can add new skb(s) in the (non empty) queue ?
Le vendredi 12 novembre 2010 à 19:21 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > > Hello!
> > >
> > > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > > are performed on unlocked sockets. It looks like at least this BUG_ON():
> > >
> > > skb = tcp_write_queue_head(sk);
> > > BUG_ON(!skb);
> > >
> > > can be easily triggered.
> > >
> > > Do I miss something?
> > >
> >
> > Hi Alexey !
> >
> > I see socket is locked around line 368,
> >
> > bh_lock_sock(sk);
> > /* If too many ICMPs get dropped on busy
> > * servers this needs to be solved differently.
> > */
> > if (sock_owned_by_user(sk))
> > NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> >
> >
> > Hmm, maybe some goto is missing ;)
> >
>
> Well, goto is not missing.
>
> Why do you think BUG_ON(!skb) can be triggered ?
>
> We test before :
>
> if (seq != tp->snd_una || !icsk->icsk_retransmits ||
> !icsk->icsk_backoff)
> break;
>
> So a concurrent user only can add new skb(s) in the (non empty) queue ?
>
>
Oh well, it seems you are right (backlog processing)
Bug was introduced in commit f1ecd5d9e736660 (Revert Backoff [v3]:
Revert RTO on ICMP destination unreachable) from Damian Lukowski
Hello!
On Fri, Nov 12, 2010 at 07:12:58PM +0100, Eric Dumazet wrote:
> I see socket is locked around line 368,
>
> bh_lock_sock(sk);
> /* If too many ICMPs get dropped on busy
> * servers this needs to be solved differently.
> */
> if (sock_owned_by_user(sk))
> NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
>
>
> Hmm, maybe some goto is missing ;)
It is not missing, sock_owned_by_user() is checked later when some operation which
cannot be done without lock is required. It was done to save error in sk_err_soft even
when socket is locked.
This code also _understands_ this: look at
} else if (sock_owned_by_user(sk)) {
/* RTO revert clocked out retransmission,
* but socket is locked. Will defer. */
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
HZ/20, TCP_RTO_MAX);
but somehow it considers the manipulations with rto/backoff/write_queue as safe.
Seems, they are not.
Alexey
Hello!
> Oh well, it seems you are right (backlog processing)
Exactly.
Alexey
Le vendredi 12 novembre 2010 à 21:29 +0300, Alexey Kuznetsov a écrit :
> Hello!
>
> On Fri, Nov 12, 2010 at 07:12:58PM +0100, Eric Dumazet wrote:
> > I see socket is locked around line 368,
> >
> > bh_lock_sock(sk);
> > /* If too many ICMPs get dropped on busy
> > * servers this needs to be solved differently.
> > */
> > if (sock_owned_by_user(sk))
> > NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> >
> >
> > Hmm, maybe some goto is missing ;)
>
> It is not missing, sock_owned_by_user() is checked later when some operation which
> cannot be done without lock is required. It was done to save error in sk_err_soft even
> when socket is locked.
>
> This code also _understands_ this: look at
>
> } else if (sock_owned_by_user(sk)) {
> /* RTO revert clocked out retransmission,
> * but socket is locked. Will defer. */
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> HZ/20, TCP_RTO_MAX);
>
> but somehow it considers the manipulations with rto/backoff/write_queue as safe.
> Seems, they are not.
Indeed, right you are, I came to same conclusion.
I CC Damian Lukowski in my previous answer (and this one too)
Thanks !
From: Eric Dumazet <[email protected]>
Date: Fri, 12 Nov 2010 19:33:23 +0100
> I CC Damian Lukowski in my previous answer (and this one too)
Probably the safest fix is this:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8f8527d..69ccbc1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
!icsk->icsk_backoff)
break;
+ if (sock_owned_by_user(sk))
+ break;
+
icsk->icsk_backoff--;
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
icsk->icsk_backoff;
@@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
if (remaining) {
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
remaining, TCP_RTO_MAX);
- } else if (sock_owned_by_user(sk)) {
- /* RTO revert clocked out retransmission,
- * but socket is locked. Will defer. */
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- HZ/20, TCP_RTO_MAX);
} else {
/* RTO revert clocked out retransmission.
* Will retransmit now */
From: Alexey Kuznetsov <[email protected]>
Date: Fri, 12 Nov 2010 20:46:20 +0300
> The only loophole is ICMP error in the same case as yours. In
> _violation_ of specs linux immediately aborts unestablished connect
> on an icmp error. IMHO that thing which you suggest is correct (of
> course, provided you filter out transient errors and react only to
> EPERM or something like this). It was not done because it was
> expected firewall rule prescribing immediate abort is configured
> with "--reject-with icmp-port-unreachable", otherwise the rule
> orders real blackhole.
The idea to signal on -EPERM might be OK, but if that's also
what things like "-m statistical" and friends end up reporting
then we still cannot do it.
On Fri, Nov 12, 2010 at 05:54:29PM +0100, Patrick McHardy wrote:
> Am 12.11.2010 17:35, schrieb David Lamparter:
> > On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> >> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> >>
> >>> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> >>> to send. Should it return error codes back up the stack somehow or
> >>> should they continue to be ignored? Obviously continuing to just ignore
> >>> information we have doesn't make me happy (otherwise I wouldn't have
> >>> started scratching this itch). But the point about ENOBUFS is well
> >>> taken. Maybe I should make tcp_connect(), or the caller to
> >>> tcp_connect() more intelligent about specific error codes?
> >>>
> >>> I'm looking for a path forward. If SELinux is rejecting the SYN packets
> >>> on connect() I want to pass that info to userspace rather than just
> >>> hanging. What's the best way to accomplish that?
> >>>
> >>
> >> Eric, if you can differentiate a permanent reject, instead of a
> >> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> >> yes, you could make tcp_connect() report to user the permanent error,
> >> and ignore the temporary one.
>
> Indeed. We could even make the NF_DROP return value configurable
> by encoding it in the verdict.
> There is no NF_REJECT.
Ah, sorry, not at home in netfilter, coming from an user perspective here.
> Returning NF_DROP results in -EPERM getting reported back. As Eric
> noticed, this is ignored for SYN packets.
Hrm. But how do you silently drop packets? This seems counterintuitive
or even buggy to me; or at least the netfilter DROP target shouldn't use
this kind of error-reporting drop.
As food for thought I'd like to pose the following rule:
iptables -A OUTPUT -m statistic --mode nth --every 5 -j DROP
which should, to my understanding, still allow the connect to complete,
even if the first SYN got (silently!...) dropped.
Also, i'm *very* sure i was able to trigger a "permission denied" from
either firewall or route rules; weirdly enough i can't get that on my
2.6.35.7 router... (poking older boxes to reproduce it right now)
Just for reference some test results: (heavily cropped)
TL;DR: only tcp-reset and route prohibit work immediately.
+ telnet 74.125.43.105 80
Connected to 74.125.43.105.
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT
# default w/o reject-with is icmp-port-unreachable
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Connection refused
real 0m3.014s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with tcp-reset
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Connection refused
real 0m0.007s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with host-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: No route to host
real 0m3.010s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with admin-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: No route to host
real 0m3.009s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with net-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Network is unreachable
real 0m3.011s
+ iptables -F OUTPUT
+ ip route add prohibit 74.125.43.105
+ ip route flush cache
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Network is unreachable
real 0m0.007s
-David
From: David Lamparter <[email protected]>
Date: Fri, 12 Nov 2010 22:16:27 +0100
> As food for thought I'd like to pose the following rule:
> iptables -A OUTPUT -m statistic --mode nth --every 5 -j DROP
> which should, to my understanding, still allow the connect to complete,
> even if the first SYN got (silently!...) dropped.
Yes, I agree and this is pretty much the point I tried to make
earlier.
Le vendredi 12 novembre 2010 à 11:22 -0800, David Miller a écrit :
> From: Eric Dumazet <[email protected]>
> Date: Fri, 12 Nov 2010 19:33:23 +0100
>
> > I CC Damian Lukowski in my previous answer (and this one too)
>
> Probably the safest fix is this:
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 8f8527d..69ccbc1 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
> !icsk->icsk_backoff)
> break;
>
> + if (sock_owned_by_user(sk))
> + break;
> +
> icsk->icsk_backoff--;
> inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
> icsk->icsk_backoff;
> @@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
> if (remaining) {
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> remaining, TCP_RTO_MAX);
> - } else if (sock_owned_by_user(sk)) {
> - /* RTO revert clocked out retransmission,
> - * but socket is locked. Will defer. */
> - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> - HZ/20, TCP_RTO_MAX);
> } else {
> /* RTO revert clocked out retransmission.
> * Will retransmit now */
Acked-by: Eric Dumazet <[email protected]>
From: Eric Dumazet <[email protected]>
Date: Fri, 12 Nov 2010 22:18:38 +0100
> Acked-by: Eric Dumazet <[email protected]>
Thanks for reviewing, here is the final bit I committed and
I'll queue this up for whatever -stable trees need this.
--------------------
tcp: Don't change unlocked socket state in tcp_v4_err().
Alexey Kuznetsov noticed a regression introduced by
commit f1ecd5d9e7366609d640ff4040304ea197fbc618
("Revert Backoff [v3]: Revert RTO on ICMP destination unreachable")
The RTO and timer modification code added to tcp_v4_err()
doesn't check sock_owned_by_user(), which if true means we
don't have exclusive access to the socket and therefore cannot
modify it's critical state.
Just skip this new code block if sock_owned_by_user() is true
and eliminate the now superfluous sock_owned_by_user() code
block contained within.
Reported-by: Alexey Kuznetsov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
CC: Damian Lukowski <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
---
net/ipv4/tcp_ipv4.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8f8527d..69ccbc1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
!icsk->icsk_backoff)
break;
+ if (sock_owned_by_user(sk))
+ break;
+
icsk->icsk_backoff--;
inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
icsk->icsk_backoff;
@@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
if (remaining) {
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
remaining, TCP_RTO_MAX);
- } else if (sock_owned_by_user(sk)) {
- /* RTO revert clocked out retransmission,
- * but socket is locked. Will defer. */
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- HZ/20, TCP_RTO_MAX);
} else {
/* RTO revert clocked out retransmission.
* Will retransmit now */
--
1.7.3.2
> On 11.11.2010 22:58, Hua Zhong wrote:
> >> Yes, I realize this is little different than if the
> >> SYN was dropped in the first network device, but it is different
> >> because we know what happened! We know that connect() call failed
> >> and that there isn't anything coming back.
> >
> > I would argue that -j DROP should behave exactly as the packet is
> dropped in the network, while -j REJECT should signal the failure to
> the application as soon as possible (which it doesn't seem to do).
>
> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
TCP reset.
On 13.11.2010 00:14, Hua Zhong wrote:
>> On 11.11.2010 22:58, Hua Zhong wrote:
>>>> Yes, I realize this is little different than if the
>>>> SYN was dropped in the first network device, but it is different
>>>> because we know what happened! We know that connect() call failed
>>>> and that there isn't anything coming back.
>>>
>>> I would argue that -j DROP should behave exactly as the packet is
>> dropped in the network, while -j REJECT should signal the failure to
>> the application as soon as possible (which it doesn't seem to do).
>>
>> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
>
> Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
> TCP reset.
Of course there is.
ICMP (default):
iptables -A OUTPUT -p tcp -j REJECT
TCP reset:
iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
The second one will cause a hard error for the connection.
On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
> On 13.11.2010 00:14, Hua Zhong wrote:
> >> On 11.11.2010 22:58, Hua Zhong wrote:
> >>>> Yes, I realize this is little different than if the
> >>>> SYN was dropped in the first network device, but it is different
> >>>> because we know what happened! We know that connect() call failed
> >>>> and that there isn't anything coming back.
> >>>
> >>> I would argue that -j DROP should behave exactly as the packet is
> >> dropped in the network, while -j REJECT should signal the failure to
> >> the application as soon as possible (which it doesn't seem to do).
> >>
> >> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
> >
> > Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
> > TCP reset.
>
> Of course there is.
>
> ICMP (default):
>
> iptables -A OUTPUT -p tcp -j REJECT
>
> TCP reset:
>
> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>
> The second one will cause a hard error for the connection.
Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
anything with a local outgoing connection but --reject-with tcp-reset
does something like what I'm looking for.
I notice the heavy lifting for this is done in
net/ipv4/netfilter/ipt_REJECT.c::send_rest()
(and something very similar for IPv6)
I really don't want to duplicate that code into SELinux (for obvious
reasons) and I'm wondering if anyone has objections to me making it
available outside of netlink and/or suggestions on how to make that code
available outside of netfilter (aka what header to expose it, and does
it still make logical sense in ipt_REJECT.c or somewhere else?)
-Eric
On 15.11.2010 16:47, Eric Paris wrote:
> On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
>> On 13.11.2010 00:14, Hua Zhong wrote:
>>>> On 11.11.2010 22:58, Hua Zhong wrote:
>>>>>> Yes, I realize this is little different than if the
>>>>>> SYN was dropped in the first network device, but it is different
>>>>>> because we know what happened! We know that connect() call failed
>>>>>> and that there isn't anything coming back.
>>>>>
>>>>> I would argue that -j DROP should behave exactly as the packet is
>>>> dropped in the network, while -j REJECT should signal the failure to
>>>> the application as soon as possible (which it doesn't seem to do).
>>>>
>>>> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
>>>
>>> Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
>>> TCP reset.
>>
>> Of course there is.
>>
>> ICMP (default):
>>
>> iptables -A OUTPUT -p tcp -j REJECT
>>
>> TCP reset:
>>
>> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>>
>> The second one will cause a hard error for the connection.
>
> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
> anything with a local outgoing connection but --reject-with tcp-reset
> does something like what I'm looking for.
>
> I notice the heavy lifting for this is done in
> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
> (and something very similar for IPv6)
>
> I really don't want to duplicate that code into SELinux (for obvious
> reasons) and I'm wondering if anyone has objections to me making it
> available outside of netlink and/or suggestions on how to make that code
> available outside of netfilter (aka what header to expose it, and does
> it still make logical sense in ipt_REJECT.c or somewhere else?)
I don't think having SELinux sending packets to handle local
connections is a very elegant design, its not a firewall after
all. What's wrong with reacting only to specific errno codes
in tcp_connect()? You could f.i. return -ECONNREFUSED from
SELinux, that one is pretty much guaranteed not to occur in
the network stack itself and can be returned directly.
That would need minor changes to nf_hook_slow so we can
encode errno values in the upper 16 bits of the verdict,
as we already do with the queue number. The added benefit
is that we don't have to return EPERM anymore when f.i.
rerouting fails.
On 15.11.2010 16:57, Patrick McHardy wrote:
> On 15.11.2010 16:47, Eric Paris wrote:
>>> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>>>
>>> The second one will cause a hard error for the connection.
>>
>> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
>> anything with a local outgoing connection but --reject-with tcp-reset
>> does something like what I'm looking for.
>>
>> I notice the heavy lifting for this is done in
>> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
>> (and something very similar for IPv6)
>>
>> I really don't want to duplicate that code into SELinux (for obvious
>> reasons) and I'm wondering if anyone has objections to me making it
>> available outside of netlink and/or suggestions on how to make that code
>> available outside of netfilter (aka what header to expose it, and does
>> it still make logical sense in ipt_REJECT.c or somewhere else?)
>
> I don't think having SELinux sending packets to handle local
> connections is a very elegant design, its not a firewall after
> all. What's wrong with reacting only to specific errno codes
> in tcp_connect()? You could f.i. return -ECONNREFUSED from
> SELinux, that one is pretty much guaranteed not to occur in
> the network stack itself and can be returned directly.
One more note: there is also the problem that the RST might never
reach the socket, f.i. because netfilter drops it, or TC actions
reroute it etc. With netfilter users are expected to make sure the
entire combination of network features does what the expect, but
that's probably not what you want for SELinux.
> That would need minor changes to nf_hook_slow so we can
> encode errno values in the upper 16 bits of the verdict,
> as we already do with the queue number. The added benefit
> is that we don't have to return EPERM anymore when f.i.
> rerouting fails.
On 15.11.2010 16:57, Patrick McHardy wrote:
> On 15.11.2010 16:47, Eric Paris wrote:
>> I notice the heavy lifting for this is done in
>> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
>> (and something very similar for IPv6)
>>
>> I really don't want to duplicate that code into SELinux (for obvious
>> reasons) and I'm wondering if anyone has objections to me making it
>> available outside of netlink and/or suggestions on how to make that code
>> available outside of netfilter (aka what header to expose it, and does
>> it still make logical sense in ipt_REJECT.c or somewhere else?)
>
> I don't think having SELinux sending packets to handle local
> connections is a very elegant design, its not a firewall after
> all. What's wrong with reacting only to specific errno codes
> in tcp_connect()? You could f.i. return -ECONNREFUSED from
> SELinux, that one is pretty much guaranteed not to occur in
> the network stack itself and can be returned directly.
>
> That would need minor changes to nf_hook_slow so we can
> encode errno values in the upper 16 bits of the verdict,
> as we already do with the queue number. The added benefit
> is that we don't have to return EPERM anymore when f.i.
> rerouting fails.
Patch for demonstration purposes attached. I've modified the
MARK target so it returns NF_DROP with an errno code of
-ECONNREFUSED:
# iptables -A OUTPUT -d 1.2.3.4 -j MARK --set-mark 1
# ping 1.2.3.4
PING 1.2.3.4 (1.2.3.4) 56(84) bytes of data.
ping: sendmsg: Connection refused
# telnet 1.2.3.4
Trying 1.2.3.4...
telnet: Unable to connect to remote host: Connection refused
From: Patrick McHardy <[email protected]>
Date: Mon, 15 Nov 2010 17:36:40 +0100
> Patch for demonstration purposes attached. I've modified the
> MARK target so it returns NF_DROP with an errno code of
> -ECONNREFUSED:
I'm fine with the tcp_output.c changes.
Hello!
On Mon, Nov 15, 2010 at 10:47:46AM -0500, Eric Paris wrote:
> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
> anything with a local outgoing connection
Yeah. This was "an obvious surpise": for local socket the first icmp error always
arrives on locked socket and gets dropped. I even forgot about this (tcp-reset
still works ok due to backlog) This makes the idea to respect error more attractive.
Alexey