Received: by 10.192.165.148 with SMTP id m20csp641683imm; Wed, 9 May 2018 20:33:27 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqhLnCF/rjrUFNGEl2s2SwgOEW6en4nwpw9MR79okWn0+Wfsj5niBimCpqGPoHQQOjPGZgt X-Received: by 2002:a17:902:aa03:: with SMTP id be3-v6mr13118056plb.61.1525923207300; Wed, 09 May 2018 20:33:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525923207; cv=none; d=google.com; s=arc-20160816; b=jfkNmC+5Gm0QU/WZuAawcvJWsPsMZIWqVcqoIew5Qgt6JWqKLS9wDaktIbl0zvu1ue 63+GceQ/cflcvGgB33ZLTI3p980eoQNKNTAr0jhrOwF+Qw24qDUbLnm84mSoKQjORCi0 1AkcuRp64KlFBLwR4h276C8z9NtHBMNOsfWrK8RihsjKCNZlM0QpyrtixAndyCoLlp2l Ml31cqMY2qpVNrRjdW9/HHR19xgbccEBTfBcFQ8suI+g3Yjew6F5rAoCuPQcAYITx5Mo 5d8FW292l/tcArpFf191gxJtMlWiLm/Lnf6Km3aDmX5oD1Smwbv+iiKsE/jv1ksPA0O8 ii4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=nWLuayEiLjOJBrsj5E4W8u+m55tMwFJaGKwlU+4BKac=; b=udYqXo9/XItBY9hUoXEuuJzIT7mqD1ShtVh//cKwPfBQCx+j50fPo3iWL7mrLX+W8r fupZMEXk5VGrL1c43A+rDKdlImXIZXGJMaO6UrrX9srwoSEwUEAdqqE2WM0ridszEZ08 ZO4z5aq3wsZKqg6J2yTUMPIulk5dyJvbE8hacr+nX1ORvkR8ocSpe/956hxEYWePfWka V52HTAeAWhjri8p9P/GsGDJSYL223G0UnnYgxwE6dzm8LpgYF6CBNwJsWXXZ8MsxdNob scURnbfiKYIi7+vjZQrOhXADwBy+Fb1BmGtvKRk8zvGUuAZGsFMOv53YTaZS5IKRx8XE pWBg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=HbNfFwE1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e9-v6si5755386plk.61.2018.05.09.20.33.11; Wed, 09 May 2018 20:33:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=HbNfFwE1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756411AbeEJDcT (ORCPT + 99 others); Wed, 9 May 2018 23:32:19 -0400 Received: from mail-pl0-f68.google.com ([209.85.160.68]:45190 "EHLO mail-pl0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756118AbeEJDcR (ORCPT ); Wed, 9 May 2018 23:32:17 -0400 Received: by mail-pl0-f68.google.com with SMTP id bi12-v6so458476plb.12; Wed, 09 May 2018 20:32:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=nWLuayEiLjOJBrsj5E4W8u+m55tMwFJaGKwlU+4BKac=; b=HbNfFwE1yW9JnXjc63u+lfot75Hh+F9Kvl8pOk2j8gi2NxQAFdVviSyEBNAH+sSvY/ f0mbV7pn1cgLbvLKj6/rt45Jb7fIHjtpYRgqS4Fb3UjL+qr69/JDNa7JrtUX99VZdIZ0 gXJxd/ysnF9svZxPVNzzDv1RUgNd3RTSaUo3kVoBjE9hPTSalRDV6VHk/15SZZ4+i4uO 943Ti508lucRD58SZygaJC2WJY3s6dGkGKtcCxFi/ugmtq+N2s2978bm22OMM4ToDaJf Qixef3TQMUJJ++rWwqqNhr3NJ2xUkHz5lC/I7q358MALi/39Hnl8Va6JanuUtaFb+2FU esNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=nWLuayEiLjOJBrsj5E4W8u+m55tMwFJaGKwlU+4BKac=; b=o5F2sF66lbCf3fKp406kQksJ3RulufVFdaZsCnKZHqUVf2HSKrShE8cyCQJCrAxsFI ++U1WhBdRS9gMf1nn16H4yKhhD3LLVO9BaH5weA9AqUSYUEG1jJqFWXgtxTRblhjis19 Yj0VwwAf9nloxAMqc0MfZx57+5qBd3kiNLm+4AulupSGrYx+22llyf0LBFxig15wtCZq kve4Xxi7Id3/2RYGQOGlJZVaqNhoAC3/M+nnb32oFvAp89gAV9AtBL1/lro7kAlrVVCk /eS4zAbLxkHzv/PAdbkJH5zBsNlzqO6OeSQUZ3YZaPWhnFPiL81cef7z2u7zrRWgg6yb T7zA== X-Gm-Message-State: ALQs6tA9Vrbr6M0AiJnIZSTFvbMccNiGOPOG8PoY8jq1X3qYpNZzbLY1 kkIw7zl0JT5fiDhCMFo3VUuiM38K X-Received: by 2002:a17:902:7615:: with SMTP id k21-v6mr32849291pll.97.1525923137267; Wed, 09 May 2018 20:32:17 -0700 (PDT) Received: from [192.168.86.235] (c-67-180-167-114.hsd1.ca.comcast.net. [67.180.167.114]) by smtp.gmail.com with ESMTPSA id w19-v6sm41233213pgv.59.2018.05.09.20.32.16 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 May 2018 20:32:16 -0700 (PDT) Subject: Re: [PATCH net-next] tcp: Add mark for TIMEWAIT sockets To: Jon Maxwell , davem@davemloft.net Cc: kuznet@ms2.inr.ac.ru, yoshfuji@linux-ipv6.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, jmaxwell@redhat.com References: <20180510020739.8599-1-jmaxwell37@gmail.com> From: Eric Dumazet Message-ID: Date: Wed, 9 May 2018 20:32:15 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <20180510020739.8599-1-jmaxwell37@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/09/2018 07:07 PM, Jon Maxwell wrote: > Aidan McGurn from Openwave Mobility systems reported the following bug: > > "Marked routing is broken on customer deployment. Its effects are large > increase in Uplink retransmissions caused by the client never receiving > the final ACK to their FINACK - this ACK misses the mark and routes out > of the incorrect route." > > Currently marks are added to sk_buffs for replies when the "fwmark_reflect" > sysctl is enabled. But not for TIME_WAIT sockets where the original socket had > sk->sk_mark set via setsockopt(SO_MARK..). > > Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the > original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location. > Then copy this into ctl_sk->sk_mark so that the skb gets sent with the correct > mark. Do the same for resets. Give the "fwmark_reflect" sysctl precedence over > sk->sk_mark so that netfilter rules are still honored. > > Signed-off-by: Jon Maxwell > --- > include/net/inet_timewait_sock.h | 1 + > net/ipv4/ip_output.c | 3 ++- > net/ipv4/tcp_ipv4.c | 18 ++++++++++++++++-- > net/ipv4/tcp_minisocks.c | 1 + > net/ipv6/tcp_ipv6.c | 8 +++++++- > 5 files changed, 27 insertions(+), 4 deletions(-) > > diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h > index c7be1ca8e562..659d8ed5a3bc 100644 > --- a/include/net/inet_timewait_sock.h > +++ b/include/net/inet_timewait_sock.h > @@ -62,6 +62,7 @@ struct inet_timewait_sock { > #define tw_dr __tw_common.skc_tw_dr > > int tw_timeout; > + __u32 tw_mark; > volatile unsigned char tw_substate; > unsigned char tw_rcv_wscale; > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 95adb171f852..cca4412dc4cb 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -1539,6 +1539,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, > struct sk_buff *nskb; > int err; > int oif; > + __u32 mark = IP4_REPLY_MARK(net, skb->mark); > > if (__ip_options_echo(net, &replyopts.opt.opt, skb, sopt)) > return; > @@ -1561,7 +1562,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, > oif = skb->skb_iif; > > flowi4_init_output(&fl4, oif, > - IP4_REPLY_MARK(net, skb->mark), > + mark ? (mark) : sk->sk_mark, You can avoid the declaration of mark variable and simply use here : IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark, > RT_TOS(arg->tos), > RT_SCOPE_UNIVERSE, ip_hdr(skb)->protocol, > ip_reply_arg_flowi_flags(arg), > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index f70586b50838..fbee36579c83 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -621,6 +621,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) > struct sock *sk1 = NULL; > #endif > struct net *net; > + struct sock *ctl_sk; > > /* Never send a reset in response to a reset. */ > if (th->rst) > @@ -723,11 +724,17 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) > arg.tos = ip_hdr(skb)->tos; > arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL); > local_bh_disable(); > - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk), > + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk); > + if (sk && sk->sk_state == TCP_TIME_WAIT) > + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark; > + else if (sk && sk_fullsock(sk)) > + ctl_sk->sk_mark = sk->sk_mark; > + ip_send_unicast_reply(ctl_sk, > skb, &TCP_SKB_CB(skb)->header.h4.opt, > ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, > &arg, arg.iov[0].iov_len); > > + ctl_sk->sk_mark = 0; > __TCP_INC_STATS(net, TCP_MIB_OUTSEGS); > __TCP_INC_STATS(net, TCP_MIB_OUTRSTS); > local_bh_enable(); > @@ -759,6 +766,7 @@ static void tcp_v4_send_ack(const struct sock *sk, > } rep; > struct net *net = sock_net(sk); > struct ip_reply_arg arg; > + struct sock *ctl_sk; > > memset(&rep.th, 0, sizeof(struct tcphdr)); > memset(&arg, 0, sizeof(arg)); > @@ -809,11 +817,17 @@ static void tcp_v4_send_ack(const struct sock *sk, > arg.tos = tos; > arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL); > local_bh_disable(); > - ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk), > + ctl_sk = *this_cpu_ptr(net->ipv4.tcp_sk); > + if (sk && sk->sk_state == TCP_TIME_WAIT) > + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark; > + else if (sk && sk_fullsock(sk)) > + ctl_sk->sk_mark = sk->sk_mark; > + ip_send_unicast_reply(ctl_sk, > skb, &TCP_SKB_CB(skb)->header.h4.opt, > ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, > &arg, arg.iov[0].iov_len); > > + ctl_sk->sk_mark = 0; > __TCP_INC_STATS(net, TCP_MIB_OUTSEGS); > local_bh_enable(); > } > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c > index 57b5468b5139..f867658b4b30 100644 > --- a/net/ipv4/tcp_minisocks.c > +++ b/net/ipv4/tcp_minisocks.c > @@ -263,6 +263,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) > struct inet_sock *inet = inet_sk(sk); > > tw->tw_transparent = inet->transparent; > + tw->tw_mark = sk->sk_mark; > tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale; > tcptw->tw_rcv_nxt = tp->rcv_nxt; > tcptw->tw_snd_nxt = tp->snd_nxt; > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 6d664d83cd16..a6f876125091 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -803,6 +803,7 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 > unsigned int tot_len = sizeof(struct tcphdr); > struct dst_entry *dst; > __be32 *topt; > + __u32 mark = IP6_REPLY_MARK(net, skb->mark); > > if (tsecr) > tot_len += TCPOLEN_TSTAMP_ALIGNED; > @@ -871,11 +872,16 @@ static void tcp_v6_send_response(const struct sock *sk, struct sk_buff *skb, u32 > fl6.flowi6_oif = oif; > } > > - fl6.flowi6_mark = IP6_REPLY_MARK(net, skb->mark); > + if (sk && sk->sk_state == TCP_TIME_WAIT) > + ctl_sk->sk_mark = inet_twsk(sk)->tw_mark; > + else if (sk && sk_fullsock(sk)) > + ctl_sk->sk_mark = sk->sk_mark; Unfortunately IPv6 has a single net->ipv6.tcp_sk, shared by all cpus. So writing ctl_sk->sk_mark is racy on SMP hosts. I would suggest using a local variable, and not touch ctl_sk->sk_mark For consistency, you could do the same for IPv4, even if IPv4 currently uses per-cpu sockets > + fl6.flowi6_mark = mark ? (mark) : ctl_sk->sk_mark; > fl6.fl6_dport = t1->dest; > fl6.fl6_sport = t1->source; > fl6.flowi6_uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL); > security_skb_classify_flow(skb, flowi6_to_flowi(&fl6)); > + ctl_sk->sk_mark = 0; > > /* Pass a socket to ip6_dst_lookup either it is for RST > * Underlying function will use this to retrieve the network >