Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932096AbaAHNrV (ORCPT ); Wed, 8 Jan 2014 08:47:21 -0500 Received: from mail-pa0-f48.google.com ([209.85.220.48]:42236 "EHLO mail-pa0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755664AbaAHNrR (ORCPT ); Wed, 8 Jan 2014 08:47:17 -0500 Message-ID: <1389188841.26646.87.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get (v2) From: Eric Dumazet To: Andrey Vagin Cc: netfilter-devel@vger.kernel.org, netfilter@vger.kernel.org, coreteam@netfilter.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, vvs@openvz.org, Florian Westphal , Pablo Neira Ayuso , Patrick McHardy , Jozsef Kadlecsik , "David S. Miller" , Cyrill Gorcunov Date: Wed, 08 Jan 2014 05:47:21 -0800 In-Reply-To: <1389187051-7794-1-git-send-email-avagin@openvz.org> References: <1389090711-15843-1-git-send-email-avagin@openvz.org> <1389187051-7794-1-git-send-email-avagin@openvz.org> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.2.3-0ubuntu6 Content-Transfer-Encoding: 7bit Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2014-01-08 at 17:17 +0400, Andrey Vagin wrote: > Lets look at destroy_conntrack: > > hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); > ... > nf_conntrack_free(ct) > kmem_cache_free(net->ct.nf_conntrack_cachep, ct); > > net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU. > > The hash is protected by rcu, so readers look up conntracks without > locks. > A conntrack is removed from the hash, but in this moment a few readers > still can use the conntrack. Then this conntrack is released and another > thread creates conntrack with the same address and the equal tuple. > After this a reader starts to validate the conntrack: > * It's not dying, because a new conntrack was created > * nf_ct_tuple_equal() returns true. > > But this conntrack is not initialized yet, so it can not be used by two > threads concurrently. In this case BUG_ON may be triggered from > nf_nat_setup_info(). > > Florian Westphal suggested to check the confirm bit too. I think it's > right. > > task 1 task 2 task 3 > nf_conntrack_find_get > ____nf_conntrack_find > destroy_conntrack > hlist_nulls_del_rcu > nf_conntrack_free > kmem_cache_free > __nf_conntrack_alloc > kmem_cache_alloc > memset(&ct->tuplehash[IP_CT_DIR_MAX], > if (nf_ct_is_dying(ct)) > if (!nf_ct_tuple_equal() > > I'm not sure, that I have ever seen this race condition in a real life. > Currently we are investigating a bug, which is reproduced on a few node. > In our case one conntrack is initialized from a few tasks concurrently, > we don't have any other explanation for this. > > <2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322! > ... > <4>[46267.083951] RIP: 0010:[] [] nf_nat_setup_info+0x564/0x590 [nf_nat] > ... > <4>[46267.085549] Call Trace: > <4>[46267.085622] [] alloc_null_binding+0x5b/0xa0 [iptable_nat] > <4>[46267.085697] [] nf_nat_rule_find+0x5c/0x80 [iptable_nat] > <4>[46267.085770] [] nf_nat_fn+0x111/0x260 [iptable_nat] > <4>[46267.085843] [] nf_nat_out+0x48/0xd0 [iptable_nat] > <4>[46267.085919] [] nf_iterate+0x69/0xb0 > <4>[46267.085991] [] ? ip_finish_output+0x0/0x2f0 > <4>[46267.086063] [] nf_hook_slow+0x74/0x110 > <4>[46267.086133] [] ? ip_finish_output+0x0/0x2f0 > <4>[46267.086207] [] ? dst_output+0x0/0x20 > <4>[46267.086277] [] ip_output+0xa4/0xc0 > <4>[46267.086346] [] raw_sendmsg+0x8b4/0x910 > <4>[46267.086419] [] inet_sendmsg+0x4a/0xb0 > <4>[46267.086491] [] ? sock_update_classid+0x3a/0x50 > <4>[46267.086562] [] sock_sendmsg+0x117/0x140 > <4>[46267.086638] [] ? _spin_unlock_bh+0x1b/0x20 > <4>[46267.086712] [] ? autoremove_wake_function+0x0/0x40 > <4>[46267.086785] [] ? do_ip_setsockopt+0x90/0xd80 > <4>[46267.086858] [] ? call_function_interrupt+0xe/0x20 > <4>[46267.086936] [] ? ub_slab_ptr+0x20/0x90 > <4>[46267.087006] [] ? ub_slab_ptr+0x20/0x90 > <4>[46267.087081] [] ? kmem_cache_alloc+0xd8/0x1e0 > <4>[46267.087151] [] sys_sendto+0x139/0x190 > <4>[46267.087229] [] ? sock_setsockopt+0x16d/0x6f0 > <4>[46267.087303] [] ? audit_syscall_entry+0x1d7/0x200 > <4>[46267.087378] [] ? __audit_syscall_exit+0x265/0x290 > <4>[46267.087454] [] ? compat_sys_setsockopt+0x75/0x210 > <4>[46267.087531] [] compat_sys_socketcall+0x13f/0x210 > <4>[46267.087607] [] ia32_sysret+0x0/0x5 > <4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74 > <1>[46267.088023] RIP [] nf_nat_setup_info+0x564/0x590 > > v2: move nf_ct_is_confirmed into the unlikely() annotation > > Cc: Florian Westphal > Cc: Pablo Neira Ayuso > Cc: Patrick McHardy > Cc: Jozsef Kadlecsik > Cc: "David S. Miller" > Cc: Cyrill Gorcunov > Signed-off-by: Andrey Vagin > --- > net/netfilter/nf_conntrack_core.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c > index 43549eb..403f634 100644 > --- a/net/netfilter/nf_conntrack_core.c > +++ b/net/netfilter/nf_conntrack_core.c > @@ -387,8 +387,12 @@ begin: > !atomic_inc_not_zero(&ct->ct_general.use))) > h = NULL; > else { > + /* A conntrack can be recreated with the equal tuple, > + * so we need to check that the conntrack is initialized > + */ > if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) || > - nf_ct_zone(ct) != zone)) { > + nf_ct_zone(ct) != zone || > + !nf_ct_is_confirmed(ct))) { > nf_ct_put(ct); > goto begin; > } I am still not convinced of this being the right fix. The key we test after taking the refcount should be the same key that we test before taking the refcount, otherwise we might add a loop here. Your patch did not change ____nf_conntrack_find(), so I find it confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/