Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp2981172rdb; Fri, 22 Sep 2023 14:10:05 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHBMmLLXG2ng4PVa7PblfDJ35H6TFQ7EfX8pIX6q6ggm76jO5PHtXmRXR35VGxGpufbJp4K X-Received: by 2002:a05:6a00:1818:b0:690:2ecd:a59c with SMTP id y24-20020a056a00181800b006902ecda59cmr586128pfa.23.1695417005122; Fri, 22 Sep 2023 14:10:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695417005; cv=none; d=google.com; s=arc-20160816; b=vItb+JlJODA2pbGqTxsdSstaXBKOS/xmiMIQR6NC4E3D5mhvQpFGLwe+D8fO4YoQOZ 9pZ1Gd3KF2zo0NfjJ5V5mIDWRs7a7ElRWQqQOFx1F/o1p+rFcb8jg8buT6IKajs+54H7 ANhyZ3t3Y6YOpp+zGl4OXvKZv6g/KT/ZTznVUjikQdTP7lczMjz5wXxE/8GtpbC+Umcw Q7NUcmjVpZ3/TbcQHIhOE4QK7i05eyPPZC7rwoWxRX1VQpMzBSsZgj6d/GbpLXhFioym bBtFHTxWNT+vKIdQW1az2zfAmBXkbuBm0NGYpWRq/rC2YSKeXW47srBErItSNkQ0lI6u Q8qg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:to:content-language:subject:cc:user-agent:mime-version :date:message-id; bh=IRW/lk4swIdTh0OCLE5TjQmqpNzlA7yy4SWBqFRdNpk=; fh=bRBECP3NSWC2EB6kvFimGamO3U5FvSwaOIbr3BvpUP4=; b=CHJgGX2Cx5NxbutchMEU2e+g0KSlVwLir71CpzCpfexKBDZbU2B/M5AhHRbFscYS06 iAKBG1qYIY6GmlsfAFHp7DjgKkfNCtG4UWTgMohT2L6WVFxV0drxI0orV0D6yxDvKr/9 S3NLwF+h3l8vDSFthweOToGDsPHI1z243GS1ujx86c4+7XLnHvLNkYeakdloosEACrmy K0/kELFStIAvrIdKKWnTOicj+Jsxnnh/qdzj55g22MqZNc5VrfDp8AnDGq0g9gNw3dHl yR9v1G/swqnPH3NtygenkHiABIlJScpr11maX60xRtyjk54p79poJlvtf0ZAK5zKXiSf kxzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id g25-20020a62e319000000b00690c19cb105si4326001pfh.250.2023.09.22.14.10.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Sep 2023 14:10:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 6CF5280725FD; Fri, 22 Sep 2023 14:08:27 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229845AbjIVVIW (ORCPT + 99 others); Fri, 22 Sep 2023 17:08:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229798AbjIVVIV (ORCPT ); Fri, 22 Sep 2023 17:08:21 -0400 Received: from relay3-d.mail.gandi.net (relay3-d.mail.gandi.net [IPv6:2001:4b98:dc4:8::223]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 756B7C1; Fri, 22 Sep 2023 14:08:14 -0700 (PDT) Received: by mail.gandi.net (Postfix) with ESMTPSA id 582AD60004; Fri, 22 Sep 2023 21:08:10 +0000 (UTC) Message-ID: <68103640-acb0-b2f8-fcde-91ab3ed5beab@ovn.org> Date: Fri, 22 Sep 2023 23:09:02 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Cc: i.maximets@ovn.org, "David S. Miller" , dev@openvswitch.org, linux-kernel@vger.kernel.org, Eric Dumazet , Jakub Kicinski , Paolo Abeni , Florian Westphal , Frode Nordahl , Madhu Koriginja Subject: Re: [PATCH net] net: ensure all external references are released in deferred skbuffs Content-Language: en-US To: netdev@vger.kernel.org References: <20220619003919.394622-1-i.maximets@ovn.org> From: Ilya Maximets In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-Sasl: i.maximets@ovn.org X-Spam-Status: No, score=1.3 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Fri, 22 Sep 2023 14:08:27 -0700 (PDT) X-Spam-Level: * On 9/22/23 21:07, Ilya Maximets wrote: > On 9/22/23 15:26, Ilya Maximets wrote: >> On 6/19/22 02:39, Ilya Maximets wrote: >>> Open vSwitch system test suite is broken due to inability to >>> load/unload netfilter modules. kworker thread is getting trapped >>> in the infinite loop while running a net cleanup inside the >>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still >>> holding nfct references and not being freed by their CPU cores. >>> >>> In general, the idea that we will have an rx interrupt on every >>> CPU core at some point in a near future doesn't seem correct. >>> Devices are getting created and destroyed, interrupts are getting >>> re-scheduled, CPUs are going online and offline dynamically. >>> Any of these events may leave packets stuck in defer list for a >>> long time. It might be OK, if they are just a piece of memory, >>> but we can't afford them holding references to any other resources. >>> >>> In case of OVS, nfct reference keeps the kernel thread in busy loop >>> while holding a 'pernet_ops_rwsem' semaphore. That blocks the >>> later modprobe request from user space: >>> >>> # ps >>> 299 root R 99.3 200:25.89 kworker/u96:4+ >>> >>> # journalctl >>> INFO: task modprobe:11787 blocked for more than 1228 seconds. >>> Not tainted 5.19.0-rc2 #8 >>> task:modprobe state:D >>> Call Trace: >>> >>> __schedule+0x8aa/0x21d0 >>> schedule+0xcc/0x200 >>> rwsem_down_write_slowpath+0x8e4/0x1580 >>> down_write+0xfc/0x140 >>> register_pernet_subsys+0x15/0x40 >>> nf_nat_init+0xb6/0x1000 [nf_nat] >>> do_one_initcall+0xbb/0x410 >>> do_init_module+0x1b4/0x640 >>> load_module+0x4c1b/0x58d0 >>> __do_sys_init_module+0x1d7/0x220 >>> do_syscall_64+0x3a/0x80 >>> entry_SYSCALL_64_after_hwframe+0x46/0xb0 >>> >>> At this point OVS testsuite is unresponsive and never recover, >>> because these skbuffs are never freed. >>> >>> Solution is to make sure no external references attached to skb >>> before pushing it to the defer list. Using skb_release_head_state() >>> for that purpose. The function modified to be re-enterable, as it >>> will be called again during the defer list flush. >>> >>> Another approach that can fix the OVS use-case, is to kick all >>> cores while waiting for references to be released during the net >>> cleanup. But that sounds more like a workaround for a current >>> issue rather than a proper solution and will not cover possible >>> issues in other parts of the code. >>> >>> Additionally checking for skb_zcopy() while deferring. This might >>> not be necessary, as I'm not sure if we can actually have zero copy >>> packets on this path, but seems worth having for completeness as we >>> should never defer such packets regardless. >>> >>> CC: Eric Dumazet >>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") >>> Signed-off-by: Ilya Maximets >>> --- >> >> Happy Friday! :) >> >> Resurrecting this thread because I managed to reproduce the issue again >> on a latest 6.6.0-rc1. >> >> (It doesn't mean we need to accept this particular patch, I just think >> that it's an appropriate discussion thread.) >> >> It's a bit different testsuite this time. Last year I had an issue with >> OVS testsuite, today I have an issue with OVN testsuite. Their structure >> is very similar, but OVN tests are a fair bit more complex. >> >> The story is the same: Each test loads a pack of kernel modules including >> OVS and nf_conntrack, sends some traffic, verifies OVN functionality, >> stops OVN/OVS and unloads all the previously loaded modules to clean up >> the space for the next tests. >> >> Kernel hangs in the same way as before waiting for nf_conntrack module >> to be unloaded: >> >> 13 root R 100.0 933:17.98 kworker/u80:1+netns >> >> CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7 >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014 >> Workqueue: netns cleanup_net >> RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8 >> ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e >> 2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e >> 2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00 >> RSP: 0018:ffffb23040073d98 EFLAGS: 00000202 >> RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38 >> RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920 >> RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38 >> R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080 >> R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0 >> FS: 0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> >> ? nmi_cpu_backtrace+0x82/0xf0 >> ? nmi_cpu_backtrace_handler+0xd/0x20 >> ? nmi_handle+0x5e/0x150 >> ? default_do_nmi+0x40/0x100 >> ? exc_nmi+0x112/0x190 >> ? end_repeat_nmi+0x16/0x67 >> ? __pfx_kill_all+0x10/0x10 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack] >> >> >> nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack] >> cleanup_net+0x213/0x3b0 >> process_one_work+0x177/0x340 >> worker_thread+0x27e/0x390 >> ? __pfx_worker_thread+0x10/0x10 >> kthread+0xe2/0x110 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork+0x30/0x50 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork_asm+0x1b/0x30 >> >> >> Since this testsuite is different from what I had issues with before, >> I don't really know when the issue first appeared. It may have been >> there even last year. >> >> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but >> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9. >> It doesn't necessarily mean that RHEL9 doesn't have the issue though, >> the testsuite is not 100% reliable. >> >> I'll try to dig deeper and bisect the problem on the upstream kernel. >> >> For now it seems like the issue is likely in IPv6 code, because the >> tests that trigger it are mostly IPv6-related. > > So, I bisected this down to an unsurprising commit: > > commit b0e214d212030fe497d4d150bb3474e50ad5d093 > Author: Madhu Koriginja > Date: Tue Mar 21 21:28:44 2023 +0530 > > netfilter: keep conntrack reference until IPsecv6 policy checks are done > > Keep the conntrack reference until policy checks have been performed for > IPsec V6 NAT support, just like ipv4. > > The reference needs to be dropped before a packet is > queued to avoid having the conntrack module unloadable. > > Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support") > Signed-off-by: Madhu Koriginja > Signed-off-by: Florian Westphal > > This commit repeated the pattern from the old ipv4 commit b59c270104f0 > ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done") > and hence introduced the exact same issue, but for IPv6. They both failed > to deliver on the "avoid having the conntrack module unloadable" part. > > IIUC, the fix should be exactly the same as Eric did for ipv4 in commit > 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling"). > > I can try that and send a fix. Posted a fix here: https://lore.kernel.org/netdev/20230922210530.2045146-1-i.maximets@ovn.org/ It fixes the problem I have with OVN testsuite. > > Would still really like to have some preventive mechanism for this kind of > issues though. Any ideas on that? > > Best regards, Ilya Maximets.