Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp2294416pxb; Fri, 25 Mar 2022 14:52:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxWDyCA+EAtVDDlzK+L58iPZ+4G6P4+D+kxmot1hbn7x8TFLPziRKbLOi8+suwGNTdbQ5RT X-Received: by 2002:a17:902:d48e:b0:154:b6a:9ff with SMTP id c14-20020a170902d48e00b001540b6a09ffmr13954489plg.2.1648245129738; Fri, 25 Mar 2022 14:52:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648245129; cv=none; d=google.com; s=arc-20160816; b=NBA6YXNeRPn2aFKpuTcJ4Ucn8bhQhYtNVwRhA2RqVGI6Q5F9pX/3zvEybobPJAfELv 1JnDaD+iUCa26c1gy0+zJKxIYOwGYVuwBBCPlftXuH8+Md0ySTw5iOZ8bz7jyK2M06kk PB9/iQqgFxwheBNfdrrk56PrOBBgL/raAUDR19HIf5/dntQpwfTwR02Jkyri7cZ5urKV bhfYaCBanPXzmGjArsWa23ycb/Uzhg4OD/jTtRaulhl4MOehjsiOIV0VValu3U8jXIV/ Klyavi084bQbIyVR4258Tv+HcOJxjvp5MiLGIkJXLLbVvO7MU9Q/R+f6dvKJAO5i81Og xwMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=QEDhIgJeIbzCTSaqjVqTZlpKCAowR+UcFnXwn8gtAfA=; b=tcJB/kjz2ksxTAWE5V0rYZEuDPBGwD3JvcFHbJS9iolJqttamEUGvZXNhwbM3ojlcB kkfzlS8lBYAlEPfRpwWz2Ue23bg30OE2ZrsZfc1ph68kaG2b7ei2LIl5gn1UnjejGzUE QkSZVrv8ErV9nBFBbJxvM/YuV9HF98fzLPTscgS1sYaezesan6k7ZUoAvr2H/gRwGm/g wF5eQl4RLM8vrxpRNPxEQfRszlN8hIpbovcL1W7CmrUz4YIvpvkvLfVtcFUJ9bXXOv2R 4/8U2iTV2TfR89Y52++UmkwtD8uFeKoYwBeFq+RfolOb2dDZliydIgQJY9autJtMvq73 OlJw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sipsolutions.net header.s=mail header.b=pf8OVm9S; spf=softfail (google.com: domain of transitioning linux-wireless-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=REJECT dis=NONE) header.from=sipsolutions.net Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id a3-20020a170902ecc300b00153ee8f2c7fsi3904774plh.7.2022.03.25.14.52.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Mar 2022 14:52:09 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-wireless-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@sipsolutions.net header.s=mail header.b=pf8OVm9S; spf=softfail (google.com: domain of transitioning linux-wireless-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=REJECT dis=NONE) header.from=sipsolutions.net Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F1C0023C0D2; Fri, 25 Mar 2022 14:25:23 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233386AbiCYV0z (ORCPT + 70 others); Fri, 25 Mar 2022 17:26:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233363AbiCYV0w (ORCPT ); Fri, 25 Mar 2022 17:26:52 -0400 Received: from sipsolutions.net (s3.sipsolutions.net [IPv6:2a01:4f8:191:4433::2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3759123932C; Fri, 25 Mar 2022 14:25:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sipsolutions.net; s=mail; h=Content-Transfer-Encoding:MIME-Version: Content-Type:References:In-Reply-To:Date:Cc:To:From:Subject:Message-ID:Sender :Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-To: Resent-Cc:Resent-Message-ID; bh=QEDhIgJeIbzCTSaqjVqTZlpKCAowR+UcFnXwn8gtAfA=; t=1648243516; x=1649453116; b=pf8OVm9SPCIjev//bYF0G+SVdxQ9cy1rcFiKdhlOzQgt58D MAGmC/4Byaal5ODDkraahfVJWGOn7MCS6vG0uTChP0UYTa8kZxdC0eZw2tdVX9kByv1Z74AjDu3cW 1LWKYuijksu9NuxrSVHZKt6ZkwtCZ3/N+1lT8uTGrd7oltKxuoDZat1zty3tvhffh9o9dyXmeUVco aCCY/QprEZlfAS6A7lco2UfUgrzwS6Kt8RzptXbt5kK3e/dWg6o31pEE/Oc6PS2dw8UqPmKTMh/az BALcmIMsMtT439hfI4CQiOI2gLQY4vtSOsG/5AlkVEImqQ02mrmZujHxSnOucHUw==; Received: by sipsolutions.net with esmtpsa (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.95) (envelope-from ) id 1nXrQg-000VlP-Hr; Fri, 25 Mar 2022 22:25:06 +0100 Message-ID: <46b8555d4cded50bc5573fd9b7dd444021317a6b.camel@sipsolutions.net> Subject: Re: [BUG] deadlock in nl80211_vendor_cmd From: Johannes Berg To: Jakub Kicinski Cc: William McVicker , linux-wireless@vger.kernel.org, Marek Szyprowski , Kalle Valo , "David S. Miller" , netdev@vger.kernel.org, Amitkumar Karwar , Xinming Hu , kernel-team@android.com, Paolo Abeni , Eric Dumazet , Cong Wang , Cong Wang , "Eric W. Biederman" Date: Fri, 25 Mar 2022 22:25:05 +0100 In-Reply-To: <20220325134040.0d98835b@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> References: <0000000000009e9b7105da6d1779@google.com> <99eda6d1dad3ff49435b74e539488091642b10a8.camel@sipsolutions.net> <5d5cf050-7de0-7bad-2407-276970222635@quicinc.com> <19e12e6b5f04ba9e5b192001fbe31a3fc47d380a.camel@sipsolutions.net> <20220325094952.10c46350@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> <20220325134040.0d98835b@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.42.4 (3.42.4-1.fc35) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-malware-bazaar: not-scanned X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org On Fri, 2022-03-25 at 13:40 -0700, Jakub Kicinski wrote: > On Fri, 25 Mar 2022 18:01:30 +0100 Johannes Berg wrote: > > That's not a bad idea, but I think I wouldn't want to backport that, so > > separately :) I don't think that fundamentally changes the locking > > properties though. > > > > > > Couple of more questions I guess: First, are we assuming that the > > cfg80211 code *is* actually broken, even if it looks like nothing can > > cause the situation, due to the empty todo list? > > Right. I guess that the below is basically saying "it's not really broken" :) > > Given that we have rtnl_lock_unregistering() (and also > > rtnl_lock_unregistering_all()), it looks like we *do* in fact at least > > not want to make an assumption that no user of __rtnl_unlock() can have > > added a todo item. > > > > I mean, there's technically yet *another* thing we could do - something > > like this: > > > > [this doesn't compile, need to suitably make net_todo_list non-static] > > --- a/net/core/rtnetlink.c > > +++ b/net/core/rtnetlink.c > > @@ -95,6 +95,7 @@ void __rtnl_unlock(void) > > > > defer_kfree_skb_list = NULL; > > > > + WARN_ON(!list_empty(&net_todo_list)); > > mutex_unlock(&rtnl_mutex); > > > > while (head) { > > Yeah, I think we could do that. It seems that would be simpler. Even if I might eventually still want to do the cfg80211 change, but it would give us some confidence that this really cannot be happening anywhere. > TBH I don't know what you mean by rtnl_lock_unregistering(), I don't > have that in my tree. rtnl_lock_unregistering_all() can't hurt the case > we're talking about AFAICT. > > Eric removed some of the netns / loopback dependencies in net-next, > make sure you pull! Sorry, yeah, I was looking at an older tree where I was testing on in the simulation environment - this disappeared in commit 8a4fc54b07d7 ("net: get rid of rtnl_lock_unregistering()"). > > With some suitable commentary, that might also be a reasonable thing? > > __rtnl_unlock() is actually rather pretty rare, and not exported. > > The main use for it seems to be re-locking before loading a module, > which TBH I have no idea why, is it just a cargo cult or a historical > thing :S I don't see how letting netdevs leave before _loading_ > a module makes any difference whatsoever. Indeed. > The WARN_ON() you suggested up front make perfect sense to me. > You can also take the definition of net_unlink_todo() out of > netdevice.h while at it because o_0 Heh indeed, what? But (and now I'll CC even more people...) if we can actually have an invariant that while RTNL is unlocked the todo list is empty, then we also don't need rtnl_lock_unregistering_all(), and can remove the netdev_unregistering_wq, etc., no? IOW, I'm not sure why we needed commit 50624c934db1 ("net: Delay default_device_exit_batch until no devices are unregistering v2"), but I also have little doubt that we did. Ah, no. This isn't about locking in this case, it's literally about ensuring that free_netdev() has been called in netdev_run_todo()? Which we don't care about in cfg80211 - we just care about the list being empty so there's no chance we'll reacquire the RTNL. johannes