2022-03-17 17:43:38

by William McVicker

[permalink] [raw]
Subject: [BUG] deadlock in nl80211_vendor_cmd

Hi,

I wanted to report a deadlock that I'm hitting as a result of the upstream
commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
well.

Basically, my wlan driver uses the wiphy_vendor_command ops to handle
a number of vendor specific operations. One of them in particular deletes
a cfg80211 interface. The deadlock happens when thread 1 tries to take the
RTNL lock before calling cfg80211_unregister_device() while thread 2 is
inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
wiphy_lock().

Here is the call flow:

Thread 1: Thread 2:

nl80211_pre_doit():
-> rtnl_lock()
nl80211_pre_doit():
-> rtnl_lock()
-> <blocked by Thread 1>
-> wiphy_lock()
-> rtnl_unlock()
-> <unblock Thread 1>
exit nl80211_pre_doit()
<Thread 2 got the RTNL lock>
-> wiphy_lock()
-> <blocked by Thread 1>
nl80211_doit()
-> nl80211_vendor_cmd()
-> rtnl_lock() <DEADLOCK>
-> cfg80211_unregister_device()
-> rtnl_unlock()


To be complete, here are the kernel call traces when the deadlock occurs:

Thread 1 Call trace:
<Take rtnl before calling cfg80211_unregister_device()>
nl80211_vendor_cmd+0x210/0x218
genl_rcv_msg+0x3ac/0x45c
netlink_rcv_skb+0x130/0x168
genl_rcv+0x38/0x54
netlink_unicast_kernel+0xe4/0x1f4
netlink_unicast+0x128/0x21c
netlink_sendmsg+0x2d8/0x3d8

Thread 2 Call trace:
<Take wiphy_lock>
nl80211_pre_doit+0x1b0/0x250
genl_rcv_msg+0x37c/0x45c
netlink_rcv_skb+0x130/0x168
genl_rcv+0x38/0x54
netlink_unicast_kernel+0xe4/0x1f4
netlink_unicast+0x128/0x21c
netlink_sendmsg+0x2d8/0x3d8

I'm not an networking expert. So my main question is if I'm allowed to take
the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
regardless of why I take it, we shouldn't be allowing this deadlock
situation, right?

I hope that helps explain the issue. Let me know if you need any more
details.

Thanks,
Will


2022-03-21 22:12:45

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

Hi,


> Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> a number of vendor specific operations.
>

I guess it's an out-of-tree driver, since I (hope I) fixed all the
issues in the code here ... :)

> One of them in particular deletes
> a cfg80211 interface.

There's quite normal API for that, why would you do that?!

> The deadlock happens when thread 1 tries to take the
> RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> wiphy_lock().
>
> Here is the call flow:
>
> Thread 1: Thread 2:
>
> nl80211_pre_doit():
> -> rtnl_lock()
> nl80211_pre_doit():
> -> rtnl_lock()
> -> <blocked by Thread 1>
> -> wiphy_lock()
> -> rtnl_unlock()
> -> <unblock Thread 1>
> exit nl80211_pre_doit()
> <Thread 2 got the RTNL lock>
> -> wiphy_lock()
> -> <blocked by Thread 1>
> nl80211_doit()
> -> nl80211_vendor_cmd()
> -> rtnl_lock() <DEADLOCK>

Yeah, I guess the way we invoke vendor commands now w/o RTNL held means
you cannot safely acquire RTNL in them.

I mean, the whole above thing basically collapses down to

Thread 1 Thread 2
wiphy_lock(); // nl80211
rtnl_lock();
wiphy_lock();
rtnl_lock(); // your driver

The correct order to _acquire_ these is rtnl -> wiphy, and we do it that
way around everywhere (else).

> I'm not an networking expert. So my main question is if I'm allowed to take
> the RTNL lock inside the nl80211_vendor_cmd callbacks?

Evidently, you're not. It's interesting though, it used to be that we
called these with the RTNL held, now we don't, and the driver you're
using somehow "got fixed" to take it, but whoever fixed it didn't take
into account that this is not possible?

> I hope that helps explain the issue. Let me know if you need any more
> details.

It does, but I don't think there's any way to fix it. You just
fundamentally cannot acquire the RTNL in a vendor command operation
since that introduced the ABBA deadlock you observed.

Since it's an out-of-tree driver that's about as much as I can help.

johannes

2022-03-21 23:22:07

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/21/2022, Will McVicker wrote:
> On Thu, Mar 17, 2022 at 10:09 AM <[email protected]> wrote:
>
> > Hi,
> >
> > I wanted to report a deadlock that I'm hitting as a result of the upstream
> > commit a05829a7222e ("cfg80211: avoid holding the RTNL when calling the
> > driver"). I'm using the Pixel 6 with downstream version of the 5.15 kernel,
> > but I'm pretty sure this will happen on the upstream tip-of-tree kernel as
> > well.
> >
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations. One of them in particular deletes
> > a cfg80211 interface. The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> >
> > Here is the call flow:
> >
> > Thread 1: Thread 2:
> >
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > -> <blocked by Thread 1>
> > -> wiphy_lock()
> > -> rtnl_unlock()
> > -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> > <Thread 2 got the RTNL lock>
> > -> wiphy_lock()
> > -> <blocked by Thread 1>
> > nl80211_doit()
> > -> nl80211_vendor_cmd()
> > -> rtnl_lock() <DEADLOCK>
> > -> cfg80211_unregister_device()
> > -> rtnl_unlock()
> >
> >
> > To be complete, here are the kernel call traces when the deadlock occurs:
> >
> > Thread 1 Call trace:
> > <Take rtnl before calling cfg80211_unregister_device()>
> > nl80211_vendor_cmd+0x210/0x218
> > genl_rcv_msg+0x3ac/0x45c
> > netlink_rcv_skb+0x130/0x168
> > genl_rcv+0x38/0x54
> > netlink_unicast_kernel+0xe4/0x1f4
> > netlink_unicast+0x128/0x21c
> > netlink_sendmsg+0x2d8/0x3d8
> >
> > Thread 2 Call trace:
> > <Take wiphy_lock>
> > nl80211_pre_doit+0x1b0/0x250
> > genl_rcv_msg+0x37c/0x45c
> > netlink_rcv_skb+0x130/0x168
> > genl_rcv+0x38/0x54
> > netlink_unicast_kernel+0xe4/0x1f4
> > netlink_unicast+0x128/0x21c
> > netlink_sendmsg+0x2d8/0x3d8
> >
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks? If so, then
> > regardless of why I take it, we shouldn't be allowing this deadlock
> > situation, right?
> >
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
> >
> > Thanks,
> > Will
> >
>
> Sorry my CC list got dropped. Adding the following:
>
> Kalle Valo <[email protected]>
> "David S. Miller" <[email protected]>
> Jakub Kicinski <[email protected]>
> [email protected]
> Amitkumar Karwar <[email protected]>
> Ganapathi Bhat <[email protected]>
> Xinming Hu <[email protected]>
> [email protected]

Sorry for the noise. The lists bounced due to html. Resending with mutt to make
sure everyone gets this message.

As an update, I was able to fix the deadlock by updating nl80211_pre_doit() to
not hold the RTNL lock while waiting to get the wiphy_lock. This allows us to
take the RTNL lock within nl80211_doit() and have parallel calls to
nl80211_doit(). Below is the logic I tested. Please let me know if I'm heading
in the right direction.

Thanks,
Will

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 686a69381731..bb4ad746509b 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -15227,7 +15227,24 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
}

if (rdev && !(ops->internal_flags & NL80211_FLAG_NO_WIPHY_MTX)) {
- wiphy_lock(&rdev->wiphy);
+ while (!mutex_trylock(&rdev->wiphy.mtx)) {
+ /* Holding the RTNL lock while waiting for the wiphy lock can lead to
+ * a deadlock within doit() ops that don't hold the RTNL in pre_doit. So
+ * we need to release the RTNL lock first while we wait for the wiphy
+ * lock.
+ */
+ rtnl_unlock();
+ wiphy_lock(&rdev->wiphy);
+
+ /* Once we get the wiphy_lock, we need to grab the RTNL lock. If we can't
+ * get it, then we need to unlock the wiphy to avoid a deadlock in
+ * pre_doit and then retry taking the locks again. */
+ if (!rtnl_trylock()) {
+ wiphy_unlock(&rdev->wiphy);
+ rtnl_lock();
+ } else
+ break;
+ }
/* we keep the mutex locked until post_doit */
__release(&rdev->wiphy.mtx);
}

2022-03-22 00:50:01

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/21/2022, Johannes Berg wrote:
> Hi,
>
>
> > Basically, my wlan driver uses the wiphy_vendor_command ops to handle
> > a number of vendor specific operations.
> >
>
> I guess it's an out-of-tree driver, since I (hope I) fixed all the
> issues in the code here ... :)
>
> > One of them in particular deletes
> > a cfg80211 interface.
>
> There's quite normal API for that, why would you do that?!

Yeah, unfortunately this is the code I was given.

>
> > The deadlock happens when thread 1 tries to take the
> > RTNL lock before calling cfg80211_unregister_device() while thread 2 is
> > inside nl80211_pre_doit(), holding the RTNL lock, and waiting on
> > wiphy_lock().
> >
> > Here is the call flow:
> >
> > Thread 1: Thread 2:
> >
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > nl80211_pre_doit():
> > -> rtnl_lock()
> > -> <blocked by Thread 1>
> > -> wiphy_lock()
> > -> rtnl_unlock()
> > -> <unblock Thread 1>
> > exit nl80211_pre_doit()
> > <Thread 2 got the RTNL lock>
> > -> wiphy_lock()
> > -> <blocked by Thread 1>
> > nl80211_doit()
> > -> nl80211_vendor_cmd()
> > -> rtnl_lock() <DEADLOCK>
>
> Yeah, I guess the way we invoke vendor commands now w/o RTNL held means
> you cannot safely acquire RTNL in them.
>
> I mean, the whole above thing basically collapses down to
>
> Thread 1 Thread 2
> wiphy_lock(); // nl80211
> rtnl_lock();
> wiphy_lock();
> rtnl_lock(); // your driver
>
> The correct order to _acquire_ these is rtnl -> wiphy, and we do it that
> way around everywhere (else).
>
> > I'm not an networking expert. So my main question is if I'm allowed to take
> > the RTNL lock inside the nl80211_vendor_cmd callbacks?
>
> Evidently, you're not. It's interesting though, it used to be that we
> called these with the RTNL held, now we don't, and the driver you're
> using somehow "got fixed" to take it, but whoever fixed it didn't take
> into account that this is not possible?

So in my quest to upgrade the Pixel 6 kernel from 5.10 to 5.15, I noticed that
I was hitting several ASSERT_RTNL() warnings during wlan testing. When I dug
into those asserts, I found commit a05829a7222e ("cfg80211: avoid holding the
RTNL when calling the driver") was causing these issues. So I went about adding
the necessary locks in the driver which led me to find this ABBA deadlock
scenario.

>
> > I hope that helps explain the issue. Let me know if you need any more
> > details.
>
> It does, but I don't think there's any way to fix it. You just
> fundamentally cannot acquire the RTNL in a vendor command operation
> since that introduced the ABBA deadlock you observed.
>
> Since it's an out-of-tree driver that's about as much as I can help.
>
> johannes

Yeah, I understand. Thanks for the response!

--Will

2022-03-22 23:28:14

by Jeff Johnson

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 3/21/2022 1:07 PM, Johannes Berg wrote:
[..snip..]

>> I'm not an networking expert. So my main question is if I'm allowed to take
>> the RTNL lock inside the nl80211_vendor_cmd callbacks?
>
> Evidently, you're not. It's interesting though, it used to be that we
> called these with the RTNL held, now we don't, and the driver you're
> using somehow "got fixed" to take it, but whoever fixed it didn't take
> into account that this is not possible?

On this point I just want to remind that prior to the locking change
that a driver would specify on a per-vendor command basis whether or not
it wanted the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL. I'm
guessing for the command in question the driver did not set this flag
since the driver wanted to explicitly take the lock itself, otherwise it
would have deadlocked on itself with the 5.10 kernel.

/jeff

2022-03-22 23:28:15

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/22/2022, Jeff Johnson wrote:
> On 3/21/2022 1:07 PM, Johannes Berg wrote:
> [..snip..]
>
> > > I'm not an networking expert. So my main question is if I'm allowed to take
> > > the RTNL lock inside the nl80211_vendor_cmd callbacks?
> >
> > Evidently, you're not. It's interesting though, it used to be that we
> > called these with the RTNL held, now we don't, and the driver you're
> > using somehow "got fixed" to take it, but whoever fixed it didn't take
> > into account that this is not possible?
>
> On this point I just want to remind that prior to the locking change that a
> driver would specify on a per-vendor command basis whether or not it wanted
> the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL. I'm guessing for the
> command in question the driver did not set this flag since the driver wanted
> to explicitly take the lock itself, otherwise it would have deadlocked on
> itself with the 5.10 kernel.
>
> /jeff

On the 5.10 kernel, the core kernel sets NL80211_FLAG_NEED_RTNL as part of
the internal_flags for NL80211_CMD_VENDOR:

net/wireless/nl80211.c:
{
.cmd = NL80211_CMD_VENDOR,
.validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = nl80211_vendor_cmd,
.dumpit = nl80211_vendor_cmd_dump,
.flags = GENL_UNS_ADMIN_PERM,
.internal_flags = NL80211_FLAG_NEED_WIPHY |
NL80211_FLAG_NEED_RTNL |
NL80211_FLAG_CLEAR_SKB,
},

So the 5.10 version of this driver doesn't need to directly call rtnl_lock()
within the vendor command doit() functions since pre_doit() handles the RTNL
locking.

It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
requested via the vendor flags. That would require moving the wiphy lock to
nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
correct order. Is that something you'd be open to Johannes?

--Will

2022-03-23 20:46:49

by Jeff Johnson

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 3/22/2022 2:58 PM, William McVicker wrote:
> On 03/22/2022, Jeff Johnson wrote:
>> On 3/21/2022 1:07 PM, Johannes Berg wrote:
>> [..snip..]
>>
>>>> I'm not an networking expert. So my main question is if I'm allowed to take
>>>> the RTNL lock inside the nl80211_vendor_cmd callbacks?
>>>
>>> Evidently, you're not. It's interesting though, it used to be that we
>>> called these with the RTNL held, now we don't, and the driver you're
>>> using somehow "got fixed" to take it, but whoever fixed it didn't take
>>> into account that this is not possible?
>>
>> On this point I just want to remind that prior to the locking change that a
>> driver would specify on a per-vendor command basis whether or not it wanted
>> the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL. I'm guessing for the
>> command in question the driver did not set this flag since the driver wanted
>> to explicitly take the lock itself, otherwise it would have deadlocked on
>> itself with the 5.10 kernel.
>>
>> /jeff
>
> On the 5.10 kernel, the core kernel sets NL80211_FLAG_NEED_RTNL as part of
> the internal_flags for NL80211_CMD_VENDOR:
>
> net/wireless/nl80211.c:
> {
> .cmd = NL80211_CMD_VENDOR,
> .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> .doit = nl80211_vendor_cmd,
> .dumpit = nl80211_vendor_cmd_dump,
> .flags = GENL_UNS_ADMIN_PERM,
> .internal_flags = NL80211_FLAG_NEED_WIPHY |
> NL80211_FLAG_NEED_RTNL |
> NL80211_FLAG_CLEAR_SKB,
> },
>
> So the 5.10 version of this driver doesn't need to directly call rtnl_lock()
> within the vendor command doit() functions since pre_doit() handles the RTNL
> locking.
>
> It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
> requested via the vendor flags. That would require moving the wiphy lock to
> nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
> correct order. Is that something you'd be open to Johannes?
>
> --Will

Thanks for correcting my understanding. I concur that it would be useful
for vendor commands to be able to specify that a given command needs the
RTNL lock to be held.


2022-03-25 15:25:39

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

Hi,

(Jakub, can you please see below, I wonder which you prefer)

>
> I found that we can hit this same ABBA deadlock within the nl80211 code
> before ever even calling into the vendor doit() function. 

Hmm.

> The issue I found
> is caused by the way we unlock the RTNL mutex. Here is the call flow that
> leads to the deadlock:
>
> Thread 1 Thread 2
> nl80211_pre_doit():
> rtnl_lock()
> wiphy_lock() nl80211_pre_doit():
> rtnl_lock() // blocked by Thread 1
> rtnl_unlock():
> netdev_run_todo():
> __rtnl_unlock()
> <got RTNL lock>
> wiphy_lock() // blocked by Thread 1
> rtnl_lock(); // DEADLOCK
> doit()
> nl80211_post_doit():
> wiphy_unlock();
>
> Basically, unlocking the RTNL within netdev_run_todo() gives another thread
> that is waiting for the RTNL in nl80211_pre_doit() a chance to grab the
> RTNL lock leading to the deadlock. I found that there are multiple
> instances where rtnl_lock() is called within netdev_run_todo(): a couple of
> times inside netdev_wait_allrefs() and directly by netdev_run_todo().

Have you actually observed this in practice?

It's true, but dynamically this only happens if you get into the

while (!list_empty(&list)) {
...
}

code in netdev_run_todo().

Which in itself can only be true if a netdev was unregistered and
netdev_run_todo() didn't run yet.

Now, since normally we go through rtnl_unlock(), it's highly likely that
we get into rtnl_lock() with the todo list being empty (but not
impossible, read on), so then rtnl_unlock()/netdev_run_todo() won't be
getting into this branch, and then the deadlock cannot happen.

Now, it might be possible somewhere in the tree to unregister a netdev
and then unlock using __rtnl_unlock(), so the todo list isn't processed
until the next time, but __rtnl_unlock() isn't exported and all the
users I found didn't seem to cause this problem.


On the other hand, clearly people thought it was worth worrying about,
there are already *two* almost identical implementations of avoiding
this problem:
- rtnl_lock_unregistering
- rtnl_lock_unregistering_all

(identical except for the netns list they use, partial vs. all).

So we can avoid the potential deadlock in cfg80211 in a few ways:

1) export rtnl_lock_unregistering_all() or maybe a variant after
refactoring the two versions, to allow cfg80211 to use it, that way
netdev_run_todo() can never have a non-empty todo list

2) export __rtnl_unlock() so cfg80211 can avoid running
netdev_run_todo() in the unlock, personally I like this less because
it might encourage random drivers to use it

3) completely rework cfg80211's locking, adding a separate mutex for
the wiphy list so we don't need to acquire the RTNL at all here
(unless the ops need it, but there's no issue if we don't drop it),
something like https://p.sipsolutions.net/27d08e1f5881a793.txt


I think I'm happy with 3) now (even if it took a couple of hours), so I
think we can go with it, just need to go through all the possibilities.

johannes

2022-03-25 17:22:44

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 13:04 +0100, Johannes Berg wrote:
>
> So we can avoid the potential deadlock in cfg80211 in a few ways:
>
> 1) export rtnl_lock_unregistering_all() or maybe a variant after
> refactoring the two versions, to allow cfg80211 to use it, that way
> netdev_run_todo() can never have a non-empty todo list
>
> 2) export __rtnl_unlock() so cfg80211 can avoid running
> netdev_run_todo() in the unlock, personally I like this less because
> it might encourage random drivers to use it
>
> 3) completely rework cfg80211's locking, adding a separate mutex for
> the wiphy list so we don't need to acquire the RTNL at all here
> (unless the ops need it, but there's no issue if we don't drop it),
> something like https://p.sipsolutions.net/27d08e1f5881a793.txt
>

Note that none of these actually let you do what you wanted - that is
acquiring the RTNL in the vendor op itself.

johannes

2022-03-25 17:55:16

by Jeff Johnson

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 3/25/2022 9:04 AM, Johannes Berg wrote:
> On Fri, 2022-03-25 at 08:59 -0700, Jeff Johnson wrote:
>> On 3/25/2022 5:09 AM, Johannes Berg wrote:
>>> On Tue, 2022-03-22 at 21:58 +0000, William McVicker wrote:
>>>>
>>>> It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
>>>> requested via the vendor flags. That would require moving the wiphy lock to
>>>> nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
>>>> correct order. Is that something you'd be open to Johannes?
>>>
>>> Not very, to be honest. If we had a good use case for it, maybe, but
>>> you're not even saying quite what the use case is :)
>>
>> I believe Will and I are up against the same issues. Out-of-tree Android
>> drivers use a large number of vendor commands to support the
>> requirements of the Android framework. Prior to 5.12 these vendor
>> commands would always be called with the rtnl lock held, and in some
>> cases the implementation of the vendor commands care about that,
>> especially when calling back into the kernel. In particular we are faced
>> with the issue that there are some kernel APIs that require that the
>> rtnl lock be held, and as a result of the lock infrastructure change we
>> can no longer call those APIs from vendor commands. That is a pretty
>> severe limitation to work around.
>
> I'm not really sure how sympathetic I am to an out-of-tree use case I
> cannot even see, and I'd probably consider problematic in the first
> place :)

You can see them, they're open source, just not in-tree.
That doesn't mean you wouldn't find things there problematic.
There are many reasons that most (all?) Android wifi drivers are
out-of-tree, but that is a different discussion.

>
> Your easiest option might be to just patch NL80211_FLAG_NEED_RTNL into
> your kernel for vendor commands and call it a day?

Well it isn't my kernel, it's Android's kernel since now with the ACK
they control the kernel:
https://source.android.com/devices/architecture/kernel/android-common

So Will and I would have to advocate for that, and this may be a
reasonable short-term solution. But I'm sure that Greg KH would like to
minimize the deltas between the ACK and upstream.

/jeff

2022-03-25 18:42:35

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 08:59 -0700, Jeff Johnson wrote:
> On 3/25/2022 5:09 AM, Johannes Berg wrote:
> > On Tue, 2022-03-22 at 21:58 +0000, William McVicker wrote:
> > >
> > > It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
> > > requested via the vendor flags. That would require moving the wiphy lock to
> > > nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
> > > correct order. Is that something you'd be open to Johannes?
> >
> > Not very, to be honest. If we had a good use case for it, maybe, but
> > you're not even saying quite what the use case is :)
>
> I believe Will and I are up against the same issues. Out-of-tree Android
> drivers use a large number of vendor commands to support the
> requirements of the Android framework. Prior to 5.12 these vendor
> commands would always be called with the rtnl lock held, and in some
> cases the implementation of the vendor commands care about that,
> especially when calling back into the kernel. In particular we are faced
> with the issue that there are some kernel APIs that require that the
> rtnl lock be held, and as a result of the lock infrastructure change we
> can no longer call those APIs from vendor commands. That is a pretty
> severe limitation to work around.

I'm not really sure how sympathetic I am to an out-of-tree use case I
cannot even see, and I'd probably consider problematic in the first
place :)

Your easiest option might be to just patch NL80211_FLAG_NEED_RTNL into
your kernel for vendor commands and call it a day?

johannes

2022-03-25 18:43:36

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > So we can avoid the potential deadlock in cfg80211 in a few ways:
> >
> > 1) export rtnl_lock_unregistering_all() or maybe a variant after
> > refactoring the two versions, to allow cfg80211 to use it, that way
> > netdev_run_todo() can never have a non-empty todo list
> >
> > 2) export __rtnl_unlock() so cfg80211 can avoid running
> > netdev_run_todo() in the unlock, personally I like this less because
> > it might encourage random drivers to use it
> >
> > 3) completely rework cfg80211's locking, adding a separate mutex for
> > the wiphy list so we don't need to acquire the RTNL at all here
> > (unless the ops need it, but there's no issue if we don't drop it),
> > something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> >
> >
> > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > think we can go with it, just need to go through all the possibilities.
>
> I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> seen more) had been converting to xarray for managing the "registered"
> objects. It may be worth looking into if you're re-doing things, anyway.
>

That's not a bad idea, but I think I wouldn't want to backport that, so
separately :) I don't think that fundamentally changes the locking
properties though.


Couple of more questions I guess: First, are we assuming that the
cfg80211 code *is* actually broken, even if it looks like nothing can
cause the situation, due to the empty todo list?

Given that we have rtnl_lock_unregistering() (and also
rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
not want to make an assumption that no user of __rtnl_unlock() can have
added a todo item.

I mean, there's technically yet *another* thing we could do - something
like this:

[this doesn't compile, need to suitably make net_todo_list non-static]
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -95,6 +95,7 @@ void __rtnl_unlock(void)

defer_kfree_skb_list = NULL;

+ WARN_ON(!list_empty(&net_todo_list));
mutex_unlock(&rtnl_mutex);

while (head) {

and actually that would allow us to get rid of rtnl_lock_unregistering()
and rtnl_lock_unregistering_all() simply because we'd actually guarantee
the invariant that when the RTNL is freshly locked, the list is empty
(by guaranteeing that it's always empty when it's unlocked, since it can
only be added to under RTNL).

With some suitable commentary, that might also be a reasonable thing?
__rtnl_unlock() is actually rather pretty rare, and not exported.


However, if you don't like that ...

I've been testing with this patch, to make lockdep complain:

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
if (!list_empty(&list))
rcu_barrier();

+#ifdef CONFIG_LOCKDEP
+ rtnl_lock();
+ __rtnl_unlock();
+#endif
+
while (!list_empty(&list)) {
struct net_device *dev
= list_first_entry(&list, struct net_device, todo_list);


That causes lockdep to complain for cfg80211 even if the list *is* in
fact empty.

Would you be open to adding something like that? Perhaps if I don't just
do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
only things to do there, to cause lockdep to do things without really
locking? OTOH, the locking overhead of the RTNL we just unlocked is
probably minimal, vs. the actual work *lockdep* is doing to track all
this ...

Thanks,
johannes

2022-03-25 18:54:17

by Jeff Johnson

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 3/25/2022 5:09 AM, Johannes Berg wrote:
> On Tue, 2022-03-22 at 21:58 +0000, William McVicker wrote:
>>
>> It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
>> requested via the vendor flags. That would require moving the wiphy lock to
>> nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
>> correct order. Is that something you'd be open to Johannes?
>
> Not very, to be honest. If we had a good use case for it, maybe, but
> you're not even saying quite what the use case is :)

I believe Will and I are up against the same issues. Out-of-tree Android
drivers use a large number of vendor commands to support the
requirements of the Android framework. Prior to 5.12 these vendor
commands would always be called with the rtnl lock held, and in some
cases the implementation of the vendor commands care about that,
especially when calling back into the kernel. In particular we are faced
with the issue that there are some kernel APIs that require that the
rtnl lock be held, and as a result of the lock infrastructure change we
can no longer call those APIs from vendor commands. That is a pretty
severe limitation to work around.

/jeff

2022-03-25 19:05:19

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/23/2022, Jeff Johnson wrote:
> On 3/22/2022 2:58 PM, William McVicker wrote:
> > On 03/22/2022, Jeff Johnson wrote:
> > > On 3/21/2022 1:07 PM, Johannes Berg wrote:
> > > [..snip..]
> > >
> > > > > I'm not an networking expert. So my main question is if I'm allowed to take
> > > > > the RTNL lock inside the nl80211_vendor_cmd callbacks?
> > > >
> > > > Evidently, you're not. It's interesting though, it used to be that we
> > > > called these with the RTNL held, now we don't, and the driver you're
> > > > using somehow "got fixed" to take it, but whoever fixed it didn't take
> > > > into account that this is not possible?
> > >
> > > On this point I just want to remind that prior to the locking change that a
> > > driver would specify on a per-vendor command basis whether or not it wanted
> > > the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL. I'm guessing for the
> > > command in question the driver did not set this flag since the driver wanted
> > > to explicitly take the lock itself, otherwise it would have deadlocked on
> > > itself with the 5.10 kernel.
> > >
> > > /jeff
> >
> > On the 5.10 kernel, the core kernel sets NL80211_FLAG_NEED_RTNL as part of
> > the internal_flags for NL80211_CMD_VENDOR:
> >
> > net/wireless/nl80211.c:
> > {
> > .cmd = NL80211_CMD_VENDOR,
> > .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> > .doit = nl80211_vendor_cmd,
> > .dumpit = nl80211_vendor_cmd_dump,
> > .flags = GENL_UNS_ADMIN_PERM,
> > .internal_flags = NL80211_FLAG_NEED_WIPHY |
> > NL80211_FLAG_NEED_RTNL |
> > NL80211_FLAG_CLEAR_SKB,
> > },
> >
> > So the 5.10 version of this driver doesn't need to directly call rtnl_lock()
> > within the vendor command doit() functions since pre_doit() handles the RTNL
> > locking.
> >
> > It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
> > requested via the vendor flags. That would require moving the wiphy lock to
> > nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
> > correct order. Is that something you'd be open to Johannes?
> >
> > --Will
>
> Thanks for correcting my understanding. I concur that it would be useful for
> vendor commands to be able to specify that a given command needs the RTNL
> lock to be held.
>
>

Hi Johannes,

I found that we can hit this same ABBA deadlock within the nl80211 code
before ever even calling into the vendor doit() function. The issue I found
is caused by the way we unlock the RTNL mutex. Here is the call flow that
leads to the deadlock:

Thread 1 Thread 2
nl80211_pre_doit():
rtnl_lock()
wiphy_lock() nl80211_pre_doit():
rtnl_lock() // blocked by Thread 1
rtnl_unlock():
netdev_run_todo():
__rtnl_unlock()
<got RTNL lock>
wiphy_lock() // blocked by Thread 1
rtnl_lock(); // DEADLOCK
doit()
nl80211_post_doit():
wiphy_unlock();

Basically, unlocking the RTNL within netdev_run_todo() gives another thread
that is waiting for the RTNL in nl80211_pre_doit() a chance to grab the
RTNL lock leading to the deadlock. I found that there are multiple
instances where rtnl_lock() is called within netdev_run_todo(): a couple of
times inside netdev_wait_allrefs() and directly by netdev_run_todo().

Since I'm not really familiar with all the RNTL locking requirements, I was
hoping you could take a look at netdev_run_todo() to see if it's possible
to refactor it to avoid this deadlock. If not, then I don't think we can
call rtnl_unlock() while still holding the wiphy mutex.

Thanks,
Will

2022-03-25 19:14:28

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> So we can avoid the potential deadlock in cfg80211 in a few ways:
>
> 1) export rtnl_lock_unregistering_all() or maybe a variant after
> refactoring the two versions, to allow cfg80211 to use it, that way
> netdev_run_todo() can never have a non-empty todo list
>
> 2) export __rtnl_unlock() so cfg80211 can avoid running
> netdev_run_todo() in the unlock, personally I like this less because
> it might encourage random drivers to use it
>
> 3) completely rework cfg80211's locking, adding a separate mutex for
> the wiphy list so we don't need to acquire the RTNL at all here
> (unless the ops need it, but there's no issue if we don't drop it),
> something like https://p.sipsolutions.net/27d08e1f5881a793.txt
>
>
> I think I'm happy with 3) now (even if it took a couple of hours), so I
> think we can go with it, just need to go through all the possibilities.

I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
seen more) had been converting to xarray for managing the "registered"
objects. It may be worth looking into if you're re-doing things, anyway.

2022-03-25 19:32:59

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Tue, 2022-03-22 at 21:58 +0000, William McVicker wrote:
>
> It would be nice if nl80211_vendor_cmd() could support taking the RTNL lock if
> requested via the vendor flags. That would require moving the wiphy lock to
> nl80211_vendor_cmds() so that it could take the RTNL and wiphy lock in the
> correct order. Is that something you'd be open to Johannes?

Not very, to be honest. If we had a good use case for it, maybe, but
you're not even saying quite what the use case is :)

It's also rather complex, because it'd require effectively open-coding
the pre_doit() stuff in the vendor command handler itself, in order to
look at the sub-command flags while doing it ...

johannes

2022-03-25 20:10:57

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Tue, 2022-03-22 at 14:31 -0700, Jeff Johnson wrote:
> On 3/21/2022 1:07 PM, Johannes Berg wrote:
> [..snip..]
>
> > > I'm not an networking expert. So my main question is if I'm allowed to take
> > > the RTNL lock inside the nl80211_vendor_cmd callbacks?
> >
> > Evidently, you're not. It's interesting though, it used to be that we
> > called these with the RTNL held, now we don't, and the driver you're
> > using somehow "got fixed" to take it, but whoever fixed it didn't take
> > into account that this is not possible?
>
> On this point I just want to remind that prior to the locking change
> that a driver would specify on a per-vendor command basis whether or not
> it wanted the rtnl_lock to be held via NL80211_FLAG_NEED_RTNL.

No, that flag isn't on a per-vendor-command basis. It was set for all
the vendor commands though, in nl80211, so the driver wouldn't have had
the rtnl_lock()/unlock() at all.

johannes

2022-03-25 20:24:36

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/25/2022, Johannes Berg wrote:
> On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> > On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > > So we can avoid the potential deadlock in cfg80211 in a few ways:
> > >
> > > 1) export rtnl_lock_unregistering_all() or maybe a variant after
> > > refactoring the two versions, to allow cfg80211 to use it, that way
> > > netdev_run_todo() can never have a non-empty todo list
> > >
> > > 2) export __rtnl_unlock() so cfg80211 can avoid running
> > > netdev_run_todo() in the unlock, personally I like this less because
> > > it might encourage random drivers to use it
> > >
> > > 3) completely rework cfg80211's locking, adding a separate mutex for
> > > the wiphy list so we don't need to acquire the RTNL at all here
> > > (unless the ops need it, but there's no issue if we don't drop it),
> > > something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> > >
> > >
> > > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > > think we can go with it, just need to go through all the possibilities.
> >
> > I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> > seen more) had been converting to xarray for managing the "registered"
> > objects. It may be worth looking into if you're re-doing things, anyway.
> >
>
> That's not a bad idea, but I think I wouldn't want to backport that, so
> separately :) I don't think that fundamentally changes the locking
> properties though.
>
>
> Couple of more questions I guess: First, are we assuming that the
> cfg80211 code *is* actually broken, even if it looks like nothing can
> cause the situation, due to the empty todo list?

I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
support to allow vendor commands to request for the RTNL. For this case, I just
delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
flags to see if I should unlock before calling doit(). That allows me to run my
tests again and hit this issue. I imagine that I could hit this issue without
any changes if I re-work my vendor ops to not need the RTNL.

>
> Given that we have rtnl_lock_unregistering() (and also
> rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> not want to make an assumption that no user of __rtnl_unlock() can have
> added a todo item.
>
> I mean, there's technically yet *another* thing we could do - something
> like this:
>
> [this doesn't compile, need to suitably make net_todo_list non-static]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
>
> defer_kfree_skb_list = NULL;
>
> + WARN_ON(!list_empty(&net_todo_list));
> mutex_unlock(&rtnl_mutex);
>
> while (head) {
>
> and actually that would allow us to get rid of rtnl_lock_unregistering()
> and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> the invariant that when the RTNL is freshly locked, the list is empty
> (by guaranteeing that it's always empty when it's unlocked, since it can
> only be added to under RTNL).
>
> With some suitable commentary, that might also be a reasonable thing?
> __rtnl_unlock() is actually rather pretty rare, and not exported.
>
>
> However, if you don't like that ...
>
> I've been testing with this patch, to make lockdep complain:
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
> if (!list_empty(&list))
> rcu_barrier();
>
> +#ifdef CONFIG_LOCKDEP
> + rtnl_lock();
> + __rtnl_unlock();
> +#endif
> +
> while (!list_empty(&list)) {
> struct net_device *dev
> = list_first_entry(&list, struct net_device, todo_list);
>
>
> That causes lockdep to complain for cfg80211 even if the list *is* in
> fact empty.
>
> Would you be open to adding something like that? Perhaps if I don't just
> do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> only things to do there, to cause lockdep to do things without really
> locking? OTOH, the locking overhead of the RTNL we just unlocked is
> probably minimal, vs. the actual work *lockdep* is doing to track all
> this ...
>
> Thanks,
> johannes

Let me know if you'd like me to test any patches out.

Thanks,
Will

2022-03-25 20:52:18

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 18:08 +0000, William McVicker wrote:
>
> I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
> support to allow vendor commands to request for the RTNL. 
>

Hm, wait, which of the two issues?

> For this case, I just
> delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
> flags to see if I should unlock before calling doit(). That allows me to run my
> tests again and hit this issue. I imagine that I could hit this issue without
> any changes if I re-work my vendor ops to not need the RTNL.

What are the vendor ops doing though?

If they're actually unregistering a netdev - which I believe you
mentioned earlier - then that's quite clearly going to cause an issue,
if you unlock RTNL while the wiphy mutex is still held.

If not, then I don't see right now how you'd be able to trigger any
issue here at all.

The original issue - that you rtnl_lock() yourself while the wiphy mutex
is held - can't happen anymore with your rework I guess.


johannes

2022-03-25 21:04:04

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/25/2022, William McVicker wrote:
> On 03/25/2022, Johannes Berg wrote:
> > On Fri, 2022-03-25 at 09:49 -0700, Jakub Kicinski wrote:
> > > On Fri, 25 Mar 2022 13:04:23 +0100 Johannes Berg wrote:
> > > > So we can avoid the potential deadlock in cfg80211 in a few ways:
> > > >
> > > > 1) export rtnl_lock_unregistering_all() or maybe a variant after
> > > > refactoring the two versions, to allow cfg80211 to use it, that way
> > > > netdev_run_todo() can never have a non-empty todo list
> > > >
> > > > 2) export __rtnl_unlock() so cfg80211 can avoid running
> > > > netdev_run_todo() in the unlock, personally I like this less because
> > > > it might encourage random drivers to use it
> > > >
> > > > 3) completely rework cfg80211's locking, adding a separate mutex for
> > > > the wiphy list so we don't need to acquire the RTNL at all here
> > > > (unless the ops need it, but there's no issue if we don't drop it),
> > > > something like https://p.sipsolutions.net/27d08e1f5881a793.txt
> > > >
> > > >
> > > > I think I'm happy with 3) now (even if it took a couple of hours), so I
> > > > think we can go with it, just need to go through all the possibilities.
> > >
> > > I like 3) as well. FWIW a few places (e.g. mlx5, devlink, I think I've
> > > seen more) had been converting to xarray for managing the "registered"
> > > objects. It may be worth looking into if you're re-doing things, anyway.
> > >
> >
> > That's not a bad idea, but I think I wouldn't want to backport that, so
> > separately :) I don't think that fundamentally changes the locking
> > properties though.
> >
> >
> > Couple of more questions I guess: First, are we assuming that the
> > cfg80211 code *is* actually broken, even if it looks like nothing can
> > cause the situation, due to the empty todo list?
>
> I'm able to reproduce this issue pretty easily with a Pixel 6 when I add
> support to allow vendor commands to request for the RTNL. For this case, I just
> delay unlocking the RTNL until nl80211_vendor_cmds() at which point I check the
> flags to see if I should unlock before calling doit(). That allows me to run my
> tests again and hit this issue. I imagine that I could hit this issue without
> any changes if I re-work my vendor ops to not need the RTNL.
>
> >
> > Given that we have rtnl_lock_unregistering() (and also
> > rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> > not want to make an assumption that no user of __rtnl_unlock() can have
> > added a todo item.
> >
> > I mean, there's technically yet *another* thing we could do - something
> > like this:
> >
> > [this doesn't compile, need to suitably make net_todo_list non-static]
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
> >
> > defer_kfree_skb_list = NULL;
> >
> > + WARN_ON(!list_empty(&net_todo_list));
> > mutex_unlock(&rtnl_mutex);
> >
> > while (head) {
> >
> > and actually that would allow us to get rid of rtnl_lock_unregistering()
> > and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> > the invariant that when the RTNL is freshly locked, the list is empty
> > (by guaranteeing that it's always empty when it's unlocked, since it can
> > only be added to under RTNL).
> >
> > With some suitable commentary, that might also be a reasonable thing?
> > __rtnl_unlock() is actually rather pretty rare, and not exported.
> >
> >
> > However, if you don't like that ...
> >
> > I've been testing with this patch, to make lockdep complain:
> >
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
> > if (!list_empty(&list))
> > rcu_barrier();
> >
> > +#ifdef CONFIG_LOCKDEP
> > + rtnl_lock();
> > + __rtnl_unlock();
> > +#endif
> > +
> > while (!list_empty(&list)) {
> > struct net_device *dev
> > = list_first_entry(&list, struct net_device, todo_list);
> >
> >
> > That causes lockdep to complain for cfg80211 even if the list *is* in
> > fact empty.
> >
> > Would you be open to adding something like that? Perhaps if I don't just
> > do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> > only things to do there, to cause lockdep to do things without really
> > locking? OTOH, the locking overhead of the RTNL we just unlocked is
> > probably minimal, vs. the actual work *lockdep* is doing to track all
> > this ...
> >
> > Thanks,
> > johannes
>
> Let me know if you'd like me to test any patches out.
>
> Thanks,
> Will

Hi Johannes,

I found that my wlan driver is using the vendor commands to create/delete NAN
interfaces for this Android feature called Wi-Fi aware [1]. Basically, this
features allows users to discover other nearby devices and allows them to
connect directly with one another over a local network. To get my driver
working again, I first had to allow the kernel to let my driver request for the
RTNL lock for these NAN create/delete interface vendor commands. With that
I got the following code path:


Thread 1 Thread 2
nl80211_pre_doit():
rtnl_lock()
wiphy_lock() nl80211_pre_doit():
rtnl_lock() // blocked by Thread 1
nl80211_vendor_cmd():
doit()
cfg80211_unregister_netdevice()
rtnl_unlock():
netdev_run_todo():
__rtnl_unlock()
<got RTNL lock>
wiphy_lock() // blocked by Thread 1
rtnl_lock(); // DEADLOCK
nl80211_post_doit():
wiphy_unlock();


Since I'm unlocking the RTNL inside nl80211_vendor_cmd() after calling doit()
instead of waiting till post_doit(), I get into the situation you mentioned
where the net_todo_list is not empty when calling rtnl_unlock. So I decided to
drop the rtnl_unlock() in nl80211_vendor_cmd() and defer that until
nl80211_post_doit() after calling wiphy_unlock(). With this change, I haven't
been able to reproduce the deadlock. So it's possible that we aren't actually
able to hit this deadlock in nl80211_pre_doit() with the existing code since,
as you mentioned, one wouldn't be able to call unregister_netdevice() without
having the RTNL lock.

Sorry if I sent you down a rabbit hole with the first code path scenario.

Thanks,
Will

[1] https://developer.android.com/guide/topics/connectivity/wifi-aware

2022-03-25 21:08:21

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 25 Mar 2022 18:01:30 +0100 Johannes Berg wrote:
> That's not a bad idea, but I think I wouldn't want to backport that, so
> separately :) I don't think that fundamentally changes the locking
> properties though.
>
>
> Couple of more questions I guess: First, are we assuming that the
> cfg80211 code *is* actually broken, even if it looks like nothing can
> cause the situation, due to the empty todo list?

Right.

> Given that we have rtnl_lock_unregistering() (and also
> rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> not want to make an assumption that no user of __rtnl_unlock() can have
> added a todo item.
>
> I mean, there's technically yet *another* thing we could do - something
> like this:
>
> [this doesn't compile, need to suitably make net_todo_list non-static]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
>
> defer_kfree_skb_list = NULL;
>
> + WARN_ON(!list_empty(&net_todo_list));
> mutex_unlock(&rtnl_mutex);
>
> while (head) {

Yeah, I think we could do that.

> and actually that would allow us to get rid of rtnl_lock_unregistering()
> and rtnl_lock_unregistering_all() simply because we'd actually guarantee
> the invariant that when the RTNL is freshly locked, the list is empty
> (by guaranteeing that it's always empty when it's unlocked, since it can
> only be added to under RTNL).

TBH I don't know what you mean by rtnl_lock_unregistering(), I don't
have that in my tree. rtnl_lock_unregistering_all() can't hurt the case
we're talking about AFAICT.

Eric removed some of the netns / loopback dependencies in net-next,
make sure you pull!

> With some suitable commentary, that might also be a reasonable thing?
> __rtnl_unlock() is actually rather pretty rare, and not exported.

The main use for it seems to be re-locking before loading a module,
which TBH I have no idea why, is it just a cargo cult or a historical
thing :S I don't see how letting netdevs leave before _loading_
a module makes any difference whatsoever.

> However, if you don't like that ...
>
> I've been testing with this patch, to make lockdep complain:
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9933,6 +9933,11 @@ void netdev_run_todo(void)
> if (!list_empty(&list))
> rcu_barrier();
>
> +#ifdef CONFIG_LOCKDEP
> + rtnl_lock();
> + __rtnl_unlock();
> +#endif
> +
> while (!list_empty(&list)) {
> struct net_device *dev
> = list_first_entry(&list, struct net_device, todo_list);
>
>
> That causes lockdep to complain for cfg80211 even if the list *is* in
> fact empty.
>
> Would you be open to adding something like that? Perhaps if I don't just
> do the easy rtnl_lock/unlock, but try to find the corresponding lockdep-
> only things to do there, to cause lockdep to do things without really
> locking? OTOH, the locking overhead of the RTNL we just unlocked is
> probably minimal, vs. the actual work *lockdep* is doing to track all
> this ...

The WARN_ON() you suggested up front make perfect sense to me.
You can also take the definition of net_unlink_todo() out of
netdevice.h while at it because o_0

2022-03-25 21:44:52

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 20:36 +0000, William McVicker wrote:
>
> I found that my wlan driver is using the vendor commands to create/delete NAN
> interfaces for this Android feature called Wi-Fi aware [1]. Basically, this
> features allows users to discover other nearby devices and allows them to
> connect directly with one another over a local network. 
>

Wait, why is it doing that? We actually support a NAN interface type
upstream :) It's not really quite fully fleshed out, but it could be?
Probably should be?


> Thread 1 Thread 2
> nl80211_pre_doit():
> rtnl_lock()
> wiphy_lock() nl80211_pre_doit():
> rtnl_lock() // blocked by Thread 1
> nl80211_vendor_cmd():
> doit()
> cfg80211_unregister_netdevice()
> rtnl_unlock():
> netdev_run_todo():
> __rtnl_unlock()
> <got RTNL lock>
> wiphy_lock() // blocked by Thread 1
> rtnl_lock(); // DEADLOCK
> nl80211_post_doit():
> wiphy_unlock();


Right, this is what I had discussed in my other mails.

Basically, you're actually doing (some form of) unregister_netdevice()
before rtnl_unlock().

Clearly this isn't possible in cfg80211 itself.

However, I couldn't entirely discount the possibility that this is
possible:

Thread 1 Thread 2
rtnl_lock()
unregister_netdevice()
__rtnl_unlock()
rtnl_lock()
wiphy_lock()
netdev_run_todo()
__rtnl_unlock()
// list not empty now
// because of thread 2 rtnl_lock()
rtnl_lock()
wiphy_lock()

** DEADLOCK **


Given my other discussion with Jakub though, it seems that we can indeed
make sure that this cannot happen, and then this scenario is impossible
without the unregistration you're doing.

> Since I'm unlocking the RTNL inside nl80211_vendor_cmd() after calling doit()
> instead of waiting till post_doit(), I get into the situation you mentioned
> where the net_todo_list is not empty when calling rtnl_unlock. So I decided to
> drop the rtnl_unlock() in nl80211_vendor_cmd() and defer that until
> nl80211_post_doit() after calling wiphy_unlock(). With this change, I haven't
> been able to reproduce the deadlock. So it's possible that we aren't actually
> able to hit this deadlock in nl80211_pre_doit() with the existing code since,
> as you mentioned, one wouldn't be able to call unregister_netdevice() without
> having the RTNL lock.
>

Right, this is why I said earlier that actually adding a flag for vendor
commands to get the RTNL would be more complex - you'd have to basically
open-code pre_doit() and post_doit() in there and check the sub-command
flag at the very beginning and very end.

johannes

2022-03-25 21:52:10

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 13:40 -0700, Jakub Kicinski wrote:
> On Fri, 25 Mar 2022 18:01:30 +0100 Johannes Berg wrote:
> > That's not a bad idea, but I think I wouldn't want to backport that, so
> > separately :) I don't think that fundamentally changes the locking
> > properties though.
> >
> >
> > Couple of more questions I guess: First, are we assuming that the
> > cfg80211 code *is* actually broken, even if it looks like nothing can
> > cause the situation, due to the empty todo list?
>
> Right.

I guess that the below is basically saying "it's not really broken" :)

> > Given that we have rtnl_lock_unregistering() (and also
> > rtnl_lock_unregistering_all()), it looks like we *do* in fact at least
> > not want to make an assumption that no user of __rtnl_unlock() can have
> > added a todo item.
> >
> > I mean, there's technically yet *another* thing we could do - something
> > like this:
> >
> > [this doesn't compile, need to suitably make net_todo_list non-static]
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -95,6 +95,7 @@ void __rtnl_unlock(void)
> >
> > defer_kfree_skb_list = NULL;
> >
> > + WARN_ON(!list_empty(&net_todo_list));
> > mutex_unlock(&rtnl_mutex);
> >
> > while (head) {
>
> Yeah, I think we could do that.

It seems that would be simpler. Even if I might eventually still want to
do the cfg80211 change, but it would give us some confidence that this
really cannot be happening anywhere.


> TBH I don't know what you mean by rtnl_lock_unregistering(), I don't
> have that in my tree. rtnl_lock_unregistering_all() can't hurt the case
> we're talking about AFAICT.
>
> Eric removed some of the netns / loopback dependencies in net-next,
> make sure you pull!

Sorry, yeah, I was looking at an older tree where I was testing on in
the simulation environment - this disappeared in commit 8a4fc54b07d7
("net: get rid of rtnl_lock_unregistering()").

> > With some suitable commentary, that might also be a reasonable thing?
> > __rtnl_unlock() is actually rather pretty rare, and not exported.
>
> The main use for it seems to be re-locking before loading a module,
> which TBH I have no idea why, is it just a cargo cult or a historical
> thing :S I don't see how letting netdevs leave before _loading_
> a module makes any difference whatsoever.

Indeed.


> The WARN_ON() you suggested up front make perfect sense to me.
> You can also take the definition of net_unlink_todo() out of
> netdevice.h while at it because o_0

Heh indeed, what?

But (and now I'll CC even more people...) if we can actually have an
invariant that while RTNL is unlocked the todo list is empty, then we
also don't need rtnl_lock_unregistering_all(), and can remove the
netdev_unregistering_wq, etc., no?

IOW, I'm not sure why we needed commit 50624c934db1 ("net: Delay
default_device_exit_batch until no devices are unregistering v2"), but I
also have little doubt that we did.

Ah, no. This isn't about locking in this case, it's literally about
ensuring that free_netdev() has been called in netdev_run_todo()?

Which we don't care about in cfg80211 - we just care about the list
being empty so there's no chance we'll reacquire the RTNL.

johannes

2022-03-25 22:13:30

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 25 Mar 2022 22:25:05 +0100 Johannes Berg wrote:
> > > With some suitable commentary, that might also be a reasonable thing?
> > > __rtnl_unlock() is actually rather pretty rare, and not exported.
> >
> > The main use for it seems to be re-locking before loading a module,
> > which TBH I have no idea why, is it just a cargo cult or a historical
> > thing :S I don't see how letting netdevs leave before _loading_
> > a module makes any difference whatsoever.
>
> Indeed.
>
> > The WARN_ON() you suggested up front make perfect sense to me.
> > You can also take the definition of net_unlink_todo() out of
> > netdevice.h while at it because o_0
>
> Heh indeed, what?

To be clear - I just meant that it's declaring a static variable in
a header, so I doubt that it'll do the right thing unless it's only
called from one compilation unit.

> But (and now I'll CC even more people...) if we can actually have an
> invariant that while RTNL is unlocked the todo list is empty, then we
> also don't need rtnl_lock_unregistering_all(), and can remove the
> netdev_unregistering_wq, etc., no?
>
> IOW, I'm not sure why we needed commit 50624c934db1 ("net: Delay
> default_device_exit_batch until no devices are unregistering v2"), but I
> also have little doubt that we did.
>
> Ah, no. This isn't about locking in this case, it's literally about
> ensuring that free_netdev() has been called in netdev_run_todo()?

Yup, multiple contexts sitting independently in netdev_run_todo() and
chewing on netdevs is slightly different. destructors of those netdevs
could be pointing at memory of a module being unloaded.

> Which we don't care about in cfg80211 - we just care about the list
> being empty so there's no chance we'll reacquire the RTNL.

2022-03-25 22:15:14

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 14:48 -0700, Jakub Kicinski wrote:
> >
> > > The WARN_ON() you suggested up front make perfect sense to me.
> > > You can also take the definition of net_unlink_todo() out of
> > > netdevice.h while at it because o_0
> >
> > Heh indeed, what?
>
> To be clear - I just meant that it's declaring a static variable in
> a header, so I doubt that it'll do the right thing unless it's only
> called from one compilation unit.

Right, it's odd. I just made a patch (will send in a minute) moving the
entire block to dev.c, which is the only user of any of it.

> > Ah, no. This isn't about locking in this case, it's literally about
> > ensuring that free_netdev() has been called in netdev_run_todo()?
>
> Yup, multiple contexts sitting independently in netdev_run_todo() and
> chewing on netdevs is slightly different. destructors of those netdevs
> could be pointing at memory of a module being unloaded.

Right, thanks.

johannes

2022-03-25 22:20:15

by Johannes Berg

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 2022-03-25 at 22:16 +0100, Johannes Berg wrote:
>
> > Thread 1 Thread 2
> > nl80211_pre_doit():
> > rtnl_lock()
> > wiphy_lock() nl80211_pre_doit():
> > rtnl_lock() // blocked by Thread 1
> > nl80211_vendor_cmd():
> > doit()
> > cfg80211_unregister_netdevice()
> > rtnl_unlock():
> > netdev_run_todo():
> > __rtnl_unlock()
> > <got RTNL lock>
> > wiphy_lock() // blocked by Thread 1
> > rtnl_lock(); // DEADLOCK
> > nl80211_post_doit():
> > wiphy_unlock();
>
>
> Right, this is what I had discussed in my other mails.
>
> Basically, you're actually doing (some form of) unregister_netdevice()
> before rtnl_unlock().
>
> Clearly this isn't possible in cfg80211 itself.
>
> However, I couldn't entirely discount the possibility that this is
> possible:
>
> Thread 1 Thread 2
> rtnl_lock()
> unregister_netdevice()
> __rtnl_unlock()
> rtnl_lock()
> wiphy_lock()
> netdev_run_todo()
> __rtnl_unlock()
> // list not empty now
> // because of thread 2 rtnl_lock()
> rtnl_lock()
> wiphy_lock()
>
> ** DEADLOCK **
>
>
> Given my other discussion with Jakub though, it seems that we can indeed
> make sure that this cannot happen, and then this scenario is impossible
> without the unregistration you're doing.
>

I just sent a patch for this then, forgot to CC everyone:

https://lore.kernel.org/r/20220325225055.37e89a72f814.Ic73d206e217db20fd22dcec14fe5442ca732804b@changeid

But basically it changes nothing, just adds a WARN_ON with documentation
ensuring that the invariant never breaks, i.e. that Thread 2 can't
happen.

And maybe I should've written that with 3 Threads, but the setup of
unregister_netdevice()/__rtnl_unlock() could happen anywhere in the
system anyway.


johannes

2022-03-25 22:45:39

by Jeff Johnson

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 3/25/2022 2:16 PM, Johannes Berg wrote:
> On Fri, 2022-03-25 at 20:36 +0000, William McVicker wrote:
>>
>> I found that my wlan driver is using the vendor commands to create/delete NAN
>> interfaces for this Android feature called Wi-Fi aware [1]. Basically, this
>> features allows users to discover other nearby devices and allows them to
>> connect directly with one another over a local network.
>>
>
> Wait, why is it doing that? We actually support a NAN interface type
> upstream :) It's not really quite fully fleshed out, but it could be?
> Probably should be?

And this is the issue with Android drivers. Android team proposes
changes to the Wifi HAL and driver vendors have to implement those
quickly to meet product deadlines. Some infrastructure changes we're
able to get into the core kernel without having an in-tree driver that
uses them (such as introducing NL80211_IFTYPE_NAN), but there have been
instances of core kernel changes being rejected because there was not an
in-tree user.

Yes, in your ideal world all of the Android wifi drivers would be
in-tree. And in that ideal world every release cycle the Android team
would advocate for core kernel changes needed to support the new
features of the HAL. But past history has shown attempts to upstream new
features has been delayed, perhaps in part due to the absence of an
in-tree driver that utilizes those features, and the only way to meet
product deadlines is to take the vendor command route.

And yes my out-of-tree driver is facing the exact same issue with NAN
interface creation and deletion via vendor commands.

Previously you had suggested:
> Your easiest option might be to just patch NL80211_FLAG_NEED_RTNL into
> your kernel for vendor commands and call it a day?

Would you consider taking that upstream given that there are very few
in-tree users of vendor commands, and I fear Will and I aren't the only
ones who'll face this issue?

Will, suggest you at least advocate for getting this into the 5.15 ACK.

/jeff

2022-03-25 23:58:40

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/25/2022, Johannes Berg wrote:
> On Fri, 2022-03-25 at 20:36 +0000, William McVicker wrote:
> >
> > I found that my wlan driver is using the vendor commands to create/delete NAN
> > interfaces for this Android feature called Wi-Fi aware [1]. Basically, this
> > features allows users to discover other nearby devices and allows them to
> > connect directly with one another over a local network.?
> >
>
> Wait, why is it doing that? We actually support a NAN interface type
> upstream :) It's not really quite fully fleshed out, but it could be?
> Probably should be?
>
>
> > Thread 1 Thread 2
> > nl80211_pre_doit():
> > rtnl_lock()
> > wiphy_lock() nl80211_pre_doit():
> > rtnl_lock() // blocked by Thread 1
> > nl80211_vendor_cmd():
> > doit()
> > cfg80211_unregister_netdevice()
> > rtnl_unlock():
> > netdev_run_todo():
> > __rtnl_unlock()
> > <got RTNL lock>
> > wiphy_lock() // blocked by Thread 1
> > rtnl_lock(); // DEADLOCK
> > nl80211_post_doit():
> > wiphy_unlock();
>
>
> Right, this is what I had discussed in my other mails.
>
> Basically, you're actually doing (some form of) unregister_netdevice()
> before rtnl_unlock().
>
> Clearly this isn't possible in cfg80211 itself.
>
> However, I couldn't entirely discount the possibility that this is
> possible:
>
> Thread 1 Thread 2
> rtnl_lock()
> unregister_netdevice()
> __rtnl_unlock()
> rtnl_lock()
> wiphy_lock()
> netdev_run_todo()
> __rtnl_unlock()
> // list not empty now
> // because of thread 2 rtnl_lock()
> rtnl_lock()
> wiphy_lock()
>
> ** DEADLOCK **
>
>
> Given my other discussion with Jakub though, it seems that we can indeed
> make sure that this cannot happen, and then this scenario is impossible
> without the unregistration you're doing.

Sounds good.

>
> > Since I'm unlocking the RTNL inside nl80211_vendor_cmd() after calling doit()
> > instead of waiting till post_doit(), I get into the situation you mentioned
> > where the net_todo_list is not empty when calling rtnl_unlock. So I decided to
> > drop the rtnl_unlock() in nl80211_vendor_cmd() and defer that until
> > nl80211_post_doit() after calling wiphy_unlock(). With this change, I haven't
> > been able to reproduce the deadlock. So it's possible that we aren't actually
> > able to hit this deadlock in nl80211_pre_doit() with the existing code since,
> > as you mentioned, one wouldn't be able to call unregister_netdevice() without
> > having the RTNL lock.
> >
>
> Right, this is why I said earlier that actually adding a flag for vendor
> commands to get the RTNL would be more complex - you'd have to basically
> open-code pre_doit() and post_doit() in there and check the sub-command
> flag at the very beginning and very end.
>
> johannes

Instead of open coding it, we could just pass the internal_flags via struct
genl_info to nl80211_vendor_cmds() and then handle the rtnl_unlock() there if
the vendor command doesn't request it. I included the patch below in case
there's any chance you would consider this for upstream. This would at least
add backwards compatibility to the vendor ops API so that existing drivers that
depend on the RTNL being held don't need to be fully refactored.

Thanks,
Will

[1] https://lore.kernel.org/all/[email protected]/

---
include/net/cfg80211.h | 1 +
include/net/genetlink.h | 1 +
net/netlink/genetlink.c | 1 +
net/wireless/nl80211.c | 54 +++++++++++++++++++++++++++++------------
4 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 30d86032e8cb..01f8a2cc6d11 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -4706,6 +4706,7 @@ enum wiphy_vendor_command_flags {
WIPHY_VENDOR_CMD_NEED_WDEV = BIT(0),
WIPHY_VENDOR_CMD_NEED_NETDEV = BIT(1),
WIPHY_VENDOR_CMD_NEED_RUNNING = BIT(2),
+ WIPHY_VENDOR_CMD_NEED_RTNL = BIT(3),
};

/**
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 7cb3fa8310ed..e92796366492 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -92,6 +92,7 @@ struct genl_info {
possible_net_t _net;
void * user_ptr[2];
struct netlink_ext_ack *extack;
+ u8 internal_flags;
};

static inline struct net *genl_info_net(struct genl_info *info)
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 1afca2a6c2ac..2db1c07c9f5a 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -719,6 +719,7 @@ static int genl_family_rcv_msg_doit(const struct genl_family *family,
info.userhdr = nlmsg_data(nlh) + GENL_HDRLEN;
info.attrs = attrbuf;
info.extack = extack;
+ info.internal_flags = ops->internal_flags;
genl_info_net_set(&info, net);
memset(&info.user_ptr, 0, sizeof(info.user_ptr));

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 686a69381731..561c3cd3a9a0 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -13991,6 +13991,19 @@ static int nl80211_vendor_check_policy(const struct wiphy_vendor_command *vcmd,
return nla_validate_nested(attr, vcmd->maxattr, vcmd->policy, extack);
}

+#define NL80211_FLAG_NEED_WIPHY 0x01
+#define NL80211_FLAG_NEED_NETDEV 0x02
+#define NL80211_FLAG_NEED_RTNL 0x04
+#define NL80211_FLAG_CHECK_NETDEV_UP 0x08
+#define NL80211_FLAG_NEED_NETDEV_UP (NL80211_FLAG_NEED_NETDEV |\
+ NL80211_FLAG_CHECK_NETDEV_UP)
+#define NL80211_FLAG_NEED_WDEV 0x10
+/* If a netdev is associated, it must be UP, P2P must be started */
+#define NL80211_FLAG_NEED_WDEV_UP (NL80211_FLAG_NEED_WDEV |\
+ NL80211_FLAG_CHECK_NETDEV_UP)
+#define NL80211_FLAG_CLEAR_SKB 0x20
+#define NL80211_FLAG_NO_WIPHY_MTX 0x40
+
static int nl80211_vendor_cmd(struct sk_buff *skb, struct genl_info *info)
{
struct cfg80211_registered_device *rdev = info->user_ptr[0];
@@ -13999,6 +14012,12 @@ static int nl80211_vendor_cmd(struct sk_buff *skb, struct genl_info *info)
info->attrs);
int i, err;
u32 vid, subcmd;
+ bool internal_rtnl_flag = info->internal_flags & NL80211_FLAG_NEED_RTNL;
+
+ /* In case of an error, we need to set the RTNL flag so that we unlock the
+ * RTNL in post_doit().
+ */
+ info->internal_flags = NL80211_FLAG_NEED_RTNL;

if (!rdev->wiphy.vendor_commands)
return -EOPNOTSUPP;
@@ -14058,6 +14077,12 @@ static int nl80211_vendor_cmd(struct sk_buff *skb, struct genl_info *info)
return err;
}

+ if (!internal_rtnl_flag && !(vcmd->flags & WIPHY_VENDOR_CMD_NEED_RTNL)) {
+ rtnl_unlock();
+ /* clear the rtnl flag so that we don't unlock in post_doit(). */
+ info->internal_flags &= ~NL80211_FLAG_NEED_RTNL;
+ }
+
rdev->cur_cmd_info = info;
err = vcmd->doit(&rdev->wiphy, wdev, data, len);
rdev->cur_cmd_info = NULL;
@@ -15165,19 +15190,6 @@ static int nl80211_set_fils_aad(struct sk_buff *skb,
return rdev_set_fils_aad(rdev, dev, &fils_aad);
}

-#define NL80211_FLAG_NEED_WIPHY 0x01
-#define NL80211_FLAG_NEED_NETDEV 0x02
-#define NL80211_FLAG_NEED_RTNL 0x04
-#define NL80211_FLAG_CHECK_NETDEV_UP 0x08
-#define NL80211_FLAG_NEED_NETDEV_UP (NL80211_FLAG_NEED_NETDEV |\
- NL80211_FLAG_CHECK_NETDEV_UP)
-#define NL80211_FLAG_NEED_WDEV 0x10
-/* If a netdev is associated, it must be UP, P2P must be started */
-#define NL80211_FLAG_NEED_WDEV_UP (NL80211_FLAG_NEED_WDEV |\
- NL80211_FLAG_CHECK_NETDEV_UP)
-#define NL80211_FLAG_CLEAR_SKB 0x20
-#define NL80211_FLAG_NO_WIPHY_MTX 0x40
-
static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
struct genl_info *info)
{
@@ -15231,8 +15243,14 @@ static int nl80211_pre_doit(const struct genl_ops *ops, struct sk_buff *skb,
/* we keep the mutex locked until post_doit */
__release(&rdev->wiphy.mtx);
}
- if (!(ops->internal_flags & NL80211_FLAG_NEED_RTNL))
- rtnl_unlock();
+
+ /* NL80211 vendor command doit() will handle the RTNL unlocking based on the
+ * vendor command flags.
+ */
+ if (ops->cmd != NL80211_CMD_VENDOR) {
+ if (!(ops->internal_flags & NL80211_FLAG_NEED_RTNL))
+ rtnl_unlock();
+ }

return 0;
}
@@ -15259,7 +15277,11 @@ static void nl80211_post_doit(const struct genl_ops *ops, struct sk_buff *skb,
wiphy_unlock(&rdev->wiphy);
}

- if (ops->internal_flags & NL80211_FLAG_NEED_RTNL)
+ /* If a vendor command requested for the RTNL, then it will set the
+ * info->internal_flags to indicate that the RTNL needs to be released.
+ */
+ if (ops->internal_flags & NL80211_FLAG_NEED_RTNL ||
+ info->internal_flags & NL80211_FLAG_NEED_RTNL)
rtnl_unlock();

/* If needed, clear the netlink message payload from the SKB
--
2.35.1.1021.g381101b075-goog

2022-03-26 00:27:42

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On Fri, 25 Mar 2022 23:57:33 +0000 William McVicker wrote:
> Instead of open coding it, we could just pass the internal_flags via struct
> genl_info to nl80211_vendor_cmds() and then handle the rtnl_unlock() there if
> the vendor command doesn't request it. I included the patch below in case
> there's any chance you would consider this for upstream. This would at least
> add backwards compatibility to the vendor ops API so that existing drivers that
> depend on the RTNL being held don't need to be fully refactored.

Sorry to step in, Johannes may be AFK already. There's no asterisk next
to the "we don't cater to out of tree code" rule, AFAIK. We change
locking often, making a precedent like this would be ill-advised.

2022-03-26 00:33:04

by William McVicker

[permalink] [raw]
Subject: Re: [BUG] deadlock in nl80211_vendor_cmd

On 03/25/2022, Jakub Kicinski wrote:
> On Fri, 25 Mar 2022 23:57:33 +0000 William McVicker wrote:
> > Instead of open coding it, we could just pass the internal_flags via struct
> > genl_info to nl80211_vendor_cmds() and then handle the rtnl_unlock() there if
> > the vendor command doesn't request it. I included the patch below in case
> > there's any chance you would consider this for upstream. This would at least
> > add backwards compatibility to the vendor ops API so that existing drivers that
> > depend on the RTNL being held don't need to be fully refactored.
>
> Sorry to step in, Johannes may be AFK already. There's no asterisk next
> to the "we don't cater to out of tree code" rule, AFAIK. We change
> locking often, making a precedent like this would be ill-advised.

Yeah I understand. I'll talk to Broadcom about this to see why they didn't use
the existing upstream NAN interface. This sounds like it's going to be
a problem for all the Android out-of-tree drivers.

Thanks for the help!

--Will