The kernel module may sleep with holding a spinlock.
The function call paths (from bottom to top) are:
[FUNC] zalloc_cpumask_var(GFP_KERNEL)
drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
Signed-off-by: Kaige Li <[email protected]>
---
+cc [email protected]
drivers/net/ethernet/cisco/enic/enic_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index cd5fe4f..ee62065 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -122,7 +122,7 @@ static void enic_init_affinity_hint(struct enic *enic)
!cpumask_empty(enic->msix[i].affinity_mask)))
continue;
if (zalloc_cpumask_var(&enic->msix[i].affinity_mask,
- GFP_KERNEL))
+ GFP_ATOMIC))
cpumask_set_cpu(cpumask_local_spread(i, numa_node),
enic->msix[i].affinity_mask);
}
--
2.1.0
On Tue, 23 Jun 2020 16:13:09 +0800 Kaige Li wrote:
> The kernel module may sleep with holding a spinlock.
>
> The function call paths (from bottom to top) are:
>
> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>
> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>
> Signed-off-by: Kaige Li <[email protected]>
I don't think this is sufficient. Calling open with a spin lock held
seems like a very bad idea. At a quick look the driver also calls
request_irq() from open - request_irq() can sleep.
From: Kaige Li <[email protected]>
Date: Tue, 23 Jun 2020 16:13:09 +0800
> The kernel module may sleep with holding a spinlock.
>
> The function call paths (from bottom to top) are:
>
> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>
> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>
> Signed-off-by: Kaige Li <[email protected]>
Just grepping around for GFP_KERNEL usage in atomic contexts I guess
is fine.
But you really have to look at the bigger picture.
Calling a NIC driver open function from a context holding a spinlock
is very much the real problem, so many operations have to sleep and
in face that ->ndo_open() method is defined as being allowed to sleep
and that's why the core networking never invokes it with spinlocks
held.
From: David Miller <[email protected]>
Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
> Calling a NIC driver open function from a context holding a spinlock
> is very much the real problem, so many operations have to sleep and
> in face that ->ndo_open() method is defined as being allowed to sleep
> and that's why the core networking never invokes it with spinlocks
^^^^
I mean "without" of course. :-)
> held.
On 06/24/2020 06:26 AM, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>
>> Calling a NIC driver open function from a context holding a spinlock
>> is very much the real problem, so many operations have to sleep and
>> in face that ->ndo_open() method is defined as being allowed to sleep
>> and that's why the core networking never invokes it with spinlocks
> ^^^^
>
> I mean "without" of course. :-)
>
>> held.
Did you mean that open function should be out of spinlock? If so, I will
send V2 patch.
Thank you.
On 06/24/2020 04:50 AM, Jakub Kicinski wrote:
> On Tue, 23 Jun 2020 16:13:09 +0800 Kaige Li wrote:
>> The kernel module may sleep with holding a spinlock.
>>
>> The function call paths (from bottom to top) are:
>>
>> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
>> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
>> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
>> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
>> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>>
>> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>>
>> Signed-off-by: Kaige Li <[email protected]>
> I don't think this is sufficient. Calling open with a spin lock held
> seems like a very bad idea. At a quick look the driver also calls
> request_irq() from open - request_irq() can sleep.
You are right. Should I do spin_unlock before the enic_open, or remove
spin_lock in enic_reset?
Thank you.
From: Kaige Li <[email protected]>
Date: Wed, 24 Jun 2020 10:07:16 +0800
> You are right. Should I do spin_unlock before the enic_open, or remove
> spin_lock in enic_reset?
You need to learn how this driver's locking works and design a correct
adjustment.
From: Kaige Li <[email protected]>
Date: Wed, 24 Jun 2020 09:56:47 +0800
>
> On 06/24/2020 06:26 AM, David Miller wrote:
>> From: David Miller <[email protected]>
>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>>
>>> Calling a NIC driver open function from a context holding a spinlock
>>> is very much the real problem, so many operations have to sleep and
>>> in face that ->ndo_open() method is defined as being allowed to sleep
>>> and that's why the core networking never invokes it with spinlocks
>> ^^^^
>>
>> I mean "without" of course. :-)
>>
>>> held.
>
> Did you mean that open function should be out of spinlock? If so, I
> will
> send V2 patch.
Yes, but only if that is safe.
You have to analyze the locking done by this driver and fix it properly.
I anticipate it is not just a matter of changing where the spinlock
is held, you will have to rearchitect things a bit.
On 06/24/2020 11:23 AM, David Miller wrote:
> From: Kaige Li <[email protected]>
> Date: Wed, 24 Jun 2020 09:56:47 +0800
>
>> On 06/24/2020 06:26 AM, David Miller wrote:
>>> From: David Miller <[email protected]>
>>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>>>
>>>> Calling a NIC driver open function from a context holding a spinlock
>>>> is very much the real problem, so many operations have to sleep and
>>>> in face that ->ndo_open() method is defined as being allowed to sleep
>>>> and that's why the core networking never invokes it with spinlocks
>>> ^^^^
>>>
>>> I mean "without" of course. :-)
>>>
>>>> held.
>> Did you mean that open function should be out of spinlock? If so, I
>> will
>> send V2 patch.
> Yes, but only if that is safe.
>
> You have to analyze the locking done by this driver and fix it properly.
> I anticipate it is not just a matter of changing where the spinlock
> is held, you will have to rearchitect things a bit.
Okay, I will careful analyze this question, and make a suitable patch in V2.
Thank you.
> -----Original Message-----
> From: [email protected] <[email protected]>
> On Behalf Of Kaige Li
> Sent: Tuesday, June 23, 2020 8:41 PM
> To: David Miller <[email protected]>
> Cc: Christian Benvenuti (benve) <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in
> enic_init_affinity_hint()
>
>
> On 06/24/2020 11:23 AM, David Miller wrote:
> > From: Kaige Li <[email protected]>
> > Date: Wed, 24 Jun 2020 09:56:47 +0800
> >
> >> On 06/24/2020 06:26 AM, David Miller wrote:
> >>> From: David Miller <[email protected]>
> >>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
> >>>
> >>>> Calling a NIC driver open function from a context holding a
> >>>> spinlock is very much the real problem, so many operations have to
> >>>> sleep and in face that ->ndo_open() method is defined as being
> >>>> allowed to sleep and that's why the core networking never invokes
> >>>> it with spinlocks
> >>> ^^^^
> >>>
> >>> I mean "without" of course. :-)
> >>>
> >>>> held.
> >> Did you mean that open function should be out of spinlock? If so, I
> >> will send V2 patch.
> > Yes, but only if that is safe.
> >
> > You have to analyze the locking done by this driver and fix it properly.
> > I anticipate it is not just a matter of changing where the spinlock
> > is held, you will have to rearchitect things a bit.
>
> Okay, I will careful analyze this question, and make a suitable patch in V2.
>
> Thank you.
Hi David, Kaige,
I assume you are referring to the enic_api_lock spin_lock used in
enic_reset()
which is used to hard-reset the interface when the driver receives an error interrupt
and
enic_tx_hang_reset()
which is used to soft-reset the interface when the stack detects the TX timeout.
Both reset functions above are called in the context of a workqueue.
However, the same spin_lock (enic_api_lock) is taken by the enic_api_devcmd_proxy_by_index() api that is exported by the enic driver and that the usnic_verbs driver uses to send commands to the firmware.
This spin_lock was likely added to guarantee that no firmware command is sent by usnic_verbs to an enic interface that is undergoing a reset.
Unfortunately changing that spin_lock to a mutex will likely not work on the usnic_verbs side, and removing the spin_lock will require a rearchitect of the code as mentioned by David.
Kaige, V2 is of course more than welcome and we can test it too.
We/Cisco will also look into it, hopefully a small code reorg will be sufficient.
David, from your previous emails on this 3D I assume
- we can leave request_irq() in ndo_open (ie, no need to move it to pci/probe), which is done by a number of other drivers too.
- no need to change GFP_KERNEL to GFP_ATOMIC as it was suggested in the original patch.
Thanks!
/Chris
On Wed, 24 Jun 2020 06:32:36 +0000 Christian Benvenuti (benve) wrote:
> We/Cisco will also look into it, hopefully a small code reorg will be sufficient.
Make sure you enable CONFIG_DEBUG_ATOMIC_SLEEP when you test.
On 06/25/2020 12:59 AM, Jakub Kicinski wrote:
> On Wed, 24 Jun 2020 06:32:36 +0000 Christian Benvenuti (benve) wrote:
>> We/Cisco will also look into it, hopefully a small code reorg will be sufficient.
Hi, Christian:
I have seen some submissions and codes, and feel that spin_lock is unnecessary in enci_reset.<https://lore.kernel.org/patchwork/project/lkml/list/?submitter=28441>
<https://lore.kernel.org/patchwork/project/lkml/list/?submitter=28441>
Some key submissions are as follows.
Tue Sep 16 00:17:11 2008
git show 01f2e4ead. we can see that spin_lock is just in here:
+ spin_lock(&enic->devcmd_lock);
+ vnic_dev_hang_notify(enic->vdev);
+ spin_unlock(&enic->devcmd_lock);
Sat Aug 17 06:47:40 2013
git show 0b038566c: Add an interface for USNIC to interact with firmware.
Before commit-id: 0b038566c, spin_lock is not used in enic_reset. rtnl_lock() is enough. And 0b038566c add a interface: enic_api_devcmd_proxy_by_index.
Enic_api_devcmd_proxy_by_index is just used in ./drivers/infiniband/hw/usnic/usnic_fwd.c:50, which is added in 2183b990.
+ spin_lock(&enic->enic_api_lock);
enic_dev_hang_notify(enic);
enic_dev_set_ig_vlan_rewrite_mode(enic);
enic_open(enic->netdev);
+ spin_unlock(&enic->enic_api_lock);
By analyzing enic_api_lock, it's mainly used for locking vnic_dev_cmd(vdev, cmd, a0, a1, wait). And enic_reset didn't call to vnic_dev_cmd.
So, I think spin_lock may be deleted in enci_reset. What do you think? Or you have better advice.
Thank you.
> Make sure you enable CONFIG_DEBUG_ATOMIC_SLEEP when you test.