2020-06-23 08:17:34

by Kaige Li

[permalink] [raw]
Subject: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

The kernel module may sleep with holding a spinlock.

The function call paths (from bottom to top) are:

[FUNC] zalloc_cpumask_var(GFP_KERNEL)
drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset

To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.

Signed-off-by: Kaige Li <[email protected]>
---

+cc [email protected]

drivers/net/ethernet/cisco/enic/enic_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index cd5fe4f..ee62065 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -122,7 +122,7 @@ static void enic_init_affinity_hint(struct enic *enic)
!cpumask_empty(enic->msix[i].affinity_mask)))
continue;
if (zalloc_cpumask_var(&enic->msix[i].affinity_mask,
- GFP_KERNEL))
+ GFP_ATOMIC))
cpumask_set_cpu(cpumask_local_spread(i, numa_node),
enic->msix[i].affinity_mask);
}
--
2.1.0


2020-06-23 20:52:06

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

On Tue, 23 Jun 2020 16:13:09 +0800 Kaige Li wrote:
> The kernel module may sleep with holding a spinlock.
>
> The function call paths (from bottom to top) are:
>
> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>
> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>
> Signed-off-by: Kaige Li <[email protected]>

I don't think this is sufficient. Calling open with a spin lock held
seems like a very bad idea. At a quick look the driver also calls
request_irq() from open - request_irq() can sleep.

2020-06-23 21:37:07

by David Miller

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

From: Kaige Li <[email protected]>
Date: Tue, 23 Jun 2020 16:13:09 +0800

> The kernel module may sleep with holding a spinlock.
>
> The function call paths (from bottom to top) are:
>
> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>
> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>
> Signed-off-by: Kaige Li <[email protected]>

Just grepping around for GFP_KERNEL usage in atomic contexts I guess
is fine.

But you really have to look at the bigger picture.

Calling a NIC driver open function from a context holding a spinlock
is very much the real problem, so many operations have to sleep and
in face that ->ndo_open() method is defined as being allowed to sleep
and that's why the core networking never invokes it with spinlocks
held.

2020-06-23 22:52:23

by David Miller

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

From: David Miller <[email protected]>
Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)

> Calling a NIC driver open function from a context holding a spinlock
> is very much the real problem, so many operations have to sleep and
> in face that ->ndo_open() method is defined as being allowed to sleep
> and that's why the core networking never invokes it with spinlocks
^^^^

I mean "without" of course. :-)

> held.

2020-06-24 01:57:37

by Kaige Li

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()


On 06/24/2020 06:26 AM, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>
>> Calling a NIC driver open function from a context holding a spinlock
>> is very much the real problem, so many operations have to sleep and
>> in face that ->ndo_open() method is defined as being allowed to sleep
>> and that's why the core networking never invokes it with spinlocks
> ^^^^
>
> I mean "without" of course. :-)
>
>> held.

Did you mean that open function should be out of spinlock? If so, I will
send V2 patch.

Thank you.

2020-06-24 02:09:00

by Kaige Li

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()


On 06/24/2020 04:50 AM, Jakub Kicinski wrote:
> On Tue, 23 Jun 2020 16:13:09 +0800 Kaige Li wrote:
>> The kernel module may sleep with holding a spinlock.
>>
>> The function call paths (from bottom to top) are:
>>
>> [FUNC] zalloc_cpumask_var(GFP_KERNEL)
>> drivers/net/ethernet/cisco/enic/enic_main.c, 125: zalloc_cpumask_var in enic_init_affinity_hint
>> drivers/net/ethernet/cisco/enic/enic_main.c, 1918: enic_init_affinity_hint in enic_open
>> drivers/net/ethernet/cisco/enic/enic_main.c, 2348: enic_open in enic_reset
>> drivers/net/ethernet/cisco/enic/enic_main.c, 2341: spin_lock in enic_reset
>>
>> To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC.
>>
>> Signed-off-by: Kaige Li <[email protected]>
> I don't think this is sufficient. Calling open with a spin lock held
> seems like a very bad idea. At a quick look the driver also calls
> request_irq() from open - request_irq() can sleep.

You are right. Should I do spin_unlock before the enic_open, or remove
spin_lock in enic_reset?

Thank you.

2020-06-24 03:25:08

by David Miller

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

From: Kaige Li <[email protected]>
Date: Wed, 24 Jun 2020 10:07:16 +0800

> You are right. Should I do spin_unlock before the enic_open, or remove
> spin_lock in enic_reset?

You need to learn how this driver's locking works and design a correct
adjustment.

2020-06-24 03:25:56

by David Miller

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

From: Kaige Li <[email protected]>
Date: Wed, 24 Jun 2020 09:56:47 +0800

>
> On 06/24/2020 06:26 AM, David Miller wrote:
>> From: David Miller <[email protected]>
>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>>
>>> Calling a NIC driver open function from a context holding a spinlock
>>> is very much the real problem, so many operations have to sleep and
>>> in face that ->ndo_open() method is defined as being allowed to sleep
>>> and that's why the core networking never invokes it with spinlocks
>> ^^^^
>>
>> I mean "without" of course. :-)
>>
>>> held.
>
> Did you mean that open function should be out of spinlock? If so, I
> will
> send V2 patch.

Yes, but only if that is safe.

You have to analyze the locking done by this driver and fix it properly.
I anticipate it is not just a matter of changing where the spinlock
is held, you will have to rearchitect things a bit.

2020-06-24 03:42:06

by Kaige Li

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()


On 06/24/2020 11:23 AM, David Miller wrote:
> From: Kaige Li <[email protected]>
> Date: Wed, 24 Jun 2020 09:56:47 +0800
>
>> On 06/24/2020 06:26 AM, David Miller wrote:
>>> From: David Miller <[email protected]>
>>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
>>>
>>>> Calling a NIC driver open function from a context holding a spinlock
>>>> is very much the real problem, so many operations have to sleep and
>>>> in face that ->ndo_open() method is defined as being allowed to sleep
>>>> and that's why the core networking never invokes it with spinlocks
>>> ^^^^
>>>
>>> I mean "without" of course. :-)
>>>
>>>> held.
>> Did you mean that open function should be out of spinlock? If so, I
>> will
>> send V2 patch.
> Yes, but only if that is safe.
>
> You have to analyze the locking done by this driver and fix it properly.
> I anticipate it is not just a matter of changing where the spinlock
> is held, you will have to rearchitect things a bit.

Okay, I will careful analyze this question, and make a suitable patch in V2.

Thank you.

Subject: RE: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

> -----Original Message-----
> From: [email protected] <[email protected]>
> On Behalf Of Kaige Li
> Sent: Tuesday, June 23, 2020 8:41 PM
> To: David Miller <[email protected]>
> Cc: Christian Benvenuti (benve) <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in
> enic_init_affinity_hint()
>
>
> On 06/24/2020 11:23 AM, David Miller wrote:
> > From: Kaige Li <[email protected]>
> > Date: Wed, 24 Jun 2020 09:56:47 +0800
> >
> >> On 06/24/2020 06:26 AM, David Miller wrote:
> >>> From: David Miller <[email protected]>
> >>> Date: Tue, 23 Jun 2020 14:33:11 -0700 (PDT)
> >>>
> >>>> Calling a NIC driver open function from a context holding a
> >>>> spinlock is very much the real problem, so many operations have to
> >>>> sleep and in face that ->ndo_open() method is defined as being
> >>>> allowed to sleep and that's why the core networking never invokes
> >>>> it with spinlocks
> >>> ^^^^
> >>>
> >>> I mean "without" of course. :-)
> >>>
> >>>> held.
> >> Did you mean that open function should be out of spinlock? If so, I
> >> will send V2 patch.
> > Yes, but only if that is safe.
> >
> > You have to analyze the locking done by this driver and fix it properly.
> > I anticipate it is not just a matter of changing where the spinlock
> > is held, you will have to rearchitect things a bit.
>
> Okay, I will careful analyze this question, and make a suitable patch in V2.
>
> Thank you.

Hi David, Kaige,
I assume you are referring to the enic_api_lock spin_lock used in

enic_reset()
which is used to hard-reset the interface when the driver receives an error interrupt

and

enic_tx_hang_reset()
which is used to soft-reset the interface when the stack detects the TX timeout.

Both reset functions above are called in the context of a workqueue.
However, the same spin_lock (enic_api_lock) is taken by the enic_api_devcmd_proxy_by_index() api that is exported by the enic driver and that the usnic_verbs driver uses to send commands to the firmware.
This spin_lock was likely added to guarantee that no firmware command is sent by usnic_verbs to an enic interface that is undergoing a reset.
Unfortunately changing that spin_lock to a mutex will likely not work on the usnic_verbs side, and removing the spin_lock will require a rearchitect of the code as mentioned by David.

Kaige, V2 is of course more than welcome and we can test it too.
We/Cisco will also look into it, hopefully a small code reorg will be sufficient.

David, from your previous emails on this 3D I assume
- we can leave request_irq() in ndo_open (ie, no need to move it to pci/probe), which is done by a number of other drivers too.
- no need to change GFP_KERNEL to GFP_ATOMIC as it was suggested in the original patch.

Thanks!
/Chris

2020-06-24 17:01:37

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()

On Wed, 24 Jun 2020 06:32:36 +0000 Christian Benvenuti (benve) wrote:
> We/Cisco will also look into it, hopefully a small code reorg will be sufficient.

Make sure you enable CONFIG_DEBUG_ATOMIC_SLEEP when you test.

2020-07-03 06:57:46

by Kaige Li

[permalink] [raw]
Subject: Re: [PATCH RESEND] net/cisco: Fix a sleep-in-atomic-context bug in enic_init_affinity_hint()


On 06/25/2020 12:59 AM, Jakub Kicinski wrote:
> On Wed, 24 Jun 2020 06:32:36 +0000 Christian Benvenuti (benve) wrote:
>> We/Cisco will also look into it, hopefully a small code reorg will be sufficient.

Hi, Christian:

I have seen some submissions and codes, and feel that spin_lock is unnecessary in enci_reset.<https://lore.kernel.org/patchwork/project/lkml/list/?submitter=28441>

<https://lore.kernel.org/patchwork/project/lkml/list/?submitter=28441>

Some key submissions are as follows.

Tue Sep 16 00:17:11 2008

git show 01f2e4ead. we can see that spin_lock is just in here:

+ spin_lock(&enic->devcmd_lock);

+ vnic_dev_hang_notify(enic->vdev);

+ spin_unlock(&enic->devcmd_lock);



Sat Aug 17 06:47:40 2013

git show 0b038566c: Add an interface for USNIC to interact with firmware.

Before commit-id: 0b038566c, spin_lock is not used in enic_reset. rtnl_lock() is enough. And 0b038566c add a interface: enic_api_devcmd_proxy_by_index.

Enic_api_devcmd_proxy_by_index is just used in ./drivers/infiniband/hw/usnic/usnic_fwd.c:50, which is added in 2183b990.

+ spin_lock(&enic->enic_api_lock);

enic_dev_hang_notify(enic);


enic_dev_set_ig_vlan_rewrite_mode(enic);

enic_open(enic->netdev);

+ spin_unlock(&enic->enic_api_lock);

By analyzing enic_api_lock, it's mainly used for locking vnic_dev_cmd(vdev, cmd, a0, a1, wait). And enic_reset didn't call to vnic_dev_cmd.

So, I think spin_lock may be deleted in enci_reset. What do you think? Or you have better advice.

Thank you.

> Make sure you enable CONFIG_DEBUG_ATOMIC_SLEEP when you test.