2019-02-07 01:41:52

by Brian Norris

[permalink] [raw]
Subject: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

The DIAG copy engine is only used via polling, but it holds a spinlock
with softirqs disabled. Each iteration of our read/write loops can
theoretically take 20ms (two 10ms timeout loops), and this loop can be
run an unbounded number of times while holding the spinlock -- dependent
on the request size given by the caller.

As of commit 39501ea64116 ("ath10k: download firmware via diag Copy
Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware
memory using this mechanism. With large enough firmware segments, this
becomes an exceedingly long period for disabling soft IRQs. For example,
with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop
iterations of about 50-100us each, which can total about 10-20ms.

In reality, we don't really need to block softirqs for this duration.
The DIAG CE is only used in polling mode, and we only need to hold
ce_lock to make sure any CE bookkeeping is done without screwing up
another CE. Otherwise, we only need to ensure exclusion between
ath10k_pci_diag_{read,write}_mem() contexts.

This patch moves to use fine-grained locking for the shared ce_lock,
while adding a new mutex just to ensure mutual exclusion of diag
read/write operations.

Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1.

Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.")
Signed-off-by: Brian Norris <[email protected]>
---
I'm not quite sure if this is -stable material. It's technically an
existing bug, but it looks like previously, the DIAG interface was only
exposed via debugfs -- you could block softirqs by playing with
/sys/kernel/debug/ieee80211/phyX/ath10k/mem_value. The new usage (for
loading firmware, on QCA6174 and QCA9377) might constitute a significant
regression.

Also, I'd appreciate some review from Qualcomm folks as always.
---
drivers/net/wireless/ath/ath10k/pci.c | 41 ++++++++++-----------------
drivers/net/wireless/ath/ath10k/pci.h | 3 ++
2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 39e0b1cc2a12..f8356b3bf150 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -913,7 +913,6 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, u32 address, void *data,
int nbytes)
{
struct ath10k_pci *ar_pci = ath10k_pci_priv(ar);
- struct ath10k_ce *ce = ath10k_ce_priv(ar);
int ret = 0;
u32 *buf;
unsigned int completed_nbytes, alloc_nbytes, remaining_bytes;
@@ -924,8 +923,7 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, u32 address, void *data,
void *data_buf = NULL;
int i;

- spin_lock_bh(&ce->ce_lock);
-
+ mutex_lock(&ar_pci->ce_diag_mutex);
ce_diag = ar_pci->ce_diag;

/*
@@ -960,19 +958,17 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, u32 address, void *data,
nbytes = min_t(unsigned int, remaining_bytes,
DIAG_TRANSFER_LIMIT);

- ret = ce_diag->ops->ce_rx_post_buf(ce_diag, &ce_data, ce_data);
+ ret = ath10k_ce_rx_post_buf(ce_diag, &ce_data, ce_data);
if (ret != 0)
goto done;

/* Request CE to send from Target(!) address to Host buffer */
- ret = ath10k_ce_send_nolock(ce_diag, NULL, (u32)address, nbytes, 0,
- 0);
+ ret = ath10k_ce_send(ce_diag, NULL, (u32)address, nbytes, 0, 0);
if (ret)
goto done;

i = 0;
- while (ath10k_ce_completed_send_next_nolock(ce_diag,
- NULL) != 0) {
+ while (ath10k_ce_completed_send_next(ce_diag, NULL) != 0) {
udelay(DIAG_ACCESS_CE_WAIT_US);
i += DIAG_ACCESS_CE_WAIT_US;

@@ -983,10 +979,8 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, u32 address, void *data,
}

i = 0;
- while (ath10k_ce_completed_recv_next_nolock(ce_diag,
- (void **)&buf,
- &completed_nbytes)
- != 0) {
+ while (ath10k_ce_completed_recv_next(ce_diag, (void **)&buf,
+ &completed_nbytes) != 0) {
udelay(DIAG_ACCESS_CE_WAIT_US);
i += DIAG_ACCESS_CE_WAIT_US;

@@ -1019,7 +1013,7 @@ static int ath10k_pci_diag_read_mem(struct ath10k *ar, u32 address, void *data,
dma_free_coherent(ar->dev, alloc_nbytes, data_buf,
ce_data_base);

- spin_unlock_bh(&ce->ce_lock);
+ mutex_unlock(&ar_pci->ce_diag_mutex);

return ret;
}
@@ -1067,7 +1061,6 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
const void *data, int nbytes)
{
struct ath10k_pci *ar_pci = ath10k_pci_priv(ar);
- struct ath10k_ce *ce = ath10k_ce_priv(ar);
int ret = 0;
u32 *buf;
unsigned int completed_nbytes, alloc_nbytes, remaining_bytes;
@@ -1076,8 +1069,7 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
dma_addr_t ce_data_base = 0;
int i;

- spin_lock_bh(&ce->ce_lock);
-
+ mutex_lock(&ar_pci->ce_diag_mutex);
ce_diag = ar_pci->ce_diag;

/*
@@ -1118,7 +1110,7 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
memcpy(data_buf, data, nbytes);

/* Set up to receive directly into Target(!) address */
- ret = ce_diag->ops->ce_rx_post_buf(ce_diag, &address, address);
+ ret = ath10k_ce_rx_post_buf(ce_diag, &address, address);
if (ret != 0)
goto done;

@@ -1126,14 +1118,12 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
* Request CE to send caller-supplied data that
* was copied to bounce buffer to Target(!) address.
*/
- ret = ath10k_ce_send_nolock(ce_diag, NULL, ce_data_base,
- nbytes, 0, 0);
+ ret = ath10k_ce_send(ce_diag, NULL, ce_data_base, nbytes, 0, 0);
if (ret != 0)
goto done;

i = 0;
- while (ath10k_ce_completed_send_next_nolock(ce_diag,
- NULL) != 0) {
+ while (ath10k_ce_completed_send_next(ce_diag, NULL) != 0) {
udelay(DIAG_ACCESS_CE_WAIT_US);
i += DIAG_ACCESS_CE_WAIT_US;

@@ -1144,10 +1134,8 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
}

i = 0;
- while (ath10k_ce_completed_recv_next_nolock(ce_diag,
- (void **)&buf,
- &completed_nbytes)
- != 0) {
+ while (ath10k_ce_completed_recv_next(ce_diag, (void **)&buf,
+ &completed_nbytes) != 0) {
udelay(DIAG_ACCESS_CE_WAIT_US);
i += DIAG_ACCESS_CE_WAIT_US;

@@ -1182,7 +1170,7 @@ int ath10k_pci_diag_write_mem(struct ath10k *ar, u32 address,
ath10k_warn(ar, "failed to write diag value at 0x%x: %d\n",
address, ret);

- spin_unlock_bh(&ce->ce_lock);
+ mutex_unlock(&ar_pci->ce_diag_mutex);

return ret;
}
@@ -3462,6 +3450,7 @@ int ath10k_pci_setup_resource(struct ath10k *ar)

spin_lock_init(&ce->ce_lock);
spin_lock_init(&ar_pci->ps_lock);
+ mutex_init(&ar_pci->ce_diag_mutex);

timer_setup(&ar_pci->rx_post_retry, ath10k_pci_rx_replenish_retry, 0);

diff --git a/drivers/net/wireless/ath/ath10k/pci.h b/drivers/net/wireless/ath/ath10k/pci.h
index e8d86331c539..a9270fa6463c 100644
--- a/drivers/net/wireless/ath/ath10k/pci.h
+++ b/drivers/net/wireless/ath/ath10k/pci.h
@@ -19,6 +19,7 @@
#define _PCI_H_

#include <linux/interrupt.h>
+#include <linux/mutex.h>

#include "hw.h"
#include "ce.h"
@@ -128,6 +129,8 @@ struct ath10k_pci {

/* Copy Engine used for Diagnostic Accesses */
struct ath10k_ce_pipe *ce_diag;
+ /* For protecting ce_diag */
+ struct mutex ce_diag_mutex;

struct ath10k_ce ce;
struct timer_list rx_post_retry;
--
2.20.1.611.gfbb209baf1-goog



2019-02-11 16:32:32

by Kalle Valo

[permalink] [raw]
Subject: Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

Brian Norris <[email protected]> wrote:

> The DIAG copy engine is only used via polling, but it holds a spinlock
> with softirqs disabled. Each iteration of our read/write loops can
> theoretically take 20ms (two 10ms timeout loops), and this loop can be
> run an unbounded number of times while holding the spinlock -- dependent
> on the request size given by the caller.
>
> As of commit 39501ea64116 ("ath10k: download firmware via diag Copy
> Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware
> memory using this mechanism. With large enough firmware segments, this
> becomes an exceedingly long period for disabling soft IRQs. For example,
> with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop
> iterations of about 50-100us each, which can total about 10-20ms.
>
> In reality, we don't really need to block softirqs for this duration.
> The DIAG CE is only used in polling mode, and we only need to hold
> ce_lock to make sure any CE bookkeeping is done without screwing up
> another CE. Otherwise, we only need to ensure exclusion between
> ath10k_pci_diag_{read,write}_mem() contexts.
>
> This patch moves to use fine-grained locking for the shared ce_lock,
> while adding a new mutex just to ensure mutual exclusion of diag
> read/write operations.
>
> Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1.
>
> Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.")
> Signed-off-by: Brian Norris <[email protected]>
> Signed-off-by: Kalle Valo <[email protected]>

Patch applied to ath-next branch of ath.git, thanks.

25733c4e67df ath10k: pci: use mutex for diagnostic window CE polling

--
https://patchwork.kernel.org/patch/10800343/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


2019-03-25 20:27:12

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

Hi Kalle,

On Wed, Feb 06, 2019 at 05:41:43PM -0800, Brian Norris wrote:
> The DIAG copy engine is only used via polling, but it holds a spinlock
> with softirqs disabled. Each iteration of our read/write loops can
> theoretically take 20ms (two 10ms timeout loops), and this loop can be
> run an unbounded number of times while holding the spinlock -- dependent
> on the request size given by the caller.
>
> As of commit 39501ea64116 ("ath10k: download firmware via diag Copy
> Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware
> memory using this mechanism. With large enough firmware segments, this
> becomes an exceedingly long period for disabling soft IRQs. For example,
> with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop
> iterations of about 50-100us each, which can total about 10-20ms.
>
> In reality, we don't really need to block softirqs for this duration.
> The DIAG CE is only used in polling mode, and we only need to hold
> ce_lock to make sure any CE bookkeeping is done without screwing up
> another CE. Otherwise, we only need to ensure exclusion between
> ath10k_pci_diag_{read,write}_mem() contexts.
>
> This patch moves to use fine-grained locking for the shared ce_lock,
> while adding a new mutex just to ensure mutual exclusion of diag
> read/write operations.
>
> Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1.
>
> Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.")
> Signed-off-by: Brian Norris <[email protected]>

It would appear that this triggers new warnings

BUG: sleeping function called from invalid context

when handling firmware crashes. The call stack is

ath10k_pci_fw_crashed_dump
-> ath10k_pci_dump_memory
...
-> ath10k_pci_diag_read_mem

and the problem is that we're holding the 'data_lock' spinlock with
softirqs disabled, while later trying to grab this new mutex.

Unfortunately, data_lock is used in a lot of places, and it's unclear if
it can be migrated to a mutex as well. It seems like it probably can be,
but I'd have to audit a little more closely.

Any thoughts on what the short- and long-term solutions should be? I can
send a revert, to get v5.1 fixed. But it still seems like we should
avoid disabling softirqs for so long.

Brian

2019-03-25 21:20:29

by Michal Kazior

[permalink] [raw]
Subject: Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

Hi Brian,

On Mon, 25 Mar 2019 at 21:27, Brian Norris <[email protected]> wrote:
> Hi Kalle,
>
> On Wed, Feb 06, 2019 at 05:41:43PM -0800, Brian Norris wrote:
> > The DIAG copy engine is only used via polling, but it holds a spinlock
> > with softirqs disabled. Each iteration of our read/write loops can
> > theoretically take 20ms (two 10ms timeout loops), and this loop can be
> > run an unbounded number of times while holding the spinlock -- dependent
> > on the request size given by the caller.
> >
> > As of commit 39501ea64116 ("ath10k: download firmware via diag Copy
> > Engine for QCA6174 and QCA9377."), we transfer large chunks of firmware
> > memory using this mechanism. With large enough firmware segments, this
> > becomes an exceedingly long period for disabling soft IRQs. For example,
> > with a 500KiB firmware segment, in testing QCA6174A, I see 200 loop
> > iterations of about 50-100us each, which can total about 10-20ms.
> >
> > In reality, we don't really need to block softirqs for this duration.
> > The DIAG CE is only used in polling mode, and we only need to hold
> > ce_lock to make sure any CE bookkeeping is done without screwing up
> > another CE. Otherwise, we only need to ensure exclusion between
> > ath10k_pci_diag_{read,write}_mem() contexts.
> >
> > This patch moves to use fine-grained locking for the shared ce_lock,
> > while adding a new mutex just to ensure mutual exclusion of diag
> > read/write operations.
> >
> > Tested on QCA6174A, firmware version WLAN.RM.4.4.1-00132-QCARMSWPZ-1.
> >
> > Fixes: 39501ea64116 ("ath10k: download firmware via diag Copy Engine for QCA6174 and QCA9377.")
> > Signed-off-by: Brian Norris <[email protected]>
>
> It would appear that this triggers new warnings
>
> BUG: sleeping function called from invalid context
>
> when handling firmware crashes. The call stack is
>
> ath10k_pci_fw_crashed_dump
> -> ath10k_pci_dump_memory
> ...
> -> ath10k_pci_diag_read_mem
>
> and the problem is that we're holding the 'data_lock' spinlock with
> softirqs disabled, while later trying to grab this new mutex.

No, the spinlock is not the real problem. The real problem is you're
trying to hold a mutex on a path which is potentially atomic /
non-sleepable: ath10k_pci_napi_poll().


> Unfortunately, data_lock is used in a lot of places, and it's unclear if
> it can be migrated to a mutex as well. It seems like it probably can be,
> but I'd have to audit a little more closely.

It can't be migrated to a mutex. It's intended to synchronize top half
with bottom half. It has to be an atomic non-sleeping lock mechanism.

What you need to do is make sure ath10k_pci_diag_read_mem() and
ath10k_pci_diag_write_mem() are never called from an atomic context.

For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker.
Maybe into ar->restart_work which the dump function calls now.

To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need
to at least make fw_crash_counter into an atomic_t. This is just from
a quick glance.


Michał

2019-03-25 22:14:22

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

Hi Michal,

Thanks for the quick and useful response.

On Mon, Mar 25, 2019 at 2:20 PM Michał Kazior <[email protected]> wrote:
> On Mon, 25 Mar 2019 at 21:27, Brian Norris <[email protected]> wrote:
> > It would appear that this triggers new warnings
> >
> > BUG: sleeping function called from invalid context
> >
> > when handling firmware crashes. The call stack is
> >
> > ath10k_pci_fw_crashed_dump
> > -> ath10k_pci_dump_memory
> > ...
> > -> ath10k_pci_diag_read_mem
> >
> > and the problem is that we're holding the 'data_lock' spinlock with
> > softirqs disabled, while later trying to grab this new mutex.
>
> No, the spinlock is not the real problem. The real problem is you're
> trying to hold a mutex on a path which is potentially atomic /
> non-sleepable: ath10k_pci_napi_poll().

I'll admit here that I've been testing a variety of kernels here
(including upstream), and some of them are prior to this commit:

3c97f5de1f28 ath10k: implement NAPI support

So this was running in a tasklet, not NAPI polling. But then my
understanding was still incorrect: tasklets are also an atomic
(softirq) context. Doh.

I guess I'd say the problem is "both".

>
> > Unfortunately, data_lock is used in a lot of places, and it's unclear if
> > it can be migrated to a mutex as well. It seems like it probably can be,
> > but I'd have to audit a little more closely.
>
> It can't be migrated to a mutex. It's intended to synchronize top half
> with bottom half. It has to be an atomic non-sleeping lock mechanism.

Ack, thanks for the correction.

> What you need to do is make sure ath10k_pci_diag_read_mem() and
> ath10k_pci_diag_write_mem() are never called from an atomic context.

I knew that part already :)

> For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker.
> Maybe into ar->restart_work which the dump function calls now.

Hmm, that's an idea -- although I'm not sure if I'd steal
'restart_work', or create a different work item on the same queue. But
either way, we'd still also have to avoid holding 'data_lock', and at
that point, I'm not sure if we're losing desirable properties of these
firmware dumps -- it allows more "stuff" to keep going on while we're
preparing to dump the device memory state.

> To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need
> to at least make fw_crash_counter into an atomic_t. This is just from
> a quick glance.

Yes, we'd need at least that much. We'd also need some other form of
locking to ensure exclusion between all users of
ar->coredump.fw_crash_data and similar. At the moment, that's
'data_lock', but I suppose we get a similar exclusion if all the
dump/restart work is on the same workqueue.

I'm still not quite sure if this is 5.1-rc material, or I should just
revert for 5.1.

Thanks,
Brian

2019-03-26 20:35:31

by Brian Norris

[permalink] [raw]
Subject: Re: [PATCH] ath10k: pci: use mutex for diagnostic window CE polling

On Mon, Mar 25, 2019 at 3:14 PM Brian Norris <[email protected]> wrote:
> On Mon, Mar 25, 2019 at 2:20 PM Michał Kazior <[email protected]> wrote:
> > For one, you'll need to defer ath10k_pci_fw_crashed_dump to a worker.
> > Maybe into ar->restart_work which the dump function calls now.
>
> Hmm, that's an idea -- although I'm not sure if I'd steal
> 'restart_work', or create a different work item on the same queue. But
> either way, we'd still also have to avoid holding 'data_lock', and at
> that point, I'm not sure if we're losing desirable properties of these
> firmware dumps -- it allows more "stuff" to keep going on while we're
> preparing to dump the device memory state.

So IIUC, we don't today have, for instance, a complete guarantee that
Copy Engines are stopped at this point, so these memory dumps are
useful mostly by the implied fact that a crashed firmware is no longer
doing anything active.

So as long as we ensure the driver is retaining exclusion on its own
resources (e.g., coredump buffers) while performing the dump work,
then we should be OK.

> > To get rid of data_lock from ath10k_pci_fw_crashed_dump() you'll need
> > to at least make fw_crash_counter into an atomic_t. This is just from
> > a quick glance.
>
> Yes, we'd need at least that much. We'd also need some other form of
> locking to ensure exclusion between all users of
> ar->coredump.fw_crash_data and similar. At the moment, that's
> 'data_lock', but I suppose we get a similar exclusion if all the
> dump/restart work is on the same workqueue.

I've cooked up a solution with a new dump_{work,mutex} to protect the
coredump buffers and do the dumping work. I still keep the 'data_lock'
only around the fw_crash_counter.

> I'm still not quite sure if this is 5.1-rc material, or I should just
> revert for 5.1.

I'll post the above soon with a goal of 5.1. If that's not considered
good enough, i'll post a revert too.

Brian