LinuxLists.cc - [PATCH net 0/2] net: ipa: fix two replenish bugs

2022-01-11 19:21:57

Subject: [PATCH net 0/2] net: ipa: fix two replenish bugs

This series contains two fixes for bugs in the IPA receive buffer
replenishing code.

-Alex

Alex Elder (2):
net: ipa: fix atomic update in ipa_endpoint_replenish()
net: ipa: prevent concurrent replenish

drivers/net/ipa/ipa_endpoint.c | 20 ++++++++++++++++----
drivers/net/ipa/ipa_endpoint.h | 2 ++
2 files changed, 18 insertions(+), 4 deletions(-)

--
2.32.0

2022-01-11 19:21:59

by Alex Elder

[permalink] [raw]

Subject: [PATCH net 1/2] net: ipa: fix atomic update in ipa_endpoint_replenish()

In ipa_endpoint_replenish(), if an error occurs when attempting to
replenish a receive buffer, we just quit and try again later. In
that case we increment the backlog count to reflect that the attempt
was unsuccessful. Then, if the add_one flag was true we increment
the backlog again.

This second increment is not included in the backlog local variable
though, and its value determines whether delayed work should be
scheduled. This is a bug.

Fix this by determining whether 1 or 2 should be added to the
backlog before adding it in a atomic_add_return() call.

Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
Signed-off-by: Alex Elder <[email protected]>
---
drivers/net/ipa/ipa_endpoint.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
index 49d9a077d0375..8b055885cf3cf 100644
--- a/drivers/net/ipa/ipa_endpoint.c
+++ b/drivers/net/ipa/ipa_endpoint.c
@@ -1080,6 +1080,7 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)
{
struct gsi *gsi;
u32 backlog;
+ int delta;

if (!endpoint->replenish_enabled) {
if (add_one)
@@ -1097,10 +1098,8 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)

try_again_later:
/* The last one didn't succeed, so fix the backlog */
- backlog = atomic_inc_return(&endpoint->replenish_backlog);
-
- if (add_one)
- atomic_inc(&endpoint->replenish_backlog);
+ delta = add_one ? 2 : 1;
+ backlog = atomic_add_return(delta, &endpoint->replenish_backlog);

/* Whenever a receive buffer transaction completes we'll try to
* replenish again. It's unlikely, but if we fail to supply even
--
2.32.0

2022-01-11 19:22:03

by Alex Elder

[permalink] [raw]

Subject: [PATCH net 2/2] net: ipa: prevent concurrent replenish

We have seen cases where an endpoint RX completion interrupt arrives
while replenishing for the endpoint is underway. This causes another
instance of replenishing to begin as part of completing the receive
transaction. If this occurs it can lead to transaction corruption.

Use a new atomic variable to ensure only replenish instance for an
endpoint executes at a time.

Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
Signed-off-by: Alex Elder <[email protected]>
---
drivers/net/ipa/ipa_endpoint.c | 13 +++++++++++++
drivers/net/ipa/ipa_endpoint.h | 2 ++
2 files changed, 15 insertions(+)

diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
index 8b055885cf3cf..a1019f5fe1748 100644
--- a/drivers/net/ipa/ipa_endpoint.c
+++ b/drivers/net/ipa/ipa_endpoint.c
@@ -1088,15 +1088,27 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)
return;
}

+ /* If already active, just update the backlog */
+ if (atomic_xchg(&endpoint->replenish_active, 1)) {
+ if (add_one)
+ atomic_inc(&endpoint->replenish_backlog);
+ return;
+ }
+
while (atomic_dec_not_zero(&endpoint->replenish_backlog))
if (ipa_endpoint_replenish_one(endpoint))
goto try_again_later;
+
+ atomic_set(&endpoint->replenish_active, 0);
+
if (add_one)
atomic_inc(&endpoint->replenish_backlog);

return;

try_again_later:
+ atomic_set(&endpoint->replenish_active, 0);
+
/* The last one didn't succeed, so fix the backlog */
delta = add_one ? 2 : 1;
backlog = atomic_add_return(delta, &endpoint->replenish_backlog);
@@ -1691,6 +1703,7 @@ static void ipa_endpoint_setup_one(struct ipa_endpoint *endpoint)
* backlog is the same as the maximum outstanding TREs.
*/
endpoint->replenish_enabled = false;
+ atomic_set(&endpoint->replenish_active, 0);
atomic_set(&endpoint->replenish_saved,
gsi_channel_tre_max(gsi, endpoint->channel_id));
atomic_set(&endpoint->replenish_backlog, 0);
diff --git a/drivers/net/ipa/ipa_endpoint.h b/drivers/net/ipa/ipa_endpoint.h
index 0a859d10312dc..200f093214997 100644
--- a/drivers/net/ipa/ipa_endpoint.h
+++ b/drivers/net/ipa/ipa_endpoint.h
@@ -53,6 +53,7 @@ enum ipa_endpoint_name {
* @netdev: Network device pointer, if endpoint uses one
* @replenish_enabled: Whether receive buffer replenishing is enabled
* @replenish_ready: Number of replenish transactions without doorbell
+ * @replenish_active: 1 when replenishing is active, 0 otherwise
* @replenish_saved: Replenish requests held while disabled
* @replenish_backlog: Number of buffers needed to fill hardware queue
* @replenish_work: Work item used for repeated replenish failures
@@ -74,6 +75,7 @@ struct ipa_endpoint {
/* Receive buffer replenishing for RX endpoints */
bool replenish_enabled;
u32 replenish_ready;
+ atomic_t replenish_active;
atomic_t replenish_saved;
atomic_t replenish_backlog;
struct delayed_work replenish_work; /* global wq */
--
2.32.0

2022-01-11 20:05:59

by Matthias Kaehlcke

[permalink] [raw]

Subject: Re: [PATCH net 1/2] net: ipa: fix atomic update in ipa_endpoint_replenish()

On Tue, Jan 11, 2022 at 01:21:49PM -0600, Alex Elder wrote:
> In ipa_endpoint_replenish(), if an error occurs when attempting to
> replenish a receive buffer, we just quit and try again later. In
> that case we increment the backlog count to reflect that the attempt
> was unsuccessful. Then, if the add_one flag was true we increment
> the backlog again.
>
> This second increment is not included in the backlog local variable
> though, and its value determines whether delayed work should be
> scheduled. This is a bug.
>
> Fix this by determining whether 1 or 2 should be added to the
> backlog before adding it in a atomic_add_return() call.
>
> Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
> Signed-off-by: Alex Elder <[email protected]>

Reviewed-by: Matthias Kaehlcke <[email protected]>

2022-01-11 20:20:35

by Matthias Kaehlcke

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

On Tue, Jan 11, 2022 at 01:21:50PM -0600, Alex Elder wrote:
> We have seen cases where an endpoint RX completion interrupt arrives
> while replenishing for the endpoint is underway. This causes another
> instance of replenishing to begin as part of completing the receive
> transaction. If this occurs it can lead to transaction corruption.
>
> Use a new atomic variable to ensure only replenish instance for an
> endpoint executes at a time.
>
> Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
> Signed-off-by: Alex Elder <[email protected]>
> ---
> drivers/net/ipa/ipa_endpoint.c | 13 +++++++++++++
> drivers/net/ipa/ipa_endpoint.h | 2 ++
> 2 files changed, 15 insertions(+)
>
> diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
> index 8b055885cf3cf..a1019f5fe1748 100644
> --- a/drivers/net/ipa/ipa_endpoint.c
> +++ b/drivers/net/ipa/ipa_endpoint.c
> @@ -1088,15 +1088,27 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)
> return;
> }
>
> + /* If already active, just update the backlog */
> + if (atomic_xchg(&endpoint->replenish_active, 1)) {
> + if (add_one)
> + atomic_inc(&endpoint->replenish_backlog);
> + return;
> + }
> +
> while (atomic_dec_not_zero(&endpoint->replenish_backlog))
> if (ipa_endpoint_replenish_one(endpoint))
> goto try_again_later;

I think there is a race here, not sure whether it's a problem: If the first
interrupt is here just when a 2nd interrupt evaluates 'replenish_active' the
latter will return, since it looks like replenishing is still active, when it
actually just finished. Would replenishing be kicked off anyway shortly after
or could the transaction be stalled until another endpoint RX completion
interrupt arrives?

> +
> + atomic_set(&endpoint->replenish_active, 0);
> +
> if (add_one)
> atomic_inc(&endpoint->replenish_backlog);
>
> return;
>
> try_again_later:
> + atomic_set(&endpoint->replenish_active, 0);
> +
> /* The last one didn't succeed, so fix the backlog */
> delta = add_one ? 2 : 1;
> backlog = atomic_add_return(delta, &endpoint->replenish_backlog);
> @@ -1691,6 +1703,7 @@ static void ipa_endpoint_setup_one(struct ipa_endpoint *endpoint)
> * backlog is the same as the maximum outstanding TREs.
> */
> endpoint->replenish_enabled = false;
> + atomic_set(&endpoint->replenish_active, 0);
> atomic_set(&endpoint->replenish_saved,
> gsi_channel_tre_max(gsi, endpoint->channel_id));
> atomic_set(&endpoint->replenish_backlog, 0);
> diff --git a/drivers/net/ipa/ipa_endpoint.h b/drivers/net/ipa/ipa_endpoint.h
> index 0a859d10312dc..200f093214997 100644
> --- a/drivers/net/ipa/ipa_endpoint.h
> +++ b/drivers/net/ipa/ipa_endpoint.h
> @@ -53,6 +53,7 @@ enum ipa_endpoint_name {
> * @netdev: Network device pointer, if endpoint uses one
> * @replenish_enabled: Whether receive buffer replenishing is enabled
> * @replenish_ready: Number of replenish transactions without doorbell
> + * @replenish_active: 1 when replenishing is active, 0 otherwise
> * @replenish_saved: Replenish requests held while disabled
> * @replenish_backlog: Number of buffers needed to fill hardware queue
> * @replenish_work: Work item used for repeated replenish failures
> @@ -74,6 +75,7 @@ struct ipa_endpoint {
> /* Receive buffer replenishing for RX endpoints */
> bool replenish_enabled;
> u32 replenish_ready;
> + atomic_t replenish_active;
> atomic_t replenish_saved;
> atomic_t replenish_backlog;
> struct delayed_work replenish_work; /* global wq */
> --
> 2.32.0
>

2022-01-11 20:58:22

by Alex Elder

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

On 1/11/22 2:20 PM, Matthias Kaehlcke wrote:
> On Tue, Jan 11, 2022 at 01:21:50PM -0600, Alex Elder wrote:
>> We have seen cases where an endpoint RX completion interrupt arrives
>> while replenishing for the endpoint is underway. This causes another
>> instance of replenishing to begin as part of completing the receive
>> transaction. If this occurs it can lead to transaction corruption.
>>
>> Use a new atomic variable to ensure only replenish instance for an
>> endpoint executes at a time.
>>
>> Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
>> Signed-off-by: Alex Elder <[email protected]>
>> ---
>> drivers/net/ipa/ipa_endpoint.c | 13 +++++++++++++
>> drivers/net/ipa/ipa_endpoint.h | 2 ++
>> 2 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
>> index 8b055885cf3cf..a1019f5fe1748 100644
>> --- a/drivers/net/ipa/ipa_endpoint.c
>> +++ b/drivers/net/ipa/ipa_endpoint.c
>> @@ -1088,15 +1088,27 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)
>> return;
>> }
>>
>> + /* If already active, just update the backlog */
>> + if (atomic_xchg(&endpoint->replenish_active, 1)) {
>> + if (add_one)
>> + atomic_inc(&endpoint->replenish_backlog);
>> + return;
>> + }
>> +
>> while (atomic_dec_not_zero(&endpoint->replenish_backlog))
>> if (ipa_endpoint_replenish_one(endpoint))
>> goto try_again_later;
>
> I think there is a race here, not sure whether it's a problem: If the first
> interrupt is here just when a 2nd interrupt evaluates 'replenish_active' the
> latter will return, since it looks like replenishing is still active, when it
> actually just finished. Would replenishing be kicked off anyway shortly after
> or could the transaction be stalled until another endpoint RX completion
> interrupt arrives?

I acknowledge the race you point out. You're saying another
thread could test the flag after the while loop exits, but
before the flag gets reset to 0. And that means that other
thread would skip the replenishing it would otherwise do.

To be honest, that is a different scenario than the one I
was trying to prevent, because it involves two threads, rather
than one thread entering this function a second time (via an
interrupt). But regardless, I think it's OK.

The replenishing loop is intentionally tolerant of errors.
It will send as many receive buffers to the hardware as there
is room for, but will stop and "try again later" if an
error occurs. Even if the specific case you mention
occurred it wouldn't be a problem because we'd get another
shot at it. I'll explain.

The replenish_backlog is the number of "open slots" the hardware
has to hold a receive buffer. When the backlog reaches 0, the
hardware is "full."

When a receive operation completes (in ipa_endpoint_rx_complete())
it calls ipa_endpoint_replenish(), requesting that we add one to
the backlog (to account for the buffer just consumed). What this
means is that if the hardware has *any* receive buffers, a
replenish will eventually be requested (when one completes).

The logic in ipa_endpoint_replenish() tolerates an error
attempting to send a new receive buffer to the hardware.
If that happens, we just try again later--knowing that
the next RX completion will trigger that retry.

The only case where we aren't guaranteed a subsequent call
to ipa_endpoint_replenish() is when the hardware has *zero*
receive buffers, or the backlog is the maximum. In that
case we explicitly schedule a new replenish via delayed
work.

Now, two more points.
- If the while loop exits without an error replenishing,
the hardware is "full", so there are buffers that will
complete, and trigger a replenish call. So if the race
occurred, it would be harmless.
- If the while loop exits because of an error replenishing
one buffer, it jumps to try_again_later. In that case,
either the hardware has at least one receive buffer (so
we'll get another replenish), or it has none and we will
schedule delayed work to to schedule another replenish.

So I think that even if the race you point out occurs, the
replenish logic will eventually get another try.

I don't claim my logic is flawless; if it's wrong I can
find another solution.

Thanks.

-Alex

PS I will be implementing some improvements to this logic
soon, but this is a simple bug fix for back-port.

>> +
>> + atomic_set(&endpoint->replenish_active, 0);
>> +
>> if (add_one)
>> atomic_inc(&endpoint->replenish_backlog);
>>
>> return;
>>
>> try_again_later:
>> + atomic_set(&endpoint->replenish_active, 0);
>> +
>> /* The last one didn't succeed, so fix the backlog */
>> delta = add_one ? 2 : 1;
>> backlog = atomic_add_return(delta, &endpoint->replenish_backlog);
>> @@ -1691,6 +1703,7 @@ static void ipa_endpoint_setup_one(struct ipa_endpoint *endpoint)
>> * backlog is the same as the maximum outstanding TREs.
>> */
>> endpoint->replenish_enabled = false;
>> + atomic_set(&endpoint->replenish_active, 0);
>> atomic_set(&endpoint->replenish_saved,
>> gsi_channel_tre_max(gsi, endpoint->channel_id));
>> atomic_set(&endpoint->replenish_backlog, 0);
>> diff --git a/drivers/net/ipa/ipa_endpoint.h b/drivers/net/ipa/ipa_endpoint.h
>> index 0a859d10312dc..200f093214997 100644
>> --- a/drivers/net/ipa/ipa_endpoint.h
>> +++ b/drivers/net/ipa/ipa_endpoint.h
>> @@ -53,6 +53,7 @@ enum ipa_endpoint_name {
>> * @netdev: Network device pointer, if endpoint uses one
>> * @replenish_enabled: Whether receive buffer replenishing is enabled
>> * @replenish_ready: Number of replenish transactions without doorbell
>> + * @replenish_active: 1 when replenishing is active, 0 otherwise
>> * @replenish_saved: Replenish requests held while disabled
>> * @replenish_backlog: Number of buffers needed to fill hardware queue
>> * @replenish_work: Work item used for repeated replenish failures
>> @@ -74,6 +75,7 @@ struct ipa_endpoint {
>> /* Receive buffer replenishing for RX endpoints */
>> bool replenish_enabled;
>> u32 replenish_ready;
>> + atomic_t replenish_active;
>> atomic_t replenish_saved;
>> atomic_t replenish_backlog;
>> struct delayed_work replenish_work; /* global wq */
>> --
>> 2.32.0
>>

2022-01-11 21:53:56

by Matthias Kaehlcke

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

On Tue, Jan 11, 2022 at 02:58:16PM -0600, Alex Elder wrote:
> On 1/11/22 2:20 PM, Matthias Kaehlcke wrote:
> > On Tue, Jan 11, 2022 at 01:21:50PM -0600, Alex Elder wrote:
> > > We have seen cases where an endpoint RX completion interrupt arrives
> > > while replenishing for the endpoint is underway. This causes another
> > > instance of replenishing to begin as part of completing the receive
> > > transaction. If this occurs it can lead to transaction corruption.
> > >
> > > Use a new atomic variable to ensure only replenish instance for an
> > > endpoint executes at a time.
> > >
> > > Fixes: 84f9bd12d46db ("soc: qcom: ipa: IPA endpoints")
> > > Signed-off-by: Alex Elder <[email protected]>
> > > ---
> > > drivers/net/ipa/ipa_endpoint.c | 13 +++++++++++++
> > > drivers/net/ipa/ipa_endpoint.h | 2 ++
> > > 2 files changed, 15 insertions(+)
> > >
> > > diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
> > > index 8b055885cf3cf..a1019f5fe1748 100644
> > > --- a/drivers/net/ipa/ipa_endpoint.c
> > > +++ b/drivers/net/ipa/ipa_endpoint.c
> > > @@ -1088,15 +1088,27 @@ static void ipa_endpoint_replenish(struct ipa_endpoint *endpoint, bool add_one)
> > > return;
> > > }
> > > + /* If already active, just update the backlog */
> > > + if (atomic_xchg(&endpoint->replenish_active, 1)) {
> > > + if (add_one)
> > > + atomic_inc(&endpoint->replenish_backlog);
> > > + return;
> > > + }
> > > +
> > > while (atomic_dec_not_zero(&endpoint->replenish_backlog))
> > > if (ipa_endpoint_replenish_one(endpoint))
> > > goto try_again_later;
> >
> > I think there is a race here, not sure whether it's a problem: If the first
> > interrupt is here just when a 2nd interrupt evaluates 'replenish_active' the
> > latter will return, since it looks like replenishing is still active, when it
> > actually just finished. Would replenishing be kicked off anyway shortly after
> > or could the transaction be stalled until another endpoint RX completion
> > interrupt arrives?
>
> I acknowledge the race you point out. You're saying another
> thread could test the flag after the while loop exits, but
> before the flag gets reset to 0. And that means that other
> thread would skip the replenishing it would otherwise do.
>
> To be honest, that is a different scenario than the one I
> was trying to prevent, because it involves two threads, rather
> than one thread entering this function a second time (via an
> interrupt). But regardless, I think it's OK.
>
> The replenishing loop is intentionally tolerant of errors.
> It will send as many receive buffers to the hardware as there
> is room for, but will stop and "try again later" if an
> error occurs. Even if the specific case you mention
> occurred it wouldn't be a problem because we'd get another
> shot at it. I'll explain.
>
>
> The replenish_backlog is the number of "open slots" the hardware
> has to hold a receive buffer. When the backlog reaches 0, the
> hardware is "full."
>
> When a receive operation completes (in ipa_endpoint_rx_complete())
> it calls ipa_endpoint_replenish(), requesting that we add one to
> the backlog (to account for the buffer just consumed). What this
> means is that if the hardware has *any* receive buffers, a
> replenish will eventually be requested (when one completes).
>
> The logic in ipa_endpoint_replenish() tolerates an error
> attempting to send a new receive buffer to the hardware.
> If that happens, we just try again later--knowing that
> the next RX completion will trigger that retry.
>
> The only case where we aren't guaranteed a subsequent call
> to ipa_endpoint_replenish() is when the hardware has *zero*
> receive buffers, or the backlog is the maximum. In that
> case we explicitly schedule a new replenish via delayed
> work.
>
> Now, two more points.
> - If the while loop exits without an error replenishing,
> the hardware is "full", so there are buffers that will
> complete, and trigger a replenish call. So if the race
> occurred, it would be harmless.
> - If the while loop exits because of an error replenishing
> one buffer, it jumps to try_again_later. In that case,
> either the hardware has at least one receive buffer (so
> we'll get another replenish), or it has none and we will
> schedule delayed work to to schedule another replenish.
>
> So I think that even if the race you point out occurs, the
> replenish logic will eventually get another try.
>
> I don't claim my logic is flawless; if it's wrong I can
> find another solution.

Thanks for the detailed description!

I agree the race isn't an actual issue as long as it is
guaranteed that the function is called again 'soon' after
the race occurred. It seems that should be always the case.

2022-01-11 21:54:46

by Matthias Kaehlcke

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

2022-01-12 04:04:40

by Jakub Kicinski

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

On Tue, 11 Jan 2022 13:21:50 -0600 Alex Elder wrote:
> Use a new atomic variable to ensure only replenish instance for an
> endpoint executes at a time.

Why atomic_t? test_and_set_bit() + clear_bit() should do nicely here?

2022-01-12 13:16:31

by Alex Elder

[permalink] [raw]

Subject: Re: [PATCH net 2/2] net: ipa: prevent concurrent replenish

On 1/11/22 10:04 PM, Jakub Kicinski wrote:
> On Tue, 11 Jan 2022 13:21:50 -0600 Alex Elder wrote:
>> Use a new atomic variable to ensure only replenish instance for an
>> endpoint executes at a time.
>
> Why atomic_t? test_and_set_bit() + clear_bit() should do nicely here?

I think it foreshadows the replenish logic improvements
I'm experimenting with. The bit operations are probably
best to represent Booleans, so I'll send version 2 that
adds and uses a bitmask instead.

Thanks.

-Alex