2018-03-22 17:18:20

by Sinan Kaya

[permalink] [raw]
Subject: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <[email protected]>
---
drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 ++++++++----
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 4 ++--
drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 2 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 4 ++--
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 2 +-
drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 2 +-
7 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..d847e1b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do { \
#define REG_RD8(bp, offset) readb(REG_ADDR(bp, offset))
#define REG_RD16(bp, offset) readw(REG_ADDR(bp, offset))

+#define REG_WR_RELAXED(bp, offset, val) \
+ writel_relaxed((u32)val, REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+ writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
#define REG_WR(bp, offset, val) writel((u32)val, REG_ADDR(bp, offset))
#define REG_WR8(bp, offset, val) writeb((u8)val, REG_ADDR(bp, offset))
#define REG_WR16(bp, offset, val) writew((u16)val, REG_ADDR(bp, offset))
@@ -758,10 +764,8 @@ struct bnx2x_fastpath {
#if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
#error "Min DB doorbell stride is 8"
#endif
-#define DOORBELL(bp, cid, val) \
- do { \
- writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
- } while (0)
+#define DOORBELL_RELAXED(bp, cid, val) \
+ writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid)))

/* TX CSUM helpers */
#define SKB_CS_OFF(skb) (offsetof(struct tcphdr, check) - \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index d7c98e8..172fd79 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4155,7 +4155,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
txdata->tx_db.data.prod += nbd;
barrier();

- DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+ DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);

mmiowb();

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();

for (i = 0; i < sizeof(rx_prods)/4; i++)
- REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
- ((u32 *)&rx_prods)[i]);
+ REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+ ((u32 *)&rx_prods)[i]);

mmiowb(); /* keep prod updates ordered */

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..bda5e4f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2592,7 +2592,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)

txdata->tx_db.data.prod += 2;
barrier();
- DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+ DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);

mmiowb();
barrier();
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..146c40d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
*/
mb();

- REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
- bp->spq_prod_idx);
+ REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+ bp->spq_prod_idx);
mmiowb();
}

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index ffa7959..40e55d8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -105,7 +105,7 @@ static void bnx2x_vf_igu_ack_sb(struct bnx2x *bp, struct bnx2x_virtf *vf,

DP(NETIF_MSG_HW, "write 0x%08x to IGU(via GRC) addr 0x%x\n",
ctl, igu_addr_ctl);
- REG_WR(bp, igu_addr_ctl, ctl);
+ REG_WR_RELAXED(bp, igu_addr_ctl, ctl);
mmiowb();
barrier();
}
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..3b2f1bd 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,7 @@ static int bnx2x_send_msg2pf(struct bnx2x *bp, u8 *done, dma_addr_t msg_mapping)
wmb();

/* Trigger the PF FW */
- writeb(1, &zone_data->trigger.vf_pf_channel.addr_valid);
+ writeb_relaxed(1, &zone_data->trigger.vf_pf_channel.addr_valid);

/* Wait for PF to complete */
while ((tout >= 0) && (!*done)) {
--
2.7.4



2018-03-23 16:22:08

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

From: Sinan Kaya <[email protected]>
Date: Thu, 22 Mar 2018 13:10:00 -0400

> Code includes wmb() followed by writel(). writel() already has a
> barrier on some architectures like arm64.
...
> @@ -4155,7 +4155,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
> txdata->tx_db.data.prod += nbd;
> barrier();
>
> - DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
> + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
>
> mmiowb();
...
> @@ -2592,7 +2592,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>
> txdata->tx_db.data.prod += 2;
> barrier();
> - DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
> + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);

These are compiler barriers being used here, not wmb().

Look, if I can't see a clear:

wmb()
writel()

sequence in the patch hunks, I am going to keep pushing back on
these changes.

Thank you.

2018-03-23 16:32:34

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

On 3/23/2018 12:20 PM, David Miller wrote:
> From: Sinan Kaya <[email protected]>
> Date: Thu, 22 Mar 2018 13:10:00 -0400
>
>> Code includes wmb() followed by writel(). writel() already has a
>> barrier on some architectures like arm64.
> ...
>> @@ -4155,7 +4155,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
>> txdata->tx_db.data.prod += nbd;
>> barrier();
>>
>> - DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>> + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
>>
>> mmiowb();
> ...
>> @@ -2592,7 +2592,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode)
>>
>> txdata->tx_db.data.prod += 2;
>> barrier();
>> - DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>> + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
>
> These are compiler barriers being used here, not wmb().
>
> Look, if I can't see a clear:
>
> wmb()
> writel()
>
> sequence in the patch hunks, I am going to keep pushing back on
> these changes.

Sorry, you got me confused now.

If you look at the code closer, you'll see this.

wmb();

txdata->tx_db.data.prod += nbd;
barrier();

DOORBELL(bp, txdata->cid, txdata->tx_db.raw);

and you also asked me to rename DOORBELL to DOORBELL_RELAXED() to make
it obvious that we have a relaxed operator inside the macro.

Did I miss something?

of course, treating barrier() universally as a write barrier is wrong.

>
> Thank you.
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-23 16:46:46

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

From: Sinan Kaya <[email protected]>
Date: Fri, 23 Mar 2018 12:31:12 -0400

> Sorry, you got me confused now.
>
> If you look at the code closer, you'll see this.
>
> wmb();
>
> txdata->tx_db.data.prod += nbd;
> barrier();
>
> DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>
> and you also asked me to rename DOORBELL to DOORBELL_RELAXED() to make
> it obvious that we have a relaxed operator inside the macro.

This still doesn't match the stated pattern.

wmb();
/* no other memory or I/O or IOMEM operation */
writel();

There is a write to a producer index there and then no non-compiler
barrier or any kind before the writel().

So, in fact, it might really need that implicit writel() barrier here!

2018-03-23 16:53:31

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

On 3/23/2018 12:43 PM, David Miller wrote:
> From: Sinan Kaya <[email protected]>
> Date: Fri, 23 Mar 2018 12:31:12 -0400
>
>> Sorry, you got me confused now.
>>
>> If you look at the code closer, you'll see this.
>>
>> wmb();
>>
>> txdata->tx_db.data.prod += nbd;
>> barrier();
>>
>> DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
>>
>> and you also asked me to rename DOORBELL to DOORBELL_RELAXED() to make
>> it obvious that we have a relaxed operator inside the macro.
>
> This still doesn't match the stated pattern.

I can certainly update the commit text for this or spin into its own
patch to make it obvious.

>
> wmb();
> /* no other memory or I/O or IOMEM operation */
> writel();
>
> There is a write to a producer index there and then no non-compiler
> barrier or any kind before the writel().
>
> So, in fact, it might really need that implicit writel() barrier here!
>

It could if txdata->tx_db was not a union. There is a data dependency
between txdata->tx_db.data.prod and txdata->tx_db.raw.

So, no reordering.

I can argue that barrier() here is useless in fact.

Anyhow, I'll spin this piece out of this patch so that we pay special
attention with a better description.

--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-23 17:06:20

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

From: Sinan Kaya <[email protected]>
Date: Fri, 23 Mar 2018 12:51:47 -0400

> It could if txdata->tx_db was not a union. There is a data dependency
> between txdata->tx_db.data.prod and txdata->tx_db.raw.
>
> So, no reordering.

I don't see it that way, the code requires that:

txdata->tx_db.data.prod += nbd;

is visible before the doorbell update.

barrier() doesn't provide that.

Neither does writel_relaxed(). However plain writel() does.

Therefore the code is only correct as-is, and your change potentially
adds a reordering problem.

2018-03-23 17:15:01

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

On 3/23/2018 1:04 PM, David Miller wrote:
> From: Sinan Kaya <[email protected]>
> Date: Fri, 23 Mar 2018 12:51:47 -0400
>
>> It could if txdata->tx_db was not a union. There is a data dependency
>> between txdata->tx_db.data.prod and txdata->tx_db.raw.
>>
>> So, no reordering.
>
> I don't see it that way, the code requires that:
>
> txdata->tx_db.data.prod += nbd;
>
> is visible before the doorbell update.>
> barrier() doesn't provide that.
>
> Neither does writel_relaxed(). However plain writel() does.

Correct for some architectures including ARM but not correct universally.

writel() just guarantees register read/writes before and after to be
ordered when HW observes it.

writel() doesn't guarantee that the memory update is visible to the HW
on all architectures.

If you need memory update visibility, that barrier() should have been a
wmb()

A correct multi-arch pattern is

wmb()
writel_relaxed()
mmiowb()

We have decided to drop a similar patch on Infiniband due to incorrect
usage of barrier(). If you feel strongly about it, I can remove the
DOORBELL change.

>
> Therefore the code is only correct as-is, and your change potentially
> adds a reordering problem.
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-23 17:18:12

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

From: Sinan Kaya <[email protected]>
Date: Fri, 23 Mar 2018 13:13:46 -0400

> We have decided to drop a similar patch on Infiniband due to incorrect
> usage of barrier(). If you feel strongly about it, I can remove the
> DOORBELL change.

Either remove the DOORBELL change or properly adjust the barrier() to
be a wmb().

Be sure to mention this in the commit message, explicitly.

Thank you.

2018-03-24 14:31:38

by Chopra, Manish

[permalink] [raw]
Subject: RE: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

> -----Original Message-----
> From: Sinan Kaya [mailto:[email protected]]
> Sent: Friday, March 23, 2018 10:44 PM
> To: David Miller <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Elior,
> Ariel <[email protected]>; Dept-Eng Everest Linux L2 <Dept-
> [email protected]>; [email protected]
> Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-
> ordered archs
>
> On 3/23/2018 1:04 PM, David Miller wrote:
> > From: Sinan Kaya <[email protected]>
> > Date: Fri, 23 Mar 2018 12:51:47 -0400
> >
> >> It could if txdata->tx_db was not a union. There is a data dependency
> >> between txdata->tx_db.data.prod and txdata->tx_db.raw.
> >>
> >> So, no reordering.
> >
> > I don't see it that way, the code requires that:
> >
> > txdata->tx_db.data.prod += nbd;
> >
> > is visible before the doorbell update.>
> > barrier() doesn't provide that.
> >
> > Neither does writel_relaxed(). However plain writel() does.
>
> Correct for some architectures including ARM but not correct universally.
>
> writel() just guarantees register read/writes before and after to be ordered
> when HW observes it.
>
> writel() doesn't guarantee that the memory update is visible to the HW on all
> architectures.
>
> If you need memory update visibility, that barrier() should have been a
> wmb()
>
> A correct multi-arch pattern is
>
> wmb()
> writel_relaxed()
> mmiowb()
>

Sinan, Since you have mentioned the use of mmiowb() here after writel_relaxed().
I believe this is not always correct for all types of IO mapped memory [Specially if IO memory is mapped using write combined (for ex. Ioremap_wc())].
We have a current issue on our NIC (qede) driver on x86 for which the patch is already been sent more than a week ago [Still awaiting to hear from David on that].
where mmiowb() seems to be useless since we use write combined mapped doorbell and mmiowb() just seems to be a compiler barrier() there.
So in order to flush write combined buffer we really need writel_relaxed() followed by a wmb() to synchronize writes among CPU cores.
I think the correct pattern in such cases (for write combined IO) would have been like below -

wmb();
writel_relaxed();
wmb(); -> To flush the writes actually.

Thanks.





2018-03-24 14:58:56

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

On 3/24/2018 10:30 AM, Chopra, Manish wrote:
>> -----Original Message-----
>> From: Sinan Kaya [mailto:[email protected]]
>> Sent: Friday, March 23, 2018 10:44 PM
>> To: David Miller <[email protected]>
>> Cc: [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; Elior,
>> Ariel <[email protected]>; Dept-Eng Everest Linux L2 <Dept-
>> [email protected]>; [email protected]
>> Subject: Re: [PATCH v5 3/5] bnx2x: Eliminate duplicate barriers on weakly-
>> ordered archs
>>
>> On 3/23/2018 1:04 PM, David Miller wrote:
>>> From: Sinan Kaya <[email protected]>
>>> Date: Fri, 23 Mar 2018 12:51:47 -0400
>>>
>>>> It could if txdata->tx_db was not a union. There is a data dependency
>>>> between txdata->tx_db.data.prod and txdata->tx_db.raw.
>>>>
>>>> So, no reordering.
>>>
>>> I don't see it that way, the code requires that:
>>>
>>> txdata->tx_db.data.prod += nbd;
>>>
>>> is visible before the doorbell update.>
>>> barrier() doesn't provide that.
>>>
>>> Neither does writel_relaxed(). However plain writel() does.
>>
>> Correct for some architectures including ARM but not correct universally.
>>
>> writel() just guarantees register read/writes before and after to be ordered
>> when HW observes it.
>>
>> writel() doesn't guarantee that the memory update is visible to the HW on all
>> architectures.
>>
>> If you need memory update visibility, that barrier() should have been a
>> wmb()
>>
>> A correct multi-arch pattern is
>>
>> wmb()
>> writel_relaxed()
>> mmiowb()
>>
>
> Sinan, Since you have mentioned the use of mmiowb() here after writel_relaxed().
> I believe this is not always correct for all types of IO mapped memory [Specially if IO memory is mapped using write combined (for ex. Ioremap_wc())].
> We have a current issue on our NIC (qede) driver on x86 for which the patch is already been sent more than a week ago [Still awaiting to hear from David on that].
> where mmiowb() seems to be useless since we use write combined mapped doorbell and mmiowb() just seems to be a compiler barrier() there.
> So in order to flush write combined buffer we really need writel_relaxed() followed by a wmb() to synchronize writes among CPU cores.
> I think the correct pattern in such cases (for write combined IO) would have been like below -
>
> wmb();
> writel_relaxed();
> wmb(); -> To flush the writes actually.

You actually have good points. It is the same problem with barrier() description above.

The answer really depends on what you are doing/expecting after mmiowb(). If you expect
that some memory content to be observed by HW, you definitely need a wmb() like
you mentioned.

If you just want writes to be flushed but you don't expect any memory content to be
updated, you need a mmiowb().

https://lwn.net/Articles/198988/
"There is mmiowb(), but its real purpose is to enforce ordering between MMIO operations only."

>
> Thanks.
>
>
>
>
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.