2018-03-16 16:18:47

by Sinan Kaya

[permalink] [raw]
Subject: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <[email protected]>
---
drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..7a48c9e 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
(u64 *)wqe);
} else {
pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
- wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+ writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+ wq->sq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+ writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
}

static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
(void *)wqe);
} else {
pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
- wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+ writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+ wq->rq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+ writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
}

static inline int t4_wq_in_error(struct t4_wq *wq)
--
2.7.4



2018-03-16 21:34:27

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

> Code includes wmb() followed by writel(). writel() already has a barrier
on
> some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing the
> register write.
>
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
>
> Signed-off-by: Sinan Kaya <[email protected]>

NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just
writeX().

I was just looking at this with Chelsio developers, and they said the
writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
get rid of the extra barrier for all architectures.

Also, t4.h:pio_copy() needs to use __raw_writeq() to enable the write
combining fastpath for ARM and PowerPC. The code as it stands doesn't
achieve any write combining on PowerPC at least.

And the writel()s at the end of the ring functions (the non bar2 udb path)
needs a mmiowb() afterwards if you're going to use __raw_writeX() there.
However that path is only used for very old hardware (T4), so I wouldn't
worry about them.

Steve.


> ---
> drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/hw/cxgb4/t4.h
> b/drivers/infiniband/hw/cxgb4/t4.h
> index 8369c7c..7a48c9e 100644
> --- a/drivers/infiniband/hw/cxgb4/t4.h
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
> (u64 *)wqe);
> } else {
> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> - wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >sq.bar2_qid),
> + wq->sq.bar2_va +
> SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> + writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> }
>
> static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
> (void *)wqe);
> } else {
> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> - wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> + writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >rq.bar2_qid),
> + wq->rq.bar2_va +
> SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> + writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> }
>
> static inline int t4_wq_in_error(struct t4_wq *wq)
> --
> 2.7.4


2018-03-16 21:47:31

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On 3/16/2018 5:05 PM, Steve Wise wrote:
>> Code includes wmb() followed by writel(). writel() already has a barrier
> on
>> some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> Signed-off-by: Sinan Kaya <[email protected]>
>
> NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just
> writeX().
>
> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

OK. I can do that but isn't the problem at PowerPC adaptation?

/*
* We don't do relaxed operations yet, at least not with this semantic
*/
#define readb_relaxed(addr) readb(addr)
#define readw_relaxed(addr) readw(addr)
#define readl_relaxed(addr) readl(addr)
#define readq_relaxed(addr) readq(addr)
#define writeb_relaxed(v, addr) writeb(v, addr)
#define writew_relaxed(v, addr) writew(v, addr)
#define writel_relaxed(v, addr) writel(v, addr)
#define writeq_relaxed(v, addr) writeq(v, addr)

Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

From API perspective both __raw_writeX() and writeX_relaxed() are correct.
It is just PowerPC doesn't seem the follow the definition yet.

2018-03-16 22:16:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > Code includes wmb() followed by writel(). writel() already has a barrier
> on
> > some architectures like arm64.
> >
> > This ends up CPU observing two barriers back to back before executing the
> > register write.
> >
> > Since code already has an explicit barrier call, changing writel() to
> > writel_relaxed().
> >
> > Signed-off-by: Sinan Kaya <[email protected]>
>
> NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just
> writeX().

?? Why is changing writex() to writeX() a NAK then?

> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

That doesn't seem semanticaly sane.

__raw_writeX() should not appear in driver code, IMHO. Only the arch
code can know what the exact semantics of that accessor are..

If ppc can't use writel_relaxed to optimize then we probably need yet
another io accessor semantic defined :(

Jason

2018-03-16 23:06:07

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

>
> On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > > Code includes wmb() followed by writel(). writel() already has a
barrier
> > on
> > > some architectures like arm64.
> > >
> > > This ends up CPU observing two barriers back to back before executing
> the
> > > register write.
> > >
> > > Since code already has an explicit barrier call, changing writel() to
> > > writel_relaxed().
> > >
> > > Signed-off-by: Sinan Kaya <[email protected]>
> >
> > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is
just
> > writeX().
>
> ?? Why is changing writex() to writeX() a NAK then?

Because I want it correct for PPC as well.

>
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(),
to
> > get rid of the extra barrier for all architectures.
>
> That doesn't seem semanticaly sane.
>
> __raw_writeX() should not appear in driver code, IMHO. Only the arch
> code can know what the exact semantics of that accessor are..
>
> If ppc can't use writel_relaxed to optimize then we probably need yet
> another io accessor semantic defined :(


Anybody understand why the PPC implementation of writeX_relaxed() isn't
relaxed?


Steve.


2018-03-16 23:07:15

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

>
> On 3/16/2018 5:05 PM, Steve Wise wrote:
> >> Code includes wmb() followed by writel(). writel() already has a barrier
> > on
> >> some architectures like arm64.
> >>
> >> This ends up CPU observing two barriers back to back before executing
> the
> >> register write.
> >>
> >> Since code already has an explicit barrier call, changing writel() to
> >> writel_relaxed().
> >>
> >> Signed-off-by: Sinan Kaya <[email protected]>
> >
> > NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just
> > writeX().
> >
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> > get rid of the extra barrier for all architectures.
>
> OK. I can do that but isn't the problem at PowerPC adaptation?
>
> /*
> * We don't do relaxed operations yet, at least not with this semantic
> */
> #define readb_relaxed(addr) readb(addr)
> #define readw_relaxed(addr) readw(addr)
> #define readl_relaxed(addr) readl(addr)
> #define readq_relaxed(addr) readq(addr)
> #define writeb_relaxed(v, addr) writeb(v, addr)
> #define writew_relaxed(v, addr) writew(v, addr)
> #define writel_relaxed(v, addr) writel(v, addr)
> #define writeq_relaxed(v, addr) writeq(v, addr)
>
> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?


>
> >From API perspective both __raw_writeX() and writeX_relaxed() are
> correct.
> It is just PowerPC doesn't seem the follow the definition yet.




2018-03-17 03:42:12

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On 3/16/2018 7:05 PM, Steve Wise wrote:
>>
>> On 3/16/2018 5:05 PM, Steve Wise wrote:
>>>> Code includes wmb() followed by writel(). writel() already has a barrier
>>> on
>>>> some architectures like arm64.
>>>>
>>>> This ends up CPU observing two barriers back to back before executing
>> the
>>>> register write.
>>>>
>>>> Since code already has an explicit barrier call, changing writel() to
>>>> writel_relaxed().
>>>>
>>>> Signed-off-by: Sinan Kaya <[email protected]>
>>>
>>> NAK - This isn't correct for PowerPC. For PowerPC, writeX_relaxed() is just
>>> writeX().
>>>
>>> I was just looking at this with Chelsio developers, and they said the
>>> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
>>> get rid of the extra barrier for all architectures.
>>
>> OK. I can do that but isn't the problem at PowerPC adaptation?
>>
>> /*
>> * We don't do relaxed operations yet, at least not with this semantic
>> */
>> #define readb_relaxed(addr) readb(addr)
>> #define readw_relaxed(addr) readw(addr)
>> #define readl_relaxed(addr) readl(addr)
>> #define readq_relaxed(addr) readq(addr)
>> #define writeb_relaxed(v, addr) writeb(v, addr)
>> #define writew_relaxed(v, addr) writew(v, addr)
>> #define writel_relaxed(v, addr) writel(v, addr)
>> #define writeq_relaxed(v, addr) writeq(v, addr)
>>
>> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?
>
> I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?
>

I found this article: https://lwn.net/Articles/698014/#PowerPC%20Implementation

Apparently, this issue was discussed at a conference in 2015.

Based on how I read this article, writel() and writel_relaxed() behave exactly
the same on PowerPC because the barrier is enforced by the time code is leaving
a critical section not during MMIO execution.

I also see that __raw_writel() is being heavily used in drivers directory.

https://elixir.bootlin.com/linux/latest/ident/__raw_writel

I'll change writel_relaxed() with __raw_writel() in the series like you suggested
and also look at your other comments.

This seems to be the current norm.

>
>>
>> >From API perspective both __raw_writeX() and writeX_relaxed() are
>> correct.
>> It is just PowerPC doesn't seem the follow the definition yet.
>
>
>
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-17 04:04:20

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> and also look at your other comments.

I spoke too soon.

Now that I realized, code needs to follow one of the following patterns for correctness

1)
wmb()
writel()/writel_relaxed()

or

2)
wmb()
__raw_wrltel()
mmiowb()

but definitely not

wmb()
__raw_wrltel()

Since #1 == #2, I'll stick to my current implementation of writel_relaxed()

Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
for correctness. ARM's mmiowb() implementation is empty.

So, there is no one size fits all solution with the current state of affairs.


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-17 04:09:30

by Timur Tabi

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On 3/16/18 6:04 PM, Steve Wise wrote:
> Anybody understand why the PPC implementation of writeX_relaxed() isn't
> relaxed?

You probably should ask that on the [email protected]
mailing list.

I've always wondered why PowerPC has non-standard I/O accessors.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-17 04:27:14

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>> and also look at your other comments.
>
> I spoke too soon.
>
> Now that I realized, code needs to follow one of the following patterns for correctness
>
> 1)
> wmb()
> writel()/writel_relaxed()
>
> or
>
> 2)
> wmb()
> __raw_wrltel()
> mmiowb()
>
> but definitely not
>
> wmb()
> __raw_wrltel()
>
> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>
> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> for correctness. ARM's mmiowb() implementation is empty.
>
> So, there is no one size fits all solution with the current state of affairs.
>
>

I think I finally got what you mean.

Code seems to have

wmb()
writel()/writeq()
wmb()

this can be safely replaced with

wmb()
__raw_writel()/__raw_writeq()
wmb()

This will work on all arches. Below is the new version. Let me know if this is OK.

+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
int count = 8;

while (count) {
- writeq(*src, dst);
+ __raw_writeq(*src, dst);
src++;
dst++;
count--;
@@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
(u64 *)wqe);
} else {
pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
- wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+ __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+ wq->sq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+ __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+ mmiowmb()
}

static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
(void *)wqe);
} else {
pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
- wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+ __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+ wq->rq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+ __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+ mmiowmb();
}


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-17 04:32:10

by Timur Tabi

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On Fri, Mar 16, 2018 at 11:25 PM, Sinan Kaya <[email protected]> wrote:
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
> (u64 *)wqe);
> } else {
> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> - wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),

I think Jason might be right that drivers shouldn't be calling the
__raw_write functions. You definitely need to run this by the PPC
developers, specifically Ben Herrenschmidt.


--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-17 13:26:35

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

>
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you
> suggested
> >> and also look at your other comments.
> >
> > I spoke too soon.
> >
> > Now that I realized, code needs to follow one of the following patterns
> for correctness
> >
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> >
> > or
> >
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> >
> > but definitely not
> >
> > wmb()
> > __raw_wrltel()
> >
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> >
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough.
> PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> >
> > So, there is no one size fits all solution with the current state of affairs.
> >
> >
>
> I think I finally got what you mean.
>
> Code seems to have
>
> wmb()
> writel()/writeq()
> wmb()
>
> this can be safely replaced with
>
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
>
> This will work on all arches. Below is the new version. Let me know if this is
> OK.
>
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64
> *src)
> int count = 8;
>
> while (count) {
> - writeq(*src, dst);
> + __raw_writeq(*src, dst);
> src++;
> dst++;
> count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
> (u64 *)wqe);
> } else {
> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> - wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> + wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> + mmiowmb()
> }
>
> static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
> (void *)wqe);
> } else {
> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> - wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> + wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> + mmiowmb();
> }
>
>

Yes, this is what chelsio recommended to me.

Reviewed-by: Steve Wise <[email protected]>


2018-03-17 13:29:12

by David Miller

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

From: Sinan Kaya <[email protected]>
Date: Sat, 17 Mar 2018 00:25:14 -0400

> I think I finally got what you mean.
>
> Code seems to have
>
> wmb()
> writel()/writeq()
> wmb()
>
> this can be safely replaced with
>
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
>
> This will work on all arches. Below is the new version. Let me know if this is OK.

Unfortunately, I think this won't work.

At least on sparc, the __raw_*() variants also change the endianness to
native endianness.

PowerPC does this as well, even documented in a comment :-)

/*
* Non ordered and non-swapping "raw" accessors
*/

Only the non-__raw_ variants do the endianness swap to/from
little-endian.

2018-03-17 15:07:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> >> and also look at your other comments.
> >
> > I spoke too soon.
> >
> > Now that I realized, code needs to follow one of the following patterns for correctness
> >
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> >
> > or
> >
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> >
> > but definitely not
> >
> > wmb()
> > __raw_wrltel()
> >
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> >
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> >
> > So, there is no one size fits all solution with the current state of affairs.
> >
> >
>
> I think I finally got what you mean.
>
> Code seems to have
>
> wmb()
> writel()/writeq()
> wmb()
>
> this can be safely replaced with
>
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
>
> This will work on all arches. Below is the new version. Let me know if this is OK.
>
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
> int count = 8;
>
> while (count) {
> - writeq(*src, dst);
> + __raw_writeq(*src, dst);
> src++;
> dst++;
> count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
> (u64 *)wqe);
> } else {
> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> - wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> + wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> + mmiowmb()
> }
>
> static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> (void *)wqe);
> } else {
> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> - wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> + wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> }
>
> /* Flush user doorbell area writes. */
> wmb();
> return;
> }
> - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> + mmiowmb();

No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!

Your first patch was right, replacing
wmb()
writel()

With wmb(); writel_relaxed() is fine, and helps ARM.

The problem with PPC has nothing to do with the writel, it is with the
spinlock, and we can't improve it without adding some new
infrastructure. Certainly don't hack around the problem by using
__raw_writel in multi-arch drivers!

In userspace we settled on something like this as a pattern:

mmio_wc_spin_lock()
writel_wc_mmio()
mmio_wc_spin_unlock()

Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
fully contain all MMIO access to WC and UC memory.

Due to our userspace the pattern is specifically used with MMIO writes
to WC BAR memory, but it could be generalized..

This allows all known arches to use the best code in all cases. x86
can even optimize a lot and combine the mmiowmb() into the
spin_unlock, apparently.

Jason

2018-03-17 18:31:37

by Sinan Kaya

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

[email protected]

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>> int count = 8;
>>
>> while (count) {
>> - writeq(*src, dst);
>> + __raw_writeq(*src, dst);
>> src++;
>> dst++;
>> count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>> (u64 *)wqe);
>> } else {
>> pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> - writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> - wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> + wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> - writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> + __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> + mmiowmb()
>> }
>>
>> static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> (void *)wqe);
>> } else {
>> pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> - writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> - wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> + __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> + wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> }
>>
>> /* Flush user doorbell area writes. */
>> wmb();
>> return;
>> }
>> - writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> + __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> + mmiowmb();
>
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
>
> Your first patch was right, replacing
> wmb()
> writel()
>
> With wmb(); writel_relaxed() is fine, and helps ARM.
>
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
>
> In userspace we settled on something like this as a pattern:
>
> mmio_wc_spin_lock()
> writel_wc_mmio()
> mmio_wc_spin_unlock()

>
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
> fully contain all MMIO access to WC and UC memory.
>
> Due to our userspace the pattern is specifically used with MMIO writes
> to WC BAR memory, but it could be generalized..
>
> This allows all known arches to use the best code in all cases. x86
> can even optimize a lot and combine the mmiowmb() into the
> spin_unlock, apparently.

Given both Dave [1] and Jason's feedback, we have to ask PowerPC developers
to fix writel_relaxed().

Otherwise, PowerPC won't be able to take advantage of the optimizations
introduced in this series.

I don't think we need writel_really_relaxed() API.

Somebody also has to take a task and work very hard to get rid of __raw_writeX()
APIs in drivers/net directory. It looked like a very common practice though
it clearly violates multiarch portability concerns Jason and Deve highlighted.

This will be the next version:

iff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..6e5658a 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
int count = 8;

while (count) {
- writeq(*src, dst);
+ writeq_relaxed(*src, dst);
src++;
dst++;
count--;
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
(u64 *)wqe);
} else {
pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
- wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+ writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+ wq->sq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+ writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
}

static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
(void *)wqe);
} else {
pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
- writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
- wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+ writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+ wq->rq.bar2_va + SGE_UDB_KDOORBELL);
}

/* Flush user doorbell area writes. */
wmb();
return;
}
- writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+ writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
}


[1] https://lkml.org/lkml/2018/3/17/100

>
> Jason
>


--
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

2018-03-19 01:49:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs

On Sat, Mar 17, 2018 at 02:30:10PM -0400, Sinan Kaya wrote:

> Somebody also has to take a task and work very hard to get rid of __raw_writeX()
> APIs in drivers/net directory. It looked like a very common practice though
> it clearly violates multiarch portability concerns Jason and Deve highlighted.

When you posted your list I thought most of the hits were in what I'd
think of 'one-arch drivers', eg an IRQ controller or clock driver or
something.. Some might have a reason for it (eg avoiding the swap, for
instance), maybe it is a hold over from before writel_relaxed, or
maybe it is just a cargo-cult behavior..

It is the obviously multi-arch drivers that probably need some
attention..

Jason