Memory barriers help ensure instruction ordering, NOT time and order
of actual write arrival at other observers (e.g. memory-mapped IP).
On architectures employing weak memory ordering, the latter can be a
giant pain point, and it has been as part of this driver.
Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
readl/writel, which include r/w (respectively) barriers.
Replace the barriers with a readback that ensures the previous writes
have exited the write buffer (as the CPU must flush the write to the
register it's trying to read back) and subsequently remove the hack
introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
status in hw_init").
Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
Signed-off-by: Konrad Dybcio <[email protected]>
---
drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
2 files changed, 6 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 0e3dfd4c2bc8..4135a53b55a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
int ret;
u32 val;
- gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
- /* Wait for the register to finish posting */
- wmb();
+ gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
+ gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
val & (1 << 1), 100, 10000);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 973872ad0474..0acbc38b8e70 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
}
/* Clear GBIF halt in case GX domain was not collapsed */
+ gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
+ gpu_read(gpu, REG_A6XX_GBIF_HALT);
if (adreno_is_a619_holi(adreno_gpu)) {
- gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
- /* Let's make extra sure that the GPU can access the memory.. */
- mb();
+ gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
} else if (a6xx_has_gbif(adreno_gpu)) {
- gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
- /* Let's make extra sure that the GPU can access the memory.. */
- mb();
+ gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
}
- /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
- if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
- spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
-
gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
if (adreno_is_a619_holi(adreno_gpu))
---
base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
change-id: 20240508-topic-adreno-a2d199cd4152
Best regards,
--
Konrad Dybcio <[email protected]>
On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> Memory barriers help ensure instruction ordering, NOT time and order
> of actual write arrival at other observers (e.g. memory-mapped IP).
> On architectures employing weak memory ordering, the latter can be a
> giant pain point, and it has been as part of this driver.
>
> Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> readl/writel, which include r/w (respectively) barriers.
>
> Replace the barriers with a readback that ensures the previous writes
> have exited the write buffer (as the CPU must flush the write to the
> register it's trying to read back) and subsequently remove the hack
> introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> status in hw_init").
>
> Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> Signed-off-by: Konrad Dybcio <[email protected]>
> ---
> drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> 2 files changed, 6 insertions(+), 13 deletions(-)
I prefer this version compared to the v2. A helper routine is
unnecessary here because:
1. there are very few scenarios where we have to read back the same
register.
2. we may accidently readback a write only register.
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 0e3dfd4c2bc8..4135a53b55a7 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> int ret;
> u32 val;
>
> - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> - /* Wait for the register to finish posting */
> - wmb();
> + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
This is unnecessary because we are polling on a register on the same port below. But I think we
can replace "wmb()" above with "mb()" to avoid reordering between read
and write IO instructions.
>
> ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> val & (1 << 1), 100, 10000);
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 973872ad0474..0acbc38b8e70 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> }
>
> /* Clear GBIF halt in case GX domain was not collapsed */
> + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
We need a full barrier here to avoid reordering. Also, lets add a
comment about why we are doing this odd looking sequence.
> + gpu_read(gpu, REG_A6XX_GBIF_HALT);
> if (adreno_is_a619_holi(adreno_gpu)) {
> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> - /* Let's make extra sure that the GPU can access the memory.. */
> - mb();
We need a full barrier here.
> + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> } else if (a6xx_has_gbif(adreno_gpu)) {
> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> - /* Let's make extra sure that the GPU can access the memory.. */
> - mb();
We need a full barrier here.
> + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> }
>
> - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> -
Why is this removed?
-Akhil
> gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
>
> if (adreno_is_a619_holi(adreno_gpu))
>
> ---
> base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> change-id: 20240508-topic-adreno-a2d199cd4152
>
> Best regards,
> --
> Konrad Dybcio <[email protected]>
>
On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > Memory barriers help ensure instruction ordering, NOT time and order
> > of actual write arrival at other observers (e.g. memory-mapped IP).
> > On architectures employing weak memory ordering, the latter can be a
> > giant pain point, and it has been as part of this driver.
> >
> > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > readl/writel, which include r/w (respectively) barriers.
> >
> > Replace the barriers with a readback that ensures the previous writes
> > have exited the write buffer (as the CPU must flush the write to the
> > register it's trying to read back) and subsequently remove the hack
> > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > status in hw_init").
For what its worth, I've been eyeing (but haven't tested) sending some
patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
guarantee between a writel() and a delay(), so the expected "write then
delay" sequence might not be happening.. you need to write, read, delay.
memory-barriers.txt:
5. A readX() by a CPU thread from the peripheral will complete before
any subsequent delay() loop can begin execution on the same thread.
This ensures that two MMIO register writes by the CPU to a peripheral
will arrive at least 1us apart if the first write is immediately read
back with readX() and udelay(1) is called prior to the second
writeX():
writel(42, DEVICE_REGISTER_0); // Arrives at the device...
readl(DEVICE_REGISTER_0);
udelay(1);
writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
> >
> > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > Signed-off-by: Konrad Dybcio <[email protected]>
> > ---
> > drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
> > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > 2 files changed, 6 insertions(+), 13 deletions(-)
>
> I prefer this version compared to the v2. A helper routine is
> unnecessary here because:
> 1. there are very few scenarios where we have to read back the same
> register.
> 2. we may accidently readback a write only register.
>
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > int ret;
> > u32 val;
> >
> > - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > - /* Wait for the register to finish posting */
> > - wmb();
> > + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
>
> This is unnecessary because we are polling on a register on the same port below. But I think we
> can replace "wmb()" above with "mb()" to avoid reordering between read
> and write IO instructions.
If I understand correctly, you don't need any memory barrier.
writel()/readl()'s are ordered to the same endpoint. That goes for all
the reordering/barrier comments mentioned below too.
device-io.rst:
The read and write functions are defined to be ordered. That is the
compiler is not permitted to reorder the I/O sequence. When the ordering
can be compiler optimised, you can use __readb() and friends to
indicate the relaxed ordering. Use this with care.
memory-barriers.txt:
(*) readX(), writeX():
The readX() and writeX() MMIO accessors take a pointer to the
peripheral being accessed as an __iomem * parameter. For pointers
mapped with the default I/O attributes (e.g. those returned by
ioremap()), the ordering guarantees are as follows:
1. All readX() and writeX() accesses to the same peripheral are ordered
with respect to each other. This ensures that MMIO register accesses
by the same CPU thread to a particular device will arrive in program
order.
>
> >
> > ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > val & (1 << 1), 100, 10000);
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 973872ad0474..0acbc38b8e70 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > }
> >
> > /* Clear GBIF halt in case GX domain was not collapsed */
> > + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>
> We need a full barrier here to avoid reordering. Also, lets add a
> comment about why we are doing this odd looking sequence.
>
> > + gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > if (adreno_is_a619_holi(adreno_gpu)) {
> > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > - /* Let's make extra sure that the GPU can access the memory.. */
> > - mb();
>
> We need a full barrier here.
>
> > + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > } else if (a6xx_has_gbif(adreno_gpu)) {
> > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > - /* Let's make extra sure that the GPU can access the memory.. */
> > - mb();
>
> We need a full barrier here.
>
> > + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > }
> >
> > - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > -
>
> Why is this removed?
>
> -Akhil
>
> > gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> >
> > if (adreno_is_a619_holi(adreno_gpu))
> >
> > ---
> > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > change-id: 20240508-topic-adreno-a2d199cd4152
> >
> > Best regards,
> > --
> > Konrad Dybcio <[email protected]>
> >
>
On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > Memory barriers help ensure instruction ordering, NOT time and order
> > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > On architectures employing weak memory ordering, the latter can be a
> > > giant pain point, and it has been as part of this driver.
> > >
> > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > readl/writel, which include r/w (respectively) barriers.
> > >
> > > Replace the barriers with a readback that ensures the previous writes
> > > have exited the write buffer (as the CPU must flush the write to the
> > > register it's trying to read back) and subsequently remove the hack
> > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > status in hw_init").
>
> For what its worth, I've been eyeing (but haven't tested) sending some
> patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> guarantee between a writel() and a delay(), so the expected "write then
> delay" sequence might not be happening.. you need to write, read, delay.
>
> memory-barriers.txt:
>
> 5. A readX() by a CPU thread from the peripheral will complete before
> any subsequent delay() loop can begin execution on the same thread.
> This ensures that two MMIO register writes by the CPU to a peripheral
> will arrive at least 1us apart if the first write is immediately read
> back with readX() and udelay(1) is called prior to the second
> writeX():
>
> writel(42, DEVICE_REGISTER_0); // Arrives at the device...
> readl(DEVICE_REGISTER_0);
> udelay(1);
> writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
Yes, udelay orders only with readl(). I saw a patch from Will Deacon
which fixes this for arm64 few years back:
https://lore.kernel.org/all/[email protected]/T/
But this is needed only when you write io and do cpuside wait , not when
you poll io to check status.
>
> > >
> > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > Signed-off-by: Konrad Dybcio <[email protected]>
> > > ---
> > > drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
> > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > > 2 files changed, 6 insertions(+), 13 deletions(-)
> >
> > I prefer this version compared to the v2. A helper routine is
> > unnecessary here because:
> > 1. there are very few scenarios where we have to read back the same
> > register.
> > 2. we may accidently readback a write only register.
> >
> > >
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > > int ret;
> > > u32 val;
> > >
> > > - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > - /* Wait for the register to finish posting */
> > > - wmb();
> > > + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> >
> > This is unnecessary because we are polling on a register on the same port below. But I think we
> > can replace "wmb()" above with "mb()" to avoid reordering between read
> > and write IO instructions.
>
> If I understand correctly, you don't need any memory barrier.
> writel()/readl()'s are ordered to the same endpoint. That goes for all
> the reordering/barrier comments mentioned below too.
>
> device-io.rst:
>
> The read and write functions are defined to be ordered. That is the
> compiler is not permitted to reorder the I/O sequence. When the ordering
> can be compiler optimised, you can use __readb() and friends to
> indicate the relaxed ordering. Use this with care.
>
> memory-barriers.txt:
>
> (*) readX(), writeX():
>
> The readX() and writeX() MMIO accessors take a pointer to the
> peripheral being accessed as an __iomem * parameter. For pointers
> mapped with the default I/O attributes (e.g. those returned by
> ioremap()), the ordering guarantees are as follows:
>
> 1. All readX() and writeX() accesses to the same peripheral are ordered
> with respect to each other. This ensures that MMIO register accesses
> by the same CPU thread to a particular device will arrive in program
> order.
>
In arm64, a writel followed by readl translates to roughly the following
sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
sure what is stopping compiler from reordering __raw_writel() and __raw_readl()
above? I am assuming iomem cookie is ignored during compilation.
Added Will to this thread if he can throw some light on this.
-Akhil
>
> >
> > >
> > > ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > > val & (1 << 1), 100, 10000);
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > index 973872ad0474..0acbc38b8e70 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > > }
> > >
> > > /* Clear GBIF halt in case GX domain was not collapsed */
> > > + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> >
> > We need a full barrier here to avoid reordering. Also, lets add a
> > comment about why we are doing this odd looking sequence.
> >
> > > + gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > > if (adreno_is_a619_holi(adreno_gpu)) {
> > > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > - /* Let's make extra sure that the GPU can access the memory.. */
> > > - mb();
> >
> > We need a full barrier here.
> >
> > > + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > > } else if (a6xx_has_gbif(adreno_gpu)) {
> > > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > - /* Let's make extra sure that the GPU can access the memory.. */
> > > - mb();
> >
> > We need a full barrier here.
> >
> > > + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > > }
> > >
> > > - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > -
> >
> > Why is this removed?
> >
> > -Akhil
> >
> > > gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> > >
> > > if (adreno_is_a619_holi(adreno_gpu))
> > >
> > > ---
> > > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > > change-id: 20240508-topic-adreno-a2d199cd4152
> > >
> > > Best regards,
> > > --
> > > Konrad Dybcio <[email protected]>
> > >
> >
>
On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > > Memory barriers help ensure instruction ordering, NOT time and order
> > > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > > On architectures employing weak memory ordering, the latter can be a
> > > > giant pain point, and it has been as part of this driver.
> > > >
> > > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > > readl/writel, which include r/w (respectively) barriers.
> > > >
> > > > Replace the barriers with a readback that ensures the previous writes
> > > > have exited the write buffer (as the CPU must flush the write to the
> > > > register it's trying to read back) and subsequently remove the hack
> > > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > > status in hw_init").
> >
> > For what its worth, I've been eyeing (but haven't tested) sending some
> > patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> > guarantee between a writel() and a delay(), so the expected "write then
> > delay" sequence might not be happening.. you need to write, read, delay.
> >
> > memory-barriers.txt:
> >
> > 5. A readX() by a CPU thread from the peripheral will complete before
> > any subsequent delay() loop can begin execution on the same thread.
> > This ensures that two MMIO register writes by the CPU to a peripheral
> > will arrive at least 1us apart if the first write is immediately read
> > back with readX() and udelay(1) is called prior to the second
> > writeX():
> >
> > writel(42, DEVICE_REGISTER_0); // Arrives at the device...
> > readl(DEVICE_REGISTER_0);
> > udelay(1);
> > writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
>
> Yes, udelay orders only with readl(). I saw a patch from Will Deacon
> which fixes this for arm64 few years back:
> https://lore.kernel.org/all/[email protected]/T/
>
> But this is needed only when you write io and do cpuside wait , not when
> you poll io to check status.
Sure, I'm just highlighting the bug in dsi_phy_write_*delay() functions
here, which match the udelay case you mention.
>
> >
> > > >
> > > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > > > Signed-off-by: Konrad Dybcio <[email protected]>
> > > > ---
> > > > drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
> > > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
> > > > 2 files changed, 6 insertions(+), 13 deletions(-)
> > >
> > > I prefer this version compared to the v2. A helper routine is
> > > unnecessary here because:
> > > 1. there are very few scenarios where we have to read back the same
> > > register.
> > > 2. we may accidently readback a write only register.
> > >
> > > >
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > > > int ret;
> > > > u32 val;
> > > >
> > > > - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > > - /* Wait for the register to finish posting */
> > > > - wmb();
> > > > + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > > + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > >
> > > This is unnecessary because we are polling on a register on the same port below. But I think we
> > > can replace "wmb()" above with "mb()" to avoid reordering between read
> > > and write IO instructions.
> >
> > If I understand correctly, you don't need any memory barrier.
> > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > the reordering/barrier comments mentioned below too.
> >
> > device-io.rst:
> >
> > The read and write functions are defined to be ordered. That is the
> > compiler is not permitted to reorder the I/O sequence. When the ordering
> > can be compiler optimised, you can use __readb() and friends to
> > indicate the relaxed ordering. Use this with care.
> >
> > memory-barriers.txt:
> >
> > (*) readX(), writeX():
> >
> > The readX() and writeX() MMIO accessors take a pointer to the
> > peripheral being accessed as an __iomem * parameter. For pointers
> > mapped with the default I/O attributes (e.g. those returned by
> > ioremap()), the ordering guarantees are as follows:
> >
> > 1. All readX() and writeX() accesses to the same peripheral are ordered
> > with respect to each other. This ensures that MMIO register accesses
> > by the same CPU thread to a particular device will arrive in program
> > order.
> >
>
> In arm64, a writel followed by readl translates to roughly the following
> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> sure what is stopping compiler from reordering __raw_writel() and __raw_readl()
> above? I am assuming iomem cookie is ignored during compilation.
It seems to me that is due to some usage of volatile there in
__raw_writel() etc, but to be honest after reading about volatile and
some threads from gcc mailing lists, I don't have a confident answer :)
>
> Added Will to this thread if he can throw some light on this.
Hopefully Will can school us.
>
> -Akhil
>
> >
> > >
> > > >
> > > > ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > > > val & (1 << 1), 100, 10000);
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > index 973872ad0474..0acbc38b8e70 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > > > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > > > }
> > > >
> > > > /* Clear GBIF halt in case GX domain was not collapsed */
> > > > + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > >
> > > We need a full barrier here to avoid reordering. Also, lets add a
> > > comment about why we are doing this odd looking sequence.
> > >
> > > > + gpu_read(gpu, REG_A6XX_GBIF_HALT);
> > > > if (adreno_is_a619_holi(adreno_gpu)) {
> > > > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > > gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> > > > - /* Let's make extra sure that the GPU can access the memory.. */
> > > > - mb();
> > >
> > > We need a full barrier here.
> > >
> > > > + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
> > > > } else if (a6xx_has_gbif(adreno_gpu)) {
> > > > - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> > > > gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> > > > - /* Let's make extra sure that the GPU can access the memory.. */
> > > > - mb();
> > >
> > > We need a full barrier here.
> > >
> > > > + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
> > > > }
> > > >
> > > > - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> > > > - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> > > > - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> > > > -
> > >
> > > Why is this removed?
> > >
> > > -Akhil
> > >
> > > > gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
> > > >
> > > > if (adreno_is_a619_holi(adreno_gpu))
> > > >
> > > > ---
> > > > base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> > > > change-id: 20240508-topic-adreno-a2d199cd4152
> > > >
> > > > Best regards,
> > > > --
> > > > Konrad Dybcio <[email protected]>
> > > >
> > >
> >
>
On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > If I understand correctly, you don't need any memory barrier.
> > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > the reordering/barrier comments mentioned below too.
> > >
> > > device-io.rst:
> > >
> > > The read and write functions are defined to be ordered. That is the
> > > compiler is not permitted to reorder the I/O sequence. When the ordering
> > > can be compiler optimised, you can use __readb() and friends to
> > > indicate the relaxed ordering. Use this with care.
> > >
> > > memory-barriers.txt:
> > >
> > > (*) readX(), writeX():
> > >
> > > The readX() and writeX() MMIO accessors take a pointer to the
> > > peripheral being accessed as an __iomem * parameter. For pointers
> > > mapped with the default I/O attributes (e.g. those returned by
> > > ioremap()), the ordering guarantees are as follows:
> > >
> > > 1. All readX() and writeX() accesses to the same peripheral are ordered
> > > with respect to each other. This ensures that MMIO register accesses
> > > by the same CPU thread to a particular device will arrive in program
> > > order.
> > >
> >
> > In arm64, a writel followed by readl translates to roughly the following
> > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > sure what is stopping compiler from reordering __raw_writel() and __raw_readl()
> > above? I am assuming iomem cookie is ignored during compilation.
>
> It seems to me that is due to some usage of volatile there in
> __raw_writel() etc, but to be honest after reading about volatile and
> some threads from gcc mailing lists, I don't have a confident answer :)
>
> >
> > Added Will to this thread if he can throw some light on this.
>
> Hopefully Will can school us.
The ordering in this case is ensured by the memory attributes used for
ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
(as it the case for ioremap()), the "nR" part means "no reordering", so
readX() and writeX() to that region are ordered wrt each other.
Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
e.g. ioremap_wc() won't give you the ordering.
Hope that helps,
Will
On 5/14/24 20:38, Akhil P Oommen wrote:
> On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
>> Memory barriers help ensure instruction ordering, NOT time and order
>> of actual write arrival at other observers (e.g. memory-mapped IP).
>> On architectures employing weak memory ordering, the latter can be a
>> giant pain point, and it has been as part of this driver.
>>
>> Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
>> readl/writel, which include r/w (respectively) barriers.
>>
>> Replace the barriers with a readback that ensures the previous writes
>> have exited the write buffer (as the CPU must flush the write to the
>> register it's trying to read back) and subsequently remove the hack
>> introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
>> status in hw_init").
>>
>> Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
>> Signed-off-by: Konrad Dybcio <[email protected]>
>> ---
>> drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 5 ++---
>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 ++++----------
>> 2 files changed, 6 insertions(+), 13 deletions(-)
>
> I prefer this version compared to the v2. A helper routine is
> unnecessary here because:
> 1. there are very few scenarios where we have to read back the same
> register.
> 2. we may accidently readback a write only register.
Which would still trigger an address dependency on the CPU, no?
>
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> index 0e3dfd4c2bc8..4135a53b55a7 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>> @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
>> int ret;
>> u32 val;
>>
>> - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
>> - /* Wait for the register to finish posting */
>> - wmb();
>> + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
>> + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
>
> This is unnecessary because we are polling on a register on the same port below. But I think we
> can replace "wmb()" above with "mb()" to avoid reordering between read
> and write IO instructions.
Ok on the dropping readback part
+ AFAIU from Will's response, we can drop the barrier as well
>
>>
>> ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
>> val & (1 << 1), 100, 10000);
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 973872ad0474..0acbc38b8e70 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
>> }
>>
>> /* Clear GBIF halt in case GX domain was not collapsed */
>> + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>
> We need a full barrier here to avoid reordering. Also, lets add a
> comment about why we are doing this odd looking sequence.
>
>> + gpu_read(gpu, REG_A6XX_GBIF_HALT);
>> if (adreno_is_a619_holi(adreno_gpu)) {
>> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>> gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
>> - /* Let's make extra sure that the GPU can access the memory.. */
>> - mb();
>
> We need a full barrier here.
>
>> + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
>> } else if (a6xx_has_gbif(adreno_gpu)) {
>> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>> gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
>> - /* Let's make extra sure that the GPU can access the memory.. */
>> - mb();
>
> We need a full barrier here.
Not sure we do between REG_A6XX_GBIF_HALT & REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL),
but I suppose keeping the one after REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL) makes
sense to avoid the possibility of configuring the GPU before it can access DRAM..
>
>> + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
>> }
>>
>> - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
>> - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
>> - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
>> -
>
> Why is this removed?
Because it was a hack in the first place and the enforcement of GBIF
unhalt requests coming through before proceeding further removes the
necessity to check this (unless there's some hw-mandated delay we should
keep in mind, but kgsl doesn't have that and there doesn't seem to be
any from testing on 8[456]50).
Konrad
On 4.06.2024 4:40 PM, Will Deacon wrote:
> On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
>> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
>>> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
>>>> If I understand correctly, you don't need any memory barrier.
>>>> writel()/readl()'s are ordered to the same endpoint. That goes for all
>>>> the reordering/barrier comments mentioned below too.
>>>>
>>>> device-io.rst:
>>>>
>>>> The read and write functions are defined to be ordered. That is the
>>>> compiler is not permitted to reorder the I/O sequence. When the ordering
>>>> can be compiler optimised, you can use __readb() and friends to
>>>> indicate the relaxed ordering. Use this with care.
>>>>
>>>> memory-barriers.txt:
>>>>
>>>> (*) readX(), writeX():
>>>>
>>>> The readX() and writeX() MMIO accessors take a pointer to the
>>>> peripheral being accessed as an __iomem * parameter. For pointers
>>>> mapped with the default I/O attributes (e.g. those returned by
>>>> ioremap()), the ordering guarantees are as follows:
>>>>
>>>> 1. All readX() and writeX() accesses to the same peripheral are ordered
>>>> with respect to each other. This ensures that MMIO register accesses
>>>> by the same CPU thread to a particular device will arrive in program
>>>> order.
>>>>
>>>
>>> In arm64, a writel followed by readl translates to roughly the following
>>> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
>>> sure what is stopping compiler from reordering __raw_writel() and __raw_readl()
>>> above? I am assuming iomem cookie is ignored during compilation.
>>
>> It seems to me that is due to some usage of volatile there in
>> __raw_writel() etc, but to be honest after reading about volatile and
>> some threads from gcc mailing lists, I don't have a confident answer :)
>>
>>>
>>> Added Will to this thread if he can throw some light on this.
>>
>> Hopefully Will can school us.
>
> The ordering in this case is ensured by the memory attributes used for
> ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> (as it the case for ioremap()), the "nR" part means "no reordering", so
> readX() and writeX() to that region are ordered wrt each other.
>
> Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
> e.g. ioremap_wc() won't give you the ordering.
>
> Hope that helps,
Just to make sure I'm following, would mapping things as nGnRnE effectively
get rid of write buffering, perhaps being a way of debugging whether that
in particular is causing issues (at the cost of speed)?
Konrad