2023-10-09 14:33:10

by Sui Jingfeng

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

Hi,


On 2023/10/9 12:28, Icenowy Zheng wrote:
> Currently the code disables WUC only disables it for ioremap_wc(), which
> is only used when mapping writecombine pages like ioremap() (mapped to
> the kernel space). For VRAM mapped in TTM/GEM, it's mapped with a
> crafted pgprot with pgprot_writecombine() function, which isn't
> corrently disabled now.
>
> Disable WUC for pgprot_writecombine() (fallback to SUC) too.
>
> This improves AMDGPU driver stability on Loongson 3A5000 machines.
>
> Signed-off-by: Icenowy Zheng <[email protected]>
> ---
> Changes since v1:
> - Removed _WC macros
> - Mention ioremap_wc in commit message
>
> arch/loongarch/include/asm/io.h | 5 ++---
> arch/loongarch/include/asm/pgtable-bits.h | 4 +++-
> arch/loongarch/kernel/setup.c | 10 +++++-----
> 3 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/arch/loongarch/include/asm/io.h b/arch/loongarch/include/asm/io.h
> index 0dcb36b32cb25..290aad87a8847 100644
> --- a/arch/loongarch/include/asm/io.h
> +++ b/arch/loongarch/include/asm/io.h
> @@ -52,10 +52,9 @@ static inline void __iomem *ioremap_prot(phys_addr_t offset, unsigned long size,
> * @offset: bus address of the memory
> * @size: size of the resource to map
> */
> -extern pgprot_t pgprot_wc;
> -
> #define ioremap_wc(offset, size) \
> - ioremap_prot((offset), (size), pgprot_val(pgprot_wc))
> + ioremap_prot((offset), (size), pgprot_val( \
> + wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC))
>
> #define ioremap_cache(offset, size) \
> ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL))
> diff --git a/arch/loongarch/include/asm/pgtable-bits.h b/arch/loongarch/include/asm/pgtable-bits.h
> index 35348d4c4209a..a3d827701736d 100644
> --- a/arch/loongarch/include/asm/pgtable-bits.h
> +++ b/arch/loongarch/include/asm/pgtable-bits.h
> @@ -92,6 +92,8 @@
>
> #ifndef __ASSEMBLY__
>
> +extern bool wc_enabled;
> +
> #define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL_SUC)
>
> #define pgprot_noncached pgprot_noncached
> @@ -111,7 +113,7 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
> {
> unsigned long prot = pgprot_val(_prot);
>
> - prot = (prot & ~_CACHE_MASK) | _CACHE_WUC;
> + prot = (prot & ~_CACHE_MASK) | (wc_enabled ? _CACHE_WUC : _CACHE_SUC);
>
> return __pgprot(prot);
> }
> diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
> index 7783f0a3d742c..465c1dbb6f4b4 100644
> --- a/arch/loongarch/kernel/setup.c
> +++ b/arch/loongarch/kernel/setup.c
> @@ -161,19 +161,19 @@ static void __init smbios_parse(void)
> }
>
> #ifdef CONFIG_ARCH_WRITECOMBINE
> -pgprot_t pgprot_wc = PAGE_KERNEL_WUC;
> +bool wc_enabled = true;
> #else
> -pgprot_t pgprot_wc = PAGE_KERNEL_SUC;
> +bool wc_enabled;
> #endif
>
> -EXPORT_SYMBOL(pgprot_wc);
> +EXPORT_SYMBOL(wc_enabled);
>
> static int __init setup_writecombine(char *p)
> {
> if (!strcmp(p, "on"))
> - pgprot_wc = PAGE_KERNEL_WUC;
> + wc_enabled = true;
> else if (!strcmp(p, "off"))
> - pgprot_wc = PAGE_KERNEL_SUC;
> + wc_enabled = false;
> else
> pr_warn("Unknown writecombine setting \"%s\".\n", p);
>


Good catch!

But this will make the write combine(WC) mappings completely unusable on LoongArch.
This is nearly equivalent to say that LoongArch don't support write combine at all.
But the write combine(WC) mappings works fine for software based drm drivers,
such as drm/loongson and drm/ast etc. Even include drm/radeon and drm/amdgpu with
pure software rendering setup (by putting Option "Accel" "off" into 10-amdgpu.conf
or 10-radeon.conf) After merge this patch, the performance drop dramatically for
2D software rendering based display controller drivers.

Well, this patch itself is a good catch, as it is a fix for the commit <16c52e503043>
("LoongArch: Make WriteCombine configurable for ioremap()"). But I'm afraid that
both of this commit and the <16c52e503043> commit are not a *real* fix write combine
related issue on LoongArch. It just negative sidestep of the real problem.


2023-10-10 00:16:40

by WANG Xuerui

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc


On 10/9/23 22:32, Sui Jingfeng wrote:
> Hi,
>
>
> On 2023/10/9 12:28, Icenowy Zheng wrote:
>> Currently the code disables WUC only disables it for ioremap_wc(), which
>> is only used when mapping writecombine pages like ioremap() (mapped to
>> the kernel space). For VRAM mapped in TTM/GEM, it's mapped with a
>> crafted pgprot with pgprot_writecombine() function, which isn't
>> corrently disabled now.
>>
>> Disable WUC for pgprot_writecombine() (fallback to SUC) too.
>>
>> This improves AMDGPU driver stability on Loongson 3A5000 machines.
>>
>> Signed-off-by: Icenowy Zheng <[email protected]>
>> ---
>> Changes since v1:
>> - Removed _WC macros
>> - Mention ioremap_wc in commit message
>>
>>   arch/loongarch/include/asm/io.h           |  5 ++---
>>   arch/loongarch/include/asm/pgtable-bits.h |  4 +++-
>>   arch/loongarch/kernel/setup.c             | 10 +++++-----
>>   3 files changed, 10 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/loongarch/include/asm/io.h
>> b/arch/loongarch/include/asm/io.h
>> index 0dcb36b32cb25..290aad87a8847 100644
>> --- a/arch/loongarch/include/asm/io.h
>> +++ b/arch/loongarch/include/asm/io.h
>> @@ -52,10 +52,9 @@ static inline void __iomem
>> *ioremap_prot(phys_addr_t offset, unsigned long size,
>>    * @offset:    bus address of the memory
>>    * @size:      size of the resource to map
>>    */
>> -extern pgprot_t pgprot_wc;
>> -
>>   #define ioremap_wc(offset, size)    \
>> -    ioremap_prot((offset), (size), pgprot_val(pgprot_wc))
>> +    ioremap_prot((offset), (size), pgprot_val( \
>> +        wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC))
>>     #define ioremap_cache(offset, size)    \
>>       ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL))
>> diff --git a/arch/loongarch/include/asm/pgtable-bits.h
>> b/arch/loongarch/include/asm/pgtable-bits.h
>> index 35348d4c4209a..a3d827701736d 100644
>> --- a/arch/loongarch/include/asm/pgtable-bits.h
>> +++ b/arch/loongarch/include/asm/pgtable-bits.h
>> @@ -92,6 +92,8 @@
>>     #ifndef __ASSEMBLY__
>>   +extern bool wc_enabled;
>> +
>>   #define _PAGE_IOREMAP        pgprot_val(PAGE_KERNEL_SUC)
>>     #define pgprot_noncached pgprot_noncached
>> @@ -111,7 +113,7 @@ static inline pgprot_t
>> pgprot_writecombine(pgprot_t _prot)
>>   {
>>       unsigned long prot = pgprot_val(_prot);
>>   -    prot = (prot & ~_CACHE_MASK) | _CACHE_WUC;
>> +    prot = (prot & ~_CACHE_MASK) | (wc_enabled ? _CACHE_WUC :
>> _CACHE_SUC);
>>         return __pgprot(prot);
>>   }
>> diff --git a/arch/loongarch/kernel/setup.c
>> b/arch/loongarch/kernel/setup.c
>> index 7783f0a3d742c..465c1dbb6f4b4 100644
>> --- a/arch/loongarch/kernel/setup.c
>> +++ b/arch/loongarch/kernel/setup.c
>> @@ -161,19 +161,19 @@ static void __init smbios_parse(void)
>>   }
>>     #ifdef CONFIG_ARCH_WRITECOMBINE
>> -pgprot_t pgprot_wc = PAGE_KERNEL_WUC;
>> +bool wc_enabled = true;
>>   #else
>> -pgprot_t pgprot_wc = PAGE_KERNEL_SUC;
>> +bool wc_enabled;
>>   #endif
>>   -EXPORT_SYMBOL(pgprot_wc);
>> +EXPORT_SYMBOL(wc_enabled);
>>     static int __init setup_writecombine(char *p)
>>   {
>>       if (!strcmp(p, "on"))
>> -        pgprot_wc = PAGE_KERNEL_WUC;
>> +        wc_enabled = true;
>>       else if (!strcmp(p, "off"))
>> -        pgprot_wc = PAGE_KERNEL_SUC;
>> +        wc_enabled = false;
>>       else
>>           pr_warn("Unknown writecombine setting \"%s\".\n", p);
>
>
> Good catch!
>
> But this will make the write combine(WC) mappings completely unusable
> on LoongArch.
> This is nearly equivalent to say that LoongArch don't support write
> combine at all.
> But the write combine(WC) mappings works fine for software based drm
> drivers,
> such as drm/loongson and drm/ast etc. Even include drm/radeon and
> drm/amdgpu with
> pure software rendering setup (by putting Option "Accel" "off" into
> 10-amdgpu.conf
> or 10-radeon.conf) After merge this patch, the performance drop
> dramatically for
> 2D software rendering based display controller drivers.
>
> Well, this patch itself is a good catch, as it is a fix for the commit
> <16c52e503043>
> ("LoongArch: Make WriteCombine configurable for ioremap()"). But I'm
> afraid that
> both of this commit and the <16c52e503043> commit are not a *real* fix
> write combine
> related issue on LoongArch. It just negative sidestep of the real
> problem.
Sure, but given the public information I have access to, I don't think
it's possible to "really" fix the root cause with software only. Do you
have any insight on this, given from your perspective and language, such
a solution seems to exist?

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/

2023-10-10 00:52:14

by Icenowy Zheng

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

在 2023-10-09星期一的 22:32 +0800,Sui Jingfeng写道:
> Hi,
>
>
> On 2023/10/9 12:28, Icenowy Zheng wrote:
> > Currently the code disables WUC only disables it for ioremap_wc(),
> > which
> > is only used when mapping writecombine pages like ioremap() (mapped
> > to
> > the kernel space). For VRAM mapped in TTM/GEM, it's mapped with a
> > crafted pgprot with pgprot_writecombine() function, which isn't
> > corrently disabled now.
> >
> > Disable WUC for pgprot_writecombine() (fallback to SUC) too.
> >
> > This improves AMDGPU driver stability on Loongson 3A5000 machines.
> >
> > Signed-off-by: Icenowy Zheng <[email protected]>
> > ---
> > Changes since v1:
> > - Removed _WC macros
> > - Mention ioremap_wc in commit message
> >
> >   arch/loongarch/include/asm/io.h           |  5 ++---
> >   arch/loongarch/include/asm/pgtable-bits.h |  4 +++-
> >   arch/loongarch/kernel/setup.c             | 10 +++++-----
> >   3 files changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/loongarch/include/asm/io.h
> > b/arch/loongarch/include/asm/io.h
> > index 0dcb36b32cb25..290aad87a8847 100644
> > --- a/arch/loongarch/include/asm/io.h
> > +++ b/arch/loongarch/include/asm/io.h
> > @@ -52,10 +52,9 @@ static inline void __iomem
> > *ioremap_prot(phys_addr_t offset, unsigned long size,
> >    * @offset:    bus address of the memory
> >    * @size:      size of the resource to map
> >    */
> > -extern pgprot_t pgprot_wc;
> > -
> >   #define ioremap_wc(offset, size)      \
> > -       ioremap_prot((offset), (size), pgprot_val(pgprot_wc))
> > +       ioremap_prot((offset), (size), pgprot_val( \
> > +               wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC))
> >  
> >   #define ioremap_cache(offset, size)   \
> >         ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL))
> > diff --git a/arch/loongarch/include/asm/pgtable-bits.h
> > b/arch/loongarch/include/asm/pgtable-bits.h
> > index 35348d4c4209a..a3d827701736d 100644
> > --- a/arch/loongarch/include/asm/pgtable-bits.h
> > +++ b/arch/loongarch/include/asm/pgtable-bits.h
> > @@ -92,6 +92,8 @@
> >  
> >   #ifndef __ASSEMBLY__
> >  
> > +extern bool wc_enabled;
> > +
> >   #define _PAGE_IOREMAP         pgprot_val(PAGE_KERNEL_SUC)
> >  
> >   #define pgprot_noncached pgprot_noncached
> > @@ -111,7 +113,7 @@ static inline pgprot_t
> > pgprot_writecombine(pgprot_t _prot)
> >   {
> >         unsigned long prot = pgprot_val(_prot);
> >  
> > -       prot = (prot & ~_CACHE_MASK) | _CACHE_WUC;
> > +       prot = (prot & ~_CACHE_MASK) | (wc_enabled ? _CACHE_WUC :
> > _CACHE_SUC);
> >  
> >         return __pgprot(prot);
> >   }
> > diff --git a/arch/loongarch/kernel/setup.c
> > b/arch/loongarch/kernel/setup.c
> > index 7783f0a3d742c..465c1dbb6f4b4 100644
> > --- a/arch/loongarch/kernel/setup.c
> > +++ b/arch/loongarch/kernel/setup.c
> > @@ -161,19 +161,19 @@ static void __init smbios_parse(void)
> >   }
> >  
> >   #ifdef CONFIG_ARCH_WRITECOMBINE
> > -pgprot_t pgprot_wc = PAGE_KERNEL_WUC;
> > +bool wc_enabled = true;
> >   #else
> > -pgprot_t pgprot_wc = PAGE_KERNEL_SUC;
> > +bool wc_enabled;
> >   #endif
> >  
> > -EXPORT_SYMBOL(pgprot_wc);
> > +EXPORT_SYMBOL(wc_enabled);
> >  
> >   static int __init setup_writecombine(char *p)
> >   {
> >         if (!strcmp(p, "on"))
> > -               pgprot_wc = PAGE_KERNEL_WUC;
> > +               wc_enabled = true;
> >         else if (!strcmp(p, "off"))
> > -               pgprot_wc = PAGE_KERNEL_SUC;
> > +               wc_enabled = false;
> >         else
> >                 pr_warn("Unknown writecombine setting \"%s\".\n",
> > p);
> >  
>
>
> Good catch!
>
> But this will make the write combine(WC) mappings completely unusable
> on LoongArch.

Not completely unusable -- we always have the writecombine=on kernel
parameter that overrides it (or CONFIG_ARCH_WRITECOMBINE build-time
option).

People knowing what they're doing can utilize them to gain back the
performace of WUC, however for the default setup I prefer functionality
correctness to (unstable) performance.

> This is nearly equivalent to say that LoongArch don't support write
> combine at all.
> But the write combine(WC) mappings works fine for software based drm
> drivers,
> such as drm/loongson and drm/ast etc. Even include drm/radeon and
> drm/amdgpu with
> pure software rendering setup (by putting Option "Accel" "off" into
> 10-amdgpu.conf
> or 10-radeon.conf) After merge this patch, the performance drop
> dramatically for
> 2D software rendering based display controller drivers.
>
> Well, this patch itself is a good catch, as it is a fix for the
> commit <16c52e503043>
> ("LoongArch: Make WriteCombine configurable for ioremap()"). But I'm
> afraid that
> both of this commit and the <16c52e503043> commit are not a *real*
> fix write combine

Yes, this is a workaround rather than a fix, however I'm a software
developer and this is the best I can do.

If you can, please report this to your hardware developers and try to
get it fixed for 3A7000/7A3000 (or maybe grabbing some chicken bit that
will fix this for existing chips). I will be quite appreciated.

Thanks,
Icenowy

> related issue on LoongArch. It just negative sidestep of the real
> problem.
>

2023-10-10 03:03:26

by Sui Jingfeng

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc


On 2023/10/10 08:15, WANG Xuerui wrote:
>
> On 10/9/23 22:32, Sui Jingfeng wrote:
>> Hi,
>>
>>
>> On 2023/10/9 12:28, Icenowy Zheng wrote:
>>> Currently the code disables WUC only disables it for ioremap_wc(),
>>> which
>>> is only used when mapping writecombine pages like ioremap() (mapped to
>>> the kernel space). For VRAM mapped in TTM/GEM, it's mapped with a
>>> crafted pgprot with pgprot_writecombine() function, which isn't
>>> corrently disabled now.
>>>
>>> Disable WUC for pgprot_writecombine() (fallback to SUC) too.
>>>
>>> This improves AMDGPU driver stability on Loongson 3A5000 machines.
>>>
>>> Signed-off-by: Icenowy Zheng <[email protected]>
>>> ---
>>> Changes since v1:
>>> - Removed _WC macros
>>> - Mention ioremap_wc in commit message
>>>
>>>   arch/loongarch/include/asm/io.h           |  5 ++---
>>>   arch/loongarch/include/asm/pgtable-bits.h |  4 +++-
>>>   arch/loongarch/kernel/setup.c             | 10 +++++-----
>>>   3 files changed, 10 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/loongarch/include/asm/io.h
>>> b/arch/loongarch/include/asm/io.h
>>> index 0dcb36b32cb25..290aad87a8847 100644
>>> --- a/arch/loongarch/include/asm/io.h
>>> +++ b/arch/loongarch/include/asm/io.h
>>> @@ -52,10 +52,9 @@ static inline void __iomem
>>> *ioremap_prot(phys_addr_t offset, unsigned long size,
>>>    * @offset:    bus address of the memory
>>>    * @size:      size of the resource to map
>>>    */
>>> -extern pgprot_t pgprot_wc;
>>> -
>>>   #define ioremap_wc(offset, size)    \
>>> -    ioremap_prot((offset), (size), pgprot_val(pgprot_wc))
>>> +    ioremap_prot((offset), (size), pgprot_val( \
>>> +        wc_enabled ? PAGE_KERNEL_WUC : PAGE_KERNEL_SUC))
>>>     #define ioremap_cache(offset, size)    \
>>>       ioremap_prot((offset), (size), pgprot_val(PAGE_KERNEL))
>>> diff --git a/arch/loongarch/include/asm/pgtable-bits.h
>>> b/arch/loongarch/include/asm/pgtable-bits.h
>>> index 35348d4c4209a..a3d827701736d 100644
>>> --- a/arch/loongarch/include/asm/pgtable-bits.h
>>> +++ b/arch/loongarch/include/asm/pgtable-bits.h
>>> @@ -92,6 +92,8 @@
>>>     #ifndef __ASSEMBLY__
>>>   +extern bool wc_enabled;
>>> +
>>>   #define _PAGE_IOREMAP        pgprot_val(PAGE_KERNEL_SUC)
>>>     #define pgprot_noncached pgprot_noncached
>>> @@ -111,7 +113,7 @@ static inline pgprot_t
>>> pgprot_writecombine(pgprot_t _prot)
>>>   {
>>>       unsigned long prot = pgprot_val(_prot);
>>>   -    prot = (prot & ~_CACHE_MASK) | _CACHE_WUC;
>>> +    prot = (prot & ~_CACHE_MASK) | (wc_enabled ? _CACHE_WUC :
>>> _CACHE_SUC);
>>>         return __pgprot(prot);
>>>   }
>>> diff --git a/arch/loongarch/kernel/setup.c
>>> b/arch/loongarch/kernel/setup.c
>>> index 7783f0a3d742c..465c1dbb6f4b4 100644
>>> --- a/arch/loongarch/kernel/setup.c
>>> +++ b/arch/loongarch/kernel/setup.c
>>> @@ -161,19 +161,19 @@ static void __init smbios_parse(void)
>>>   }
>>>     #ifdef CONFIG_ARCH_WRITECOMBINE
>>> -pgprot_t pgprot_wc = PAGE_KERNEL_WUC;
>>> +bool wc_enabled = true;
>>>   #else
>>> -pgprot_t pgprot_wc = PAGE_KERNEL_SUC;
>>> +bool wc_enabled;
>>>   #endif
>>>   -EXPORT_SYMBOL(pgprot_wc);
>>> +EXPORT_SYMBOL(wc_enabled);
>>>     static int __init setup_writecombine(char *p)
>>>   {
>>>       if (!strcmp(p, "on"))
>>> -        pgprot_wc = PAGE_KERNEL_WUC;
>>> +        wc_enabled = true;
>>>       else if (!strcmp(p, "off"))
>>> -        pgprot_wc = PAGE_KERNEL_SUC;
>>> +        wc_enabled = false;
>>>       else
>>>           pr_warn("Unknown writecombine setting \"%s\".\n", p);
>>
>>
>> Good catch!
>>
>> But this will make the write combine(WC) mappings completely unusable
>> on LoongArch.
>> This is nearly equivalent to say that LoongArch don't support write
>> combine at all.
>> But the write combine(WC) mappings works fine for software based drm
>> drivers,
>> such as drm/loongson and drm/ast etc. Even include drm/radeon and
>> drm/amdgpu with
>> pure software rendering setup (by putting Option "Accel" "off" into
>> 10-amdgpu.conf
>> or 10-radeon.conf) After merge this patch, the performance drop
>> dramatically for
>> 2D software rendering based display controller drivers.
>>
>> Well, this patch itself is a good catch, as it is a fix for the
>> commit <16c52e503043>
>> ("LoongArch: Make WriteCombine configurable for ioremap()"). But I'm
>> afraid that
>> both of this commit and the <16c52e503043> commit are not a *real*
>> fix write combine
>> related issue on LoongArch. It just negative sidestep of the real
>> problem.
> Sure, but given the public information I have access to, I don't think
> it's possible to "really" fix the root cause with software only. Do
> you have any insight on this, given from your perspective and
> language, such a solution seems to exist?
>
On LoongArch, cached mapping and uncached mappings are DMA-coherent and guaranteed by
the hardware. While WC mappings is *NOT* DMA-coherent when 3D GPU is involved. Therefore,
On downstream kernel, We disable write combine(WC) mappings at the drm drivers side.

- For buffers at VRAM(device memory), we replace the WC mappings with uncached mappings.
- For buffers reside in RAM, we replace the WC mappings with cached mappings.

By this way, we were able to minimum the side effects, and meet the usable requirements
for all of the GPU drivers.

For DMA non-coherent buffers, we should try to implement arch-specific dma_map_ops,
invalidate the CPU cache and flush the CPU write buffer before the device do DMA. Instead
of pretend to be DMA coherent for all buffers, a kernel cmdline is not a system level
solution for all of GPU drivers and OS release.

2023-10-10 12:27:35

by Xi Ruoyao

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

On Tue, 2023-10-10 at 11:02 +0800, Sui Jingfeng wrote:

>
> On LoongArch, cached mapping and uncached mappings are DMA-coherent and guaranteed by
> the hardware. While WC mappings is *NOT* DMA-coherent when 3D GPU is involved. Therefore,
> On downstream kernel, We disable write combine(WC) mappings at the drm drivers side.

Why it's only an issue when 3D GPU is involved? What's the difference
between 3D GPUs and other devices? Is it possible that the other
devices (say neural accelerators) start to perform DMA accesses in a
similar way and then suddenly broken?

> - For buffers at VRAM(device memory), we replace the WC mappings with uncached mappings.
> - For buffers reside in RAM, we replace the WC mappings with cached mappings.
>
> By this way, we were able to minimum the side effects, and meet the usable requirements
> for all of the GPU drivers.

AFAIK there has been some clear NAK from DRM maintainers towards this
"approach". So it's not possible to be applied upstream.

> For DMA non-coherent buffers, we should try to implement arch-specific dma_map_ops,
> invalidate the CPU cache and flush the CPU write buffer before the device do DMA. Instead
> of pretend to be DMA coherent for all buffers, a kernel cmdline is not a system level
> solution for all of GPU drivers and OS release.

IIUC this is a hardware bug of 7A1000 and 7A2000, so the proper location
of the workaround is in the bridge chip driver. Or am I
misunderstanding something?

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-10-13 11:12:27

by Sui Jingfeng

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

Hi,


On 2023/10/10 20:26, Xi Ruoyao wrote:
>> For DMA non-coherent buffers, we should try to implement arch-specific dma_map_ops,
>> invalidate the CPU cache and flush the CPU write buffer before the device do DMA. Instead
>> of pretend to be DMA coherent for all buffers, a kernel cmdline is not a system level
>> solution for all of GPU drivers and OS release.
> IIUC this is a hardware bug of 7A1000 and 7A2000, so the proper location
> of the workaround is in the bridge chip driver. Or am I
> misunderstanding something?


The ls2k1000 and ls2k2000 SoC (which don't use with 7A1000 and 7A2000) are also suffer form this
problem. If this is a hardware bug of 7A1000 and 7A2000, why forbidden WC mapping of pages in
system RAM?

The problem of this patch and the <16c52e503043> commit is that it forbidden all WC mappings by
default. Even pages in system RAM.

2023-10-13 12:52:25

by Sui Jingfeng

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

Hi,


On 2023/10/10 20:26, Xi Ruoyao wrote:
>> - For buffers at VRAM(device memory), we replace the WC mappings with uncached mappings.
>> - For buffers reside in RAM, we replace the WC mappings with cached mappings.
>>
>> By this way, we were able to minimum the side effects, and meet the usable requirements
>> for all of the GPU drivers.
> AFAIK there has been some clear NAK from DRM maintainers towards this
> "approach". So it's not possible to be applied upstream.


Yeah, domain specific reviewers are really hard to persuade to accept our solution.
Probably because they are not know our hardware very well. But the side effects of
this patch is really too hurt to accept. In this case, if you really want to make
a progress by workaround. I think you could try scan the PCI device in the whole
system on boot time. Turn off the WC mappings when there is a AMD or ATI GPUs found.
Leave the the WC mappings open at the rest of cases. But this is yet another ugly
workaround. Perhaps it is a bit of better than this because there is no way left.




2023-10-13 13:16:01

by Xi Ruoyao

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

On Fri, 2023-10-13 at 20:51 +0800, Sui Jingfeng wrote:
> Hi,
>
>
> On 2023/10/10 20:26, Xi Ruoyao wrote:
> > > - For buffers at VRAM(device memory), we replace the WC mappings with uncached mappings.
> > > - For buffers reside in RAM, we replace the WC mappings with cached mappings.
> > >
> > > By this way, we were able to minimum the side effects, and meet the usable requirements
> > > for all of the GPU drivers.
> > AFAIK there has been some clear NAK from DRM maintainers towards this
> > "approach".  So it's not possible to be applied upstream.
>
>
> Yeah, domain specific reviewers are really hard to persuade to accept our solution.
> Probably because they are not know our hardware very well.

The problem: why this is only an issue with DRM devices? Work around it
in DRM subsystem just does not make sense to me, at all.

You cannot exclude the possibility that another subsystem will start to
hit the same issue in the future. Then such a "solution" will cause #if
CONFIG_LOONGARCH scattering everywhere and doing so is completely
unmaintainable.

> But the side effects of this patch is really too hurt to accept. In this case, if you really want to make
> a progress by workaround.

This is not a valid reason. For example, the spectre-like bug
workarounds hurts the performance as well, but we enable them on
affected CPUs by default.

> I think you could try scan the PCI device in the whole
> system on boot time. Turn off the WC mappings when there is a AMD or ATI GPUs found.

Again, why this is only an issue with AMD or ATI GPUs?

Can you provide some detailed documentations about this hardware issue
so the community can help to figure out a solution?

> Leave the the WC mappings open at the rest of cases. But this is yet another ugly
> workaround. Perhaps it is a bit of better than this because there is no way left.

And this also won't work with PCI hotplug, I'm not sure if it's
supported now but I'm 99% sure you'll add it in the future.

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2023-10-13 13:53:44

by Xi Ruoyao

[permalink] [raw]
Subject: Re: [PATCH v2] loongarch/mm: disable WUC for pgprot_writecombine as same as ioremap_wc

On Fri, 2023-10-13 at 21:15 +0800, Xi Ruoyao wrote:
> Again, why this is only an issue with AMD or ATI GPUs?
>
> Can you provide some detailed documentations about this hardware issue
> so the community can help to figure out a solution?

3 years ago we had https://lkml.org/lkml/2020/8/10/255:

In this case the patch is a clear NAK since you haven't root caused the
issue and are just working around it in a very questionable manner.

and

But when the hardware doesn't correctly implement WC for PCIe BARs, then
this is a violation of the PCIe spec and a bit more serious issue for
the whole platform.

So do we know the root cause now? <rant>Or in all the 3 years we just
keep carrying a problematic workaround downstream, burying the head into
the sand like an ostrich, and self comforting with "oh they don't
understand our hardware"?!</rant>

Even if the problem is really "they don't understand our hardware" you
need to provide some materials to help people understanding the hardware
better.

If we cannot figure out the root cause or a proper fix is too difficult,
we should *at least* have a cmdline option and/or a configuration option
to allow the user to decide, like how we treat these spectre-like bugs.
"Should the option be enabled or disabled by default" can be debated
later.

And please try to fix the hardware, to me it will be a compelling reason
to pay some money for an upgrade :).

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University