2020-04-06 19:52:30

by Alex Xu (Hello71)

[permalink] [raw]
Subject: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
start filling dmesg, and then closing programs causes more BUGs and
hangs, and then everything grinds to a halt (can't start more programs,
can't even reboot through systemd).

Using master and reverting that branch up to that point fixes the
problem.

I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
board with IOMMU enabled.


Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
> start filling dmesg, and then closing programs causes more BUGs and
> hangs, and then everything grinds to a halt (can't start more programs,
> can't even reboot through systemd).
>
> Using master and reverting that branch up to that point fixes the
> problem.
>
> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
> board with IOMMU enabled.

Hmm. That sounds bad. Could you send a copy of your config?

Meanwhile, I'll prepare a small patch that disables the non-vmwgfx
huge_fault() until we've figured out what's happening.

/Thomas


Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

Hi,

On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
> start filling dmesg, and then closing programs causes more BUGs and
> hangs, and then everything grinds to a halt (can't start more programs,
> can't even reboot through systemd).
>
> Using master and reverting that branch up to that point fixes the
> problem.
>
> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
> board with IOMMU enabled.

If you could try the attached patch, that'd be great!

Thanks,

Thomas



Attachments:
0001-drm-ttm-Temporarily-disable-the-huge_fault-callback.patch (2.80 kB)

2020-04-07 00:41:28

by Alex Xu (Hello71)

[permalink] [raw]
Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

Excerpts from Thomas Hellström (VMware)'s message of April 6, 2020 5:04 pm:
> Hi,
>
> On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
>> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
>> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
>> start filling dmesg, and then closing programs causes more BUGs and
>> hangs, and then everything grinds to a halt (can't start more programs,
>> can't even reboot through systemd).
>>
>> Using master and reverting that branch up to that point fixes the
>> problem.
>>
>> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
>> board with IOMMU enabled.
>
> If you could try the attached patch, that'd be great!
>
> Thanks,
>
> Thomas
>

Yeah, that works too. Kernel config sent off-list.

Regards,
Alex.

Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

On 4/7/20 2:38 AM, Alex Xu (Hello71) wrote:
> Excerpts from Thomas Hellström (VMware)'s message of April 6, 2020 5:04 pm:
>> Hi,
>>
>> On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
>>> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
>>> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
>>> start filling dmesg, and then closing programs causes more BUGs and
>>> hangs, and then everything grinds to a halt (can't start more programs,
>>> can't even reboot through systemd).
>>>
>>> Using master and reverting that branch up to that point fixes the
>>> problem.
>>>
>>> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
>>> board with IOMMU enabled.
>> If you could try the attached patch, that'd be great!
>>
>> Thanks,
>>
>> Thomas
>>
> Yeah, that works too. Kernel config sent off-list.
>
> Regards,
> Alex.

Thanks. Do you want me to add your

Reported-by: and Tested-by: To this patch?

/Thomas

2020-04-07 15:37:39

by Alex Xu (Hello71)

[permalink] [raw]
Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

Excerpts from Thomas Hellström (VMware)'s message of April 7, 2020 7:26 am:
> On 4/7/20 2:38 AM, Alex Xu (Hello71) wrote:
>> Excerpts from Thomas Hellström (VMware)'s message of April 6, 2020 5:04 pm:
>>> Hi,
>>>
>>> On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
>>>> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
>>>> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
>>>> start filling dmesg, and then closing programs causes more BUGs and
>>>> hangs, and then everything grinds to a halt (can't start more programs,
>>>> can't even reboot through systemd).
>>>>
>>>> Using master and reverting that branch up to that point fixes the
>>>> problem.
>>>>
>>>> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
>>>> board with IOMMU enabled.
>>> If you could try the attached patch, that'd be great!
>>>
>>> Thanks,
>>>
>>> Thomas
>>>
>> Yeah, that works too. Kernel config sent off-list.
>>
>> Regards,
>> Alex.
>
> Thanks. Do you want me to add your
>
> Reported-by: and Tested-by: To this patch?
>
> /Thomas
>
>

Sure. Shouldn't we fix it properly though?

Subject: Re: Bad rss-counter state from drm/ttm, drm/vmwgfx: Support huge TTM pagefaults

On 4/7/20 5:36 PM, Alex Xu (Hello71) wrote:
> Excerpts from Thomas Hellström (VMware)'s message of April 7, 2020 7:26 am:
>> On 4/7/20 2:38 AM, Alex Xu (Hello71) wrote:
>>> Excerpts from Thomas Hellström (VMware)'s message of April 6, 2020 5:04 pm:
>>>> Hi,
>>>>
>>>> On 4/6/20 9:51 PM, Alex Xu (Hello71) wrote:
>>>>> Using 314b658 with amdgpu, starting sway and firefox causes "BUG: Bad
>>>>> rss-counter state" and "BUG: non-zero pgtables_bytes on freeing mm" to
>>>>> start filling dmesg, and then closing programs causes more BUGs and
>>>>> hangs, and then everything grinds to a halt (can't start more programs,
>>>>> can't even reboot through systemd).
>>>>>
>>>>> Using master and reverting that branch up to that point fixes the
>>>>> problem.
>>>>>
>>>>> I'm using a Ryzen 1600 and AMD Radeon RX 480 on an ASRock B450 Pro4
>>>>> board with IOMMU enabled.
>>>> If you could try the attached patch, that'd be great!
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>> Yeah, that works too. Kernel config sent off-list.
>>>
>>> Regards,
>>> Alex.
>> Thanks. Do you want me to add your
>>
>> Reported-by: and Tested-by: To this patch?
>>
>> /Thomas
>>
>>
> Sure. Shouldn't we fix it properly though?

It's still enabled for vmwgfx for which it is reasonably well tested and
where I can't see any such errors.

The code we remove with this patch enables huge page-table entries in
some circumstances for other drivers, but given the problems you're
seeing for amdgpu, it's better to enable this on a per-driver basis
after thorough testing. Since I don't have amdgpu hardware I'm not sure
what it's doing differently, and can't debug the issue properly.

/Thomas