2023-10-24 21:38:07

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

Hi Timo,

On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> Package: src:linux
> Version: 6.1.55-1
> Severity: normal
>
> Steps to reproduce:
> 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
> graphics card. You only need basic installation without wayland or X.
> 2) Login from the console and save th? following to reproduce.bash:
>
> #!/bin/bash
>
> chvt 3
> for j in $(seq 80); do
> echo "$(date) starting round $j"
> if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> then
> echo "bug was reproduced after $j tries"
> exit 1
> fi
> for i in $(seq 100); do
> dmesg > /dev/tty3
> done
> done
>
> echo "bug could not be reproduced"
> exit 0
>
>
> 3) Run chmod a+x reproduce.bash
> 4) Run ./reproduce.bash and wait for up to 20 minutes.
>
> Expected results:
> 4) The system prints a steady flow of text without kernel error messages
>
> Actual messages:
> 4) At some point the text stops flowing and the script prints "bug was
> reproduced". If you run "journalctl --boot" you see
>
> kernel: [TTM] Buffer eviction failed
> kernel: qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
> kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
>
>
>
> More info:
> 1) The bug does not occur if I downgrade the kernel to
> linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> 2) I used the following test_linux.bash to bisect this issue against
> upstream source:
>
> #!/bin/bash
> set -x
>
> gitversion="$(git describe HEAD|sed 's@^v@@')"
>
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> # Build
> cp ~/kernel.config .config
> # cp /boot/config-$(uname -r) .config
> # scripts/config --enable LOCALVERSION_AUTO
> # scripts/config --disable DEBUG_INFO
> # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> # scripts/config --disable STACKPROTECTOR_STRONG
> make olddefconfig
> # make localmodconfig
> make -j$(nproc --all) bindeb-pkg
> rc="$?"
> if [ "$rc" != "0" ]; then
> exit 125
> fi
> git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
>
> package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
>
> if [ "$gitversion" != "$version" ]; then
> echo "Build produced version $gitversion but got $version, ignoring"
> #exit 255
> fi
>
> # Deploy
> scp $package target:a.deb
> ssh target sudo apt install ./a.deb
> ssh target rm -f a.deb
> ssh target ./grub_set_default_version.bash $version
> ssh target sudo shutdown -r now
> sleep 40
>
> detected_version=$(ssh target uname -r)
> if [ "$detected_version" != "$version" ]; then
> echo "Booted to $detected_version but expected $version"
> exit 255
> fi
>
> # Test
> exec ssh target sudo ./reproduce.bash
>
>
> Bisect printed the following log:
>
> git bisect start
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix drm_mode_create_blob comment
> git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 config option
> git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip PP_MP1_STATE_UNLOAD on aldebaran
> git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map after first timeline pin, v4.
> git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop relocation support on all new hardware (v6)
> git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy execbuffer support (v2)
> git bisect bad b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2
> # bad: [06debd6e1b28029e6e77c41e59a162868f377897] Merge tag 'drm-intel-next-2021-03-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> git bisect bad 06debd6e1b28029e6e77c41e59a162868f377897
> # good: [e19eede54240d64b4baf9b0df4dfb8191f7ae48b] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
> git bisect good e19eede54240d64b4baf9b0df4dfb8191f7ae48b
> # good: [1e28eed17697bcf343c6743f0028cc3b5dd88bf0] Linux 5.12-rc3
> git bisect good 1e28eed17697bcf343c6743f0028cc3b5dd88bf0
> # bad: [6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7] drm/atmel-hlcdc: Rename custom plane state variable
> git bisect bad 6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7
> # good: [4ca77c513537700d3fae69030879f781dde1904c] drm/qxl: release shadow on shutdown
> git bisect good 4ca77c513537700d3fae69030879f781dde1904c
> # bad: [4a11bd1e88af130f50a72e0f54391c1c7d268e03] drm/ast: Add constants for VGACRCB register bits
> git bisect bad 4a11bd1e88af130f50a72e0f54391c1c7d268e03
> # bad: [5c209d8056b9763ce544ecd7dadb3782cdaf96ed] drm/gma500: psb_spank() doesn't need it's own file
> git bisect bad 5c209d8056b9763ce544ecd7dadb3782cdaf96ed
> # bad: [db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04] drm/gem: Export drm_gem_vmap() and drm_gem_vunmap()
> git bisect bad db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04
> # bad: [f4a84e165e6d58606097dd07b5b78767a94b870c] drm/qxl: allocate dumb buffers in ram
> git bisect bad f4a84e165e6d58606097dd07b5b78767a94b870c
> # good: [a7709b9b89a67f3ead2d188b1d0c261059b1f291] drm/qxl: handle shadow in primary destroy
> git bisect good a7709b9b89a67f3ead2d188b1d0c261059b1f291
> # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> # good: [5f6c871fe919999774e8535ea611a6f84ee43ee4] drm/qxl: properly free qxl releases
> git bisect good 5f6c871fe919999774e8535ea611a6f84ee43ee4
> # first bad commit: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
>
> I took a look at
>
> commit 5a838e5d5825c85556011478abde708251cc0776 (refs/bisect/bad)
> Author: Gerd Hoffmann <[email protected]>
> Date: Thu Feb 4 15:57:10 2021 +0100
>
> drm/qxl: simplify qxl_fence_wait
>
> Now that we have the new release_event wait queue we can just
> use that in qxl_fence_wait() and simplify the code a lot.
>
> Signed-off-by: Gerd Hoffmann <[email protected]>
> Acked-by: Thomas Zimmermann <[email protected]>
> Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
>
>
> and noticed that the bug does not occur if I boot 6.1 kernel with this patch
> reverted (see attached file).

Thanks for the excelent constructed report! I think it's best to
forward this directly to upstream including the people for the
bisected commit to get some idea.

Can you reproduce the issue with 6.5.8-1 in unstable as well?

If not, are you able to isolate an upstream fix which should be
backported to the 6.1.y series as well?

Regards,
Salvatore


2023-10-24 21:40:08

by Timo Lindfors

[permalink] [raw]
Subject: Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

Hi,

On Tue, 24 Oct 2023, Salvatore Bonaccorso wrote:
> Thanks for the excelent constructed report! I think it's best to
> forward this directly to upstream including the people for the
> bisected commit to get some idea.

Thanks for the quick reply!

> Can you reproduce the issue with 6.5.8-1 in unstable as well?

Unfortunately yes:

ansible@target:~$ uname -r
6.5.0-3-amd64
ansible@target:~$ time sudo ./reproduce.bash
Wed 25 Oct 2023 12:27:00 AM EEST starting round 1
Wed 25 Oct 2023 12:27:24 AM EEST starting round 2
Wed 25 Oct 2023 12:27:48 AM EEST starting round 3
bug was reproduced after 3 tries

real 0m48.838s
user 0m1.115s
sys 0m45.530s

I also tested upstream tag v6.6-rc6:

...
+ detected_version=6.6.0-rc6
+ '[' 6.6.0-rc6 '!=' 6.6.0-rc6 ']'
+ exec ssh target sudo ./reproduce.bash
Wed 25 Oct 2023 12:37:16 AM EEST starting round 1
Wed 25 Oct 2023 12:37:42 AM EEST starting round 2
Wed 25 Oct 2023 12:38:10 AM EEST starting round 3
Wed 25 Oct 2023 12:38:36 AM EEST starting round 4
Wed 25 Oct 2023 12:39:01 AM EEST starting round 5
Wed 25 Oct 2023 12:39:27 AM EEST starting round 6
bug was reproduced after 6 tries


For completeness, here is also the grub_set_default_version.bash script
that I had to write to automate this (maybe these could be in debian
wiki?):

#!/bin/bash
set -x

version="$1"

idx=$(expr $(grep "menuentry " /boot/grub/grub.cfg | sed 1d |grep -n "'Debian GNU/Linux, with Linux $version'"|cut -d: -f1) - 1)
exec sudo grub-set-default "1>$idx"



-Timo

2023-10-24 23:55:45

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

On Tue, Oct 24, 2023 at 11:09:10PM +0200, Salvatore Bonaccorso wrote:
> Hi Timo,
>
> On Tue, Oct 24, 2023 at 11:14:32PM +0300, Timo Lindfors wrote:
> > Package: src:linux
> > Version: 6.1.55-1
> > Severity: normal
> >
> > Steps to reproduce:
> > 1) Install Debian 12 as a virtual machine using virt-manager, choose qxl
> > graphics card. You only need basic installation without wayland or X.
> > 2) Login from the console and save thë following to reproduce.bash:
> >
> > #!/bin/bash
> >
> > chvt 3
> > for j in $(seq 80); do
> > echo "$(date) starting round $j"
> > if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ];
> > then
> > echo "bug was reproduced after $j tries"
> > exit 1
> > fi
> > for i in $(seq 100); do
> > dmesg > /dev/tty3
> > done
> > done
> >
> > echo "bug could not be reproduced"
> > exit 0
> >
> >
> > 3) Run chmod a+x reproduce.bash
> > 4) Run ./reproduce.bash and wait for up to 20 minutes.
> >
> > Expected results:
> > 4) The system prints a steady flow of text without kernel error messages
> >
> > Actual messages:
> > 4) At some point the text stops flowing and the script prints "bug was
> > reproduced". If you run "journalctl --boot" you see
> >
> > kernel: [TTM] Buffer eviction failed
> > kernel: qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
> > kernel: [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate VRAM BO
> >
> >
> >
> > More info:
> > 1) The bug does not occur if I downgrade the kernel to
> > linux-image-5.10.0-26-amd64_5.10.197-1_amd64.deb from Debian 11.
> > 2) I used the following test_linux.bash to bisect this issue against
> > upstream source:
> >
> > #!/bin/bash
> > set -x
> >
> > gitversion="$(git describe HEAD|sed 's@^v@@')"
> >
> > git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> > git show bec771b5e0901f4b0bc861bcb58056de5151ae3a | patch -p1
> > # Build
> > cp ~/kernel.config .config
> > # cp /boot/config-$(uname -r) .config
> > # scripts/config --enable LOCALVERSION_AUTO
> > # scripts/config --disable DEBUG_INFO
> > # scripts/config --disable SYSTEM_TRUSTED_KEYRING
> > # scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
> > # scripts/config --disable STACKPROTECTOR_STRONG
> > make olddefconfig
> > # make localmodconfig
> > make -j$(nproc --all) bindeb-pkg
> > rc="$?"
> > if [ "$rc" != "0" ]; then
> > exit 125
> > fi
> > git checkout drivers/gpu/drm/ttm/ttm_bo.c include/drm/ttm/ttm_bo_api.h
> >
> > package="$(ls --sort=time ../linux-image-*_amd64.deb|head -n1)"
> > version=$(echo $package | cut -d_ -f1|cut -d- -f3-)
> >
> > if [ "$gitversion" != "$version" ]; then
> > echo "Build produced version $gitversion but got $version, ignoring"
> > #exit 255
> > fi
> >
> > # Deploy
> > scp $package target:a.deb
> > ssh target sudo apt install ./a.deb
> > ssh target rm -f a.deb
> > ssh target ./grub_set_default_version.bash $version
> > ssh target sudo shutdown -r now
> > sleep 40
> >
> > detected_version=$(ssh target uname -r)
> > if [ "$detected_version" != "$version" ]; then
> > echo "Booted to $detected_version but expected $version"
> > exit 255
> > fi
> >
> > # Test
> > exec ssh target sudo ./reproduce.bash
> >
> >
> > Bisect printed the following log:
> >
> > git bisect start
> > # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> > git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> > # bad: [762949bb1da78941b25e63f7e952af037eee15a9] drm: fix drm_mode_create_blob comment
> > git bisect bad 762949bb1da78941b25e63f7e952af037eee15a9
> > # bad: [e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc] drm/gma500: Drop DRM_GMA600 config option
> > git bisect bad e40f97ef12772f8eb04b6a155baa1e0e2e8f3ecc
> > # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> > git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> > # bad: [d2b6f8a179194de0ffc4886ffc2c4358d86047b8] Merge tag 'xfs-5.13-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
> > git bisect bad d2b6f8a179194de0ffc4886ffc2c4358d86047b8
> > # bad: [68a32ba14177d4a21c4a9a941cf1d7aea86d436f] Merge tag 'drm-next-2021-04-28' of git://anongit.freedesktop.org/drm/drm
> > git bisect bad 68a32ba14177d4a21c4a9a941cf1d7aea86d436f
> > # bad: [0698b13403788a646073fcd9b2294f2dce0ce429] drm/amdgpu: skip PP_MP1_STATE_UNLOAD on aldebaran
> > git bisect bad 0698b13403788a646073fcd9b2294f2dce0ce429
> > # bad: [e1a5e6a8c48bf99ea374fb3e535661cfe226bca4] drm/doc: Add RFC section
> > git bisect bad e1a5e6a8c48bf99ea374fb3e535661cfe226bca4
> > # bad: [ed29c2691188cf7ea2a46d40b891836c2bd1a4f5] drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.
> > git bisect bad ed29c2691188cf7ea2a46d40b891836c2bd1a4f5
> > # bad: [2c8ab3339e398bbbcb0980933e266b93bedaae52] drm/i915: Pin timeline map after first timeline pin, v4.
> > git bisect bad 2c8ab3339e398bbbcb0980933e266b93bedaae52
> > # bad: [2eb8e1a69d9f8cc9c0a75e327f854957224ba421] drm/i915/gem: Drop relocation support on all new hardware (v6)
> > git bisect bad 2eb8e1a69d9f8cc9c0a75e327f854957224ba421
> > # bad: [b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2] drm/i915/gem: Drop legacy execbuffer support (v2)
> > git bisect bad b5b6f6a610127b17f20c0ca03dd27beee4ddc2b2
> > # bad: [06debd6e1b28029e6e77c41e59a162868f377897] Merge tag 'drm-intel-next-2021-03-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
> > git bisect bad 06debd6e1b28029e6e77c41e59a162868f377897
> > # good: [e19eede54240d64b4baf9b0df4dfb8191f7ae48b] Merge branch 'dmi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging
> > git bisect good e19eede54240d64b4baf9b0df4dfb8191f7ae48b
> > # good: [1e28eed17697bcf343c6743f0028cc3b5dd88bf0] Linux 5.12-rc3
> > git bisect good 1e28eed17697bcf343c6743f0028cc3b5dd88bf0
> > # bad: [6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7] drm/atmel-hlcdc: Rename custom plane state variable
> > git bisect bad 6af70eb3b40edfc8bdf2373cdc2bcf9d5a20c8c7
> > # good: [4ca77c513537700d3fae69030879f781dde1904c] drm/qxl: release shadow on shutdown
> > git bisect good 4ca77c513537700d3fae69030879f781dde1904c
> > # bad: [4a11bd1e88af130f50a72e0f54391c1c7d268e03] drm/ast: Add constants for VGACRCB register bits
> > git bisect bad 4a11bd1e88af130f50a72e0f54391c1c7d268e03
> > # bad: [5c209d8056b9763ce544ecd7dadb3782cdaf96ed] drm/gma500: psb_spank() doesn't need it's own file
> > git bisect bad 5c209d8056b9763ce544ecd7dadb3782cdaf96ed
> > # bad: [db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04] drm/gem: Export drm_gem_vmap() and drm_gem_vunmap()
> > git bisect bad db0c6bd2c0c0dada8927cd46a7c34c316a3a6c04
> > # bad: [f4a84e165e6d58606097dd07b5b78767a94b870c] drm/qxl: allocate dumb buffers in ram
> > git bisect bad f4a84e165e6d58606097dd07b5b78767a94b870c
> > # good: [a7709b9b89a67f3ead2d188b1d0c261059b1f291] drm/qxl: handle shadow in primary destroy
> > git bisect good a7709b9b89a67f3ead2d188b1d0c261059b1f291
> > # bad: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> > git bisect bad 5a838e5d5825c85556011478abde708251cc0776
> > # good: [5f6c871fe919999774e8535ea611a6f84ee43ee4] drm/qxl: properly free qxl releases
> > git bisect good 5f6c871fe919999774e8535ea611a6f84ee43ee4
> > # first bad commit: [5a838e5d5825c85556011478abde708251cc0776] drm/qxl: simplify qxl_fence_wait
> >
> > I took a look at
> >
> > commit 5a838e5d5825c85556011478abde708251cc0776 (refs/bisect/bad)
> > Author: Gerd Hoffmann <[email protected]>
> > Date: Thu Feb 4 15:57:10 2021 +0100
> >
> > drm/qxl: simplify qxl_fence_wait
> >
> > Now that we have the new release_event wait queue we can just
> > use that in qxl_fence_wait() and simplify the code a lot.
> >
> > Signed-off-by: Gerd Hoffmann <[email protected]>
> > Acked-by: Thomas Zimmermann <[email protected]>
> > Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
> >
> >
> > and noticed that the bug does not occur if I boot 6.1 kernel with this patch
> > reverted (see attached file).
>
> Thanks for the excelent constructed report! I think it's best to
> forward this directly to upstream including the people for the
> bisected commit to get some idea.
>
> Can you reproduce the issue with 6.5.8-1 in unstable as well?
>
> If not, are you able to isolate an upstream fix which should be
> backported to the 6.1.y series as well?
>

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 5a838e5d5825c8
#regzbot title: simplifying qxl_fence_wait() makes VRAM BO allocation fail
#regzbot from: Timo Lindfors <[email protected]>

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (8.78 kB)
signature.asc (235.00 B)
Download all attachments
Subject: Re: Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Gerd, it seems this regression[1] fell through the cracks. Could you
please take a look? Or is there a good reason why this can't be
addressed? Or was it dealt with and I just missed it?

[1] apparently caused by 5a838e5d5825c8 ("drm/qxl: simplify
qxl_fence_wait") [v5.13-rc1] from Gerd; for details see
https://lore.kernel.org/regressions/[email protected]/

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 24.10.23 23:39, Timo Lindfors wrote:
> Hi,
>
> On Tue, 24 Oct 2023, Salvatore Bonaccorso wrote:
>> Thanks for the excelent constructed report! I think it's best to
>> forward this directly to upstream including the people for the
>> bisected commit to get some idea.
>
> Thanks for the quick reply!
>
>> Can you reproduce the issue with 6.5.8-1 in unstable as well?
>
> Unfortunately yes:
>
> ansible@target:~$ uname -r
> 6.5.0-3-amd64
> ansible@target:~$ time sudo ./reproduce.bash
> Wed 25 Oct 2023 12:27:00 AM EEST starting round 1
> Wed 25 Oct 2023 12:27:24 AM EEST starting round 2
> Wed 25 Oct 2023 12:27:48 AM EEST starting round 3
> bug was reproduced after 3 tries
>
> real    0m48.838s
> user    0m1.115s
> sys     0m45.530s
>
> I also tested upstream tag v6.6-rc6:
>
> ...
> + detected_version=6.6.0-rc6
> + '[' 6.6.0-rc6 '!=' 6.6.0-rc6 ']'
> + exec ssh target sudo ./reproduce.bash
> Wed 25 Oct 2023 12:37:16 AM EEST starting round 1
> Wed 25 Oct 2023 12:37:42 AM EEST starting round 2
> Wed 25 Oct 2023 12:38:10 AM EEST starting round 3
> Wed 25 Oct 2023 12:38:36 AM EEST starting round 4
> Wed 25 Oct 2023 12:39:01 AM EEST starting round 5
> Wed 25 Oct 2023 12:39:27 AM EEST starting round 6
> bug was reproduced after 6 tries
>
>
> For completeness, here is also the grub_set_default_version.bash script
> that I had to write to automate this (maybe these could be in debian
> wiki?):
>
> #!/bin/bash
> set -x
>
> version="$1"
>
> idx=$(expr $(grep "menuentry " /boot/grub/grub.cfg | sed 1d |grep -n
> "'Debian GNU/Linux, with Linux $version'"|cut -d: -f1) - 1)
> exec sudo grub-set-default "1>$idx"
>
>
>
> -Timo
>
>
>

2024-03-08 01:12:27

by Alex Constantino

[permalink] [raw]
Subject: [PATCH 0/1] drm/qxl: fixes qxl_fence_wait

Hi,
As initially reported by Timo in the QXL driver will crash given enough
workload:
https://lore.kernel.org/regressions/[email protected]/
I initially came across this problem when migrating Debian VMs from Bullseye
to Bookworm. This bug will somewhat randomly but consistently happen, even
just by using neovim with plugins or playing a video. This exception would
then cascade and make Xorg crash too.

The error log from dmesg would have `[TTM] Buffer eviction failed` followed
by either a `failed to allocate VRAM BO` or `failed to allocate GEM object`.
And the error log from Xorg would have `qxl(0): error doing QXL_ALLOC`
followed by a backtrace and segmentation fault.

I can confirm the problem still exists in latest kernel versions:
https://gitlab.freedesktop.org/drm/kernel @ c6d6a82d8a9f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git @ 1870cdc0e8de

When I was investigating this issue I ended up creating a script which
triggers the issue in just a couple of minutes when executed under uxterm.
YMMV according to your system, for example when using urxvt crashes were
not as consistent, likely due to it being more efficient and having less
video memory allocations.
For me this is the fastest way to trigger the bug. Here follows:
```
#!/bin/bash
print_gradient_with_awk() {
local arg="$1"
if [[ -n $arg ]]; then
arg=" ($arg)"
fi
awk -v arg="$arg" 'BEGIN{
s="/\\/\\/\\/\\/\\"; s=s s s s s s s s;
for (colnum = 0; colnum<77; colnum++) {
r = 255-(colnum*255/76);
g = (colnum*510/76);
b = (colnum*255/76);
if (g>255) g = 510-g;
printf "\033[48;2;%d;%d;%dm", r,g,b;
printf "\033[38;2;%d;%d;%dm", 255-r,255-g,255-b;
printf "%s\033[0m", substr(s,colnum+1,1);
}
printf "%s\n", arg;
}'
}
for i in {1..10000}; do
print_gradient_with_awk $i
done
```

Timo initially reported:
commit 5f6c871fe919 ("drm/qxl: properly free qxl releases") as working fine
commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") introducing the bug

The bug occurs whenever a timeout is reached in wait_event_timeout.
To fix this issue I updated the code to include a busy wait logic, which
was how the last working version operated. That fixes this bug while still
keeping the code simple (which I suspect was the motivation for the
5a838e5d5825 commit in the first place), as opposed to just reverting to
the last working version at 5f6c871fe919
The choice for the use of HZ as a scaling factor for the loop was that it
is also used by ttm_bo_wait_ctx which is one of the indirect callers of
qxl_fence_wait, with the other being ttm_bo_delayed_delete

To confirm the problem no longer manifests I have:
- executed my own test case pasted above
- executed Timo's test case pasted below
- played a video stream in mplayer for 3h (no audio stream because
apparently pulseaudio and/or alsa have memory leaks that make the
system run out of memory)

For quick reference here is Timo's script:
```
#!/bin/bash
chvt 3
for j in $(seq 80); do
echo "$(date) starting round $j"
if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ]; then
echo "bug was reproduced after $j tries"
exit 1
fi
for i in $(seq 100); do
dmesg > /dev/tty3
done
done
echo "bug could not be reproduced"
exit 0
```

From what I could find online it seems that users that have been affected
by this problem just tend to move from QXL to VirtIO, that is why this bug
has been hidding for over 3 years now.
This issue was initially reported by Timo 4 months ago but the discussion
seems to have stalled.
It would be great if this could be addressed and avoid it falling through
the cracks.

Thank you for your time.


---

Alex Constantino (1):
drm/qxl: fixes qxl_fence_wait

drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)


base-commit: 1870cdc0e8dee32e3c221704a2977898ba4c10e8
--
2.39.2


2024-03-08 01:13:12

by Alex Constantino

[permalink] [raw]
Subject: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait

Fix OOM scenario by doing multiple notifications to the OOM handler through
a busy wait logic.
Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
result in a '[TTM] Buffer eviction failed' exception whenever it reached a
timeout.

Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
Link: https://lore.kernel.org/regressions/[email protected]
Reported-by: Timo Lindfors <[email protected]>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
Signed-off-by: Alex Constantino <[email protected]>
---
drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 368d26da0d6a..51c22e7f9647 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -20,8 +20,6 @@
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

-#include <linux/delay.h>
-
#include <trace/events/dma_fence.h>

#include "qxl_drv.h"
@@ -59,14 +57,24 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
{
struct qxl_device *qdev;
unsigned long cur, end = jiffies + timeout;
+ signed long iterations = 1;
+ signed long timeout_fraction = timeout;

qdev = container_of(fence->lock, struct qxl_device, release_lock);

- if (!wait_event_timeout(qdev->release_event,
+ // using HZ as a factor since it is used in ttm_bo_wait_ctx too
+ if (timeout_fraction > HZ) {
+ iterations = timeout_fraction / HZ;
+ timeout_fraction = HZ;
+ }
+ for (int i = 0; i < iterations; i++) {
+ if (wait_event_timeout(
+ qdev->release_event,
(dma_fence_is_signaled(fence) ||
- (qxl_io_notify_oom(qdev), 0)),
- timeout))
- return 0;
+ (qxl_io_notify_oom(qdev), 0)),
+ timeout_fraction))
+ break;
+ }

cur = jiffies;
if (time_after(cur, end))
--
2.39.2


Subject: Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait

On 08.03.24 02:08, Alex Constantino wrote:
> Fix OOM scenario by doing multiple notifications to the OOM handler through
> a busy wait logic.
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.

Thx for working on this.

> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: https://lore.kernel.org/regressions/[email protected]

Nitpicking: that ideally should be pointing to
https://lore.kernel.org/regressions/[email protected]/ , as
that the report and not just a reply to prod things.

Ciao, Thorsten

Subject: Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait

On 08.03.24 02:08, Alex Constantino wrote:
> Fix OOM scenario by doing multiple notifications to the OOM handler through
> a busy wait logic.
> Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> timeout.
>
> Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> Link: https://lore.kernel.org/regressions/[email protected]
> Reported-by: Timo Lindfors <[email protected]>
> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
> Signed-off-by: Alex Constantino <[email protected]>
> ---
> drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)

Hey Dave and Gerd as well as Thomas, Maarten and Maxime (the latter two
I just added to the CC), it seems to me this regression fix did not
maybe any progress since it was posted. Did I miss something, is it just
"we are busy with the merge window", or is there some other a reason?
Just wondering, I just saw someone on a Fedora IRC channel complaining
about the regression, that's why I'm asking. Would be really good to
finally get this resolved...

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
> index 368d26da0d6a..51c22e7f9647 100644
> --- a/drivers/gpu/drm/qxl/qxl_release.c
> +++ b/drivers/gpu/drm/qxl/qxl_release.c
> @@ -20,8 +20,6 @@
> * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> */
>
> -#include <linux/delay.h>
> -
> #include <trace/events/dma_fence.h>
>
> #include "qxl_drv.h"
> @@ -59,14 +57,24 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
> {
> struct qxl_device *qdev;
> unsigned long cur, end = jiffies + timeout;
> + signed long iterations = 1;
> + signed long timeout_fraction = timeout;
>
> qdev = container_of(fence->lock, struct qxl_device, release_lock);
>
> - if (!wait_event_timeout(qdev->release_event,
> + // using HZ as a factor since it is used in ttm_bo_wait_ctx too
> + if (timeout_fraction > HZ) {
> + iterations = timeout_fraction / HZ;
> + timeout_fraction = HZ;
> + }
> + for (int i = 0; i < iterations; i++) {
> + if (wait_event_timeout(
> + qdev->release_event,
> (dma_fence_is_signaled(fence) ||
> - (qxl_io_notify_oom(qdev), 0)),
> - timeout))
> - return 0;
> + (qxl_io_notify_oom(qdev), 0)),
> + timeout_fraction))
> + break;
> + }
>
> cur = jiffies;
> if (time_after(cur, end))

2024-03-27 14:43:19

by Maxime Ripard

[permalink] [raw]
Subject: Re: [PATCH 1/1] drm/qxl: fixes qxl_fence_wait

Hi,

On Wed, Mar 20, 2024 at 04:25:48PM +0100, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 08.03.24 02:08, Alex Constantino wrote:
> > Fix OOM scenario by doing multiple notifications to the OOM handler through
> > a busy wait logic.
> > Changes from commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") would
> > result in a '[TTM] Buffer eviction failed' exception whenever it reached a
> > timeout.
> >
> > Fixes: 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
> > Link: https://lore.kernel.org/regressions/[email protected]
> > Reported-by: Timo Lindfors <[email protected]>
> > Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1054514
> > Signed-off-by: Alex Constantino <[email protected]>
> > ---
> > drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
> > 1 file changed, 14 insertions(+), 6 deletions(-)
>
> Hey Dave and Gerd as well as Thomas, Maarten and Maxime (the latter two
> I just added to the CC), it seems to me this regression fix did not
> maybe any progress since it was posted. Did I miss something, is it just
> "we are busy with the merge window", or is there some other a reason?
> Just wondering, I just saw someone on a Fedora IRC channel complaining
> about the regression, that's why I'm asking. Would be really good to
> finally get this resolved...

I've ping'd Gerd last week about it, but he couldn't remember the
details of why that patch was warranted in the first place.

If it works, I'd prefer to revert the original patch that we know used
to work instead of coming up with some less proven logic, which seems to
be quite different to what it used to be.

Alex, could you try reverting 5a838e5d5825c85556011478abde708251cc0776
and letting us know the result?

Thanks!
Maxime


Attachments:
(No filename) (1.81 kB)
signature.asc (281.00 B)
Download all attachments

2024-04-04 18:15:39

by Alex Constantino

[permalink] [raw]
Subject: [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait"

Changes since v1:
- replace new code logic in v1 with past code version by reverting
commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
- add missing code dependency from
commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2")

---

Hi,

To clarify, the reason for my original patch, as explained in more detail
in my previous email, was that it fixed the issue while keeping the code
simpler (which was the original reason for the commit being reverted here).
But I perfectly understand opting for previously battle tested code. Makes
sense.

As requested I've reverted commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait")
and then executed both Timo's and my test cases, and 1h video playback.
I was unable to reproduce the bug with any of those cases. So the revert
seems to fix the bug.
Please note, and as stated in the commit message, due to a dependency to
DMA_FENCE_WARN this patch also restores the relevant code deleted
by commit d72277b6c37d ("dma-buf: nuke DMA_FENCE_TRACE macros v2").

A couple of things I've observed from dmesg:
- (1) it always triggers a single warning at boot, this is issued by
`WARN_ON(list_empty(&release->bos));` @ qxl_release_free @ qxl_release.c
Maybe better for this to be addressed separately from this patch?
- (2) there are quite a few `failed to wait on release xx after spincount
301` messages as printed by the patch v2 code when the test case shell
scripts are being executed.
- (3) there can be a single error message `[drm:qxl_release_from_id_locked
[qxl]] *ERROR* failed to find id in release_idr`
- (4) occasional error messages about `[drm:drm_atomic_helper_commit_planes
[drm_kms_helper]] *ERROR* head 9 wrong:`.

Issue (1) relates to this patch v2 and also happened with kernel from
base-commit 1870cdc0e8de (March 1st).
Issue (2) also relates to this patch v2 but only happens with kernel from
base-commit a6bd6c933339 (March 30th).
Both (3) and (4) are unrelated to this patch as they can occur
independently of it and I'm guessing these may be related to the recent
changes discussed in
https://lore.kernel.org/dri-devel/[email protected]/T/#u


For reference here is the output of (1):
```
[ 20.779514] ------------[ cut here ]------------
[ 20.779525] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]
[ 20.779666] WARNING: CPU: 1 PID: 601 at kernel/workqueue.c:3692 check_flush_dependency+0xfa/0x110
[ 20.779683] Modules linked in: nfsv3 nfs_acl nfs lockd grace intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel rfkill kvm snd_hda_codec_generic crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_intel_dspcfg sha512_ssse3 sha512_generic snd_hda_codec sha256_ssse3 snd_hwdep sha1_ssse3 snd_hda_core sunrpc binfmt_misc snd_pcm aesni_intel qxl drm_ttm_helper ttm crypto_simd snd_timer cryptd rapl snd virtio_balloon virtio_console drm_kms_helper pcspkr soundcore button evdev joydev serio_raw drm loop fuse efi_pstore dm_mod configfs qemu_fw_cfg virtio_rng autofs4 ext4 crc32c_generic crc16 mbcache jbd2 virtio_net ata_generic net_failover virtio_blk failover uhci_hcd ata_piix ehci_hcd libata scsi_mod usbcore crc32c_intel i2c_piix4 virtio_pci virtio psmouse virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring floppy scsi_common usb_common
[ 20.779825] CPU: 1 PID: 601 Comm: kworker/u13:1 Not tainted 6.9.0-rc1-next-20240328-amd64-00001-g756220c4615c #81
[ 20.779833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[ 20.779837] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[ 20.779862] RIP: 0010:check_flush_dependency+0xfa/0x110
[ 20.779869] Code: ff ff 49 8b 55 18 48 8d 8b c0 00 00 00 49 89 e8 48 81 c6 c0 00 00 00 48 c7 c7 c0 16 44 8d c6 05 e7 75 b3 01 01 e8 86 97 fd ff <0f> 0b e9 21 ff ff ff 80 3d d5 75 b3 01 00 75 96 e9 4d ff ff ff 90
[ 20.779875] RSP: 0000:ffffb59600dd7cc8 EFLAGS: 00010082
[ 20.779880] RAX: 0000000000000000 RBX: ffff9af88104ee00 RCX: 0000000000000027
[ 20.779902] RDX: ffff9af8fdd21708 RSI: 0000000000000001 RDI: ffff9af8fdd21700
[ 20.779906] RBP: ffffffffc0882570 R08: 0000000000000000 R09: 0000000000000003
[ 20.779910] R10: ffffb59600dd7b58 R11: ffffffff8dcc83e8 R12: ffff9af894498000
[ 20.779914] R13: ffff9af89558d780 R14: ffffb59600dd7cf8 R15: 0000000000000001
[ 20.779918] FS: 0000000000000000(0000) GS:ffff9af8fdd00000(0000) knlGS:0000000000000000
[ 20.779924] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 20.779928] CR2: 00005574b0bd4148 CR3: 000000001fb40002 CR4: 0000000000370ef0
[ 20.779994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 20.779999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 20.780003] Call Trace:
[ 20.780135] <TASK>
[ 20.780144] ? __warn+0x7c/0x120
[ 20.780153] ? check_flush_dependency+0xfa/0x110
[ 20.780161] ? report_bug+0x160/0x190
[ 20.780169] ? prb_read_valid+0x17/0x20
[ 20.780179] ? handle_bug+0x41/0x70
[ 20.780186] ? exc_invalid_op+0x13/0x60
[ 20.780193] ? asm_exc_invalid_op+0x16/0x20
[ 20.780201] ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
[ 20.780221] ? check_flush_dependency+0xfa/0x110
[ 20.780228] ? check_flush_dependency+0xfa/0x110
[ 20.780234] __flush_work+0xce/0x2c0
[ 20.780244] qxl_queue_garbage_collect+0x7f/0x90 [qxl]
[ 20.780268] qxl_fence_wait+0xa0/0x190 [qxl]
[ 20.780287] dma_fence_wait_timeout+0x5e/0x130
[ 20.780313] dma_resv_wait_timeout+0x7b/0xe0
[ 20.780327] ttm_bo_delayed_delete+0x26/0x80 [ttm]
[ 20.780359] process_one_work+0x184/0x3a0
[ 20.780370] worker_thread+0x273/0x390
[ 20.780379] ? __pfx_worker_thread+0x10/0x10
[ 20.780388] kthread+0xcb/0x100
[ 20.780396] ? __pfx_kthread+0x10/0x10
[ 20.780404] ret_from_fork+0x2d/0x50
[ 20.780416] ? __pfx_kthread+0x10/0x10
[ 20.780421] ret_from_fork_asm+0x1a/0x30
[ 20.780435] </TASK>
[ 20.780437] ---[ end trace 0000000000000000 ]---
```


TLDR: this patch fixes the instability issues. But there may be warnings
in dmesg. Errors in dmesg were observed too but they are unrelated to this
patch.


Thank you for your time.

---

Alex Constantino (1):
Revert "drm/qxl: simplify qxl_fence_wait"

drivers/gpu/drm/qxl/qxl_release.c | 50 +++++++++++++++++++++++++++----
include/linux/dma-fence.h | 7 +++++
2 files changed, 52 insertions(+), 5 deletions(-)


base-commit: a6bd6c9333397f5a0e2667d4d82fef8c970108f2
--
2.39.2