2021-09-02 10:01:10

by Sudip Mukherjee

[permalink] [raw]
Subject: Regression with mainline kernel on rpi4

Hi All,

Our last night's test on rpi4 had a nasty trace. The test was with
7c636d4d20f8 ("Merge tag 'dt-5.15' of
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc"). Previous test
was on 9e9fb7655ed5 ("Merge tag 'net-next-5.15' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") and it
did not have this trace.

[ 40.975161] Unable to handle kernel access to user memory outside
uaccess routines at virtual address 0000000000000348
[ 40.986187] Mem abort info:
[ 40.989062] ESR = 0x96000004
[ 40.992233] EC = 0x25: DABT (current EL), IL = 32 bits
[ 40.997699] SET = 0, FnV = 0
[ 41.001205] EA = 0, S1PTW = 0
[ 41.004428] FSC = 0x04: level 0 translation fault
[ 41.009468] Data abort info:
[ 41.012410] ISV = 0, ISS = 0x00000004
[ 41.016325] CM = 0, WnR = 0
[ 41.019358] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000042ae1000
[ 41.025926] [0000000000000348] pgd=0000000000000000, p4d=0000000000000000
[ 41.032845] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 41.038495] Modules linked in: overlay algif_hash algif_skcipher
af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4
btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq
libcrc32c raid1 raid0 multipath linear uas usb_storage
snd_soc_hdmi_codec btsdio brcmfmac brcmutil hci_uart btqca btrtl
bcm2835_v4l2(C) btbcm crct10dif_ce bcm2835_mmal_vchiq(C) btintel
raspberrypi_hwmon videobuf2_vmalloc videobuf2_memops bluetooth
videobuf2_v4l2 videobuf2_common cfg80211 ecdh_generic ecc vc4
drm_kms_helper videodev dwc2 cec snd_bcm2835(C) i2c_brcmstb udc_core
roles drm xhci_pci mc pwm_bcm2835 xhci_pci_renesas snd_soc_core
ac97_bus snd_pcm_dmaengine snd_pcm phy_generic snd_timer
uio_pdrv_genirq snd fb_sys_fops syscopyarea sysfillrect sysimgblt uio
aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 41.116584] CPU: 0 PID: 1569 Comm: pulseaudio Tainted: G C
5.14.0-7c636d4d20f8 #1
[ 41.125494] Hardware name: Raspberry Pi 4 Model B (DT)
[ 41.130699] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 41.137756] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
[ 41.143256] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4]
[ 41.148747] sp : ffff800012f73a50
[ 41.152099] x29: ffff800012f73a50 x28: ffff0000562ecc00 x27: 0000000000000000
[ 41.159338] x26: 0000000000000000 x25: 000000000000ac44 x24: 0000000021002003
[ 41.166574] x23: ffff800012f73b40 x22: 0000000000000003 x21: ffff000059400080
[ 41.173811] x20: ffff0000594004c8 x19: 0005833333380600 x18: 0000000000000000
[ 41.181047] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 41.188283] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000991
[ 41.195520] x11: 0000000000000001 x10: 000000000001d4c0 x9 : ffff800009047838
[ 41.202757] x8 : 0000000000000031 x7 : 000000000001d4c0 x6 : 0000000000000030
[ 41.209993] x5 : ffff800012f73a98 x4 : ffff80000905bb60 x3 : 0000000010624dd3
[ 41.217230] x2 : 00000000000003e8 x1 : 0000000000000000 x0 : 0000000000562200
[ 41.224466] Call trace:
[ 41.226939] vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
[ 41.232080] hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec]
[ 41.238083] snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core]
[ 41.244038] soc_pcm_prepare+0x5c/0x130 [snd_soc_core]
[ 41.249276] snd_pcm_prepare+0x150/0x1f0 [snd_pcm]
[ 41.254149] snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm]
[ 41.259635] snd_pcm_ioctl+0x3c/0x5c [snd_pcm]
[ 41.264152] __arm64_sys_ioctl+0xb4/0x100
[ 41.268216] invoke_syscall+0x50/0x120
[ 41.272014] el0_svc_common+0x18c/0x1a4
[ 41.275899] do_el0_svc+0x34/0x9c
[ 41.279254] el0_svc+0x2c/0xc0
[ 41.282348] el0t_64_sync_handler+0xa4/0x12c
[ 41.286673] el0t_64_sync+0x1a4/0x1a8
[ 41.290385] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423)
[ 41.296563] ---[ end trace dcfe08f10aaf6873 ]---

You can see the complete dmesg at
https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8

--
Regards
Sudip


2021-09-03 16:04:45

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

Hi Sudip,

On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> Hi All,
>
> Our last night's test on rpi4 had a nasty trace. The test was with
> 7c636d4d20f8 ("Merge tag 'dt-5.15' of
> git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc"). Previous test
> was on 9e9fb7655ed5 ("Merge tag 'net-next-5.15' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") and it
> did not have this trace.
>
> [ 40.975161] Unable to handle kernel access to user memory outside
> uaccess routines at virtual address 0000000000000348
> [ 40.986187] Mem abort info:
> [ 40.989062] ESR = 0x96000004
> [ 40.992233] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 40.997699] SET = 0, FnV = 0
> [ 41.001205] EA = 0, S1PTW = 0
> [ 41.004428] FSC = 0x04: level 0 translation fault
> [ 41.009468] Data abort info:
> [ 41.012410] ISV = 0, ISS = 0x00000004
> [ 41.016325] CM = 0, WnR = 0
> [ 41.019358] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000042ae1000
> [ 41.025926] [0000000000000348] pgd=0000000000000000, p4d=0000000000000000
> [ 41.032845] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> [ 41.038495] Modules linked in: overlay algif_hash algif_skcipher
> af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4
> btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
> async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq
> libcrc32c raid1 raid0 multipath linear uas usb_storage
> snd_soc_hdmi_codec btsdio brcmfmac brcmutil hci_uart btqca btrtl
> bcm2835_v4l2(C) btbcm crct10dif_ce bcm2835_mmal_vchiq(C) btintel
> raspberrypi_hwmon videobuf2_vmalloc videobuf2_memops bluetooth
> videobuf2_v4l2 videobuf2_common cfg80211 ecdh_generic ecc vc4
> drm_kms_helper videodev dwc2 cec snd_bcm2835(C) i2c_brcmstb udc_core
> roles drm xhci_pci mc pwm_bcm2835 xhci_pci_renesas snd_soc_core
> ac97_bus snd_pcm_dmaengine snd_pcm phy_generic snd_timer
> uio_pdrv_genirq snd fb_sys_fops syscopyarea sysfillrect sysimgblt uio
> aes_neon_bs aes_neon_blk crypto_simd cryptd
> [ 41.116584] CPU: 0 PID: 1569 Comm: pulseaudio Tainted: G C
> 5.14.0-7c636d4d20f8 #1
> [ 41.125494] Hardware name: Raspberry Pi 4 Model B (DT)
> [ 41.130699] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 41.137756] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
> [ 41.143256] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4]
> [ 41.148747] sp : ffff800012f73a50
> [ 41.152099] x29: ffff800012f73a50 x28: ffff0000562ecc00 x27: 0000000000000000
> [ 41.159338] x26: 0000000000000000 x25: 000000000000ac44 x24: 0000000021002003
> [ 41.166574] x23: ffff800012f73b40 x22: 0000000000000003 x21: ffff000059400080
> [ 41.173811] x20: ffff0000594004c8 x19: 0005833333380600 x18: 0000000000000000
> [ 41.181047] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [ 41.188283] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000991
> [ 41.195520] x11: 0000000000000001 x10: 000000000001d4c0 x9 : ffff800009047838
> [ 41.202757] x8 : 0000000000000031 x7 : 000000000001d4c0 x6 : 0000000000000030
> [ 41.209993] x5 : ffff800012f73a98 x4 : ffff80000905bb60 x3 : 0000000010624dd3
> [ 41.217230] x2 : 00000000000003e8 x1 : 0000000000000000 x0 : 0000000000562200
> [ 41.224466] Call trace:
> [ 41.226939] vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4]
> [ 41.232080] hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec]
> [ 41.238083] snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core]
> [ 41.244038] soc_pcm_prepare+0x5c/0x130 [snd_soc_core]
> [ 41.249276] snd_pcm_prepare+0x150/0x1f0 [snd_pcm]
> [ 41.254149] snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm]
> [ 41.259635] snd_pcm_ioctl+0x3c/0x5c [snd_pcm]
> [ 41.264152] __arm64_sys_ioctl+0xb4/0x100
> [ 41.268216] invoke_syscall+0x50/0x120
> [ 41.272014] el0_svc_common+0x18c/0x1a4
> [ 41.275899] do_el0_svc+0x34/0x9c
> [ 41.279254] el0_svc+0x2c/0xc0
> [ 41.282348] el0t_64_sync_handler+0xa4/0x12c
> [ 41.286673] el0t_64_sync+0x1a4/0x1a8
> [ 41.290385] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423)
> [ 41.296563] ---[ end trace dcfe08f10aaf6873 ]---
>
> You can see the complete dmesg at
> https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8

What test were you running?

Maxime


Attachments:
(No filename) (4.29 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-03 20:11:18

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

Hi Maxime,

On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
>
> Hi Sudip,
>
> On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > Hi All,
> >
>

<snip>

>
> >
> > You can see the complete dmesg at
> > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
>
> What test were you running?

Nothing particular, its just a boot test that we do every night after
pulling from torvalds/linux.git


--
Regards
Sudip

2021-09-04 09:13:38

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> Hi Maxime,
>
> On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> >
> > Hi Sudip,
> >
> > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > Hi All,
> > >
> >
>
> <snip>
>
> >
> > >
> > > You can see the complete dmesg at
> > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> >
> > What test were you running?
>
> Nothing particular, its just a boot test that we do every night after
> pulling from torvalds/linux.git

What are you booting to then?

Maxime


Attachments:
(No filename) (608.00 B)
signature.asc (235.00 B)
Download all attachments

2021-09-04 10:00:32

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

Hi Maxime,

On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
>
> On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > Hi Maxime,
> >
> > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > >
> > > Hi Sudip,
> > >
> > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > Hi All,
> > > >
> > >
> >
> > <snip>
> >
> > >
> > > >
> > > > You can see the complete dmesg at
> > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > >
> > > What test were you running?
> >
> > Nothing particular, its just a boot test that we do every night after
> > pulling from torvalds/linux.git
>
> What are you booting to then?

I am not sure I understood the question.
Its an Ubuntu image. The desktop environment is gnome. And as
mentioned earlier, we use the HEAD of linus tree every night to boot
the rpi4 and test that we can login via desktop environment and that
there is no regression in dmesg.


--
Regards
Sudip

2021-09-21 01:28:35

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> Hi Maxime,
>
> On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> >
> > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > Hi Maxime,
> > >
> > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > Hi Sudip,
> > > >
> > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > Hi All,
> > > > >
> > > >
> > >
> > > <snip>
> > >
> > > >
> > > > >
> > > > > You can see the complete dmesg at
> > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > >
> > > > What test were you running?
> > >
> > > Nothing particular, its just a boot test that we do every night after
> > > pulling from torvalds/linux.git
> >
> > What are you booting to then?
>
> I am not sure I understood the question.
> Its an Ubuntu image. The desktop environment is gnome. And as
> mentioned earlier, we use the HEAD of linus tree every night to boot
> the rpi4 and test that we can login via desktop environment and that
> there is no regression in dmesg.

Looking at the CI, this isn't from a RPi but from qemu?

What defconfig are you using? How did you generate the Ubuntu image?
Through debootstrap? Any additional package?

Maxime


Attachments:
(No filename) (1.31 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-21 01:37:05

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > Hi Maxime,
> >
> > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > >
> > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > Hi Maxime,
> > > >
> > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > >
> > > > > Hi Sudip,
> > > > >
> > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > Hi All,
> > > > > >
> > > > >
> > > >
> > > > <snip>
> > > >
> > > > >
> > > > > >
> > > > > > You can see the complete dmesg at
> > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > >
> > > > > What test were you running?
> > > >
> > > > Nothing particular, its just a boot test that we do every night after
> > > > pulling from torvalds/linux.git
> > >
> > > What are you booting to then?
> >
> > I am not sure I understood the question.
> > Its an Ubuntu image. The desktop environment is gnome. And as
> > mentioned earlier, we use the HEAD of linus tree every night to boot
> > the rpi4 and test that we can login via desktop environment and that
> > there is no regression in dmesg.
>
> Looking at the CI, this isn't from a RPi but from qemu?
>
> What defconfig are you using? How did you generate the Ubuntu image?
> Through debootstrap? Any additional package?

So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor
upstream:
https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/hw/arm/raspi.c#L367

I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what
you've been using), built using debootstrap + ubuntu-desktop, both with
and without a monitor attached, and up to the desktop once logged in.

I don't see any crash.

Maxime


Attachments:
(No filename) (1.87 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-21 01:37:13

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > Hi Maxime,
> > >
> > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > Hi Maxime,
> > > > >
> > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > >
> > > > > > Hi Sudip,
> > > > > >
> > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > Hi All,
> > > > > > >
> > > > > >
> > > > >
> > > > > <snip>
> > > > >
> > > > > >
> > > > > > >
> > > > > > > You can see the complete dmesg at
> > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > >
> > > > > > What test were you running?
> > > > >
> > > > > Nothing particular, its just a boot test that we do every night after
> > > > > pulling from torvalds/linux.git
> > > >
> > > > What are you booting to then?
> > >
> > > I am not sure I understood the question.
> > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > the rpi4 and test that we can login via desktop environment and that
> > > there is no regression in dmesg.
> >
> > Looking at the CI, this isn't from a RPi but from qemu?
> >
> > What defconfig are you using? How did you generate the Ubuntu image?
> > Through debootstrap? Any additional package?
>
> So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor
> upstream:
> https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/hw/arm/raspi.c#L367
>
> I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what
> you've been using), built using debootstrap + ubuntu-desktop, both with
> and without a monitor attached, and up to the desktop once logged in.
>
> I don't see any crash.

That was with arm64's defconfig.

Maxime


Attachments:
(No filename) (2.07 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-21 01:40:33

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
>
> On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > Hi Maxime,
> > > >
> > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > >
> > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > Hi Maxime,
> > > > > >
> > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi Sudip,
> > > > > > >
> > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > <snip>
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > You can see the complete dmesg at
> > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > >
> > > > > > > What test were you running?
> > > > > >
> > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > pulling from torvalds/linux.git
> > > > >
> > > > > What are you booting to then?
> > > >
> > > > I am not sure I understood the question.
> > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > the rpi4 and test that we can login via desktop environment and that
> > > > there is no regression in dmesg.
> > >
> > > Looking at the CI, this isn't from a RPi but from qemu?

No, this is from rpi4 board (4GB), not qemu. The CI is a little
complicated here, lava boots the board with the new kernel and will
then trigger the openQA job. openQA will then connect to the board
using vnc to test the desktop. This is the last link that I sent to
Linus when he asked for it.
https://lava.qa.codethink.co.uk/scheduler/job/164#L1356

And this is the lava job for today -
https://lava.qa.codethink.co.uk/scheduler/job/173

> > >
> > > What defconfig are you using? How did you generate the Ubuntu image?
> > > Through debootstrap? Any additional package?

Its the default ubuntu config. I can send you the config if you want.

> >
> > So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor
> > upstream:
> > https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/hw/arm/raspi.c#L367
> >

Like I said, its not on qemu.

> > I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what
> > you've been using), built using debootstrap + ubuntu-desktop, both with
> > and without a monitor attached, and up to the desktop once logged in.

I am using the official ubuntu image downloaded from
https://cdimage.ubuntu.com/releases/21.04/release/ubuntu-21.04-preinstalled-desktop-arm64+raspi.img.xz


--
Regards
Sudip

2021-09-21 01:46:36

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
> On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
> >
> > On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > > Hi Maxime,
> > > > >
> > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > > Hi Maxime,
> > > > > > >
> > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi Sudip,
> > > > > > > >
> > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > <snip>
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > You can see the complete dmesg at
> > > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > > >
> > > > > > > > What test were you running?
> > > > > > >
> > > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > > pulling from torvalds/linux.git
> > > > > >
> > > > > > What are you booting to then?
> > > > >
> > > > > I am not sure I understood the question.
> > > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > > the rpi4 and test that we can login via desktop environment and that
> > > > > there is no regression in dmesg.
> > > >
> > > > Looking at the CI, this isn't from a RPi but from qemu?
>
> No, this is from rpi4 board (4GB), not qemu. The CI is a little
> complicated here, lava boots the board with the new kernel and will
> then trigger the openQA job. openQA will then connect to the board
> using vnc to test the desktop. This is the last link that I sent to
> Linus when he asked for it.
> https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
>
> And this is the lava job for today -
> https://lava.qa.codethink.co.uk/scheduler/job/173

Is it connected to a monitor then?

> > > >
> > > > What defconfig are you using? How did you generate the Ubuntu image?
> > > > Through debootstrap? Any additional package?
>
> Its the default ubuntu config. I can send you the config if you want.

Yes, please.

Thanks
Maxime


Attachments:
(No filename) (2.50 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-21 03:35:05

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard <[email protected]> wrote:
>
> On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
> > On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
> > >
> > > On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > > > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > > > Hi Maxime,
> > > > > >
> > > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > > > Hi Maxime,
> > > > > > > >
> > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hi Sudip,
> > > > > > > > >
> > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > <snip>
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > You can see the complete dmesg at
> > > > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > > > >
> > > > > > > > > What test were you running?
> > > > > > > >
> > > > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > > > pulling from torvalds/linux.git
> > > > > > >
> > > > > > > What are you booting to then?
> > > > > >
> > > > > > I am not sure I understood the question.
> > > > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > > > the rpi4 and test that we can login via desktop environment and that
> > > > > > there is no regression in dmesg.
> > > > >
> > > > > Looking at the CI, this isn't from a RPi but from qemu?
> >
> > No, this is from rpi4 board (4GB), not qemu. The CI is a little
> > complicated here, lava boots the board with the new kernel and will
> > then trigger the openQA job. openQA will then connect to the board
> > using vnc to test the desktop. This is the last link that I sent to
> > Linus when he asked for it.
> > https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
> >
> > And this is the lava job for today -
> > https://lava.qa.codethink.co.uk/scheduler/job/173
>
> Is it connected to a monitor then?

Missed replying to this one earlier. I have a hdmi lilliput monitor
connected to it.

>
> > > > >
> > > > > What defconfig are you using? How did you generate the Ubuntu image?
> > > > > Through debootstrap? Any additional package?
> >
> > Its the default ubuntu config. I can send you the config if you want.
>
> Yes, please.

Attached.
My build script will copy this config as .config, do olddefconfig
and then build.


--
Regards
Sudip


Attachments:
config-5.11.0-1009-raspi (222.30 kB)

2021-09-22 09:59:01

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
> On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard <[email protected]> wrote:
> >
> > On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
> > > On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > > > > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > > > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > > > > Hi Maxime,
> > > > > > >
> > > > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > > > > Hi Maxime,
> > > > > > > > >
> > > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Sudip,
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > <snip>
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > You can see the complete dmesg at
> > > > > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > > > > >
> > > > > > > > > > What test were you running?
> > > > > > > > >
> > > > > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > > > > pulling from torvalds/linux.git
> > > > > > > >
> > > > > > > > What are you booting to then?
> > > > > > >
> > > > > > > I am not sure I understood the question.
> > > > > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > > > > the rpi4 and test that we can login via desktop environment and that
> > > > > > > there is no regression in dmesg.
> > > > > >
> > > > > > Looking at the CI, this isn't from a RPi but from qemu?
> > >
> > > No, this is from rpi4 board (4GB), not qemu. The CI is a little
> > > complicated here, lava boots the board with the new kernel and will
> > > then trigger the openQA job. openQA will then connect to the board
> > > using vnc to test the desktop. This is the last link that I sent to
> > > Linus when he asked for it.
> > > https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
> > >
> > > And this is the lava job for today -
> > > https://lava.qa.codethink.co.uk/scheduler/job/173
> >
> > Is it connected to a monitor then?
>
> Missed replying to this one earlier. I have a hdmi lilliput monitor
> connected to it.
>
> >
> > > > > >
> > > > > > What defconfig are you using? How did you generate the Ubuntu image?
> > > > > > Through debootstrap? Any additional package?
> > >
> > > Its the default ubuntu config. I can send you the config if you want.
> >
> > Yes, please.
>
> Attached.
> My build script will copy this config as .config, do olddefconfig
> and then build.

I still can't reproduce it.

What other customisations did you do to that image? It looks like
there's some customs scripts in there (test-mainline.sh), what are they
doing?

Maxime


Attachments:
(No filename) (3.25 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-22 10:13:41

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard <[email protected]> wrote:
>
> On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
> > On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard <[email protected]> wrote:
> > >
> > > On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
> > > > On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
> > > > >
> > > > > On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > > > > > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > > > > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > > > > > Hi Maxime,
> > > > > > > >
> > > > > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > Hi Maxime,
> > > > > > > > > >
> > > > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Sudip,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > <snip>
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > You can see the complete dmesg at
> > > > > > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > > > > > >
> > > > > > > > > > > What test were you running?
> > > > > > > > > >
> > > > > > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > > > > > pulling from torvalds/linux.git
> > > > > > > > >
> > > > > > > > > What are you booting to then?
> > > > > > > >
> > > > > > > > I am not sure I understood the question.
> > > > > > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > > > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > > > > > the rpi4 and test that we can login via desktop environment and that
> > > > > > > > there is no regression in dmesg.
> > > > > > >
> > > > > > > Looking at the CI, this isn't from a RPi but from qemu?
> > > >
> > > > No, this is from rpi4 board (4GB), not qemu. The CI is a little
> > > > complicated here, lava boots the board with the new kernel and will
> > > > then trigger the openQA job. openQA will then connect to the board
> > > > using vnc to test the desktop. This is the last link that I sent to
> > > > Linus when he asked for it.
> > > > https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
> > > >
> > > > And this is the lava job for today -
> > > > https://lava.qa.codethink.co.uk/scheduler/job/173
> > >
> > > Is it connected to a monitor then?
> >
> > Missed replying to this one earlier. I have a hdmi lilliput monitor
> > connected to it.
> >
> > >
> > > > > > >
> > > > > > > What defconfig are you using? How did you generate the Ubuntu image?
> > > > > > > Through debootstrap? Any additional package?
> > > >
> > > > Its the default ubuntu config. I can send you the config if you want.
> > >
> > > Yes, please.
> >
> > Attached.
> > My build script will copy this config as .config, do olddefconfig
> > and then build.
>
> I still can't reproduce it.
>
> What other customisations did you do to that image? It looks like
> there's some customs scripts in there (test-mainline.sh), what are they
> doing?

That test script is triggering the openqa job, but its running only
after lava is able to login. The trace is appearing before the login
prompt even, so test_mainline.sh should not matter here.
The only customization done to the default ubuntu image is the initrd.
Can you please try with the initrd from
https://elisa-builder-00.iol.unh.edu/kernel/mainline/rpi4/initrd.gz..
I will check with another board also.


--
Regards
Sudip

2021-09-22 11:30:11

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 11:10:34AM +0100, Sudip Mukherjee wrote:
> On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard <[email protected]> wrote:
> >
> > On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
> > > On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
> > > > > On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
> > > > > > > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
> > > > > > > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > Hi Maxime,
> > > > > > > > >
> > > > > > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > > Hi Maxime,
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Sudip,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > <snip>
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > You can see the complete dmesg at
> > > > > > > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
> > > > > > > > > > > >
> > > > > > > > > > > > What test were you running?
> > > > > > > > > > >
> > > > > > > > > > > Nothing particular, its just a boot test that we do every night after
> > > > > > > > > > > pulling from torvalds/linux.git
> > > > > > > > > >
> > > > > > > > > > What are you booting to then?
> > > > > > > > >
> > > > > > > > > I am not sure I understood the question.
> > > > > > > > > Its an Ubuntu image. The desktop environment is gnome. And as
> > > > > > > > > mentioned earlier, we use the HEAD of linus tree every night to boot
> > > > > > > > > the rpi4 and test that we can login via desktop environment and that
> > > > > > > > > there is no regression in dmesg.
> > > > > > > >
> > > > > > > > Looking at the CI, this isn't from a RPi but from qemu?
> > > > >
> > > > > No, this is from rpi4 board (4GB), not qemu. The CI is a little
> > > > > complicated here, lava boots the board with the new kernel and will
> > > > > then trigger the openQA job. openQA will then connect to the board
> > > > > using vnc to test the desktop. This is the last link that I sent to
> > > > > Linus when he asked for it.
> > > > > https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
> > > > >
> > > > > And this is the lava job for today -
> > > > > https://lava.qa.codethink.co.uk/scheduler/job/173
> > > >
> > > > Is it connected to a monitor then?
> > >
> > > Missed replying to this one earlier. I have a hdmi lilliput monitor
> > > connected to it.
> > >
> > > >
> > > > > > > >
> > > > > > > > What defconfig are you using? How did you generate the Ubuntu image?
> > > > > > > > Through debootstrap? Any additional package?
> > > > >
> > > > > Its the default ubuntu config. I can send you the config if you want.
> > > >
> > > > Yes, please.
> > >
> > > Attached.
> > > My build script will copy this config as .config, do olddefconfig
> > > and then build.
> >
> > I still can't reproduce it.
> >
> > What other customisations did you do to that image? It looks like
> > there's some customs scripts in there (test-mainline.sh), what are they
> > doing?
>
> That test script is triggering the openqa job, but its running only
> after lava is able to login. The trace is appearing before the login
> prompt even, so test_mainline.sh should not matter here.
> The only customization done to the default ubuntu image is the initrd.
> Can you please try with the initrd from
> https://elisa-builder-00.iol.unh.edu/kernel/mainline/rpi4/initrd.gz..
> I will check with another board also.

Still works fine (and it required some mangling of the kernel command line).

If we summarize:

- You initially just dumped a panic and a link to your QA, without any
more context:

- Then stating that you're not doing any test, really;

- Well, except for booting Ubuntu, but no other modification

- But you're not booting the standard image

- And with a custom initrd

- And that QA link states that you're booting from QEMU, but you're
not.

Please provide a full documentation on what you're doing to generate
that image, from scratch, in order to get that panic you reported
previously.

I've spent an entire day trying to make sense of what you're doing
exactly to get into that situation. I have other things to work on and I
don't plan on figuring out any random CI system.

Maxime


Attachments:
(No filename) (4.94 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-22 15:27:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 3:11 AM Sudip Mukherjee
<[email protected]> wrote:
>
> That test script is triggering the openqa job, but its running only
> after lava is able to login. The trace is appearing before the login
> prompt even, so test_mainline.sh should not matter here.

Side note: the traces might be more legible if you have debug info in
the kernel, and run the dmesg through the script in

scripts/decode_stacktrace.sh

which should give line numbers and inlining information.

That often makes it much easier to see which access it is that hits a
NULL pointer dereference.

On x86-64, generally just decode the instruction stream, and look at
the instruction patterns and try to figure out where an oops is coming
from, but that's much less useful on arm64 (partly because I'm not as
used to it, but because the arm64 oopses don't print out much of the
instructions so there's often little to go by).

Linus

2021-09-22 17:07:16

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 4:25 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 3:11 AM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > That test script is triggering the openqa job, but its running only
> > after lava is able to login. The trace is appearing before the login
> > prompt even, so test_mainline.sh should not matter here.
>
> Side note: the traces might be more legible if you have debug info in
> the kernel, and run the dmesg through the script in
>
> scripts/decode_stacktrace.sh
>
> which should give line numbers and inlining information.

Attached is a complete dmesg and also the decoded trace.
This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")


--
Regards
Sudip


Attachments:
dmesg.log (40.39 kB)
converted_trace (5.02 kB)
Download all attachments

2021-09-22 17:20:37

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 12:28 PM Maxime Ripard <[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 11:10:34AM +0100, Sudip Mukherjee wrote:
> > On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard <[email protected]> wrote:
> > >

<snip>

>
> Still works fine (and it required some mangling of the kernel command line).
>
> If we summarize:
>
> - You initially just dumped a panic and a link to your QA, without any
> more context:

The SHA was also given, and I didn't know what else you would need.
The openQA link was given to show the dmesg.

>
> - Then stating that you're not doing any test, really;

Yes, and I still say that, its just a boot test.

>
> - Well, except for booting Ubuntu, but no other modification
>
> - But you're not booting the standard image
>
> - And with a custom initrd

yes, something which has always worked in boot-testing LTS kernel or
mainline kernel.

>
> - And that QA link states that you're booting from QEMU, but you're
> not.

I only found that the "WORKER_CLASS" has the name "qemu_rpi4", that is
a name which I choose to give as that worker laptop is connected to
rpi4 and also running qemu tests. If you want I can change the name of
the WORKER_CLASS. :)
iiuc, dmesg shows if its booting in a qemu or on a real hardware.

>
> Please provide a full documentation on what you're doing to generate
> that image, from scratch, in order to get that panic you reported
> previously.

I have now ordered another rpi4 board and will create the image from
scratch and give you the steps.

>
> I've spent an entire day trying to make sense of what you're doing
> exactly to get into that situation. I have other things to work on and I
> don't plan on figuring out any random CI system.

I am not really sure why you are trying to figure out a random CI
system. I can reproduce the problem in our setup everytime I test with
that reverted commit and I have already said I am happy to test with a
debug patch or anything else.


--
Regards
Sudip

2021-09-22 18:25:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 10:02 AM Sudip Mukherjee
<[email protected]> wrote:
>
>
> Attached is a complete dmesg and also the decoded trace.
> This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")

drivers/gpu/drm/vc4/vc4_hdmi.c:1214 is

tmp = (u64)(mode->clock * 1000) * n;

in vc4_hdmi_set_n_cts(), which has apparently been inlined from
vc4_hdmi_audio_prepare() in vc4_hdmi.c:1398.

So it looks like 'mode' is some offset off a NULL pointer.

Which looks not impossible:

1207 struct drm_connector *connector = &vc4_hdmi->connector;
1208 struct drm_crtc *crtc = connector->state->crtc;
1209 const struct drm_display_mode *mode =
&crtc->state->adjusted_mode;

looks like crtc->state perhaps might be NULL.

Although it's entirely possible that it's 'crtc' itself that is NULL
or one of the earlier indirection accesses.

The exact line information from the debug info is very useful and
mostly correct, but at the same time should always be taken with a
small pinch of salt.

Compiler optimizations means that code gets munged and moved around,
and since this is the first access to 'mode', I would not be surprised
if some of the calculations and accesses to get 'mode' might be moved
around to it.

Linus

2021-09-22 20:25:11

by Sudip Mukherjee

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 7:23 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 10:02 AM Sudip Mukherjee
> <[email protected]> wrote:
> >
> >
> > Attached is a complete dmesg and also the decoded trace.
> > This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
>
> drivers/gpu/drm/vc4/vc4_hdmi.c:1214 is
>
> tmp = (u64)(mode->clock * 1000) * n;
>
> in vc4_hdmi_set_n_cts(), which has apparently been inlined from
> vc4_hdmi_audio_prepare() in vc4_hdmi.c:1398.
>
> So it looks like 'mode' is some offset off a NULL pointer.
>
> Which looks not impossible:
>
> 1207 struct drm_connector *connector = &vc4_hdmi->connector;
> 1208 struct drm_crtc *crtc = connector->state->crtc;
> 1209 const struct drm_display_mode *mode =
> &crtc->state->adjusted_mode;
>
> looks like crtc->state perhaps might be NULL.

I added some debugs to print the addresses, and I am getting:
[ 38.813809] sudip crtc 0000000000000000

This is from struct drm_crtc *crtc = connector->state->crtc;

connector and connector->state had valid addresses.
[ 38.805302] sudip connector ffff000040bac578
[ 38.809779] sudip state ffff000057eb5400

This is the diff of the debug I added:
diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c
index 4a1115043114..2a8f06948094 100644
--- a/drivers/gpu/drm/vc4/vc4_hdmi.c
+++ b/drivers/gpu/drm/vc4/vc4_hdmi.c
@@ -1205,11 +1205,20 @@ static void
vc4_hdmi_audio_set_mai_clock(struct vc4_hdmi *vc4_hdmi,
static void vc4_hdmi_set_n_cts(struct vc4_hdmi *vc4_hdmi, unsigned
int samplerate)
{
struct drm_connector *connector = &vc4_hdmi->connector;
- struct drm_crtc *crtc = connector->state->crtc;
- const struct drm_display_mode *mode = &crtc->state->adjusted_mode;
+ struct drm_crtc *crtc;
+ struct drm_display_mode *mode;
u32 n, cts;
u64 tmp;

+
+ pr_err("sudip connector %px\n", connector);
+ pr_err("sudip state %px\n", connector->state);
+ crtc = connector->state->crtc;
+
+ pr_err("sudip crtc %px\n", crtc);
+ pr_err("sudip state %px\n", crtc->state);
+ pr_err("state mode %px\n", &crtc->state->adjusted_mode);
+ mode = &crtc->state->adjusted_mode;
n = 128 * samplerate / 1000;
tmp = (u64)(mode->clock * 1000) * n;
do_div(tmp, 128 * samplerate);


--
Regards
Sudip

2021-09-22 20:27:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
<[email protected]> wrote:
>
> I added some debugs to print the addresses, and I am getting:
> [ 38.813809] sudip crtc 0000000000000000
>
> This is from struct drm_crtc *crtc = connector->state->crtc;

Yeah, that was my personal suspicion, because while the line number
implied "crtc->state" being NULL, the drm data structure documentation
and other drivers both imply that "crtc" was the more likely one.

I suspect a simple

if (!crtc)
return;

in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
I didn't check if there is possibly something else that needs to be
done too.

Linus

2021-09-25 00:06:41

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> <[email protected]> wrote:
> >
> > I added some debugs to print the addresses, and I am getting:
> > [ 38.813809] sudip crtc 0000000000000000
> >
> > This is from struct drm_crtc *crtc = connector->state->crtc;
>
> Yeah, that was my personal suspicion, because while the line number
> implied "crtc->state" being NULL, the drm data structure documentation
> and other drivers both imply that "crtc" was the more likely one.
>
> I suspect a simple
>
> if (!crtc)
> return;
>
> in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> I didn't check if there is possibly something else that needs to be
> done too.

Thanks for the decode_stacktrace.sh and the follow-up

Yeah, it looks like we have several things wrong here:

* we only check that connector->state is set, and not
connector->state->crtc indeed.

* We also check only in startup(), so at open() and not later on when
the sound streaming actually start. This has been there for a while,
so I guess it's never really been causing a practical issue before.

I'm still not entirely sure how we can end up in that situation though.
The only case I could think of is that:

* The firmware enables the HDMI controller, then boots Linux

* The driver starts, registers its audio card. connector->state is
NULL then, and if the HDMI monitor is actually an HDMI monitor (vs a
DVI monitor), the VC4_HDMI_RAM_PACKET_ENABLE bit that we test in
startup will be set.

* The driver will create the connector->state (through a call to
drm_mode_config_reset in vc4_kms_load), connector->state isn't NULL
anymore, VC4_HDMI_RAM_PACKET_ENABLE is still set.

* The driver then disables the HDMI controller (in
vc4_crtc_disable_at_boot) but never clears the
VC4_HDMI_RAM_PACKET_ENABLE bit.

* Pulseaudio opens the audio device, startup succeeds because both
conditions we test succeed.

* However, since we either never enabled the HDMI connector (or if it
was disabled at some point), connector->state->crtc is NULL and we
get our NULL pointer dereference.

The Ubuntu configuration has the framebuffer emulation and the
framebuffer console enabled, so it's likely to be enabled and
something (X.org?) comes along and disables the connector right when
pulseaudio calls prepare().

Maxime


Attachments:
(No filename) (2.49 kB)
signature.asc (235.00 B)
Download all attachments

2021-09-25 06:37:21

by Daniel Vetter

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > <[email protected]> wrote:
> > >
> > > I added some debugs to print the addresses, and I am getting:
> > > [ 38.813809] sudip crtc 0000000000000000
> > >
> > > This is from struct drm_crtc *crtc = connector->state->crtc;
> >
> > Yeah, that was my personal suspicion, because while the line number
> > implied "crtc->state" being NULL, the drm data structure documentation
> > and other drivers both imply that "crtc" was the more likely one.
> >
> > I suspect a simple
> >
> > if (!crtc)
> > return;
> >
> > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > I didn't check if there is possibly something else that needs to be
> > done too.
>
> Thanks for the decode_stacktrace.sh and the follow-up
>
> Yeah, it looks like we have several things wrong here:
>
> * we only check that connector->state is set, and not
> connector->state->crtc indeed.
>
> * We also check only in startup(), so at open() and not later on when
> the sound streaming actually start. This has been there for a while,
> so I guess it's never really been causing a practical issue before.

You also have no locking, plus looking at ->state objects outside of
atomic commit machinery makes no sense because you're not actually in
sync with the hw state. Relevant bits need to be copied over at commit
time, protected by some spinlock (and that spinlock also needs to be
held over whatever other stuff you're setting to make sure we don't
get a funny out-of-sync state anywhere).

Liberally sprinkling a few NULL checks here doesn't fix much at all,
it only papers over design bugs in the code.
-Daniel


> I'm still not entirely sure how we can end up in that situation though.
> The only case I could think of is that:
>
> * The firmware enables the HDMI controller, then boots Linux
>
> * The driver starts, registers its audio card. connector->state is
> NULL then, and if the HDMI monitor is actually an HDMI monitor (vs a
> DVI monitor), the VC4_HDMI_RAM_PACKET_ENABLE bit that we test in
> startup will be set.
>
> * The driver will create the connector->state (through a call to
> drm_mode_config_reset in vc4_kms_load), connector->state isn't NULL
> anymore, VC4_HDMI_RAM_PACKET_ENABLE is still set.
>
> * The driver then disables the HDMI controller (in
> vc4_crtc_disable_at_boot) but never clears the
> VC4_HDMI_RAM_PACKET_ENABLE bit.
>
> * Pulseaudio opens the audio device, startup succeeds because both
> conditions we test succeed.
>
> * However, since we either never enabled the HDMI connector (or if it
> was disabled at some point), connector->state->crtc is NULL and we
> get our NULL pointer dereference.
>
> The Ubuntu configuration has the framebuffer emulation and the
> framebuffer console enabled, so it's likely to be enabled and
> something (X.org?) comes along and disables the connector right when
> pulseaudio calls prepare().
>
> Maxime



--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-09-28 08:36:55

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

Hi Daniel,

On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
> On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
> >
> > On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > > <[email protected]> wrote:
> > > >
> > > > I added some debugs to print the addresses, and I am getting:
> > > > [ 38.813809] sudip crtc 0000000000000000
> > > >
> > > > This is from struct drm_crtc *crtc = connector->state->crtc;
> > >
> > > Yeah, that was my personal suspicion, because while the line number
> > > implied "crtc->state" being NULL, the drm data structure documentation
> > > and other drivers both imply that "crtc" was the more likely one.
> > >
> > > I suspect a simple
> > >
> > > if (!crtc)
> > > return;
> > >
> > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > > I didn't check if there is possibly something else that needs to be
> > > done too.
> >
> > Thanks for the decode_stacktrace.sh and the follow-up
> >
> > Yeah, it looks like we have several things wrong here:
> >
> > * we only check that connector->state is set, and not
> > connector->state->crtc indeed.
> >
> > * We also check only in startup(), so at open() and not later on when
> > the sound streaming actually start. This has been there for a while,
> > so I guess it's never really been causing a practical issue before.
>
> You also have no locking

Indeed. Do we just need locking to prevent a concurrent audio setup and
modeset, or do you have another corner case in mind?

Also, generally, what locks should we make sure we have locked when
accessing the connector and CRTC state? drm_mode_config.connection_mutex
and drm_mode_config.mutex, respectively?

> plus looking at ->state objects outside of atomic commit machinery
> makes no sense because you're not actually in sync with the hw state.
> Relevant bits need to be copied over at commit time, protected by some
> spinlock (and that spinlock also needs to be held over whatever other
> stuff you're setting to make sure we don't get a funny out-of-sync
> state anywhere).

If we already have a lock protecting against having both an ASoC and KMS
function running, it's not clear to me what the spinlock would prevent
here?

Maxime

2021-09-30 13:01:17

by Daniel Vetter

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
> Hi Daniel,
>
> On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
> > On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > > > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > > > <[email protected]> wrote:
> > > > >
> > > > > I added some debugs to print the addresses, and I am getting:
> > > > > [ 38.813809] sudip crtc 0000000000000000
> > > > >
> > > > > This is from struct drm_crtc *crtc = connector->state->crtc;
> > > >
> > > > Yeah, that was my personal suspicion, because while the line number
> > > > implied "crtc->state" being NULL, the drm data structure documentation
> > > > and other drivers both imply that "crtc" was the more likely one.
> > > >
> > > > I suspect a simple
> > > >
> > > > if (!crtc)
> > > > return;
> > > >
> > > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > > > I didn't check if there is possibly something else that needs to be
> > > > done too.
> > >
> > > Thanks for the decode_stacktrace.sh and the follow-up
> > >
> > > Yeah, it looks like we have several things wrong here:
> > >
> > > * we only check that connector->state is set, and not
> > > connector->state->crtc indeed.
> > >
> > > * We also check only in startup(), so at open() and not later on when
> > > the sound streaming actually start. This has been there for a while,
> > > so I guess it's never really been causing a practical issue before.
> >
> > You also have no locking
>
> Indeed. Do we just need locking to prevent a concurrent audio setup and
> modeset, or do you have another corner case in mind?
>
> Also, generally, what locks should we make sure we have locked when
> accessing the connector and CRTC state? drm_mode_config.connection_mutex
> and drm_mode_config.mutex, respectively?
>
> > plus looking at ->state objects outside of atomic commit machinery
> > makes no sense because you're not actually in sync with the hw state.
> > Relevant bits need to be copied over at commit time, protected by some
> > spinlock (and that spinlock also needs to be held over whatever other
> > stuff you're setting to make sure we don't get a funny out-of-sync
> > state anywhere).
>
> If we already have a lock protecting against having both an ASoC and KMS
> function running, it's not clear to me what the spinlock would prevent
> here?

Replicating the irc chat here. With

commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880
Author: Maarten Lankhorst <[email protected]>
Date: Thu Apr 6 20:55:20 2017 +0200

drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.

this is already taken care of for drivers and should be all good from a
locking pov.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-10-13 15:05:10

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
> On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
> > Hi Daniel,
> >
> > On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
> > > On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > > > > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > I added some debugs to print the addresses, and I am getting:
> > > > > > [ 38.813809] sudip crtc 0000000000000000
> > > > > >
> > > > > > This is from struct drm_crtc *crtc = connector->state->crtc;
> > > > >
> > > > > Yeah, that was my personal suspicion, because while the line number
> > > > > implied "crtc->state" being NULL, the drm data structure documentation
> > > > > and other drivers both imply that "crtc" was the more likely one.
> > > > >
> > > > > I suspect a simple
> > > > >
> > > > > if (!crtc)
> > > > > return;
> > > > >
> > > > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > > > > I didn't check if there is possibly something else that needs to be
> > > > > done too.
> > > >
> > > > Thanks for the decode_stacktrace.sh and the follow-up
> > > >
> > > > Yeah, it looks like we have several things wrong here:
> > > >
> > > > * we only check that connector->state is set, and not
> > > > connector->state->crtc indeed.
> > > >
> > > > * We also check only in startup(), so at open() and not later on when
> > > > the sound streaming actually start. This has been there for a while,
> > > > so I guess it's never really been causing a practical issue before.
> > >
> > > You also have no locking
> >
> > Indeed. Do we just need locking to prevent a concurrent audio setup and
> > modeset, or do you have another corner case in mind?
> >
> > Also, generally, what locks should we make sure we have locked when
> > accessing the connector and CRTC state? drm_mode_config.connection_mutex
> > and drm_mode_config.mutex, respectively?
> >
> > > plus looking at ->state objects outside of atomic commit machinery
> > > makes no sense because you're not actually in sync with the hw state.
> > > Relevant bits need to be copied over at commit time, protected by some
> > > spinlock (and that spinlock also needs to be held over whatever other
> > > stuff you're setting to make sure we don't get a funny out-of-sync
> > > state anywhere).
> >
> > If we already have a lock protecting against having both an ASoC and KMS
> > function running, it's not clear to me what the spinlock would prevent
> > here?
>
> Replicating the irc chat here. With
>
> commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880
> Author: Maarten Lankhorst <[email protected]>
> Date: Thu Apr 6 20:55:20 2017 +0200
>
> drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
>
> this is already taken care of for drivers and should be all good from a
> locking pov.

So, if I understand this properly, this superseeds your comment on the
spinlock for the hw state, but not the comment that we need some locking
to synchronize between the audio and KMS path (and CEC?). Right?

Maxime


Attachments:
(No filename) (3.30 kB)
signature.asc (235.00 B)
Download all attachments

2021-10-14 13:27:34

by Daniel Vetter

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

On Wed, Oct 13, 2021 at 05:01:03PM +0200, Maxime Ripard wrote:
> On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
> > On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
> > > Hi Daniel,
> > >
> > > On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
> > > > On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
> > > > >
> > > > > On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > > > > > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > I added some debugs to print the addresses, and I am getting:
> > > > > > > [ 38.813809] sudip crtc 0000000000000000
> > > > > > >
> > > > > > > This is from struct drm_crtc *crtc = connector->state->crtc;
> > > > > >
> > > > > > Yeah, that was my personal suspicion, because while the line number
> > > > > > implied "crtc->state" being NULL, the drm data structure documentation
> > > > > > and other drivers both imply that "crtc" was the more likely one.
> > > > > >
> > > > > > I suspect a simple
> > > > > >
> > > > > > if (!crtc)
> > > > > > return;
> > > > > >
> > > > > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > > > > > I didn't check if there is possibly something else that needs to be
> > > > > > done too.
> > > > >
> > > > > Thanks for the decode_stacktrace.sh and the follow-up
> > > > >
> > > > > Yeah, it looks like we have several things wrong here:
> > > > >
> > > > > * we only check that connector->state is set, and not
> > > > > connector->state->crtc indeed.
> > > > >
> > > > > * We also check only in startup(), so at open() and not later on when
> > > > > the sound streaming actually start. This has been there for a while,
> > > > > so I guess it's never really been causing a practical issue before.
> > > >
> > > > You also have no locking
> > >
> > > Indeed. Do we just need locking to prevent a concurrent audio setup and
> > > modeset, or do you have another corner case in mind?
> > >
> > > Also, generally, what locks should we make sure we have locked when
> > > accessing the connector and CRTC state? drm_mode_config.connection_mutex
> > > and drm_mode_config.mutex, respectively?
> > >
> > > > plus looking at ->state objects outside of atomic commit machinery
> > > > makes no sense because you're not actually in sync with the hw state.
> > > > Relevant bits need to be copied over at commit time, protected by some
> > > > spinlock (and that spinlock also needs to be held over whatever other
> > > > stuff you're setting to make sure we don't get a funny out-of-sync
> > > > state anywhere).
> > >
> > > If we already have a lock protecting against having both an ASoC and KMS
> > > function running, it's not clear to me what the spinlock would prevent
> > > here?
> >
> > Replicating the irc chat here. With
> >
> > commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880
> > Author: Maarten Lankhorst <[email protected]>
> > Date: Thu Apr 6 20:55:20 2017 +0200
> >
> > drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
> >
> > this is already taken care of for drivers and should be all good from a
> > locking pov.
>
> So, if I understand this properly, this superseeds your comment on the
> spinlock for the hw state, but not the comment that we need some locking
> to synchronize between the audio and KMS path (and CEC?). Right?

Other way round. There's 3 things involved here:
1. kms output probe code
2. kms atomic commit code
3. calls from asoc side

The above referenced commit makes sure 1&2 are synchronized. The problem
is that 2&3 are not synchonronized, and from 3, no matter how much locking
you have, you cannot look at kms state. I.e. not allowed to look at
crtc->state for example, irrespective of whether you're holding
drm_modeset_lock or not. This is because the atomic nonblocking commit is
done without holding any locks, protection is purely down to ownership
rules of state structures and ordering (through drm_crtc_commit) of
in-flight nonblocking atomic commits.

That's why you need a sperate lock _and_ copy state, so taht 2&3 stay in
sync.

In practice you only care about modeset changes from 2 vs anything from 3,
and most userspace does modeset atomic commits as blocking commits, which
means you won't notice that your locking has gaps.

btw same problem exists between atomic and (vblank) irq handler. There you
need a irqsafe spinlock and you also have to copy (because the irq handler
just cannot access ->state in any safe way, because it doesn't own that
structure).

This is maybe a bit the confusing thing with atomic commit: ->state isn't
protected by locks, but through ownership rules. Only for atomic check is
->state protected by locks, but once we're committed we switch over to
ownership rules for protection. swap_states() is that point of no return.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2021-10-25 14:34:13

by Maxime Ripard

[permalink] [raw]
Subject: Re: Regression with mainline kernel on rpi4

Hi,

On Thu, Oct 14, 2021 at 03:15:36PM +0200, Daniel Vetter wrote:
> On Wed, Oct 13, 2021 at 05:01:03PM +0200, Maxime Ripard wrote:
> > On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
> > > On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
> > > > Hi Daniel,
> > > >
> > > > On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
> > > > > On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
> > > > > > > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > I added some debugs to print the addresses, and I am getting:
> > > > > > > > [ 38.813809] sudip crtc 0000000000000000
> > > > > > > >
> > > > > > > > This is from struct drm_crtc *crtc = connector->state->crtc;
> > > > > > >
> > > > > > > Yeah, that was my personal suspicion, because while the line number
> > > > > > > implied "crtc->state" being NULL, the drm data structure documentation
> > > > > > > and other drivers both imply that "crtc" was the more likely one.
> > > > > > >
> > > > > > > I suspect a simple
> > > > > > >
> > > > > > > if (!crtc)
> > > > > > > return;
> > > > > > >
> > > > > > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but
> > > > > > > I didn't check if there is possibly something else that needs to be
> > > > > > > done too.
> > > > > >
> > > > > > Thanks for the decode_stacktrace.sh and the follow-up
> > > > > >
> > > > > > Yeah, it looks like we have several things wrong here:
> > > > > >
> > > > > > * we only check that connector->state is set, and not
> > > > > > connector->state->crtc indeed.
> > > > > >
> > > > > > * We also check only in startup(), so at open() and not later on when
> > > > > > the sound streaming actually start. This has been there for a while,
> > > > > > so I guess it's never really been causing a practical issue before.
> > > > >
> > > > > You also have no locking
> > > >
> > > > Indeed. Do we just need locking to prevent a concurrent audio setup and
> > > > modeset, or do you have another corner case in mind?
> > > >
> > > > Also, generally, what locks should we make sure we have locked when
> > > > accessing the connector and CRTC state? drm_mode_config.connection_mutex
> > > > and drm_mode_config.mutex, respectively?
> > > >
> > > > > plus looking at ->state objects outside of atomic commit machinery
> > > > > makes no sense because you're not actually in sync with the hw state.
> > > > > Relevant bits need to be copied over at commit time, protected by some
> > > > > spinlock (and that spinlock also needs to be held over whatever other
> > > > > stuff you're setting to make sure we don't get a funny out-of-sync
> > > > > state anywhere).
> > > >
> > > > If we already have a lock protecting against having both an ASoC and KMS
> > > > function running, it's not clear to me what the spinlock would prevent
> > > > here?
> > >
> > > Replicating the irc chat here. With
> > >
> > > commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880
> > > Author: Maarten Lankhorst <[email protected]>
> > > Date: Thu Apr 6 20:55:20 2017 +0200
> > >
> > > drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
> > >
> > > this is already taken care of for drivers and should be all good from a
> > > locking pov.
> >
> > So, if I understand this properly, this superseeds your comment on the
> > spinlock for the hw state, but not the comment that we need some locking
> > to synchronize between the audio and KMS path (and CEC?). Right?
>
> Other way round. There's 3 things involved here:
> 1. kms output probe code
> 2. kms atomic commit code
> 3. calls from asoc side
>
> The above referenced commit makes sure 1&2 are synchronized. The problem
> is that 2&3 are not synchonronized, and from 3, no matter how much locking
> you have, you cannot look at kms state. I.e. not allowed to look at
> crtc->state for example, irrespective of whether you're holding
> drm_modeset_lock or not. This is because the atomic nonblocking commit is
> done without holding any locks, protection is purely down to ownership
> rules of state structures and ordering (through drm_crtc_commit) of
> in-flight nonblocking atomic commits.
>
> That's why you need a sperate lock _and_ copy state, so taht 2&3 stay in
> sync.
>
> In practice you only care about modeset changes from 2 vs anything from 3,
> and most userspace does modeset atomic commits as blocking commits, which
> means you won't notice that your locking has gaps.
>
> btw same problem exists between atomic and (vblank) irq handler. There you
> need a irqsafe spinlock and you also have to copy (because the irq handler
> just cannot access ->state in any safe way, because it doesn't own that
> structure).
>
> This is maybe a bit the confusing thing with atomic commit: ->state isn't
> protected by locks, but through ownership rules. Only for atomic check is
> ->state protected by locks, but once we're committed we switch over to
> ownership rules for protection. swap_states() is that point of no return.

Thanks for the clarifications, I just posted a series that should be
implementing this here:
https://lore.kernel.org/dri-devel/[email protected]/

Maxime


Attachments:
(No filename) (5.42 kB)
signature.asc (235.00 B)
Download all attachments