Subject: vanilla kernels hang randomly under Fedora 10 on system with Radeon card


Hi,

After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
worked fine (next-20081128 and next-20081121) started to hang randomly
on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
that either userspace changes uncovered some kernel regression or some
Fedora specific patch must be fixing the issue. Unfortunately vanilla
2.6.27 also freezed so after the usual pain caused by hitting bunch of
unrelated problems [1] it turned out that drm-modesetting-radeon.patch
is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
the patch and enabling the option next-20081128 works stable again...

Since the following error gets logged by kernel:

[drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
[drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.

and it also seems that system is more responsive now (it was kind of
sluggish previously) my draft theory is that F9 -> F10 triggered some
AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
but I'll leave figuring this up to the more knowledgeable people... ;)

Thanks,
Bart

PS1 full dmesg, config and lspci outputs are here:
http://www.kernel.org/pub/linux/kernel/people/bart/f9_to_f10_grind/

PS2 drm set mode reporting looks quite weird sometimes, i.e.:

...
[drm] LVDS-8: set mode �Տ�� d
[drm] bios LVDS_GEN_CNTL: 0x30ff24
...

[1] some Fedora patches make quilt trip (quilt push applies whole patch
but quilt pop only removes a part of it) since they aggregate multiple
changes to a single file, execshield patch causes oops with my kernel
config and depmod run for a Fedora kernel config's build triggered OOM
kill (512M on this machine and yes, I was happily running without swap)


2008-12-02 05:19:14

by Dave Airlie

[permalink] [raw]
Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
<[email protected]> wrote:
>
> Hi,
>
> After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> worked fine (next-20081128 and next-20081121) started to hang randomly
> on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> that either userspace changes uncovered some kernel regression or some
> Fedora specific patch must be fixing the issue. Unfortunately vanilla
> 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> the patch and enabling the option next-20081128 works stable again...
>
> Since the following error gets logged by kernel:
>
> [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
>
> and it also seems that system is more responsive now (it was kind of
> sluggish previously) my draft theory is that F9 -> F10 triggered some
> AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> but I'll leave figuring this up to the more knowledgeable people... ;)

Well KMS is a purely Fedora thing, and enabling it completely avoids
the old driver codepaths so
while it might fix it, its more by accident than design.

I'm trying to track down the rv3xx hangs with hpa at the moment as he
sees them also, something in
the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
the 2.6.27 will help narrow it down.

Bisecting 2.6.26->2.6.27 might also help.

Dave.

2008-12-02 10:22:54

by Benny Amorsen

[permalink] [raw]
Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

Bartlomiej Zolnierkiewicz <[email protected]> writes:

> Since the following error gets logged by kernel:
>
> [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
>
> and it also seems that system is more responsive now (it was kind of
> sluggish previously) my draft theory is that F9 -> F10 triggered some
> AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> but I'll leave figuring this up to the more knowledgeable people... ;)

I saw those error messages with kernel-2.6.27.5-117.fc10.i686 and
xorg-x11-drv-ati-6.9.0-54.fc10.i386 on a HP nx8220. It made
gnome-terminal basically useless, especially with tabs since it took
several seconds to switch tabs.

With the proprietary ATI driver everything works perfectly, including
suspend/resume (yay, successful suspend-to-RAM for the first time
since ACPI!)


/Benny

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Tuesday 02 December 2008, Dave Airlie wrote:
> On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> <[email protected]> wrote:
> >
> > Hi,
> >
> > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > worked fine (next-20081128 and next-20081121) started to hang randomly
> > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > that either userspace changes uncovered some kernel regression or some
> > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > the patch and enabling the option next-20081128 works stable again...
> >
> > Since the following error gets logged by kernel:
> >
> > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> >
> > and it also seems that system is more responsive now (it was kind of
> > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > but I'll leave figuring this up to the more knowledgeable people... ;)
>
> Well KMS is a purely Fedora thing, and enabling it completely avoids
> the old driver codepaths so
> while it might fix it, its more by accident than design.
>
> I'm trying to track down the rv3xx hangs with hpa at the moment as he
> sees them also, something in
> the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> the 2.6.27 will help narrow it down.
>
> Bisecting 2.6.26->2.6.27 might also help.

It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
and they all hang (they all worked fine with Fedora 9)...

I will try some older kernels but I start thinking that the xorg's ati
driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
-> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).

Thanks,
Bart

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Tuesday 02 December 2008, Dave Airlie wrote:
> > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > that either userspace changes uncovered some kernel regression or some
> > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > the patch and enabling the option next-20081128 works stable again...
> > >
> > > Since the following error gets logged by kernel:
> > >
> > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > >
> > > and it also seems that system is more responsive now (it was kind of
> > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > but I'll leave figuring this up to the more knowledgeable people... ;)
> >
> > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > the old driver codepaths so
> > while it might fix it, its more by accident than design.
> >
> > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > sees them also, something in
> > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > the 2.6.27 will help narrow it down.
> >
> > Bisecting 2.6.26->2.6.27 might also help.
>
> It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> and they all hang (they all worked fine with Fedora 9)...
>
> I will try some older kernels but I start thinking that the xorg's ati
> driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).

I just went straight to trying downgrading the driver and the older driver
indeed works fine. Then I tried to narrow down the problem and the lucky
winner this time is the cute (== undocumented and unsigned-off) patch
called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
only this patch reverted fixes hangs for vanilla kernels and drm errors
for Fedora kernel. Also performance problems that I've noticed in the
meantime (slower playback of 720p videos, sluggish window scrolling in
kmail) are completely gone. That being said I'm not entirely sure whether
the patch introduced the bug or it was only the trigger for it...

FWIW I've noticed that the patch seems to change the allocation+layout of
memory used by the driver:

@@ -524,13 +521,13 @@
(II) RADEON(0): Depth moves disabled by default
(==) RADEON(0): Not using accelerated EXA DownloadFromScreen hook
(II) RADEON(0): Allocating from a screen of 65536 kb
-(II) RADEON(0): Will use 32 kb for hardware cursor 0 at offset 0x00bb8000
-(II) RADEON(0): Will use 32 kb for hardware cursor 1 at offset 0x00bbc000
-(II) RADEON(0): Will use 12000 kb for front buffer at offset 0x00000000
-(II) RADEON(0): Will use 12000 kb for back buffer at offset 0x00bc0000
-(II) RADEON(0): Will use 12000 kb for depth buffer at offset 0x01778000
-(II) RADEON(0): Will use 14720 kb for textures at offset 0x02330000
-(II) RADEON(0): Will use 14784 kb for X Server offscreen at offset 0x03190000
+(II) RADEON(0): Will use 32 kb for hardware cursor 0 at offset 0x011e0000
+(II) RADEON(0): Will use 32 kb for hardware cursor 1 at offset 0x011e4000
+(II) RADEON(0): Will use 18304 kb for front buffer at offset 0x00000000
+(II) RADEON(0): Will use 18304 kb for back buffer at offset 0x011e8000
+(II) RADEON(0): Will use 18304 kb for depth buffer at offset 0x023c8000
+(II) RADEON(0): Will use 5248 kb for textures at offset 0x035a8000
+(II) RADEON(0): Will use 5344 kb for X Server offscreen at offset 0x03ac8000
drmOpenDevice: node name is /dev/dri/card0
drmOpenDevice: open result is 11, (OK)
drmOpenDevice: node name is /dev/dri/card0

full Xorg logs at kernel.org/pub/linux/kernel/people/bart/f9_to_f10_grind/

Thanks,
Bart

PS Since I was busy with debugging the problem I haven't noticed that you
have released the new driver package (xorg-x11-drv-ati-6.9.0-59.fc10),
however since it still contains the patch in question I don't think that
it would help (anyway I'll test it tomorrow, I'm too tired now)...

2008-12-04 19:40:30

by Jack Tanner

[permalink] [raw]
Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

Dave Airlie <airlied <at> gmail.com> writes:

> On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> <bzolnier <at> gmail.com> wrote:
> >
> > Since the following error gets logged by kernel:
> >
> > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444
4000027 10000a0
> > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory
quota.

Per Bartolomiej's reasoning, I recompiled
xorg-x11-drv-ati-6.9.0-61.fc10.x86_64.rpm without the remove-limit-heuristics
patch. However, I still see errors like the one above. In addition, since
upgrading to F10 I also see visual artifacts on my (composited) desktop. See
Xorg.0.log (generated with the remove-limit-heuristics aptch) as attached to
https://bugzilla.redhat.com/show_bug.cgi?id=469214 .

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > On Tuesday 02 December 2008, Dave Airlie wrote:
> > > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > > <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > > that either userspace changes uncovered some kernel regression or some
> > > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > > the patch and enabling the option next-20081128 works stable again...
> > > >
> > > > Since the following error gets logged by kernel:
> > > >
> > > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > > >
> > > > and it also seems that system is more responsive now (it was kind of
> > > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > > but I'll leave figuring this up to the more knowledgeable people... ;)
> > >
> > > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > > the old driver codepaths so
> > > while it might fix it, its more by accident than design.
> > >
> > > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > > sees them also, something in
> > > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > > the 2.6.27 will help narrow it down.
> > >
> > > Bisecting 2.6.26->2.6.27 might also help.
> >
> > It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> > and they all hang (they all worked fine with Fedora 9)...
> >
> > I will try some older kernels but I start thinking that the xorg's ati
> > driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> > -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).
>
> I just went straight to trying downgrading the driver and the older driver
> indeed works fine. Then I tried to narrow down the problem and the lucky
> winner this time is the cute (== undocumented and unsigned-off) patch
> called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
> only this patch reverted fixes hangs for vanilla kernels and drm errors
> for Fedora kernel. Also performance problems that I've noticed in the
> meantime (slower playback of 720p videos, sluggish window scrolling in
> kmail) are completely gone. That being said I'm not entirely sure whether

I was too quick here -- performance problems are still present with
_Fedora_ kernel.

Reassuming: what I currently need to do to get my gfx working properly
with F10 is reverting radeon-6.9.0-remove-limit-heuristics.patch from
xorg-x11-drv-ati and using vanilla kernel instead of Fedora's one.

> PS Since I was busy with debugging the problem I haven't noticed that you
> have released the new driver package (xorg-x11-drv-ati-6.9.0-59.fc10),
> however since it still contains the patch in question I don't think that
> it would help (anyway I'll test it tomorrow, I'm too tired now)...

It didn't help (as expected).

To add more fun I'm getting following DRM oops with next-2008120[3,4]:

BUG: unable to handle kernel NULL pointer dereference at 00000144
IP: [<e0900741>] drm_addmap_core+0x548/0x561 [drm]
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/enable
Modules linked in: radeon(+) drm lib80211_crypt_tkip xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 acerhk cpufreq_ondemand binfmt_misc snd_intel8x0m snd_intel8x0 snd_ac97_codec snd_seq_dummy ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd ipw2200 libipw soundcore snd_page_alloc uhci_hcd lib80211 parport_pc parport ehci_hcd

Pid: 1741, comm: modprobe Not tainted (2.6.28-rc7-next-20081204 #267) Extensa 2900
EIP: 0060:[<e0900741>] EFLAGS: 00213202 CPU: 0
EIP is at drm_addmap_core+0x548/0x561 [drm]
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: d9c11400
ESI: d9f56900 EDI: d9cd57c0 EBP: e0010000 ESP: d9ee8ea4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process modprobe (pid: 1741, ti=d9ee8000 task=d9cbd470 task.ti=d9ee8000)
Stack:
d9cd57c0 00010000 d9c11400 00000002 d9c11420 d9cd57c8 d9c114d4 d9c114d4
d9c114e0 00010000 d9ee8eec d9c11000 d9c11344 d9c11400 e09007c2 00000001
00000082 d9ee8eec d9ee8ef4 00010000 e0a1774c 00000001 00000082 d9c11340
Call Trace:
[<e09007c2>] drm_addmap+0x14/0x2e [drm]
[<e0a1774c>] radeon_driver_load+0xef/0x15a [radeon]
[<e0904f43>] drm_get_dev+0x240/0x4ab [drm]
[<c01e48af>] kobject_get+0xf/0x13
[<e09015ed>] drm_init+0x5a/0x89 [drm]
[<e08a8000>] radeon_init+0x0/0x14 [radeon]
[<c010112c>] _stext+0x44/0x108
[<c01430a8>] sys_init_module+0x87/0x174
[<c0102eb1>] sysenter_do_call+0x12/0x25
[<c0310000>] rwsem_down_failed_common+0x7f/0x16a
Code: ea a0 df eb 35 8b 3c 24 8b 47 10 c7 47 1c 00 00 00 00 c1 e0 0c 89 47 18 8b 44 24 10 e8 30 ea a0 df 8b 54 24 08 8b 82 b0 02 00 00 <8b> 80 44 01 00 00 89 47 20 8b 4c 24 44 89 39 83 c4 28 89 d8 5b
EIP: [<e0900741>] drm_addmap_core+0x548/0x561 [drm] SS:ESP 0068:d9ee8ea4
---[ end trace 06bac8f3f2edd26f ]---

which I think may be caused by:

commit c2f29f764c0daa0084674d4a463e7158ac5c4dc4
Author: Dave Airlie <[email protected]>
Date: Fri Nov 28 14:22:24 2008 +1000

drm: move to kref per-master structures.

however I haven't verified it yet.

Thanks,
Bart

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > On Tuesday 02 December 2008, Dave Airlie wrote:
> > > > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > > > that either userspace changes uncovered some kernel regression or some
> > > > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > > > the patch and enabling the option next-20081128 works stable again...
> > > > >
> > > > > Since the following error gets logged by kernel:
> > > > >
> > > > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > > > >
> > > > > and it also seems that system is more responsive now (it was kind of
> > > > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > > > but I'll leave figuring this up to the more knowledgeable people... ;)
> > > >
> > > > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > > > the old driver codepaths so
> > > > while it might fix it, its more by accident than design.
> > > >
> > > > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > > > sees them also, something in
> > > > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > > > the 2.6.27 will help narrow it down.
> > > >
> > > > Bisecting 2.6.26->2.6.27 might also help.
> > >
> > > It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> > > and they all hang (they all worked fine with Fedora 9)...
> > >
> > > I will try some older kernels but I start thinking that the xorg's ati
> > > driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> > > -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).
> >
> > I just went straight to trying downgrading the driver and the older driver
> > indeed works fine. Then I tried to narrow down the problem and the lucky
> > winner this time is the cute (== undocumented and unsigned-off) patch
> > called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
> > only this patch reverted fixes hangs for vanilla kernels and drm errors
> > for Fedora kernel. Also performance problems that I've noticed in the
> > meantime (slower playback of 720p videos, sluggish window scrolling in
> > kmail) are completely gone. That being said I'm not entirely sure whether
>
> I was too quick here -- performance problems are still present with
> _Fedora_ kernel.
>
> Reassuming: what I currently need to do to get my gfx working properly
> with F10 is reverting radeon-6.9.0-remove-limit-heuristics.patch from
> xorg-x11-drv-ati and using vanilla kernel instead of Fedora's one.

Heh, and it just hang on me after sending the above mail (it took like
1h or so for hang to occur) => the patch is just a very good trigger for
the "real" bug. I'll now be running vanilla 6.9.0 to see how it goes...

Thanks,
Bart

Subject: next-2008120[3,4] drm oops (was Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card)

On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:

[...]

> To add more fun I'm getting following DRM oops with next-2008120[3,4]:

Here is refreshed oops (I needed to tweak/rebuild the kernel):

BUG: unable to handle kernel NULL pointer dereference at 00000144
IP: [<c0247371>] drm_addmap_core+0x548/0x561
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/enable
Modules linked in: radeon(+) lib80211_crypt_tkip xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 acerhk cpufreq_ondemand binfmt_misc snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_seq_dummy ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm ipw2200 snd_timer libipw snd soundcore snd_page_alloc lib80211 ehci_hcd uhci_hcd parport_pc parport

Pid: 1740, comm: modprobe Not tainted (2.6.28-rc7-next-20081204 #268) Extensa 2900
EIP: 0060:[<c0247371>] EFLAGS: 00213202 CPU: 0
EIP is at drm_addmap_core+0x548/0x561
EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: da1dec00
ESI: da2baac0 EDI: da177a80 EBP: e0010000 ESP: da2c1ea4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process modprobe (pid: 1740, ti=da2c1000 task=df8741b0 task.ti=da2c1000)
Stack:
da177a80 00010000 da1dec00 00000002 da1dec20 da177a88 da1decd4 da1decd4
da1dece0 00010000 da2c1eec da285800 da285b44 da1dec00 c02473f2 00000001
00000082 da2c1eec da2c1ef4 00010000 e085674c 00000001 00000082 da285b40
Call Trace:
[<c02473f2>] drm_addmap+0x14/0x2e
[<e085674c>] radeon_driver_load+0xef/0x15a [radeon]
[<c024bb73>] drm_get_dev+0x240/0x4ab
[<c01e48af>] kobject_get+0xf/0x13
[<c024821d>] drm_init+0x5a/0x89
[<e0832000>] radeon_init+0x0/0x14 [radeon]
[<c010112c>] _stext+0x44/0x108
[<c01430a8>] sys_init_module+0x87/0x174
[<c0102eb1>] sysenter_do_call+0x12/0x25
[<c0310000>] rtl8139_init_one+0x685/0x85e
Code: 12 0d 00 eb 35 8b 3c 24 8b 47 10 c7 47 1c 00 00 00 00 c1 e0 0c 89 47 18 8b 44 24 10 e8 a8 12 0d 00 8b 54 24 08 8b 82 b0 02 00 00 <8b> 80 44 01 00 00 89 47 20 8b 4c 24 44 89 39 83 c4 28 89 d8 5b
EIP: [<c0247371>] drm_addmap_core+0x548/0x561 SS:ESP 0068:da2c1ea4
---[ end trace b2c7f2a062698806 ]---

[...]

> which I think may be caused by:
>
> commit c2f29f764c0daa0084674d4a463e7158ac5c4dc4
> Author: Dave Airlie <[email protected]>
> Date: Fri Nov 28 14:22:24 2008 +1000
>
> drm: move to kref per-master structures.
>
> however I haven't verified it yet.

It is confirmed now, reverting commit

c2f29f764c0daa0084674d4a463e7158ac5c4dc4
("drm: move to kref per-master structures.")

and [the one depending on the above one]

21680220acd264620d7172ed99868bf580ecf0d4.
("drm: fix leak of uninitialized data to userspace")

fixes the oops and makes DRI being enabled again.


The oops itself happens when loading 'radeon' module and is caused by
primary->master dereference (+ primary being NULL) in the following chunk:

@@ -319,6 +319,7 @@ static int drm_addmap_core(struct drm_device * dev, unsigned
list->user_token = list->hash.key << PAGE_SHIFT;
mutex_unlock(&dev->struct_mutex);

+ list->master = dev->primary->master;
*maplist = list;
return 0;
}

Debug data:

$ gdb vmlinux

(gdb) l *0xc0247371
0xc0247371 is in drm_addmap_core (drivers/gpu/drm/drm_bufs.c:322).
317 }
318
319 list->user_token = list->hash.key << PAGE_SHIFT;
320 mutex_unlock(&dev->struct_mutex);
321
322 list->master = dev->primary->master;
323 *maplist = list;
324 return 0;
325 }
326

$ objdump -d drivers/gpu/drm/drm_bufs.o
...
00001fb9 <drm_addmap_core>:
...
24f2: e8 fc ff ff ff call 24f3 <drm_addmap_core+0x53a>
24f7: 8b 54 24 08 mov 0x8(%esp),%edx
24fb: 8b 82 b0 02 00 00 mov 0x2b0(%edx),%eax
-> 2501: 8b 80 44 01 00 00 mov 0x144(%eax),%eax
2507: 89 47 20 mov %eax,0x20(%edi)
...

Hope this helps.

Thanks,
Bart

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > > On Tuesday 02 December 2008, Dave Airlie wrote:
> > > > > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > > > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > > > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > > > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > > > > that either userspace changes uncovered some kernel regression or some
> > > > > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > > > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > > > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > > > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > > > > the patch and enabling the option next-20081128 works stable again...
> > > > > >
> > > > > > Since the following error gets logged by kernel:
> > > > > >
> > > > > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > > > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > > > > >
> > > > > > and it also seems that system is more responsive now (it was kind of
> > > > > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > > > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > > > > but I'll leave figuring this up to the more knowledgeable people... ;)
> > > > >
> > > > > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > > > > the old driver codepaths so
> > > > > while it might fix it, its more by accident than design.
> > > > >
> > > > > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > > > > sees them also, something in
> > > > > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > > > > the 2.6.27 will help narrow it down.
> > > > >
> > > > > Bisecting 2.6.26->2.6.27 might also help.
> > > >
> > > > It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> > > > and they all hang (they all worked fine with Fedora 9)...
> > > >
> > > > I will try some older kernels but I start thinking that the xorg's ati
> > > > driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> > > > -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).
> > >
> > > I just went straight to trying downgrading the driver and the older driver
> > > indeed works fine. Then I tried to narrow down the problem and the lucky
> > > winner this time is the cute (== undocumented and unsigned-off) patch
> > > called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
> > > only this patch reverted fixes hangs for vanilla kernels and drm errors
> > > for Fedora kernel. Also performance problems that I've noticed in the
> > > meantime (slower playback of 720p videos, sluggish window scrolling in
> > > kmail) are completely gone. That being said I'm not entirely sure whether
> >
> > I was too quick here -- performance problems are still present with
> > _Fedora_ kernel.
> >
> > Reassuming: what I currently need to do to get my gfx working properly
> > with F10 is reverting radeon-6.9.0-remove-limit-heuristics.patch from
> > xorg-x11-drv-ati and using vanilla kernel instead of Fedora's one.
>
> Heh, and it just hang on me after sending the above mail (it took like
> 1h or so for hang to occur) => the patch is just a very good trigger for
> the "real" bug. I'll now be running vanilla 6.9.0 to see how it goes...

It went well, "vanilla" in this case was xorg-x11-drv-ati-6.9.0-54.fc10
content _without_ radeon-modeset.patch and _with_ patch containing commit
da021c36bbdf3bca31ee50ebe01cdb9495c09b36 ("radeon_drm.h: remove kernel
defines") from xf86-video-ati git tree (needed to make things compile).

I tried to bisect it futher using radeon-gem-cs branch (using edge commit
deduced from radeon-modeset.patch) and managed to narrow it down further to
somewhere between

commit 44fb767aa95e5f0725386106b89d0782fd53b768
("radeon: fixup modesetting code after rebasing to master")

and

commit 12e71eaf7999520d23d50cfbcfc0299b2bdf7a9d
("port to using drm header files")

which left 66 commits which are completely unbisectable because of build
problems and bugfixes. I tried continuing with exporting commits from git
to patches, importing patches to quilt and shuffling them around to make
things bisectable again... Unfortunately this turned out to be more time
consuming than expected and I run out of time for this exercise...

Dave, do you have some ideas how can this be debugged further?
(i.e. rebuilding radeon-gem-cs tree would greatly help)

Or maybe it is not worth it until trying some updates/fixes first?

Thanks,
Bart

2008-12-08 00:17:53

by Dave Airlie

[permalink] [raw]
Subject: Re: next-2008120[3,4] drm oops (was Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card)

On Sat, Dec 6, 2008 at 4:50 AM, Bartlomiej Zolnierkiewicz
<[email protected]> wrote:
> On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
>
> [...]
>
>> To add more fun I'm getting following DRM oops with next-2008120[3,4]:
>
> Here is refreshed oops (I needed to tweak/rebuild the kernel):
>
> BUG: unable to handle kernel NULL pointer dereference at 00000144
> IP: [<c0247371>] drm_addmap_core+0x548/0x561
> *pde = 00000000
> Oops: 0000 [#1] PREEMPT SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/enable
> Modules linked in: radeon(+) lib80211_crypt_tkip xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 acerhk cpufreq_ondemand binfmt_misc snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_seq_dummy ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm ipw2200 snd_timer libipw snd soundcore snd_page_alloc lib80211 ehci_hcd uhci_hcd parport_pc parport
>
> Pid: 1740, comm: modprobe Not tainted (2.6.28-rc7-next-20081204 #268) Extensa 2900
> EIP: 0060:[<c0247371>] EFLAGS: 00213202 CPU: 0
> EIP is at drm_addmap_core+0x548/0x561
> EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: da1dec00
> ESI: da2baac0 EDI: da177a80 EBP: e0010000 ESP: da2c1ea4
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process modprobe (pid: 1740, ti=da2c1000 task=df8741b0 task.ti=da2c1000)
> Stack:
> da177a80 00010000 da1dec00 00000002 da1dec20 da177a88 da1decd4 da1decd4
> da1dece0 00010000 da2c1eec da285800 da285b44 da1dec00 c02473f2 00000001
> 00000082 da2c1eec da2c1ef4 00010000 e085674c 00000001 00000082 da285b40
> Call Trace:
> [<c02473f2>] drm_addmap+0x14/0x2e
> [<e085674c>] radeon_driver_load+0xef/0x15a [radeon]
> [<c024bb73>] drm_get_dev+0x240/0x4ab
> [<c01e48af>] kobject_get+0xf/0x13
> [<c024821d>] drm_init+0x5a/0x89
> [<e0832000>] radeon_init+0x0/0x14 [radeon]
> [<c010112c>] _stext+0x44/0x108
> [<c01430a8>] sys_init_module+0x87/0x174
> [<c0102eb1>] sysenter_do_call+0x12/0x25
> [<c0310000>] rtl8139_init_one+0x685/0x85e
> Code: 12 0d 00 eb 35 8b 3c 24 8b 47 10 c7 47 1c 00 00 00 00 c1 e0 0c 89 47 18 8b 44 24 10 e8 a8 12 0d 00 8b 54 24 08 8b 82 b0 02 00 00 <8b> 80 44 01 00 00 89 47 20 8b 4c 24 44 89 39 83 c4 28 89 d8 5b
> EIP: [<c0247371>] drm_addmap_core+0x548/0x561 SS:ESP 0068:da2c1ea4
> ---[ end trace b2c7f2a062698806 ]---
>
> [...]
>
>> which I think may be caused by:
>>
>> commit c2f29f764c0daa0084674d4a463e7158ac5c4dc4
>> Author: Dave Airlie <[email protected]>
>> Date: Fri Nov 28 14:22:24 2008 +1000
>>
>> drm: move to kref per-master structures.
>>
>> however I haven't verified it yet.


Thanks for that, I've pushed the fix into drm-next
(d5de2d1a3a88628396c895410ae9e06f732d6591)
which was to reorganise the startup sequence so things happened in a
more correct order.

Let me know if it still breaks.

Dave.

Subject: Re: next-2008120[3,4] drm oops (was Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card)

On Monday 08 December 2008, Dave Airlie wrote:
> On Sat, Dec 6, 2008 at 4:50 AM, Bartlomiej Zolnierkiewicz
> <[email protected]> wrote:
> > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> >
> > [...]
> >
> >> To add more fun I'm getting following DRM oops with next-2008120[3,4]:
> >
> > Here is refreshed oops (I needed to tweak/rebuild the kernel):
> >
> > BUG: unable to handle kernel NULL pointer dereference at 00000144
> > IP: [<c0247371>] drm_addmap_core+0x548/0x561
> > *pde = 00000000
> > Oops: 0000 [#1] PREEMPT SMP
> > last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/enable
> > Modules linked in: radeon(+) lib80211_crypt_tkip xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 acerhk cpufreq_ondemand binfmt_misc snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_seq_dummy ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm ipw2200 snd_timer libipw snd soundcore snd_page_alloc lib80211 ehci_hcd uhci_hcd parport_pc parport
> >
> > Pid: 1740, comm: modprobe Not tainted (2.6.28-rc7-next-20081204 #268) Extensa 2900
> > EIP: 0060:[<c0247371>] EFLAGS: 00213202 CPU: 0
> > EIP is at drm_addmap_core+0x548/0x561
> > EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: da1dec00
> > ESI: da2baac0 EDI: da177a80 EBP: e0010000 ESP: da2c1ea4
> > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > Process modprobe (pid: 1740, ti=da2c1000 task=df8741b0 task.ti=da2c1000)
> > Stack:
> > da177a80 00010000 da1dec00 00000002 da1dec20 da177a88 da1decd4 da1decd4
> > da1dece0 00010000 da2c1eec da285800 da285b44 da1dec00 c02473f2 00000001
> > 00000082 da2c1eec da2c1ef4 00010000 e085674c 00000001 00000082 da285b40
> > Call Trace:
> > [<c02473f2>] drm_addmap+0x14/0x2e
> > [<e085674c>] radeon_driver_load+0xef/0x15a [radeon]
> > [<c024bb73>] drm_get_dev+0x240/0x4ab
> > [<c01e48af>] kobject_get+0xf/0x13
> > [<c024821d>] drm_init+0x5a/0x89
> > [<e0832000>] radeon_init+0x0/0x14 [radeon]
> > [<c010112c>] _stext+0x44/0x108
> > [<c01430a8>] sys_init_module+0x87/0x174
> > [<c0102eb1>] sysenter_do_call+0x12/0x25
> > [<c0310000>] rtl8139_init_one+0x685/0x85e
> > Code: 12 0d 00 eb 35 8b 3c 24 8b 47 10 c7 47 1c 00 00 00 00 c1 e0 0c 89 47 18 8b 44 24 10 e8 a8 12 0d 00 8b 54 24 08 8b 82 b0 02 00 00 <8b> 80 44 01 00 00 89 47 20 8b 4c 24 44 89 39 83 c4 28 89 d8 5b
> > EIP: [<c0247371>] drm_addmap_core+0x548/0x561 SS:ESP 0068:da2c1ea4
> > ---[ end trace b2c7f2a062698806 ]---
> >
> > [...]
> >
> >> which I think may be caused by:
> >>
> >> commit c2f29f764c0daa0084674d4a463e7158ac5c4dc4
> >> Author: Dave Airlie <[email protected]>
> >> Date: Fri Nov 28 14:22:24 2008 +1000
> >>
> >> drm: move to kref per-master structures.
> >>
> >> however I haven't verified it yet.
>
>
> Thanks for that, I've pushed the fix into drm-next
> (d5de2d1a3a88628396c895410ae9e06f732d6591)
> which was to reorganise the startup sequence so things happened in a
> more correct order.
>
> Let me know if it still breaks.

I've just tested next-20081209 and it is all good now. Thanks!

Subject: Re: next-2008120[3,4] drm oops (was Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card)

On Wednesday 10 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Monday 08 December 2008, Dave Airlie wrote:
> > On Sat, Dec 6, 2008 at 4:50 AM, Bartlomiej Zolnierkiewicz
> > <[email protected]> wrote:
> > > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > >
> > > [...]
> > >
> > >> To add more fun I'm getting following DRM oops with next-2008120[3,4]:
> > >
> > > Here is refreshed oops (I needed to tweak/rebuild the kernel):
> > >
> > > BUG: unable to handle kernel NULL pointer dereference at 00000144
> > > IP: [<c0247371>] drm_addmap_core+0x548/0x561
> > > *pde = 00000000
> > > Oops: 0000 [#1] PREEMPT SMP
> > > last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/enable
> > > Modules linked in: radeon(+) lib80211_crypt_tkip xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 acerhk cpufreq_ondemand binfmt_misc snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_seq_dummy ac97_bus snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm ipw2200 snd_timer libipw snd soundcore snd_page_alloc lib80211 ehci_hcd uhci_hcd parport_pc parport
> > >
> > > Pid: 1740, comm: modprobe Not tainted (2.6.28-rc7-next-20081204 #268) Extensa 2900
> > > EIP: 0060:[<c0247371>] EFLAGS: 00213202 CPU: 0
> > > EIP is at drm_addmap_core+0x548/0x561
> > > EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: da1dec00
> > > ESI: da2baac0 EDI: da177a80 EBP: e0010000 ESP: da2c1ea4
> > > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > > Process modprobe (pid: 1740, ti=da2c1000 task=df8741b0 task.ti=da2c1000)
> > > Stack:
> > > da177a80 00010000 da1dec00 00000002 da1dec20 da177a88 da1decd4 da1decd4
> > > da1dece0 00010000 da2c1eec da285800 da285b44 da1dec00 c02473f2 00000001
> > > 00000082 da2c1eec da2c1ef4 00010000 e085674c 00000001 00000082 da285b40
> > > Call Trace:
> > > [<c02473f2>] drm_addmap+0x14/0x2e
> > > [<e085674c>] radeon_driver_load+0xef/0x15a [radeon]
> > > [<c024bb73>] drm_get_dev+0x240/0x4ab
> > > [<c01e48af>] kobject_get+0xf/0x13
> > > [<c024821d>] drm_init+0x5a/0x89
> > > [<e0832000>] radeon_init+0x0/0x14 [radeon]
> > > [<c010112c>] _stext+0x44/0x108
> > > [<c01430a8>] sys_init_module+0x87/0x174
> > > [<c0102eb1>] sysenter_do_call+0x12/0x25
> > > [<c0310000>] rtl8139_init_one+0x685/0x85e
> > > Code: 12 0d 00 eb 35 8b 3c 24 8b 47 10 c7 47 1c 00 00 00 00 c1 e0 0c 89 47 18 8b 44 24 10 e8 a8 12 0d 00 8b 54 24 08 8b 82 b0 02 00 00 <8b> 80 44 01 00 00 89 47 20 8b 4c 24 44 89 39 83 c4 28 89 d8 5b
> > > EIP: [<c0247371>] drm_addmap_core+0x548/0x561 SS:ESP 0068:da2c1ea4
> > > ---[ end trace b2c7f2a062698806 ]---
> > >
> > > [...]
> > >
> > >> which I think may be caused by:
> > >>
> > >> commit c2f29f764c0daa0084674d4a463e7158ac5c4dc4
> > >> Author: Dave Airlie <[email protected]>
> > >> Date: Fri Nov 28 14:22:24 2008 +1000
> > >>
> > >> drm: move to kref per-master structures.
> > >>
> > >> however I haven't verified it yet.
> >
> >
> > Thanks for that, I've pushed the fix into drm-next
> > (d5de2d1a3a88628396c895410ae9e06f732d6591)
> > which was to reorganise the startup sequence so things happened in a
> > more correct order.
> >
> > Let me know if it still breaks.
>
> I've just tested next-20081209 and it is all good now. Thanks!

I was just "lucky" to try -next release with drm tree dropped... ;-)

next-20081212 brakes X (== black screen on start, you can't do anything),
reverting the proposed patch is impossible since there are already commits
depending on it. I went back to next-20081208 -- the same problem with
X happens and reverting the patch fixes it.

Thanks,
Bart

Subject: Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

On Sunday 07 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > > On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > > > On Tuesday 02 December 2008, Dave Airlie wrote:
> > > > > > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > > > > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > > > > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > > > > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > > > > > that either userspace changes uncovered some kernel regression or some
> > > > > > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > > > > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > > > > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > > > > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > > > > > the patch and enabling the option next-20081128 works stable again...
> > > > > > >
> > > > > > > Since the following error gets logged by kernel:
> > > > > > >
> > > > > > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > > > > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > > > > > >
> > > > > > > and it also seems that system is more responsive now (it was kind of
> > > > > > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > > > > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > > > > > but I'll leave figuring this up to the more knowledgeable people... ;)
> > > > > >
> > > > > > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > > > > > the old driver codepaths so
> > > > > > while it might fix it, its more by accident than design.
> > > > > >
> > > > > > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > > > > > sees them also, something in
> > > > > > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > > > > > the 2.6.27 will help narrow it down.
> > > > > >
> > > > > > Bisecting 2.6.26->2.6.27 might also help.
> > > > >
> > > > > It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> > > > > and they all hang (they all worked fine with Fedora 9)...
> > > > >
> > > > > I will try some older kernels but I start thinking that the xorg's ati
> > > > > driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> > > > > -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).
> > > >
> > > > I just went straight to trying downgrading the driver and the older driver
> > > > indeed works fine. Then I tried to narrow down the problem and the lucky
> > > > winner this time is the cute (== undocumented and unsigned-off) patch
> > > > called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
> > > > only this patch reverted fixes hangs for vanilla kernels and drm errors
> > > > for Fedora kernel. Also performance problems that I've noticed in the
> > > > meantime (slower playback of 720p videos, sluggish window scrolling in
> > > > kmail) are completely gone. That being said I'm not entirely sure whether
> > >
> > > I was too quick here -- performance problems are still present with
> > > _Fedora_ kernel.
> > >
> > > Reassuming: what I currently need to do to get my gfx working properly
> > > with F10 is reverting radeon-6.9.0-remove-limit-heuristics.patch from
> > > xorg-x11-drv-ati and using vanilla kernel instead of Fedora's one.
> >
> > Heh, and it just hang on me after sending the above mail (it took like
> > 1h or so for hang to occur) => the patch is just a very good trigger for
> > the "real" bug. I'll now be running vanilla 6.9.0 to see how it goes...
>
> It went well, "vanilla" in this case was xorg-x11-drv-ati-6.9.0-54.fc10
> content _without_ radeon-modeset.patch and _with_ patch containing commit
> da021c36bbdf3bca31ee50ebe01cdb9495c09b36 ("radeon_drm.h: remove kernel
> defines") from xf86-video-ati git tree (needed to make things compile).
>
> I tried to bisect it futher using radeon-gem-cs branch (using edge commit
> deduced from radeon-modeset.patch) and managed to narrow it down further to
> somewhere between
>
> commit 44fb767aa95e5f0725386106b89d0782fd53b768
> ("radeon: fixup modesetting code after rebasing to master")
>
> and
>
> commit 12e71eaf7999520d23d50cfbcfc0299b2bdf7a9d
> ("port to using drm header files")
>
> which left 66 commits which are completely unbisectable because of build
> problems and bugfixes. I tried continuing with exporting commits from git
> to patches, importing patches to quilt and shuffling them around to make
> things bisectable again... Unfortunately this turned out to be more time
> consuming than expected and I run out of time for this exercise...
>
> Dave, do you have some ideas how can this be debugged further?
> (i.e. rebuilding radeon-gem-cs tree would greatly help)
>
> Or maybe it is not worth it until trying some updates/fixes first?

FWIW all issues are still there with kernel-2.6.27.7-134.fc10.i686
and xorg-x11-drv-ati-6.9.0-61.fc10.i386.

Anyway I got a bit impatient by a lack of follow-up on this and resumed
the "rebuild radeon-gem-cs bisectability" operation (luckily the problem
was narrowed down to the second part of changes so I didn't have to do
100% of the work)...

Hangs seem to be caused by commit 5c5736604e6a1bc280821bd92f3714e0c9e7d7d3
("radeon: no need for this anymore"):

--- a/src/radeon_driver.c
+++ b/src/radeon_driver.c
@@ -1621,9 +1621,7 @@ static Bool RADEONPreInitVRAM(ScrnInfoPtr pScrn)

pScrn->videoRam &= ~1023;

- /* half video RAM for TTM */
info->FbMapSize = pScrn->videoRam * 1024;
- info->FbMapSize /= 2;

/* if the card is PCI Express reserve the last 32k for the gart table */
#ifdef XF86DRI

I'm now running Fedora's xorg-x11-drv-ati with only the above patch reverted
and so far it is rock stable.

Thanks,
Bart