2010-11-08 17:02:28

by Markus Trippelsdorf

[permalink] [raw]
Subject: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

I can trigger a kernel crash on my system by simply loading this png
image with firefox:
http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg

The system has an embedded RS780 and is running the latest git kernel.
(Xorg.0.log is attached)

The crash looks as follows:

Nov 8 17:37:21 arch kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
Nov 8 17:37:21 arch kernel: IP: [<ffffffff81449f1f>] _raw_write_lock+0xf/0x20
Nov 8 17:37:21 arch kernel: PGD 11bf20067 PUD 11bfa7067 PMD 0
Nov 8 17:37:21 arch kernel: Oops: 0002 [#1] PREEMPT SMP
Nov 8 17:37:21 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
Nov 8 17:37:21 arch kernel: CPU 0
Nov 8 17:37:21 arch kernel: Pid: 1502, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
Nov 8 17:37:21 arch kernel: RIP: 0010:[<ffffffff81449f1f>] [<ffffffff81449f1f>] _raw_write_lock+0xf/0x20
Nov 8 17:37:21 arch kernel: RSP: 0018:ffff88011b523cc0 EFLAGS: 00010202
Nov 8 17:37:21 arch kernel: RAX: ffff88011b523fd8 RBX: 0000000000000020 RCX: 00000000ffffffff
Nov 8 17:37:22 arch kernel: RDX: 00000000ffffffff RSI: ffffffff8120a6f0 RDI: 0000000000000020
Nov 8 17:37:22 arch kernel: RBP: ffff880113f39c48 R08: 0000000000000006 R09: 0000000000000006
Nov 8 17:37:22 arch kernel: R10: 0000000000000006 R11: 0000000000000006 R12: 0000000000000071
Nov 8 17:37:22 arch kernel: R13: ffff8800c07ffb40 R14: 0000000040086409 R15: 00000000fffffff2
Nov 8 17:37:22 arch kernel: FS: 00007f3786cdc700(0000) GS:ffff8800dfc00000(0000) knlGS:0000000000000000
Nov 8 17:37:22 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 8 17:37:22 arch kernel: CR2: 0000000000000020 CR3: 000000011f60a000 CR4: 00000000000006f0
Nov 8 17:37:22 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 8 17:37:22 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 8 17:37:22 arch kernel: Process X (pid: 1502, threadinfo ffff88011b522000, task ffff88011cc3d460)
Nov 8 17:37:22 arch kernel: Stack:
Nov 8 17:37:22 arch kernel: ffffffff8121cbb8 0000000000000292 ffff88011ffabbc0 ffff88011b523d20
Nov 8 17:37:22 arch kernel: ffffffff81252a92 0000000000000296 0000000000000000 ffff88011d9410a8
Nov 8 17:37:22 arch kernel: ffff8800c07ffb40 ffffffff8120a6f0 ffffffff8126711e ffff88011f632a90
Nov 8 17:37:22 arch kernel: Call Trace:
Nov 8 17:37:22 arch kernel: [<ffffffff8121cbb8>] ? ttm_bo_unref+0x28/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81252a92>] ? radeon_bo_unref+0x42/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff8120a6f0>] ? drm_gem_object_free+0x0/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8126711e>] ? radeon_gem_object_free+0x2e/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81183493>] ? kref_put+0x33/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff8120aeb0>] ? drm_gem_close_ioctl+0xc0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 17:37:22 arch kernel: [<ffffffff8120adf0>] ? drm_gem_close_ioctl+0x0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff810cd80f>] ? do_sync_read+0xbf/0x100
Nov 8 17:37:22 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 17:37:22 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 17:37:22 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Nov 8 17:37:22 arch kernel: Code: 83 c4 08 c3 e8 f3 dd ff ff 31 c0 eb f2 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 c8 b6 00 00 ff 80 44 e0 ff ff <f0> 81 2f 00 00 00 01 74 05 e8 83 ff d3 ff c3 66 90 9c 58 fa 65
Nov 8 17:37:22 arch kernel: RIP [<ffffffff81449f1f>] _raw_write_lock+0xf/0x20
Nov 8 17:37:22 arch kernel: RSP <ffff88011b523cc0>
Nov 8 17:37:22 arch kernel: CR2: 0000000000000020
Nov 8 17:37:22 arch kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
Nov 8 17:37:22 arch kernel: IP: [<ffffffff81449b84>] _raw_spin_lock+0x14/0x30
Nov 8 17:37:22 arch kernel: PGD 11bf20067 PUD 11bfa7067 PMD 0
Nov 8 17:37:22 arch kernel: Oops: 0002 [#2] PREEMPT SMP
Nov 8 17:37:22 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
Nov 8 17:37:22 arch kernel: CPU 0
Nov 8 17:37:22 arch kernel: Pid: 1502, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
Nov 8 17:37:22 arch kernel: RIP: 0010:[<ffffffff81449b84>] [<ffffffff81449b84>] _raw_spin_lock+0x14/0x30
Nov 8 17:37:22 arch kernel: RSP: 0018:ffff88011b523660 EFLAGS: 00010002
Nov 8 17:37:22 arch kernel: RAX: 0000000000000100 RBX: ffff88011ff2c048 RCX: 0000000000000000
Nov 8 17:37:22 arch kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000088
Nov 8 17:37:22 arch kernel: RBP: 0000000000000088 R08: 0000000000000000 R09: ffffffff816a0a00
Nov 8 17:37:22 arch kernel: R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
Nov 8 17:37:22 arch kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Nov 8 17:37:22 arch kernel: FS: 00007f3786cdc700(0000) GS:ffff8800dfc00000(0000) knlGS:0000000000000000
Nov 8 17:37:22 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 8 17:37:22 arch kernel: CR2: 0000000000000088 CR3: 000000011f60a000 CR4: 00000000000006f0
Nov 8 17:37:22 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 8 17:37:22 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 8 17:37:22 arch kernel: Process X (pid: 1502, threadinfo ffff88011b522000, task ffff88011cc3d460)
Nov 8 17:37:22 arch kernel: Stack:
Nov 8 17:37:22 arch kernel: ffffffff8121c97f 0000000000000000 ffff880100000000 ffff88011ffaa000
Nov 8 17:37:22 arch kernel: ffff88011ff99000 ffff88011f67beb8 ffff88011ff2c000 ffff88011fcf6cc0
Nov 8 17:37:22 arch kernel: ffffffff8124540c ffffffff00000028 ffff88011b523708 ffff88011ff2c048
Nov 8 17:37:22 arch kernel: Call Trace:
Nov 8 17:37:22 arch kernel: [<ffffffff8121c97f>] ? ttm_bo_reserve+0x2f/0x120
Nov 8 17:37:22 arch kernel: [<ffffffff8124540c>] ? avivo_crtc_do_set_base+0x6c/0x8e0
Nov 8 17:37:22 arch kernel: [<ffffffff812044da>] ? drm_crtc_helper_set_config+0x72a/0x8c0
Nov 8 17:37:22 arch kernel: [<ffffffff812027f4>] ? drm_fb_helper_pan_display+0x84/0xc0
Nov 8 17:37:22 arch kernel: [<ffffffff8119efad>] ? fb_pan_display+0xad/0x140
Nov 8 17:37:22 arch kernel: [<ffffffff811b1d85>] ? ccw_update_start+0x45/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff811abdbd>] ? fbcon_switch+0x44d/0x5f0
Nov 8 17:37:22 arch kernel: [<ffffffff811f6961>] ? redraw_screen+0x181/0x270
Nov 8 17:37:22 arch kernel: [<ffffffff811aa652>] ? fbcon_blank+0x232/0x2e0
Nov 8 17:37:22 arch kernel: [<ffffffff8105d6b7>] ? release_console_sem+0x1a7/0x1f0
Nov 8 17:37:22 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
Nov 8 17:37:22 arch kernel: [<ffffffff81067f93>] ? lock_timer_base.clone.25+0x33/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff810683d0>] ? mod_timer+0x130/0x210
Nov 8 17:37:22 arch kernel: [<ffffffff811f8136>] ? do_unblank_screen+0xa6/0x1a0
Nov 8 17:37:22 arch kernel: [<ffffffff8118ad0d>] ? bust_spinlocks+0x1d/0x40
Nov 8 17:37:22 arch kernel: [<ffffffff81031f79>] ? oops_end+0x39/0xe0
Nov 8 17:37:22 arch kernel: [<ffffffff8104aae5>] ? no_context+0xf5/0x260
Nov 8 17:37:22 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
Nov 8 17:37:22 arch kernel: [<ffffffff8104b41e>] ? do_page_fault+0x36e/0x410
Nov 8 17:37:22 arch kernel: [<ffffffff810de060>] ? pollwake+0x0/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff810de060>] ? pollwake+0x0/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff813ae4aa>] ? sock_wfree+0x4a/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff81430323>] ? unix_destruct_scm+0x93/0xb0
Nov 8 17:37:22 arch kernel: [<ffffffff8144a40f>] ? page_fault+0x1f/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8120a6f0>] ? drm_gem_object_free+0x0/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff81449f1f>] ? _raw_write_lock+0xf/0x20
Nov 8 17:37:22 arch kernel: [<ffffffff8121cbb8>] ? ttm_bo_unref+0x28/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81252a92>] ? radeon_bo_unref+0x42/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff8120a6f0>] ? drm_gem_object_free+0x0/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8126711e>] ? radeon_gem_object_free+0x2e/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81183493>] ? kref_put+0x33/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff8120aeb0>] ? drm_gem_close_ioctl+0xc0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 17:37:22 arch kernel: [<ffffffff8120adf0>] ? drm_gem_close_ioctl+0x0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff810cd80f>] ? do_sync_read+0xbf/0x100
Nov 8 17:37:22 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 17:37:22 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 17:37:22 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Nov 8 17:37:22 arch kernel: Code: 4a 1c 48 8b 7c 24 08 e8 2b 85 c1 ff 31 c0 5b c3 0f 1f 80 00 00 00 00 65 48 8b 04 25 c8 b6 00 00 ff 80 44 e0 ff ff b8 00 01 00 00 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c3 66 66 66 2e 0f
Nov 8 17:37:22 arch kernel: RIP [<ffffffff81449b84>] _raw_spin_lock+0x14/0x30
Nov 8 17:37:22 arch kernel: RSP <ffff88011b523660>
Nov 8 17:37:22 arch kernel: CR2: 0000000000000088
Nov 8 17:37:22 arch kernel: ---[ end trace f7be0a67c5c584c7 ]---
Nov 8 17:37:22 arch kernel: note: X[1502] exited with preempt_count 2
Nov 8 17:37:22 arch kernel: BUG: scheduling while atomic: X/1502/0x10000003
Nov 8 17:37:22 arch kernel: Pid: 1502, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
Nov 8 17:37:22 arch kernel: Call Trace:
Nov 8 17:37:22 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
Nov 8 17:37:22 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
Nov 8 17:37:22 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
Nov 8 17:37:22 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
Nov 8 17:37:22 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
Nov 8 17:37:22 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
Nov 8 17:37:22 arch kernel: [<ffffffff81449ca0>] ? _raw_spin_unlock_irq+0x10/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
Nov 8 17:37:22 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
Nov 8 17:37:22 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
Nov 8 17:37:22 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
Nov 8 17:37:22 arch kernel: [<ffffffff8104aae5>] ? no_context+0xf5/0x260
Nov 8 17:37:22 arch kernel: [<ffffffff8104b41e>] ? do_page_fault+0x36e/0x410
Nov 8 17:37:22 arch kernel: [<ffffffff8102c722>] ? __switch_to+0x1e2/0x2b0
Nov 8 17:37:22 arch kernel: [<ffffffff8118885e>] ? vsnprintf+0x46e/0x620
Nov 8 17:37:22 arch kernel: [<ffffffff81187957>] ? number.clone.2+0x2b7/0x2f0
Nov 8 17:37:22 arch kernel: [<ffffffff8144a40f>] ? page_fault+0x1f/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff81449b84>] ? _raw_spin_lock+0x14/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8121c97f>] ? ttm_bo_reserve+0x2f/0x120
Nov 8 17:37:22 arch kernel: [<ffffffff8124540c>] ? avivo_crtc_do_set_base+0x6c/0x8e0
Nov 8 17:37:22 arch kernel: [<ffffffff812044da>] ? drm_crtc_helper_set_config+0x72a/0x8c0
Nov 8 17:37:22 arch kernel: [<ffffffff812027f4>] ? drm_fb_helper_pan_display+0x84/0xc0
Nov 8 17:37:22 arch kernel: [<ffffffff8119efad>] ? fb_pan_display+0xad/0x140
Nov 8 17:37:22 arch kernel: [<ffffffff811b1d85>] ? ccw_update_start+0x45/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff811abdbd>] ? fbcon_switch+0x44d/0x5f0
Nov 8 17:37:22 arch kernel: [<ffffffff811f6961>] ? redraw_screen+0x181/0x270
Nov 8 17:37:22 arch kernel: [<ffffffff811aa652>] ? fbcon_blank+0x232/0x2e0
Nov 8 17:37:22 arch kernel: [<ffffffff8105d6b7>] ? release_console_sem+0x1a7/0x1f0
Nov 8 17:37:22 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
Nov 8 17:37:22 arch kernel: [<ffffffff81067f93>] ? lock_timer_base.clone.25+0x33/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff810683d0>] ? mod_timer+0x130/0x210
Nov 8 17:37:22 arch kernel: [<ffffffff811f8136>] ? do_unblank_screen+0xa6/0x1a0
Nov 8 17:37:22 arch kernel: [<ffffffff8118ad0d>] ? bust_spinlocks+0x1d/0x40
Nov 8 17:37:22 arch kernel: [<ffffffff81031f79>] ? oops_end+0x39/0xe0
Nov 8 17:37:22 arch kernel: [<ffffffff8104aae5>] ? no_context+0xf5/0x260
Nov 8 17:37:22 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
Nov 8 17:37:22 arch kernel: [<ffffffff8104b41e>] ? do_page_fault+0x36e/0x410
Nov 8 17:37:22 arch kernel: [<ffffffff810de060>] ? pollwake+0x0/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff810de060>] ? pollwake+0x0/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff813ae4aa>] ? sock_wfree+0x4a/0x60
Nov 8 17:37:22 arch kernel: [<ffffffff81430323>] ? unix_destruct_scm+0x93/0xb0
Nov 8 17:37:22 arch kernel: [<ffffffff8144a40f>] ? page_fault+0x1f/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8120a6f0>] ? drm_gem_object_free+0x0/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff81449f1f>] ? _raw_write_lock+0xf/0x20
Nov 8 17:37:22 arch kernel: [<ffffffff8121cbb8>] ? ttm_bo_unref+0x28/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81252a92>] ? radeon_bo_unref+0x42/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff8120a6f0>] ? drm_gem_object_free+0x0/0x30
Nov 8 17:37:22 arch kernel: [<ffffffff8126711e>] ? radeon_gem_object_free+0x2e/0x50
Nov 8 17:37:22 arch kernel: [<ffffffff81183493>] ? kref_put+0x33/0x70
Nov 8 17:37:22 arch kernel: [<ffffffff8120aeb0>] ? drm_gem_close_ioctl+0xc0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 17:37:22 arch kernel: [<ffffffff8120adf0>] ? drm_gem_close_ioctl+0x0/0xf0
Nov 8 17:37:22 arch kernel: [<ffffffff810cd80f>] ? do_sync_read+0xbf/0x100
Nov 8 17:37:22 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 17:37:22 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 17:37:22 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 17:37:22 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b


--
Markus


Attachments:
(No filename) (13.97 kB)
Xorg.0.log (37.12 kB)
Download all attachments

2010-11-08 17:07:44

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> I can trigger a kernel crash on my system by simply loading this png
> image with firefox:
> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg

Sorry the above link is wrong, this is the right one (that triggers the
crash):
http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
--
Markus

2010-11-08 18:43:15

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> > I can trigger a kernel crash on my system by simply loading this png
> > image with firefox:
> > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>
> Sorry the above link is wrong, this is the right one (that triggers the
> crash):
> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png

I triggered it a few more times and took the attached picture.
It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
(Sorry for the bad picture quality)
--
Markus


Attachments:
(No filename) (684.00 B)
ttm_BUG.jpg (430.20 kB)
Download all attachments

2010-11-08 19:03:06

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> > > I can trigger a kernel crash on my system by simply loading this png
> > > image with firefox:
> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
> >
> > Sorry the above link is wrong, this is the right one (that triggers the
> > crash):
> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>
> I triggered it a few more times and took the attached picture.
> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
> (Sorry for the bad picture quality)

And here the same BUG in plaintext (should be a bit easier to read):

Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
Nov 8 19:28:23 arch kernel: CPU 1
Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246
Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
Nov 8 19:28:23 arch kernel: Stack:
Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
Nov 8 19:28:23 arch kernel: Call Trace:
Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8>
Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
Nov 8 19:28:23 arch kernel: Call Trace:
Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
Nov 8 19:28:23 arch kernel: Call Trace:
Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b


--
Markus

2010-11-08 19:36:42

by Jerome Glisse

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
<[email protected]> wrote:
> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>> > > I can trigger a kernel crash on my system by simply loading this png
>> > > image with firefox:
>> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>> >
>> > Sorry the above link is wrong, this is the right one (that triggers the
>> > crash):
>> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>
>> I triggered it a few more times and took the attached picture.
>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>> (Sorry for the bad picture quality)
>
> And here the same BUG in plaintext (should be a bit easier to read):
>
> Nov ?8 19:28:23 arch kernel: ------------[ cut here ]------------
> Nov ?8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!

Quite puzzling it is as if there was already a bo at same offset in rb
tree but not
in vm mm. Maybe some other race in destruction...

Cheers,
Jerome Glisse

2010-11-08 20:53:40

by Jerome Glisse

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
<[email protected]> wrote:
> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>> > > I can trigger a kernel crash on my system by simply loading this png
>> > > image with firefox:
>> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>> >
>> > Sorry the above link is wrong, this is the right one (that triggers the
>> > crash):
>> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>
>> I triggered it a few more times and took the attached picture.
>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>> (Sorry for the bad picture quality)
>
> And here the same BUG in plaintext (should be a bit easier to read):
>
> Nov ?8 19:28:23 arch kernel: ------------[ cut here ]------------
> Nov ?8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
> Nov ?8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
> Nov ?8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
> Nov ?8 19:28:23 arch kernel: CPU 1
> Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
> Nov ?8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] ?[<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 ?EFLAGS: 00010246
> Nov ?8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
> Nov ?8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
> Nov ?8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
> Nov ?8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
> Nov ?8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
> Nov ?8 19:28:23 arch kernel: FS: ?00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
> Nov ?8 19:28:23 arch kernel: CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Nov ?8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
> Nov ?8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Nov ?8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Nov ?8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
> Nov ?8 19:28:23 arch kernel: Stack:
> Nov ?8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
> Nov ?8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
> Nov ?8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
> Nov ?8 19:28:23 arch kernel: Call Trace:
> Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
> Nov ?8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
> Nov ?8 19:28:23 arch kernel: RIP ?[<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8>
> Nov ?8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
> Nov ?8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
> Nov ?8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
> Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G ? ? ?D ? ? 2.6.37-rc1-00116-g151f52f-dirty #31
> Nov ?8 19:28:23 arch kernel: Call Trace:
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
> Nov ?8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
> Nov ?8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
> Nov ?8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
> Nov ?8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
> Nov ?8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
> Nov ?8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
> Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G ? ? ?D ? ? 2.6.37-rc1-00116-g151f52f-dirty #31
> Nov ?8 19:28:23 arch kernel: Call Trace:
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
> Nov ?8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
> Nov ?8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
> Nov ?8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
> Nov ?8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
> Nov ?8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
> Nov ?8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
> Nov ?8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
> Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>

Thomas this bug seems to point to a case where we endup trying adding
an entry to
same offset in the rb tree for addr_space_mm. After reviewing
carefully the locking
around the rb tree modification & addr_space_mm i am fairly confident
that no race can
occur. Would you have any idea on what might go wrong here ? I guess i would
ultimately need to dump mm & rb tree state when BUG get trigger to try
to understand
states of things.

Cheers,
Jerome

2010-11-08 21:00:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Monday, November 08, 2010, Jerome Glisse wrote:
> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
> <[email protected]> wrote:
> > On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
> >> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
> >> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> >> > > I can trigger a kernel crash on my system by simply loading this png
> >> > > image with firefox:
> >> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
> >> >
> >> > Sorry the above link is wrong, this is the right one (that triggers the
> >> > crash):
> >> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
> >>
> >> I triggered it a few more times and took the attached picture.
> >> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
> >> (Sorry for the bad picture quality)
> >
> > And here the same BUG in plaintext (should be a bit easier to read):
> >
> > Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
> > Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
> > Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
> > Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
> > Nov 8 19:28:23 arch kernel: CPU 1
> > Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
> > Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246
> > Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
> > Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
> > Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
> > Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
> > Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
> > Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
> > Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
> > Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
> > Nov 8 19:28:23 arch kernel: Stack:
> > Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
> > Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
> > Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
> > Nov 8 19:28:23 arch kernel: Call Trace:
> > Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> > Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> > Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> > Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> > Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> > Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> > Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
> > Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
> > Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8>
> > Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
> > Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
> > Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
> > Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
> > Nov 8 19:28:23 arch kernel: Call Trace:
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
> > Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
> > Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
> > Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
> > Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
> > Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> > Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> > Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> > Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> > Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> > Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> > Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
> > Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
> > Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
> > Nov 8 19:28:23 arch kernel: Call Trace:
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
> > Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
> > Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
> > Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
> > Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
> > Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
> > Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
> > Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
> > Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
> > Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
> > Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
> > Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
> > Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
> > Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
> > Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
> > Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
> > Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
> >
>
> Thomas this bug seems to point to a case where we endup trying adding
> an entry to
> same offset in the rb tree for addr_space_mm. After reviewing
> carefully the locking
> around the rb tree modification & addr_space_mm i am fairly confident
> that no race can
> occur. Would you have any idea on what might go wrong here ? I guess i would
> ultimately need to dump mm & rb tree state when BUG get trigger to try
> to understand
> states of things.

Hmm, why are you using BUG in there in the first place? Would it be _so_
dangerous to continue that we just have to crash here?

Rafael

2010-11-08 22:01:52

by Jerome Glisse

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 8, 2010 at 3:58 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Monday, November 08, 2010, Jerome Glisse wrote:
>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>> <[email protected]> wrote:
>> > On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>> >> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>> >> > On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>> >> > > I can trigger a kernel crash on my system by simply loading this png
>> >> > > image with firefox:
>> >> > > http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>> >> >
>> >> > Sorry the above link is wrong, this is the right one (that triggers the
>> >> > crash):
>> >> > http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>> >>
>> >> I triggered it a few more times and took the attached picture.
>> >> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>> >> (Sorry for the bad picture quality)
>> >
>> > And here the same BUG in plaintext (should be a bit easier to read):
>> >
>> > Nov ?8 19:28:23 arch kernel: ------------[ cut here ]------------
>> > Nov ?8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
>> > Nov ?8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
>> > Nov ?8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
>> > Nov ?8 19:28:23 arch kernel: CPU 1
>> > Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
>> > Nov ?8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] ?[<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 ?EFLAGS: 00010246
>> > Nov ?8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
>> > Nov ?8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
>> > Nov ?8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
>> > Nov ?8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
>> > Nov ?8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
>> > Nov ?8 19:28:23 arch kernel: FS: ?00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
>> > Nov ?8 19:28:23 arch kernel: CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > Nov ?8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
>> > Nov ?8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > Nov ?8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Nov ?8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
>> > Nov ?8 19:28:23 arch kernel: Stack:
>> > Nov ?8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
>> > Nov ?8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
>> > Nov ?8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
>> > Nov ?8 19:28:23 arch kernel: Call Trace:
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> > Nov ?8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
>> > Nov ?8 19:28:23 arch kernel: RIP ?[<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: RSP <ffff88011b0fbbe8>
>> > Nov ?8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
>> > Nov ?8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
>> > Nov ?8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> > Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G ? ? ?D ? ? 2.6.37-rc1-00116-g151f52f-dirty #31
>> > Nov ?8 19:28:23 arch kernel: Call Trace:
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> > Nov ?8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> > Nov ?8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G ? ? ?D ? ? 2.6.37-rc1-00116-g151f52f-dirty #31
>> > Nov ?8 19:28:23 arch kernel: Call Trace:
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> > Nov ?8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> >
>>
>> Thomas this bug seems to point to a case where we endup trying adding
>> an entry to
>> same offset in the rb tree for addr_space_mm. After reviewing
>> carefully the locking
>> around the rb tree modification & addr_space_mm i am fairly confident
>> that no race can
>> occur. Would you have any idea on what might go wrong here ? I guess i would
>> ultimately need to dump mm & rb tree state when BUG get trigger to try
>> to understand
>> states of things.
>
> Hmm, why are you using BUG in there in the first place? ?Would it be _so_
> dangerous to continue that we just have to crash here?
>
> Rafael
>

This case should _never happen, i guess we could return an error
and refuse to create bo _but to me it seems that this case is the
result of corrupted rb or mm structure, so everythings might fall
off in more subtle way if we bail out in front of this error.

Jerome

2010-11-08 22:25:55

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On 11/08/2010 09:58 PM, Rafael J. Wysocki wrote:
> On Monday, November 08, 2010, Jerome Glisse wrote:
>
>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>> <[email protected]> wrote:
>>
>>> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>>>
>>>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>>>>
>>>>> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>>>>>
>>>>>> I can trigger a kernel crash on my system by simply loading this png
>>>>>> image with firefox:
>>>>>> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>>>>>>
>>>>> Sorry the above link is wrong, this is the right one (that triggers the
>>>>> crash):
>>>>> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>>>>
>>>> I triggered it a few more times and took the attached picture.
>>>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>>>> (Sorry for the bad picture quality)
>>>>
>>> And here the same BUG in plaintext (should be a bit easier to read):
>>>
>>> Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
>>> Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
>>> Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
>>> Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
>>> Nov 8 19:28:23 arch kernel: CPU 1
>>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
>>> Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246
>>> Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
>>> Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
>>> Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
>>> Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
>>> Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
>>> Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
>>> Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
>>> Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
>>> Nov 8 19:28:23 arch kernel: Stack:
>>> Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
>>> Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
>>> Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
>>> Nov 8 19:28:23 arch kernel: Call Trace:
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>>> Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3<0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
>>> Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: RSP<ffff88011b0fbbe8>
>>> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
>>> Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
>>> Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
>>> Nov 8 19:28:23 arch kernel: Call Trace:
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>>> Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
>>> Nov 8 19:28:23 arch kernel: Call Trace:
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>>>
>>>
>> Thomas this bug seems to point to a case where we endup trying adding
>> an entry to
>> same offset in the rb tree for addr_space_mm. After reviewing
>> carefully the locking
>> around the rb tree modification& addr_space_mm i am fairly confident
>> that no race can
>> occur. Would you have any idea on what might go wrong here ? I guess i would
>> ultimately need to dump mm& rb tree state when BUG get trigger to try
>> to understand
>> states of things.
>>
> Hmm, why are you using BUG in there in the first place? Would it be _so_
> dangerous to continue that we just have to crash here?
>
> Rafael
>
BUGs in the TTM module are there to catch incorrect usage of the TTM
API, and the intention is that they should only happen during
development or stabilizing phases. In this case, we're probably seeing
the symptoms of memory corruption or a buggy range manager change.

/Thomas

2010-11-08 22:29:29

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On 11/08/2010 09:53 PM, Jerome Glisse wrote:
> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
> <[email protected]> wrote:
>
>> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>>
>>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>>>
>>>> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>>>>
>>>>> I can trigger a kernel crash on my system by simply loading this png
>>>>> image with firefox:
>>>>> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>>>>>
>>>> Sorry the above link is wrong, this is the right one (that triggers the
>>>> crash):
>>>> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>>>
>>> I triggered it a few more times and took the attached picture.
>>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>>> (Sorry for the bad picture quality)
>>>
>> And here the same BUG in plaintext (should be a bit easier to read):
>>
>> Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
>> Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
>> Nov 8 19:28:23 arch kernel: invalid opcode: 0000 [#1] PREEMPT SMP
>> Nov 8 19:28:23 arch kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:18.3/temp1_input
>> Nov 8 19:28:23 arch kernel: CPU 1
>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Not tainted 2.6.37-rc1-00116-g151f52f-dirty #31 M4A78T-E/System Product Name
>> Nov 8 19:28:23 arch kernel: RIP: 0010:[<ffffffff8121f0ff>] [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: RSP: 0018:ffff88011b0fbbe8 EFLAGS: 00010246
>> Nov 8 19:28:23 arch kernel: RAX: ffff8800da881778 RBX: ffff8800da881620 RCX: ffff88011b15ed78
>> Nov 8 19:28:23 arch kernel: RDX: ffff8800c1556040 RSI: ffff88011ff22770 RDI: 000000000017adfb
>> Nov 8 19:28:23 arch kernel: RBP: ffff8800da881648 R08: 0000000000000000 R09: ffff8800c1556040
>> Nov 8 19:28:23 arch kernel: R10: 000000000ff85205 R11: ffff8800dae19200 R12: 0000000000000001
>> Nov 8 19:28:23 arch kernel: R13: ffff88011ff22528 R14: ffff88011ff22778 R15: 0000000000000000
>> Nov 8 19:28:23 arch kernel: FS: 00007f2043043700(0000) GS:ffff8800dfc80000(0000) knlGS:0000000000000000
>> Nov 8 19:28:23 arch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Nov 8 19:28:23 arch kernel: CR2: 00007f203d057000 CR3: 000000011b12b000 CR4: 00000000000006e0
>> Nov 8 19:28:23 arch kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Nov 8 19:28:23 arch kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Nov 8 19:28:23 arch kernel: Process X (pid: 1541, threadinfo ffff88011b0fa000, task ffff88011c959c20)
>> Nov 8 19:28:23 arch kernel: Stack:
>> Nov 8 19:28:23 arch kernel: 0000000000000000 ffff8800da881648 ffff88011b0fbd00 ffff8800da881600
>> Nov 8 19:28:23 arch kernel: ffff88011ff22000 0000000000000000 0000000000000001 00000000fffffff4
>> Nov 8 19:28:23 arch kernel: ffff88011b0fbd00 ffffffff8125294d 0000000000000000 ffffffff00000001
>> Nov 8 19:28:23 arch kernel: Call Trace:
>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> Nov 8 19:28:23 arch kernel: Code: e8 fb ff ff 85 c0 0f 85 68 ff ff ff 48 8b 7c 24 08 89 04 24 e8 83 d9 ff ff 8b 04 24 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3<0f> 0b 48 c7 c7 60 a4 55 81 31 c0 e8 14 80 22 00 b8 ea ff ff ff
>> Nov 8 19:28:23 arch kernel: RIP [<ffffffff8121f0ff>] ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: RSP<ffff88011b0fbbe8>
>> Nov 8 19:28:23 arch kernel: ---[ end trace 328a9acba7691d6e ]---
>> Nov 8 19:28:23 arch kernel: note: X[1541] exited with preempt_count 1
>> Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
>> Nov 8 19:28:23 arch kernel: Call Trace:
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>> Nov 8 19:28:23 arch kernel: BUG: scheduling while atomic: X/1541/0x10000002
>> Nov 8 19:28:23 arch kernel: Pid: 1541, comm: X Tainted: G D 2.6.37-rc1-00116-g151f52f-dirty #31
>> Nov 8 19:28:23 arch kernel: Call Trace:
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447ad9>] ? schedule+0x639/0x850
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105826d>] ? __cond_resched+0x1d/0x30
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447f2f>] ? _cond_resched+0x2f/0x40
>> Nov 8 19:28:23 arch kernel: [<ffffffff810b57fc>] ? unmap_vmas+0x82c/0x9c0
>> Nov 8 19:28:23 arch kernel: [<ffffffff810bcb62>] ? exit_mmap+0xe2/0x1a0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105a705>] ? mmput+0x25/0xc0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e734>] ? exit_mm+0x104/0x130
>> Nov 8 19:28:23 arch kernel: [<ffffffff81079ebf>] ? hrtimer_try_to_cancel+0x3f/0x80
>> Nov 8 19:28:23 arch kernel: [<ffffffff81089d0a>] ? acct_collect+0x9a/0x1a0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8106045a>] ? do_exit+0x5aa/0x760
>> Nov 8 19:28:23 arch kernel: [<ffffffff81447163>] ? printk+0x40/0x45
>> Nov 8 19:28:23 arch kernel: [<ffffffff8105e33c>] ? kmsg_dump+0x7c/0x150
>> Nov 8 19:28:23 arch kernel: [<ffffffff81031fda>] ? oops_end+0x9a/0xe0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102ee74>] ? do_invalid_op+0x84/0xa0
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff810ddf50>] ? __pollwait+0x0/0x110
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102e7d5>] ? invalid_op+0x15/0x20
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121f0ff>] ? ttm_bo_init+0x30f/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff8121efe3>] ? ttm_bo_init+0x1f3/0x340
>> Nov 8 19:28:23 arch kernel: [<ffffffff8125294d>] ? radeon_bo_create+0x14d/0x250
>> Nov 8 19:28:23 arch kernel: [<ffffffff812526c0>] ? radeon_ttm_bo_destroy+0x0/0xb0
>> Nov 8 19:28:23 arch kernel: [<ffffffff812671cc>] ? radeon_gem_object_create+0x8c/0x130
>> Nov 8 19:28:23 arch kernel: [<ffffffff81267634>] ? radeon_gem_create_ioctl+0x54/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff813ab26d>] ? sock_aio_read+0x10d/0x120
>> Nov 8 19:28:23 arch kernel: [<ffffffff8120963c>] ? drm_ioctl+0x39c/0x450
>> Nov 8 19:28:23 arch kernel: [<ffffffff812675e0>] ? radeon_gem_create_ioctl+0x0/0xd0
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd2c9>] ? do_vfs_ioctl+0xa9/0x610
>> Nov 8 19:28:23 arch kernel: [<ffffffff810dd879>] ? sys_ioctl+0x49/0x80
>> Nov 8 19:28:23 arch kernel: [<ffffffff810ce24e>] ? sys_read+0x4e/0x90
>> Nov 8 19:28:23 arch kernel: [<ffffffff8102dc2b>] ? system_call_fastpath+0x16/0x1b
>>
>>
> Thomas this bug seems to point to a case where we endup trying adding
> an entry to
> same offset in the rb tree for addr_space_mm. After reviewing
> carefully the locking
> around the rb tree modification& addr_space_mm i am fairly confident
> that no race can
> occur. Would you have any idea on what might go wrong here ? I guess i would
> ultimately need to dump mm& rb tree state when BUG get trigger to try
> to understand
> states of things.
>
> Cheers,
> Jerome
>

I agree there shouldn't be a race in this case.
The locking around these operations is simple and straightforward.

So this IMHO should either be a memory corruption or a bug in the range
manager. I've never seen this BUG trigger before. Dumping mm / rb tree
contents or bisecting should probably find the culprit.

/Thomas



2010-11-09 09:29:27

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
> On 11/08/2010 09:53 PM, Jerome Glisse wrote:
> >On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
> ><[email protected]> wrote:
> >>On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
> >>>On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
> >>>>On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> >>>>>I can trigger a kernel crash on my system by simply loading this png
> >>>>>image with firefox:
> >>>>>http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
> >>>>Sorry the above link is wrong, this is the right one (that triggers the
> >>>>crash):
> >>>>http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
> >>>I triggered it a few more times and took the attached picture.
> >>>It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
> >>>(Sorry for the bad picture quality)
> >>And here the same BUG in plaintext (should be a bit easier to read):
> >>
> >>Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
> >>Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
> >>
> >Thomas this bug seems to point to a case where we endup trying adding
> >an entry to
> >same offset in the rb tree for addr_space_mm. After reviewing
> >carefully the locking
> >around the rb tree modification& addr_space_mm i am fairly confident
> >that no race can
> >occur. Would you have any idea on what might go wrong here ? I guess i would
> >ultimately need to dump mm& rb tree state when BUG get trigger to try
> >to understand
> >states of things.
>
> I agree there shouldn't be a race in this case.
> The locking around these operations is simple and straightforward.
>
> So this IMHO should either be a memory corruption or a bug in the
> range manager. I've never seen this BUG trigger before. Dumping mm /
> rb tree contents or bisecting should probably find the culprit.

OK I've found the buggy commit by bisection:

e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
commit e376573f7267390f4e1bdc552564b6fb913bce76
Author: Michel D?nzer <[email protected]>
Date: Thu Jul 8 12:43:28 2010 +1000

drm/radeon: fall back to GTT if bo creation/validation in VRAM fails.

This fixes a problem where on low VRAM cards we'd run out of space for validation.

[airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]

Signed-off-by: Michel D?nzer <[email protected]>
Cc: [email protected]
Signed-off-by: Dave Airlie <[email protected]>

Please note that this is an old commit from 2.6.36-rc. When I revert it the
kernel no longer crashes. Instead I see the following in my dmesg:

[TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction.
[TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
[TTM] placement[0]=0x00070002 (1)
[TTM] has_type: 1
[TTM] use_type: 1
[TTM] flags: 0x0000000A
[TTM] gpu_offset: 0xA0000000
[TTM] size: 131072
[TTM] available_caching: 0x00070000
[TTM] default_caching: 0x00010000
[TTM] 0x00000000-0x00000001: 1: used
[TTM] 0x00000001-0x00000011: 16: used
[TTM] 0x00000011-0x00000111: 256: used
[TTM] 0x00000111-0x00000211: 256: used
[TTM] 0x00000211-0x00000248: 55: free
[TTM] 0x00000248-0x0000024c: 4: used
[TTM] 0x0000024c-0x00001976: 5930: free
[TTM] 0x00001976-0x000021aa: 2100: used
[TTM] 0x000021aa-0x0000285f: 1717: free
[TTM] 0x0000285f-0x00002860: 1: used
[TTM] 0x00002860-0x00002873: 19: free
[TTM] 0x00002873-0x000029b3: 320: used
[TTM] 0x000029b3-0x00020000: 120397: free
[TTM] total: 131072, used 2954 free 128118
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
...

And the following in the xorg log buffer:

Failed to alloc memory
Failed to allocat:
size: : 117555200 bytes
alignment : 0 bytes
domains : 4
...

--
Markus

2010-11-09 09:53:25

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
> On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
>
>> On 11/08/2010 09:53 PM, Jerome Glisse wrote:
>>
>>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>>> <[email protected]> wrote:
>>>
>>>> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>>>>
>>>>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>>>>>
>>>>>> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
>>>>>>
>>>>>>> I can trigger a kernel crash on my system by simply loading this png
>>>>>>> image with firefox:
>>>>>>> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>>>>>>>
>>>>>> Sorry the above link is wrong, this is the right one (that triggers the
>>>>>> crash):
>>>>>> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>>>>>
>>>>> I triggered it a few more times and took the attached picture.
>>>>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>>>>> (Sorry for the bad picture quality)
>>>>>
>>>> And here the same BUG in plaintext (should be a bit easier to read):
>>>>
>>>> Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
>>>> Nov 8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
>>>>
>>>>
>>> Thomas this bug seems to point to a case where we endup trying adding
>>> an entry to
>>> same offset in the rb tree for addr_space_mm. After reviewing
>>> carefully the locking
>>> around the rb tree modification& addr_space_mm i am fairly confident
>>> that no race can
>>> occur. Would you have any idea on what might go wrong here ? I guess i would
>>> ultimately need to dump mm& rb tree state when BUG get trigger to try
>>> to understand
>>> states of things.
>>>
>> I agree there shouldn't be a race in this case.
>> The locking around these operations is simple and straightforward.
>>
>> So this IMHO should either be a memory corruption or a bug in the
>> range manager. I've never seen this BUG trigger before. Dumping mm /
>> rb tree contents or bisecting should probably find the culprit.
>>
> OK I've found the buggy commit by bisection:
>
> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
> commit e376573f7267390f4e1bdc552564b6fb913bce76
> Author: Michel D?nzer<[email protected]>
> Date: Thu Jul 8 12:43:28 2010 +1000
>
> drm/radeon: fall back to GTT if bo creation/validation in VRAM fails.
>
> This fixes a problem where on low VRAM cards we'd run out of space for validation.
>
> [airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]
>
> Signed-off-by: Michel D?nzer<[email protected]>
> Cc: [email protected]
> Signed-off-by: Dave Airlie<[email protected]>
>
> Please note that this is an old commit from 2.6.36-rc. When I revert it the
> kernel no longer crashes. Instead I see the following in my dmesg:
>
>

Hmm, so this sounds like something in the Radeon eviction error path is
causing corruption.
I had a similar problem with vmwgfx, when I tried to unref a BO _after_
ttm_bo_init() failed.
ttm_bo_init() is really supposed to call unref itself for various
reasons, so calling unref() or kfree() after a failed ttm_bo_init()
will cause corruption.

In any case, the error below also suggests something is a bit fragile in
the Radeon driver:

First, an accelerated eviction may fail, like in the message below, but
then there must always be a backup plan, like unaccelerated eviction to
system. On BO creation, there are a number of placement strategies, but
if all else fails, it should be possible to initially place the BO in
system memory.

Second, If bo validation fails during a command submission, due to
insufficient VRAM / TT, then the driver should retry the complete
validation cycle after first blocking all other validators and then
evicting everything not pinned, to avoid failures due to fragmentation.

/Thomas


> [TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction.
> [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
> [TTM] placement[0]=0x00070002 (1)
> [TTM] has_type: 1
> [TTM] use_type: 1
> [TTM] flags: 0x0000000A
> [TTM] gpu_offset: 0xA0000000
> [TTM] size: 131072
> [TTM] available_caching: 0x00070000
> [TTM] default_caching: 0x00010000
> [TTM] 0x00000000-0x00000001: 1: used
> [TTM] 0x00000001-0x00000011: 16: used
> [TTM] 0x00000011-0x00000111: 256: used
> [TTM] 0x00000111-0x00000211: 256: used
> [TTM] 0x00000211-0x00000248: 55: free
> [TTM] 0x00000248-0x0000024c: 4: used
> [TTM] 0x0000024c-0x00001976: 5930: free
> [TTM] 0x00001976-0x000021aa: 2100: used
> [TTM] 0x000021aa-0x0000285f: 1717: free
> [TTM] 0x0000285f-0x00002860: 1: used
> [TTM] 0x00002860-0x00002873: 19: free
> [TTM] 0x00002873-0x000029b3: 320: used
> [TTM] 0x000029b3-0x00020000: 120397: free
> [TTM] total: 131072, used 2954 free 128118
> [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
> ...
>
> And the following in the xorg log buffer:
>
> Failed to alloc memory
> Failed to allocat:
> size: : 117555200 bytes
> alignment : 0 bytes
> domains : 4
> ...
>
>

2010-11-09 10:07:34

by Thomas Hellstrom

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
>> On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
>>> On 11/08/2010 09:53 PM, Jerome Glisse wrote:
>>>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>>>> <[email protected]> wrote:
>>>>> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>>>>>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>>>>>>> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf
>>>>>>> wrote:
>>>>>>>> I can trigger a kernel crash on my system by simply loading
>>>>>>>> this png
>>>>>>>> image with firefox:
>>>>>>>> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>>>>>>>>
>>>>>>> Sorry the above link is wrong, this is the right one (that
>>>>>>> triggers the
>>>>>>> crash):
>>>>>>> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>>>>>>
>>>>>> I triggered it a few more times and took the attached picture.
>>>>>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>>>>>> (Sorry for the bad picture quality)
>>>>> And here the same BUG in plaintext (should be a bit easier to read):
>>>>>
>>>>> Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
>>>>> Nov 8 19:28:23 arch kernel: kernel BUG at
>>>>> drivers/gpu/drm/ttm/ttm_bo.c:1628!
>>>>>
>>>> Thomas this bug seems to point to a case where we endup trying adding
>>>> an entry to
>>>> same offset in the rb tree for addr_space_mm. After reviewing
>>>> carefully the locking
>>>> around the rb tree modification& addr_space_mm i am fairly confident
>>>> that no race can
>>>> occur. Would you have any idea on what might go wrong here ? I
>>>> guess i would
>>>> ultimately need to dump mm& rb tree state when BUG get trigger to
>>>> try
>>>> to understand
>>>> states of things.
>>> I agree there shouldn't be a race in this case.
>>> The locking around these operations is simple and straightforward.
>>>
>>> So this IMHO should either be a memory corruption or a bug in the
>>> range manager. I've never seen this BUG trigger before. Dumping mm /
>>> rb tree contents or bisecting should probably find the culprit.
>> OK I've found the buggy commit by bisection:
>>
>> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
>> commit e376573f7267390f4e1bdc552564b6fb913bce76
>> Author: Michel D?nzer<[email protected]>
>> Date: Thu Jul 8 12:43:28 2010 +1000
>>
>> drm/radeon: fall back to GTT if bo creation/validation in VRAM
>> fails.
>>
>> This fixes a problem where on low VRAM cards we'd run out of
>> space for validation.
>>
>> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
>> problems.]
>>
>> Signed-off-by: Michel D?nzer<[email protected]>
>> Cc: [email protected]
>> Signed-off-by: Dave Airlie<[email protected]>
>>
>> Please note that this is an old commit from 2.6.36-rc. When I revert
>> it the
>> kernel no longer crashes. Instead I see the following in my dmesg:
>>
>
> Hmm, so this sounds like something in the Radeon eviction error path
> is causing corruption.
> I had a similar problem with vmwgfx, when I tried to unref a BO
> _after_ ttm_bo_init() failed.
> ttm_bo_init() is really supposed to call unref itself for various
> reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> will cause corruption.
>
> In any case, the error below also suggests something is a bit fragile
> in the Radeon driver:
>
> First, an accelerated eviction may fail, like in the message below,
> but then there must always be a backup plan, like unaccelerated
> eviction to system. On BO creation, there are a number of placement
> strategies, but if all else fails, it should be possible to initially
> place the BO in system memory.
>
> Second, If bo validation fails during a command submission, due to
> insufficient VRAM / TT, then the driver should retry the complete
> validation cycle after first blocking all other validators and then
> evicting everything not pinned, to avoid failures due to fragmentation.
>
> /Thomas
>

Indeed, it seems like the commit you mention just retries ttm_bo_init()
after it previously failed. At that point the bo has been destroyed, so
that is probably what's causing the BUG you are seeing.

Admittedly, ttm_bo_init() calling unref on failure is not properly
documented in the function description. The reason for doing so is to
have a single path for freeing all BO resources already allocated on the
point of failure.

/Thomas

>
>> [TTM] Failed to find memory space for buffer 0xffff880113e10e48
>> eviction.
>> [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
>> [TTM] placement[0]=0x00070002 (1)
>> [TTM] has_type: 1
>> [TTM] use_type: 1
>> [TTM] flags: 0x0000000A
>> [TTM] gpu_offset: 0xA0000000
>> [TTM] size: 131072
>> [TTM] available_caching: 0x00070000
>> [TTM] default_caching: 0x00010000
>> [TTM] 0x00000000-0x00000001: 1: used
>> [TTM] 0x00000001-0x00000011: 16: used
>> [TTM] 0x00000011-0x00000111: 256: used
>> [TTM] 0x00000111-0x00000211: 256: used
>> [TTM] 0x00000211-0x00000248: 55: free
>> [TTM] 0x00000248-0x0000024c: 4: used
>> [TTM] 0x0000024c-0x00001976: 5930: free
>> [TTM] 0x00001976-0x000021aa: 2100: used
>> [TTM] 0x000021aa-0x0000285f: 1717: free
>> [TTM] 0x0000285f-0x00002860: 1: used
>> [TTM] 0x00002860-0x00002873: 19: free
>> [TTM] 0x00002873-0x000029b3: 320: used
>> [TTM] 0x000029b3-0x00020000: 120397: free
>> [TTM] total: 131072, used 2954 free 128118
>> [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> ...
>>
>> And the following in the xorg log buffer:
>>
>> Failed to alloc memory
>> Failed to allocat:
>> size: : 117555200 bytes
>> alignment : 0 bytes
>> domains : 4
>> ...
>>
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

2010-11-09 10:37:43

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Tue, Nov 09, 2010 at 11:32:57AM +0100, Michel D?nzer wrote:
> On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
> > On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> > > On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
> > >> OK I've found the buggy commit by bisection:
> > >>
> > >> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
> > >> commit e376573f7267390f4e1bdc552564b6fb913bce76
> > >> Author: Michel D?nzer<[email protected]>
> > >> Date: Thu Jul 8 12:43:28 2010 +1000
> > >>
> > >> drm/radeon: fall back to GTT if bo creation/validation in VRAM
> > >> fails.
> > >>
> > >> This fixes a problem where on low VRAM cards we'd run out of
> > >> space for validation.
> > >>
> > >> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
> > >> problems.]
> > >>
> > >> Signed-off-by: Michel D?nzer<[email protected]>
> > >> Cc: [email protected]
> > >> Signed-off-by: Dave Airlie<[email protected]>
> > >>
> > >> Please note that this is an old commit from 2.6.36-rc. When I revert
> > >> it the
> > >> kernel no longer crashes. Instead I see the following in my dmesg:
> > >>
> > >
> > > Hmm, so this sounds like something in the Radeon eviction error path
> > > is causing corruption.
> > > I had a similar problem with vmwgfx, when I tried to unref a BO
> > > _after_ ttm_bo_init() failed.
> > > ttm_bo_init() is really supposed to call unref itself for various
> > > reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> > > will cause corruption.
> > >
> > > In any case, the error below also suggests something is a bit fragile
> > > in the Radeon driver:
> > >
> > > First, an accelerated eviction may fail, like in the message below,
> > > but then there must always be a backup plan, like unaccelerated
> > > eviction to system. On BO creation, there are a number of placement
> > > strategies, but if all else fails, it should be possible to initially
> > > place the BO in system memory.
> > >
> > > Second, If bo validation fails during a command submission, due to
> > > insufficient VRAM / TT, then the driver should retry the complete
> > > validation cycle after first blocking all other validators and then
> > > evicting everything not pinned, to avoid failures due to fragmentation.
> > >
> > > /Thomas
> > >
> >
> > Indeed, it seems like the commit you mention just retries ttm_bo_init()
> > after it previously failed. At that point the bo has been destroyed, so
> > that is probably what's causing the BUG you are seeing.
> >
> > Admittedly, ttm_bo_init() calling unref on failure is not properly
> > documented in the function description. The reason for doing so is to
> > have a single path for freeing all BO resources already allocated on the
> > point of failure.
>
> Does the patch below fix the problem?

Yes, indeed. I was just about to send the same patch to the list.

Thanks.
--
Markus

2010-11-09 10:41:13

by Michel Dänzer

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
> On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> > On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
> >> OK I've found the buggy commit by bisection:
> >>
> >> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
> >> commit e376573f7267390f4e1bdc552564b6fb913bce76
> >> Author: Michel Dänzer<[email protected]>
> >> Date: Thu Jul 8 12:43:28 2010 +1000
> >>
> >> drm/radeon: fall back to GTT if bo creation/validation in VRAM
> >> fails.
> >>
> >> This fixes a problem where on low VRAM cards we'd run out of
> >> space for validation.
> >>
> >> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
> >> problems.]
> >>
> >> Signed-off-by: Michel Dänzer<[email protected]>
> >> Cc: [email protected]
> >> Signed-off-by: Dave Airlie<[email protected]>
> >>
> >> Please note that this is an old commit from 2.6.36-rc. When I revert
> >> it the
> >> kernel no longer crashes. Instead I see the following in my dmesg:
> >>
> >
> > Hmm, so this sounds like something in the Radeon eviction error path
> > is causing corruption.
> > I had a similar problem with vmwgfx, when I tried to unref a BO
> > _after_ ttm_bo_init() failed.
> > ttm_bo_init() is really supposed to call unref itself for various
> > reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> > will cause corruption.
> >
> > In any case, the error below also suggests something is a bit fragile
> > in the Radeon driver:
> >
> > First, an accelerated eviction may fail, like in the message below,
> > but then there must always be a backup plan, like unaccelerated
> > eviction to system. On BO creation, there are a number of placement
> > strategies, but if all else fails, it should be possible to initially
> > place the BO in system memory.
> >
> > Second, If bo validation fails during a command submission, due to
> > insufficient VRAM / TT, then the driver should retry the complete
> > validation cycle after first blocking all other validators and then
> > evicting everything not pinned, to avoid failures due to fragmentation.
> >
> > /Thomas
> >
>
> Indeed, it seems like the commit you mention just retries ttm_bo_init()
> after it previously failed. At that point the bo has been destroyed, so
> that is probably what's causing the BUG you are seeing.
>
> Admittedly, ttm_bo_init() calling unref on failure is not properly
> documented in the function description. The reason for doing so is to
> have a single path for freeing all BO resources already allocated on the
> point of failure.

Does the patch below fix the problem?


commit e224472eedbda391ddb6d8b88f26e82e1c3b036b
Author: Michel Dänzer <[email protected]>
Date: Tue Nov 9 11:30:41 2010 +0100

drm/radeon/kms: Fix retrying ttm_bo_init() after it failed once.

If ttm_bo_init() returns failure, it already destroyed the BO, so we need to
retry from scratch.

Signed-off-by: Michel Dänzer <[email protected]>
Cc: [email protected]

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 1b9004e..bbe92d5 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -102,6 +102,8 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
type = ttm_bo_type_device;
}
*bo_ptr = NULL;
+
+retry:
bo = kzalloc(sizeof(struct radeon_bo), GFP_KERNEL);
if (bo == NULL)
return -ENOMEM;
@@ -109,8 +111,6 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
bo->gobj = gobj;
bo->surface_reg = -1;
INIT_LIST_HEAD(&bo->list);
-
-retry:
radeon_ttm_placement_from_domain(bo, domain);
/* Kernel allocation are uninterruptible */
mutex_lock(&rdev->vram_mutex);


--
Earthling Michel Dänzer | http://www.vmware.com
Libre software enthusiast | Debian, X and DRI developer

2010-11-09 10:52:52

by Michel Dänzer

[permalink] [raw]
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

On Die, 2010-11-09 at 11:37 +0100, Markus Trippelsdorf wrote:
> On Tue, Nov 09, 2010 at 11:32:57AM +0100, Michel Dänzer wrote:
> > On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
> > > On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> > > > On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
> > > >> OK I've found the buggy commit by bisection:
> > > >>
> > > >> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
> > > >> commit e376573f7267390f4e1bdc552564b6fb913bce76
> > > >> Author: Michel Dänzer<[email protected]>
> > > >> Date: Thu Jul 8 12:43:28 2010 +1000
> > > >>
> > > >> drm/radeon: fall back to GTT if bo creation/validation in VRAM
> > > >> fails.
> > > >>
> > > >> This fixes a problem where on low VRAM cards we'd run out of
> > > >> space for validation.
> > > >>
> > > >> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
> > > >> problems.]
> > > >>
> > > >> Signed-off-by: Michel Dänzer<[email protected]>
> > > >> Cc: [email protected]
> > > >> Signed-off-by: Dave Airlie<[email protected]>
> > > >>
> > > >> Please note that this is an old commit from 2.6.36-rc. When I revert
> > > >> it the
> > > >> kernel no longer crashes. Instead I see the following in my dmesg:
> > > >>
> > > >
> > > > Hmm, so this sounds like something in the Radeon eviction error path
> > > > is causing corruption.
> > > > I had a similar problem with vmwgfx, when I tried to unref a BO
> > > > _after_ ttm_bo_init() failed.
> > > > ttm_bo_init() is really supposed to call unref itself for various
> > > > reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> > > > will cause corruption.
> > > >
> > > > In any case, the error below also suggests something is a bit fragile
> > > > in the Radeon driver:
> > > >
> > > > First, an accelerated eviction may fail, like in the message below,
> > > > but then there must always be a backup plan, like unaccelerated
> > > > eviction to system. On BO creation, there are a number of placement
> > > > strategies, but if all else fails, it should be possible to initially
> > > > place the BO in system memory.
> > > >
> > > > Second, If bo validation fails during a command submission, due to
> > > > insufficient VRAM / TT, then the driver should retry the complete
> > > > validation cycle after first blocking all other validators and then
> > > > evicting everything not pinned, to avoid failures due to fragmentation.
> > > >
> > > > /Thomas
> > > >
> > >
> > > Indeed, it seems like the commit you mention just retries ttm_bo_init()
> > > after it previously failed. At that point the bo has been destroyed, so
> > > that is probably what's causing the BUG you are seeing.
> > >
> > > Admittedly, ttm_bo_init() calling unref on failure is not properly
> > > documented in the function description. The reason for doing so is to
> > > have a single path for freeing all BO resources already allocated on the
> > > point of failure.
> >
> > Does the patch below fix the problem?
>
> Yes, indeed. I was just about to send the same patch to the list.
>
> Thanks.

Thank you for testing / confirming the fix, and to Thomas for the
analysis of the problem.

I've submitted the fix to Dave with your Tested-by: added.


--
Earthling Michel Dänzer | http://www.vmware.com
Libre software enthusiast | Debian, X and DRI developer