2005-11-14 23:51:32

by Badari Pulavarty

[permalink] [raw]
Subject: 2.6.14 X spinning in the kernel

Hi,

My 2-cpu EM64T machine started showing this problem again on 2.6.14.
On some reboots, X seems to spin in the kernel forever.

sysrq-t output shows nothing.

X R running task 0 3607 3589 3903
(L-TLB)

top shows:
3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X


So, I wrote a module to do smp_call_function() on all CPUs
to show stacks on them. CPU0 seems to be spinning in exit_mmap().
I did this multiple times to collect stacks few times.

Is this a known issue ?

Thanks,
Badari

1st time:
---------
CPU1:

Call Trace:<ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff880ed04b>{:mod:init_mod+11}
<ffffffff80154162>{sys_init_module+306}
<ffffffff8010dc26>{system_call+126}

CPU0:

Call Trace: <IRQ> <ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff80119399>{smp_call_function_interrupt+73}
<ffffffff8010e8f0>{call_function_interrupt+132} <EOI>
<ffffffff8016eb1a>{unmap_vmas+1114}
<ffffffff8016ec1c>{unmap_vmas+1372} <ffffffff801749f6>{exit_mmap
+166}
<ffffffff80134014>{mmput+52} <ffffffff8018f2ca>{flush_old_exec
+2474}
<ffffffff801833d5>{vfs_read+341}
<ffffffff801b4933>{load_elf_binary+1507}
<ffffffff80162131>{buffered_rmqueue+529}
<ffffffff8017c99b>{alloc_page_interleave+59}
<ffffffff8018e284>{copy_strings+516}
<ffffffff801b4350>{load_elf_binary+0}
<ffffffff8018f8f9>{search_binary_handler+201}
<ffffffff8018fc5f>{do_execve+415}
<ffffffff8010dc26>{system_call+126} <ffffffff8010c6e4>{sys_execve
+68}
<ffffffff8010e046>{stub_execve+106}


2nd time:
----------

CPU1:

Call Trace:<ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff880ed04b>{:mod:init_mod+11}
<ffffffff80154162>{sys_init_module+306}
<ffffffff8010dc26>{system_call+126}

CPU0:

Call Trace: <IRQ> <ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff80119399>{smp_call_function_interrupt+73}
<ffffffff8010e8f0>{call_function_interrupt+132} <EOI>
<ffffffff8017245f>{remove_vm_struct+63}
<ffffffff80172453>{remove_vm_struct+51}
<ffffffff80174ac7>{exit_mmap+375}
<ffffffff80134014>{mmput+52} <ffffffff8018f2ca>{flush_old_exec
+2474}
<ffffffff801833d5>{vfs_read+341}
<ffffffff801b4933>{load_elf_binary+1507}
<ffffffff80162131>{buffered_rmqueue+529}
<ffffffff8017c99b>{alloc_page_interleave+59}
<ffffffff8018e284>{copy_strings+516}
<ffffffff801b4350>{load_elf_binary+0}
<ffffffff8018f8f9>{search_binary_handler+201}
<ffffffff8018fc5f>{do_execve+415}
<ffffffff8010dc26>{system_call+126} <ffffffff8010c6e4>{sys_execve
+68}
<ffffffff8010e046>{stub_execve+106}


3rd time:
---------
CPU1:

Call Trace:<ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff880ed04b>{:mod:init_mod+11}
<ffffffff80154162>{sys_init_module+306}
<ffffffff8010dc26>{system_call+126}

CPU0:

Call Trace: <IRQ> <ffffffff880ed02b>{:mod:showacpu+43}
<ffffffff80119399>{smp_call_function_interrupt+73}
<ffffffff8010e8f0>{call_function_interrupt+132} <EOI>
<ffffffff801618b4>{__mod_page_state+36}
<ffffffff80161e09>{free_hot_cold_page+41}
<ffffffff80161ef5>{__pagevec_free+37}
<ffffffff801699df>{release_pages+367}
<ffffffff80178c0b>{free_pages_and_swap_cache+123}
<ffffffff80174a62>{exit_mmap+274} <ffffffff80134014>{mmput+52}
<ffffffff8018f2ca>{flush_old_exec+2474}
<ffffffff801833d5>{vfs_read+341}
<ffffffff801b4933>{load_elf_binary+1507}
<ffffffff80162131>{buffered_rmqueue+529}
<ffffffff8017c99b>{alloc_page_interleave+59}
<ffffffff8018e284>{copy_strings+516}
<ffffffff801b4350>{load_elf_binary+0}
<ffffffff8018f8f9>{search_binary_handler+201}
<ffffffff8018fc5f>{do_execve+415} <ffffffff8010dc26>{system_call
+126}
<ffffffff8010c6e4>{sys_execve+68} <ffffffff8010e046>{stub_execve
+106}





2005-11-15 00:16:48

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Badari Pulavarty <[email protected]> wrote:
>
> My 2-cpu EM64T machine started showing this problem again on 2.6.14.
> On some reboots, X seems to spin in the kernel forever.
>
> sysrq-t output shows nothing.
>
> X R running task 0 3607 3589 3903
> (L-TLB)
>
> top shows:
> 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
>
>
> So, I wrote a module to do smp_call_function() on all CPUs
> to show stacks on them. CPU0 seems to be spinning in exit_mmap().
> I did this multiple times to collect stacks few times.
>
> Is this a known issue ?

Nope. Maybe your vma list has a loop in it, in remove_vma()? slab
debugging would detect that, due to the repeated
kmem_cache_free(vm_area_cachep, vma);

2005-11-15 00:52:48

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

On Mon, 2005-11-14 at 16:17 -0800, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > My 2-cpu EM64T machine started showing this problem again on 2.6.14.
> > On some reboots, X seems to spin in the kernel forever.
> >
> > sysrq-t output shows nothing.
> >
> > X R running task 0 3607 3589 3903
> > (L-TLB)
> >
> > top shows:
> > 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
> >
> >
> > So, I wrote a module to do smp_call_function() on all CPUs
> > to show stacks on them. CPU0 seems to be spinning in exit_mmap().
> > I did this multiple times to collect stacks few times.
> >
> > Is this a known issue ?
>
> Nope. Maybe your vma list has a loop in it, in remove_vma()? slab
> debugging would detect that, due to the repeated
> kmem_cache_free(vm_area_cachep, vma);

I compiled the kernel with slab debug and rebooted the machine.
X seems to be spinning again. But this time, it shows completely
different routines (and seems to be switching CPUs) :(
Something weird is happening on my machine..

top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3600 root 25 0 0 0 0 R 99.9 0.0 8:29.18 X


CPU0:
ffffffff8053c750 0000000000000000 0000000000000000 0000000000000000
ffff81011c451000 ffffffff8053c788 ffffffff8026de8f
ffffffff8053c7a8
ffffffff80119591 ffffffff8053c7a8
Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
<ffffffff80119591>{smp_call_function_interrupt+81}
<ffffffff8010e968>{call_function_interrupt+132} <EOI>
<ffffffff880fc60a>{:radeon:radeon_freelist_get+122}
<ffffffff880fc69d>{:radeon:radeon_freelist_get+269}
<ffffffff880fc81a>{:radeon:radeon_cp_buffers+314}
<ffffffff880fc6e0>{:radeon:radeon_cp_buffers+0}
<ffffffff80278b32>{drm_ioctl+386} <ffffffff80199f9d>{do_ioctl
+125}
<ffffffff8019a272>{vfs_ioctl+690} <ffffffff8019a2fa>{sys_ioctl
+106}
<ffffffff8010dc9e>{system_call+126}



again,

CPU1:
ffff8100d7f2bf50 0000000000000000 0000000000000000 0000000000000000
ffff81011c451000 ffff8100d7f2bf88 ffffffff8026de8f
ffff8100d7f2bfa8
ffffffff80119591 ffff8100d7f2bfa8
Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
<ffffffff80119591>{smp_call_function_interrupt+81}
<ffffffff8010e968>{call_function_interrupt+132} <EOI>
<ffffffff8021363a>{__delay+10}
<ffffffff8021367a>{__const_udelay+42}
<ffffffff880fc69d>{:radeon:radeon_freelist_get+269}
<ffffffff880fc81a>{:radeon:radeon_cp_buffers+314}
<ffffffff880fc6e0>{:radeon:radeon_cp_buffers+0}
<ffffffff80278b32>{drm_ioctl+386} <ffffffff80199f9d>{do_ioctl
+125}
<ffffffff8019a272>{vfs_ioctl+690} <ffffffff8019a2fa>{sys_ioctl
+106}
<ffffffff8010dc9e>{system_call+126}


Then I tried killing it and ran into..

CPU0:
ffffffff8053c750 0000000000000000 00000000000018ff ffff81011c9a4230
ffff81011c9a4000 ffffffff8053c788 ffffffff8026de8f
ffffffff8053c7a8
ffffffff80119591 ffffffff8053c7a8
Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
<ffffffff80119591>{smp_call_function_interrupt+81}
<ffffffff8010e968>{call_function_interrupt+132} <EOI>
<ffffffff880fa225>{:radeon:radeon_do_wait_for_idle+117}
<ffffffff880fa236>{:radeon:radeon_do_wait_for_idle+134}
<ffffffff880fa590>{:radeon:radeon_do_cp_idle+336}
<ffffffff880fc215>{:radeon:radeon_do_release+85}
<ffffffff88104369>{:radeon:radeon_driver_pretakedown+9}
<ffffffff802783aa>{drm_takedown+74}
<ffffffff80279733>{drm_release+1267}
<ffffffff801a0d01>{destroy_inode+81}
<ffffffff801a26a1>{generic_delete_inode+337}
<ffffffff8019e9f6>{d_free+54} <ffffffff8018655a>{__fput+202}
<ffffffff80186654>{fput+20} <ffffffff80184b8e>{filp_close+110}
<ffffffff80138f62>{put_files_struct+130}
<ffffffff80139945>{do_exit+549}
<ffffffff8012f85d>{try_to_wake_up+1085}
<ffffffff80141682>{recalc_sigpending+18}
<ffffffff80141dc5>{__dequeue_signal+501}
<ffffffff8013a42d>{do_group_exit+237}
<ffffffff80143f6b>{get_signal_to_deliver+1419}
<ffffffff8010d09d>{do_signal+125}
<ffffffff803b84a9>{__up+25}
<ffffffff803bad7b>{.text.lock.kernel_lock+32}
<ffffffff80199fa5>{do_ioctl+133} <ffffffff8019a272>{vfs_ioctl
+690}
<ffffffff8010dd27>{sysret_signal+28}
<ffffffff8010d800>{do_notify_resume+48}
<ffffffff8010e00f>{ptregscall_common+103}


Thanks,
Badari

2005-11-15 01:30:33

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Badari Pulavarty <[email protected]> wrote:
>
> On Mon, 2005-11-14 at 16:17 -0800, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > >
> > > My 2-cpu EM64T machine started showing this problem again on 2.6.14.
> > > On some reboots, X seems to spin in the kernel forever.
> > >
> > > sysrq-t output shows nothing.
> > >
> > > X R running task 0 3607 3589 3903
> > > (L-TLB)
> > >
> > > top shows:
> > > 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
> > >
> > >
> > > So, I wrote a module to do smp_call_function() on all CPUs
> > > to show stacks on them. CPU0 seems to be spinning in exit_mmap().
> > > I did this multiple times to collect stacks few times.
> > >
> > > Is this a known issue ?
> >
> > Nope. Maybe your vma list has a loop in it, in remove_vma()? slab
> > debugging would detect that, due to the repeated
> > kmem_cache_free(vm_area_cachep, vma);
>
> I compiled the kernel with slab debug and rebooted the machine.
> X seems to be spinning again. But this time, it shows completely
> different routines (and seems to be switching CPUs) :(
> Something weird is happening on my machine..
>
> top:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3600 root 25 0 0 0 0 R 99.9 0.0 8:29.18 X
>
>
> ...
>
> Then I tried killing it and ran into..
>
> CPU0:
> ffffffff8053c750 0000000000000000 00000000000018ff ffff81011c9a4230
> ffff81011c9a4000 ffffffff8053c788 ffffffff8026de8f
> ffffffff8053c7a8
> ffffffff80119591 ffffffff8053c7a8
> Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
> <ffffffff80119591>{smp_call_function_interrupt+81}
> <ffffffff8010e968>{call_function_interrupt+132} <EOI>
> <ffffffff880fa225>{:radeon:radeon_do_wait_for_idle+117}
> <ffffffff880fa236>{:radeon:radeon_do_wait_for_idle+134}
> <ffffffff880fa590>{:radeon:radeon_do_cp_idle+336}
> <ffffffff880fc215>{:radeon:radeon_do_release+85}
> <ffffffff88104369>{:radeon:radeon_driver_pretakedown+9}
> <ffffffff802783aa>{drm_takedown+74}

ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
fact, my workstation was doing it a year or two back.

Are you able to identify the most recent kernel which didn't do this?

David, is there a common cause for this? ISTR that it's a semi-FAQ.

2005-11-15 02:49:10

by Dave Airlie

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel


>
> ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
> fact, my workstation was doing it a year or two back.
>
> Are you able to identify the most recent kernel which didn't do this?
>
> David, is there a common cause for this? ISTR that it's a semi-FAQ.

Yes invariably the GPU has crashed and isn't responding to anything.
unfortuantely radeons have a lot of reasons for crashing most of them very
unrelated to anything like reality... we normally try and approach them
on a case by case basis as some can be solved easily some not so...

Also what X was doing etc at the time is invalulable info..

Dave.


--
David Airlie, Software Engineer
http://www.skynet.ie/~airlied / airlied at skynet.ie
Linux kernel - DRI, VAX / pam_smb / ILUG

2005-11-15 02:58:20

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Dave Airlie <[email protected]> wrote:
>
>
> >
> > ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
> > fact, my workstation was doing it a year or two back.
> >
> > Are you able to identify the most recent kernel which didn't do this?
> >
> > David, is there a common cause for this? ISTR that it's a semi-FAQ.
>
> Yes invariably the GPU has crashed and isn't responding to anything.

But radeon_do_wait_for_idle() and radeon_do_wait_for_fifo() have timeouts.
Should Badari have waited longer?

>
> Also what X was doing etc at the time is invalulable info..
>

And whether a particualr kernel version introduced this behaviour.

2005-11-15 03:02:10

by Dave Airlie

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel


> > Yes invariably the GPU has crashed and isn't responding to anything.
>
> But radeon_do_wait_for_idle() and radeon_do_wait_for_fifo() have timeouts.
> Should Badari have waited longer?

They timeout, and X usually goes straight back in there again, I can't
remember the codepath exactly at this stage but it just goes round and
round until you kick the machine..

in theory X should probably deal with the situation better... I think it
might be able to at least gracefully die or reset the chip...

> > Also what X was doing etc at the time is invalulable info..
> >
>
> And whether a particualr kernel version introduced this behaviour.

Yes, usually if a kernel introduced it it is because I've done something
really dumb (shouldn't happen too often and with radeons we usually catch
that before stable releases), or because the X server wasn't using DRI
before due to a too old DRM and suddenly the new DRM appears in the kernel
and it uses it ...

There is one known issue with some later version of X on radeons crashing
on PCI GART setups, benh was cooking a patch for X, it isn't something we
can fix in the kernel..

Dave.

--
David Airlie, Software Engineer
http://www.skynet.ie/~airlied / airlied at skynet.ie
Linux kernel - DRI, VAX / pam_smb / ILUG

2005-11-15 03:11:32

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

On Mon, Nov 14, 2005 at 05:30:37PM -0800, Andrew Morton wrote:

> > CPU0:
> > ffffffff8053c750 0000000000000000 00000000000018ff ffff81011c9a4230
> > ffff81011c9a4000 ffffffff8053c788 ffffffff8026de8f
> > ffffffff8053c7a8
> > ffffffff80119591 ffffffff8053c7a8
> > Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
> > <ffffffff80119591>{smp_call_function_interrupt+81}
> > <ffffffff8010e968>{call_function_interrupt+132} <EOI>
> > <ffffffff880fa225>{:radeon:radeon_do_wait_for_idle+117}
> > <ffffffff880fa236>{:radeon:radeon_do_wait_for_idle+134}
> > <ffffffff880fa590>{:radeon:radeon_do_cp_idle+336}
> > <ffffffff880fc215>{:radeon:radeon_do_release+85}
> > <ffffffff88104369>{:radeon:radeon_driver_pretakedown+9}
> > <ffffffff802783aa>{drm_takedown+74}
>
> ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
> fact, my workstation was doing it a year or two back.
>
> Are you able to identify the most recent kernel which didn't do this?
>
> David, is there a common cause for this? ISTR that it's a semi-FAQ.

We've seen a few reports of this in the Fedora bugzilla over the
last year or so too. It seems to come and go. The best explanation
I've heard so far is "The GPU got really confused".

Dave

2005-11-15 22:50:39

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Dave Airlie wrote:

>>ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
>>fact, my workstation was doing it a year or two back.
>>
>>Are you able to identify the most recent kernel which didn't do this?
>>
>>David, is there a common cause for this? ISTR that it's a semi-FAQ.
>
>
> Yes invariably the GPU has crashed and isn't responding to anything.
> unfortuantely radeons have a lot of reasons for crashing most of them very
> unrelated to anything like reality... we normally try and approach them
> on a case by case basis as some can be solved easily some not so...
>
> Also what X was doing etc at the time is invalulable info..

What information I can collect ? My machine seems to be reproducing this
pretty regularly.

Thanks,
Badari

2005-11-15 22:49:48

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Andrew Morton wrote:

> Badari Pulavarty <[email protected]> wrote:
>
>>On Mon, 2005-11-14 at 16:17 -0800, Andrew Morton wrote:
>>
>>>Badari Pulavarty <[email protected]> wrote:
>>>
>>>>My 2-cpu EM64T machine started showing this problem again on 2.6.14.
>>>>On some reboots, X seems to spin in the kernel forever.
>>>>
>>>>sysrq-t output shows nothing.
>>>>
>>>>X R running task 0 3607 3589 3903
>>>>(L-TLB)
>>>>
>>>>top shows:
>>>> 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
>>>>
>>>>
>>>>So, I wrote a module to do smp_call_function() on all CPUs
>>>>to show stacks on them. CPU0 seems to be spinning in exit_mmap().
>>>>I did this multiple times to collect stacks few times.
>>>>
>>>>Is this a known issue ?
>>>
>>>Nope. Maybe your vma list has a loop in it, in remove_vma()? slab
>>>debugging would detect that, due to the repeated
>>>kmem_cache_free(vm_area_cachep, vma);
>>
>>I compiled the kernel with slab debug and rebooted the machine.
>>X seems to be spinning again. But this time, it shows completely
>>different routines (and seems to be switching CPUs) :(
>>Something weird is happening on my machine..
>>
>>top:
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3600 root 25 0 0 0 0 R 99.9 0.0 8:29.18 X
>>
>>
>>...
>>
>>Then I tried killing it and ran into..
>>
>>CPU0:
>>ffffffff8053c750 0000000000000000 00000000000018ff ffff81011c9a4230
>> ffff81011c9a4000 ffffffff8053c788 ffffffff8026de8f
>>ffffffff8053c7a8
>> ffffffff80119591 ffffffff8053c7a8
>>Call Trace: <IRQ> <ffffffff8026de8f>{showacpu+47}
>><ffffffff80119591>{smp_call_function_interrupt+81}
>> <ffffffff8010e968>{call_function_interrupt+132} <EOI>
>><ffffffff880fa225>{:radeon:radeon_do_wait_for_idle+117}
>> <ffffffff880fa236>{:radeon:radeon_do_wait_for_idle+134}
>> <ffffffff880fa590>{:radeon:radeon_do_cp_idle+336}
>><ffffffff880fc215>{:radeon:radeon_do_release+85}
>> <ffffffff88104369>{:radeon:radeon_driver_pretakedown+9}
>> <ffffffff802783aa>{drm_takedown+74}
>
>
> ah-hah. We've had machines stuck in radeon_do_wait_for_idle() before. In
> fact, my workstation was doing it a year or two back.
>
> Are you able to identify the most recent kernel which didn't do this?

I got this machine recently. 2.6.14-rc kernels are the first ones I
tried on this box and they have been failing :(

Only known good kernel I had so far was RHEL4 (2.6.9 base).

Thanks,
Badari


2005-11-16 15:32:16

by Pasi Savolainen

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

* Badari Pulavarty <[email protected]>:
> Hi,
>
> My 2-cpu EM64T machine started showing this problem again on 2.6.14.
> On some reboots, X seems to spin in the kernel forever.
>
> sysrq-t output shows nothing.
>
> X R running task 0 3607 3589 3903
> (L-TLB)
>
> top shows:
> 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X


I get something like than on 2xAthlon, but kernel 2.6.12 (some debian
version, AFAIK slightly patched). In my case XOrg (6.8.2 -> 6.9-rc)
doesn't hang but continues to work, I notice other hung process from
rising load.
Video card is Radeon 9200. When I restart X (logout to gdm), hung
process disappears.

--
Psi -- <http://www.iki.fi/pasi.savolainen>

2005-11-16 21:08:25

by Max Krasnyansky

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

Badari Pulavarty wrote:
>> Badari Pulavarty <[email protected]> wrote:
>>
>>> On Mon, 2005-11-14 at 16:17 -0800, Andrew Morton wrote:
>>>
>>>> Badari Pulavarty <[email protected]> wrote:
>>>>
>>>>> My 2-cpu EM64T machine started showing this problem again on 2.6.14.
>>>>> On some reboots, X seems to spin in the kernel forever.
>>>>>
>>>>> sysrq-t output shows nothing.
>>>>>
>>>>> X R running task 0 3607 3589 3903
>>>>> (L-TLB)
>>>>>
>>>>> top shows:
>>>>> 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
>>>>>
>>>>>
>>>>> So, I wrote a module to do smp_call_function() on all CPUs
>>>>> to show stacks on them. CPU0 seems to be spinning in exit_mmap().
>>>>> I did this multiple times to collect stacks few times.
>>>>>
>>>>> Is this a known issue ?

I've seen similar problems on dual Opteron HP xw9300/Radeon 7000 PCI box with 2.6.11.12
and latest X from Fedora x86-64 YUM repos.
I haven't done any traces but it sounds like the same problem (ie X server is spinning).
Disabling DRI in xorg.conf fixed it for me.

Max

2005-11-16 21:52:43

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

On Wed, 2005-11-16 at 13:07 -0800, Max Krasnyansky wrote:
> Badari Pulavarty wrote:
> >> Badari Pulavarty <[email protected]> wrote:
> >>
> >>> On Mon, 2005-11-14 at 16:17 -0800, Andrew Morton wrote:
> >>>
> >>>> Badari Pulavarty <[email protected]> wrote:
> >>>>
> >>>>> My 2-cpu EM64T machine started showing this problem again on 2.6.14.
> >>>>> On some reboots, X seems to spin in the kernel forever.
> >>>>>
> >>>>> sysrq-t output shows nothing.
> >>>>>
> >>>>> X R running task 0 3607 3589 3903
> >>>>> (L-TLB)
> >>>>>
> >>>>> top shows:
> >>>>> 3607 root 25 0 0 0 0 R 99.1 0.0 262:04.69 X
> >>>>>
> >>>>>
> >>>>> So, I wrote a module to do smp_call_function() on all CPUs
> >>>>> to show stacks on them. CPU0 seems to be spinning in exit_mmap().
> >>>>> I did this multiple times to collect stacks few times.
> >>>>>
> >>>>> Is this a known issue ?
>
> I've seen similar problems on dual Opteron HP xw9300/Radeon 7000 PCI box with 2.6.11.12
> and latest X from Fedora x86-64 YUM repos.
> I haven't done any traces but it sounds like the same problem (ie X server is spinning).
> Disabling DRI in xorg.conf fixed it for me.

Okay. Thank you.

I traced it little further.

It looks like radeon_freelist_get() is always returning NULL.
Which seem to have 2 loops
- top loop is for for 10000 times (usec_timeout).
- second one for length of the list ?

for (t = 0; t < dev_priv->usec_timeout; t++)
..
for (i = start; i < dma->buf_count; i++) {

..
}
}

Which is making it even worse.

And also, radeon_cp_get_buffers() is getting called repeatedly.

Thanks,
Badari

2005-11-16 22:27:33

by Lee Revell

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

On Wed, 2005-11-16 at 13:52 -0800, Badari Pulavarty wrote:
> - top loop is for for 10000 times (usec_timeout).

Where does usec_timeout get set anyway? With a DRM ioctl()? I looked
at the radeon source and it looks like it defaults to 100000 (not
10000). And I can't see where it ever gets set to anything but the
default.

Lee

2005-11-16 22:37:41

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel

On Wed, 2005-11-16 at 17:11 -0500, Lee Revell wrote:
> On Wed, 2005-11-16 at 13:52 -0800, Badari Pulavarty wrote:
> > - top loop is for for 10000 times (usec_timeout).
>
> Where does usec_timeout get set anyway? With a DRM ioctl()? I looked
> at the radeon source and it looks like it defaults to 100000 (not
> 10000). And I can't see where it ever gets set to anything but the
> default.
>
> Lee

Don't know. I added a printk() and it shows

Nov 16 11:43:51 elm3b23 kernel: usec timeout 10000

Thanks,
Badari

2005-11-16 22:42:20

by Dave Airlie

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel


>
> I traced it little further.
>
> It looks like radeon_freelist_get() is always returning NULL.
> Which seem to have 2 loops
> - top loop is for for 10000 times (usec_timeout).
> - second one for length of the list ?
>
> for (t = 0; t < dev_priv->usec_timeout; t++)
> ..
> for (i = start; i < dma->buf_count; i++) {
>
> ..
> }
> }
>
> Which is making it even worse.
>
> And also, radeon_cp_get_buffers() is getting called repeatedly.

Again I say this is a chip hang, the chip isn't consuming any more data,
so we run out of buffers...

Can you send me lspci -v, /var/log/Xorg.0.log, xorg.conf

If you are running a PCI Radeon you are screwed with the latest Fedora X
packages, roll back a few to find the ones that work, the FC people took a
really hacky patch from ATI and thought it was a good idea, and now it is
in X.org, or turn off DRI...

Dave.

-- David Airlie, Software Engineer http://www.skynet.ie/~airlied / airlied
at skynet.ie Linux kernel - DRI, VAX / pam_smb / ILUG

2005-11-16 23:27:46

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel


> Again I say this is a chip hang, the chip isn't consuming any more data,
> so we run out of buffers...
>
> Can you send me lspci -v, /var/log/Xorg.0.log, xorg.conf
>
> If you are running a PCI Radeon you are screwed with the latest Fedora X
> packages, roll back a few to find the ones that work, the FC people took a
> really hacky patch from ATI and thought it was a good idea, and now it is
> in X.org, or turn off DRI...
>

Okay, here is the data.

I am running RHEL4 and RHEL4 kernel runs fine with X.

Thanks,
Badari



Attachments:
lspci.out (8.38 kB)
Xorg.0.log (41.82 kB)
xorg.conf (2.73 kB)
Download all attachments

2005-11-17 00:10:33

by Badari Pulavarty

[permalink] [raw]
Subject: Re: 2.6.14 X spinning in the kernel


> Can you send me lspci -v, /var/log/Xorg.0.log, xorg.conf
>
> If you are running a PCI Radeon you are screwed with the latest Fedora X
> packages, roll back a few to find the ones that work, the FC people took a
> really hacky patch from ATI and thought it was a good idea, and now it is
> in X.org, or turn off DRI...

I turned off DRI for now and X is happy.

Thanks,
Badari