On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
certain behavior at least once a day. I'll start a kernel build (make
-sj5 on this box), and it will "hang" in the following way:
> 31003 ? S 0:04 sshd: jgarzik@pts/0
> 31004 pts/0 Ss 0:02 \_ -bash
> 8280 pts/0 S+ 0:00 \_ make ARCH=i386 -sj4
> 8690 pts/0 Z+ 0:00 \_ [rm] <defunct>
> 8691 pts/0 S+ 0:00 \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> 8692 pts/0 R+ 6:12 \_ cat include/config/kernel.release
Specifically, the symptom is a process, often a simple one like cat(1)
or rm(1) or somewhere in check-headers, will stay in the running state,
accumulating CPU time.
If I Ctrl-C the build, and start over, the build will normally -not- get
stuck at the same point, but proceed to chew through one of a bazillion
allmodconfig builds.
I also see this occasionally on my main workstation (also
2.6.23/x86-64/Fedora-7), though not as frequently.
This is a new behavior since the new scheduler was merged... I think.
Nothing more concrete to report at this time. I cannot easily reproduce
the behavior, as it happens [apparently] randomly sometime during the
day. Generally, the files these programs are dealing with are -always-
in the pagecache, if that makes any difference.
Jeff
On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day. I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
>
Can you try to strace the hanging task?
Chuck Ebbert wrote:
> On 10/19/2007 05:39 PM, Jeff Garzik wrote:
>> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
>> certain behavior at least once a day. I'll start a kernel build (make
>> -sj5 on this box), and it will "hang" in the following way:
>>
>
> Can you try to strace the hanging task?
Well, to the system it's running, so that doesn't do much of anything...
>
> 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> 8486 pts/0 S+ 0:00 \_ cut -f2 -d<
> 8487 pts/0 S+ 0:00 \_ cut -f1 -d>
> 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm
> [jgarzik@pretzel misc-2.6]$ strace -p8484
> Process 8484 attached - interrupt to quit
[sits there, chewing up CPU grepping a 47-line header file]
On 10/19/2007 06:03 PM, Jeff Garzik wrote:
>> [jgarzik@pretzel misc-2.6]$ strace -p8484
>> Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]
>
And sysrq-p is pretty useless unless you can force the keyboard
interrupt and the spinning process onto the same CPU.
From: Chuck Ebbert <[email protected]>
Date: Fri, 19 Oct 2007 18:18:08 -0400
> On 10/19/2007 06:03 PM, Jeff Garzik wrote:
> >> [jgarzik@pretzel misc-2.6]$ strace -p8484
> >> Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
> >
>
> And sysrq-p is pretty useless unless you can force the keyboard
> interrupt and the spinning process onto the same CPU.
Yes, I find this a painful limitation too.
Sparc64 used to dump the registers on all active cpus for show_regs()
via a cross-call, and this was incredibly useful. But I disabled that
as soon as I started playing with Niagara because at 32 cpus and
larger the output is just too voluminous to be useful.
What might be appropriate is just to get a one-line program counter
dump on every cpu via some new sysrq keystroke.
On 10/19/07, Jeff Garzik <[email protected]> wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day. I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
>
> > 31003 ? S 0:04 sshd: jgarzik@pts/0
> > 31004 pts/0 Ss 0:02 \_ -bash
> > 8280 pts/0 S+ 0:00 \_ make ARCH=i386 -sj4
> > 8690 pts/0 Z+ 0:00 \_ [rm] <defunct>
> > 8691 pts/0 S+ 0:00 \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> > 8692 pts/0 R+ 6:12 \_ cat include/config/kernel.release
>
> Specifically, the symptom is a process, often a simple one like cat(1)
> or rm(1) or somewhere in check-headers, will stay in the running state,
> accumulating CPU time.
>
> If I Ctrl-C the build, and start over, the build will normally -not- get
> stuck at the same point, but proceed to chew through one of a bazillion
> allmodconfig builds.
I *think* I'm seeing this with firefox under 2.6.23-rc6. I tried a
`killall -SIGSTOP firefox; killall -SIGCONT firefox` and when I looked
back it was back to life again, but that may have been a fluke.
Regardless, try that the next time it happens?
Ray
On 10/19/2007 08:01 PM, David Miller wrote:
>
> What might be appropriate is just to get a one-line program counter
> dump on every cpu via some new sysrq keystroke.
>
IIRC -mm had something like this but it was buggy because we were
sending IPIs to each processor asking them to print their state.
Maybe it would work if we had a way of making them dump their
state to a memory location and then collected and printed it from
the CPU that's handling the sysrq.
On 20/10/2007, Jeff Garzik <[email protected]> wrote:
> Chuck Ebbert wrote:
> > On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> >> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> >> certain behavior at least once a day. I'll start a kernel build (make
> >> -sj5 on this box), and it will "hang" in the following way:
> >>
> >
> > Can you try to strace the hanging task?
>
> Well, to the system it's running, so that doesn't do much of anything...
>
> >
> > 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> > 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> > 8486 pts/0 S+ 0:00 \_ cut -f2 -d<
> > 8487 pts/0 S+ 0:00 \_ cut -f1 -d>
> > 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm
> > [jgarzik@pretzel misc-2.6]$ strace -p8484
> > Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]
I think, the following information would be helpful :
(1) # cat /proc/PID/stat ; sleep 5; cat /proc/PID/stat
as we could know :
- whether utime or/and stime keeps increasing ;
- one of the fields is actually 'eip' ... so we could, hopefully, more
or less precisely localize the area of the code where it's looping.
(do_task_stat() in fs/proc/array.c)
(2) # cat /proc/sched_debug;
and
# cat /proc/PID/sched
(if your kernel has CONFIG_SCHED_DEBUG=y)
(3) is anything else able to run on this CPU (as I understand it's
SMP) or the task does completely locks it up.
again, one of the fields of /proc/PID/stat is the CPU# on which the
task is currently active.
Let's start a buzy-loop task on this cpu and see wheather it's able to
make any progress (TIME counter in 'ps')
# taskset -c THIS_CPU some_busy_looping_prog
TIA,
--
Best regards,
Dmitry Adamushko
On 21/10/2007, Dmitry Adamushko <[email protected]> wrote:
> On 20/10/2007, Jeff Garzik <[email protected]> wrote:
> > Chuck Ebbert wrote:
> > > On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> > >> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> > >> certain behavior at least once a day. I'll start a kernel build (make
> > >> -sj5 on this box), and it will "hang" in the following way:
> > >>
> > >
> > > Can you try to strace the hanging task?
> >
> > Well, to the system it's running, so that doesn't do much of anything...
> >
> > >
> > > 8482 pts/0 S+ 0:00 \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> > > 8484 pts/0 R+ 3:10 \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> > > 8486 pts/0 S+ 0:00 \_ cut -f2 -d<
> > > 8487 pts/0 S+ 0:00 \_ cut -f1 -d>
> > > 8488 pts/0 S+ 0:00 \_ egrep ^linux|^asm
> > > [jgarzik@pretzel misc-2.6]$ strace -p8484
> > > Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
>
> I think, the following information would be helpful :
>
> (1) # cat /proc/PID/stat ; sleep 5; cat /proc/PID/stat
>
> as we could know :
>
> - whether utime or/and stime keeps increasing ;
>
> - one of the fields is actually 'eip' ... so we could, hopefully, more
> or less precisely localize the area of the code where it's looping.
sure, this 'eip' will be useless in this case.
[ ... ]
--
Best regards,
Dmitry Adamushko
From: Chuck Ebbert <[email protected]>
Date: Sun, 21 Oct 2007 11:59:14 -0400
> IIRC -mm had something like this but it was buggy because we were
> sending IPIs to each processor asking them to print their state.
What's buggy about this? :-) That's exactly how it work(ed)
on sparc64.
On Fri, 19 Oct 2007, Ray Lee wrote:
> On 10/19/07, Jeff Garzik <[email protected]> wrote:
> > On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> > certain behavior at least once a day. I'll start a kernel build (make
> > -sj5 on this box), and it will "hang" in the following way:
> >
> > > 31003 ? S 0:04 sshd: jgarzik@pts/0
> > > 31004 pts/0 Ss 0:02 \_ -bash
> > > 8280 pts/0 S+ 0:00 \_ make ARCH=i386 -sj4
> > > 8690 pts/0 Z+ 0:00 \_ [rm] <defunct>
> > > 8691 pts/0 S+ 0:00 \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> > > 8692 pts/0 R+ 6:12 \_ cat include/config/kernel.release
> >
> > Specifically, the symptom is a process, often a simple one like cat(1)
> > or rm(1) or somewhere in check-headers, will stay in the running state,
> > accumulating CPU time.
> >
> > If I Ctrl-C the build, and start over, the build will normally -not- get
> > stuck at the same point, but proceed to chew through one of a bazillion
> > allmodconfig builds.
>
> I *think* I'm seeing this with firefox under 2.6.23-rc6. I tried a
> `killall -SIGSTOP firefox; killall -SIGCONT firefox` and when I looked
> back it was back to life again, but that may have been a fluke.
> Regardless, try that the next time it happens?
Don't know if that's the same problem as above, but a few minutes ago my
mail-server locked down completely. First pine froze, then more processes
started freezing, then the system became unusable, ssh logins got stuck,
USB- and ps/2 keyboards. I managed to get a trace with the "w" sysrq:
SysRq : Show Blocked State
task PC stack pid father
syslogd D c01275a3 0 2818 1
e8a1fe4c 00000086 e8a1fe4c c01275a3 f7e9a200 00000282 e8a1fe5c 00b5ce60
c1b05cc0 e8a1fe7c c030c8e7 e8a1ff30 c8d553c4 c0407348 c1b58b18 00b5ce60
c01277d0 c1b1ba90 c0407040 000000da d5f806c0 e8a1fe84 c030c974 e8a1feb4
Call Trace:
[<c030c8e7>] schedule_timeout+0x47/0xc0
[<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
[<c01b94cb>] journal_stop+0xcb/0x270
[<c01ba1fd>] journal_force_commit+0x1d/0x30
[<c01b2265>] ext3_force_commit+0x25/0x30
[<c01ac7dc>] ext3_write_inode+0x2c/0x40
[<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
[<c01849b4>] sync_inode+0x24/0x60
[<c01a8d42>] ext3_sync_file+0xc2/0xd0
[<c0187340>] do_fsync+0x60/0xa0
[<c01873a8>] __do_fsync+0x28/0x40
[<c01873ed>] sys_fsync+0xd/0x10
[<c010424e>] sysenter_past_esp+0x5f/0x85
=======================
pine D c01275a3 0 6910 6243
d1f55e4c 00200082 c03c0e40 c01275a3 f7e42c80 00200282 d1f55e5c 00b5ce60
c1b05cc0 d1f55e7c c030c8e7 d1f55f30 c8d553d8 c1b58b18 c90a9e5c 00b5ce60
c01277d0 d8ae8a90 c0407040 000000c3 d5f806c0 d1f55e84 c030c974 d1f55eb4
Call Trace:
[<c030c8e7>] schedule_timeout+0x47/0xc0
[<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
[<c01b94cb>] journal_stop+0xcb/0x270
[<c01ba1fd>] journal_force_commit+0x1d/0x30
[<c01b2265>] ext3_force_commit+0x25/0x30
[<c01ac7dc>] ext3_write_inode+0x2c/0x40
[<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
[<c01849b4>] sync_inode+0x24/0x60
[<c01a8d42>] ext3_sync_file+0xc2/0xd0
[<c0187340>] do_fsync+0x60/0xa0
[<c01873a8>] __do_fsync+0x28/0x40
[<c01873ed>] sys_fsync+0xd/0x10
[<c010424e>] sysenter_past_esp+0x5f/0x85
=======================
sendmail D c01275a3 0 7448 7446
c90a9e4c 00000082 c03c0e40 c01275a3 f7e00580 00000282 c90a9e5c 00b5ce60
c1b05cc0 c90a9e7c c030c8e7 c90a9f30 c8d553ec d1f55e5c c0407348 00b5ce60
c01277d0 d8ae8030 c0407040 0000004b d5f806c0 c90a9e84 c030c974 c90a9eb4
Call Trace:
[<c030c8e7>] schedule_timeout+0x47/0xc0
[<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
[<c01b94cb>] journal_stop+0xcb/0x270
[<c01ba1fd>] journal_force_commit+0x1d/0x30
[<c01b2265>] ext3_force_commit+0x25/0x30
[<c01ac7dc>] ext3_write_inode+0x2c/0x40
[<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
[<c01849b4>] sync_inode+0x24/0x60
[<c01a8d42>] ext3_sync_file+0xc2/0xd0
[<c0187340>] do_fsync+0x60/0xa0
[<c01873a8>] __do_fsync+0x28/0x40
[<c01873ed>] sys_fsync+0xd/0x10
[<c010424e>] sysenter_past_esp+0x5f/0x85
=======================
now you see why I wrote "ext3 deadlock." It's y VIA C7 system, running
2.6.23-rc9-g804b3f9a, no problems since 5 October, when the kernel has
been built. No Oops / warnings in dmesg. Or has this been fixed since
23-rc9?
Thanks
Guennadi
---
Guennadi Liakhovetski