I have written a pair of applications the server side of which consistently causes my Linux Fedora Core 2 system to become
completely unresponsive; all consoles hang, and it no longer services network connections.
The applications engage in the rapid opening and closing of TCP connections. The server side is multithreaded (# threads approx 5).
It services the connections by dumping data into them from a file. The client side reads no data. The server then receives EAGAIN
from send(...,MSG_NOWAIT) calls, and issues 5ms sleep before resending on any particular TCP connection. It loops up to 20 times
waiting for the the connection to become unblocked. The applications are running within GDB, and threads *are* created/destroyed
during the process.
I will change the application to use select() rather than sleeping on a blocked pipe. However, I don't think it's a "good thing"
that the machine hangs so completely.
I looked for tools to help catch the kernel before it goes la-la (assuming it's the kernel going la-la), but got
frustrated/ran-out-of-time. E.g., lkcd seems defunct.
If pointed in the right direction, I would be happy to perform further forensics after re-creating the hang. I am also in the
process of upgrading the kernel to see if that resolves the problem.
Andrew Athan
uname -a:
Linux bbox.memeplex.com 2.6.6-1.435 #1 Mon Jun 14 09:09:07 EDT 2004 i686 i686 i386 GNU/Linux
lsmod:
Module Size Used by
snd_mixer_oss 13824 2
snd_via82xx 20644 3
snd_ac97_codec 54788 1 snd_via82xx
snd_pcm 69256 1 snd_via82xx
snd_timer 17284 1 snd_pcm
snd_page_alloc 8072 2 snd_via82xx,snd_pcm
gameport 3328 1 snd_via82xx
snd_mpu401_uart 4864 1 snd_via82xx
snd_rawmidi 17444 1 snd_mpu401_uart
snd_seq_device 6152 1 snd_rawmidi
snd 39396 10
snd_mixer_oss,snd_via82xx,snd_ac97_codec,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
soundcore 6112 3 snd
ipt_mark 1408 2
ipt_MARK 1664 14
cls_u32 5508 2
cls_fw 3200 2
sch_sfq 4352 9
sch_htb 18048 1
iptable_mangle 2176 1
ip_tables 13568 3 ipt_mark,ipt_MARK,iptable_mangle
nfsd 159488 9
exportfs 4224 1 nfsd
lockd 47816 2 nfsd
parport_pc 19392 1
lp 8236 0
parport 29640 2 parport_pc,lp
autofs4 12932 0
sunrpc 109924 19 nfsd,lockd
via_rhine 15752 0
mii 3584 1 via_rhine
floppy 47440 0
sg 27680 0
scsi_mod 91984 1 sg
microcode 4768 0
dm_mod 32800 0
ehci_hcd 22916 0
uhci_hcd 24472 0
button 4632 0
battery 6924 0
asus_acpi 8984 0
ac 3340 0
r128 85796 2
ipv6 184672 18
ext3 103656 2
jbd 40728 1 ext3
I would not normally quote an an entire message, but it contains data relevant to this problem.
The hang below occurs even outside of GDB, and also occurs after upgrading the kernel:
Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686 i686 i386 GNU/Linux
Can anyone please give me a clue/pointer to tools/techniques that might help identify where in the kernel the hang occurs? The
system is so completely unresponsive when this occurs that I cannot provide any forensic data.
Does anyone's experience show that these types of hangs might occur purely as the result of use (or mis-use) of the pthreads
library? I'm looking for hints about what parts of my code to review.
There could easily be erroneous calls to pthread_detach(), pthread_join(), close(), and other system calls involved.
Thanks,
Andrew Athan
-----Original Message-----
From: [email protected]
[mailto:[email protected]]On Behalf Of Andrew A.
Sent: Wednesday, September 22, 2004 3:17 PM
To: [email protected]
Subject: Consisten kernel hang during heavy TCP connection handling load
I have written a pair of applications the server side of which consistently causes my Linux Fedora Core 2 system to become
completely unresponsive; all consoles hang, and it no longer services network connections.
The applications engage in the rapid opening and closing of TCP connections. The server side is multithreaded (# threads approx 5).
It services the connections by dumping data into them from a file. The client side reads no data. The server then receives EAGAIN
from send(...,MSG_NOWAIT) calls, and issues 5ms sleep before resending on any particular TCP connection. It loops up to 20 times
waiting for the the connection to become unblocked. The applications are running within GDB, and threads *are* created/destroyed
during the process.
I will change the application to use select() rather than sleeping on a blocked pipe. However, I don't think it's a "good thing"
that the machine hangs so completely.
I looked for tools to help catch the kernel before it goes la-la (assuming it's the kernel going la-la), but got
frustrated/ran-out-of-time. E.g., lkcd seems defunct.
If pointed in the right direction, I would be happy to perform further forensics after re-creating the hang. I am also in the
process of upgrading the kernel to see if that resolves the problem.
Andrew Athan
uname -a:
Linux bbox.memeplex.com 2.6.6-1.435 #1 Mon Jun 14 09:09:07 EDT 2004 i686 i686 i386 GNU/Linux
lsmod:
Module Size Used by
snd_mixer_oss 13824 2
snd_via82xx 20644 3
snd_ac97_codec 54788 1 snd_via82xx
snd_pcm 69256 1 snd_via82xx
snd_timer 17284 1 snd_pcm
snd_page_alloc 8072 2 snd_via82xx,snd_pcm
gameport 3328 1 snd_via82xx
snd_mpu401_uart 4864 1 snd_via82xx
snd_rawmidi 17444 1 snd_mpu401_uart
snd_seq_device 6152 1 snd_rawmidi
snd 39396 10
snd_mixer_oss,snd_via82xx,snd_ac97_codec,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
soundcore 6112 3 snd
ipt_mark 1408 2
ipt_MARK 1664 14
cls_u32 5508 2
cls_fw 3200 2
sch_sfq 4352 9
sch_htb 18048 1
iptable_mangle 2176 1
ip_tables 13568 3 ipt_mark,ipt_MARK,iptable_mangle
nfsd 159488 9
exportfs 4224 1 nfsd
lockd 47816 2 nfsd
parport_pc 19392 1
lp 8236 0
parport 29640 2 parport_pc,lp
autofs4 12932 0
sunrpc 109924 19 nfsd,lockd
via_rhine 15752 0
mii 3584 1 via_rhine
floppy 47440 0
sg 27680 0
scsi_mod 91984 1 sg
microcode 4768 0
dm_mod 32800 0
ehci_hcd 22916 0
uhci_hcd 24472 0
button 4632 0
battery 6924 0
asus_acpi 8984 0
ac 3340 0
r128 85796 2
ipv6 184672 18
ext3 103656 2
jbd 40728 1 ext3
>
> I would not normally quote an an entire message, but it contains data
> relevant to this problem.
>
> The hang below occurs even outside of GDB, and also occurs after
> upgrading the kernel:
>
> Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> i686 i686 i386 GNU/Linux
>
>
>
> Can anyone please give me a clue/pointer to tools/techniques that
> might help identify where in the kernel the hang occurs? The system
> is so completely unresponsive when this occurs that I cannot provide
> any forensic data.
How unresponsive exactly it is? Can you switch consoles and write? I
suppose ps(1) hangs... Is the disk working?
You can compile kernel with the magic Sysrq key (it is the option in the
kernel debugging section), run it and then press alt-sysrq-t and the
state of all processes will be printed. That might help...
> Does anyone's experience show that these types of hangs might occur
> purely as the result of use (or mis-use) of the pthreads library? I'm
> looking for hints about what parts of my code to review.
>
> There could easily be erroneous calls to pthread_detach(),
> pthread_join(), close(), and other system calls involved.
>
> Thanks,
> Andrew Athan
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan,
Thanks for responding. When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
debugging/crash-data-gathering tools. LKCD for example, is not available at the location you'll eventually arrive at if you search
for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
netlog, diskdump (conusingly known as lkdump) etc. etc. And then, even if you do figure out what tools are current, you then have
to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.
Is diskdump available for 2.4? Can anyone comment on the choice of tools below?
Anyway, I have also done all of the following:
(1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts. Fixes can be found at
http://www.memeplex.com/Linux.html Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server. Can anyone tell me if this is
absurdly small, and if so, what might be the solution? The client box always reboots so I suspect too-small timeouts are the issue.
(2) Downloaded the latest 2.4 kernel, installed KDB patches and modified configs on the system to accept the 2.4 kernel --
specifically, /etc/modules.conf and xorg.conf changes (added Mouse1/SendCoreEvents on /dev/psaux). I don't think I found any
netdump patches for the 2.4 line of kernels. Can someone point me in the right direction?
(3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
I'll wait for the next hang now, trying it on both kernels. By the way, the system is hung VERY badly--doesn't respond to anything,
no switching consoles, no keyboard events, no disk activity. Dunno about network, since I haven't put a sniffer on it yet.
A.
-----Original Message-----
From: Jan Kara [mailto:[email protected]]
Sent: Sunday, September 26, 2004 1:42 PM
To: Andrew A.
Cc: [email protected]
Subject: Re: Consistent kernel hang during heavy TCP connection handling
load
>
> I would not normally quote an an entire message, but it contains data
> relevant to this problem.
>
> The hang below occurs even outside of GDB, and also occurs after
> upgrading the kernel:
>
> Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> i686 i686 i386 GNU/Linux
>
>
>
> Can anyone please give me a clue/pointer to tools/techniques that
> might help identify where in the kernel the hang occurs? The system
> is so completely unresponsive when this occurs that I cannot provide
> any forensic data.
How unresponsive exactly it is? Can you switch consoles and write? I
suppose ps(1) hangs... Is the disk working?
You can compile kernel with the magic Sysrq key (it is the option in the
kernel debugging section), run it and then press alt-sysrq-t and the
state of all processes will be printed. That might help...
> Does anyone's experience show that these types of hangs might occur
> purely as the result of use (or mis-use) of the pthreads library? I'm
> looking for hints about what parts of my code to review.
>
> There could easily be erroneous calls to pthread_detach(),
> pthread_join(), close(), and other system calls involved.
>
> Thanks,
> Andrew Athan
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
Hello,
> Thanks for responding. When I got no responses, I searched for ways
> to get more data out of the kernel--I must say that it has been quite
> a journey to identify what is working, where to get it, and how to
> install it when it comes to kernel debugging/crash-data-gathering
> tools. LKCD for example, is not available at the location you'll
> eventually arrive at if you search for it in google ... it's not
> obvious what it's state is (current/defunct/superceded), there's KDB,
> KGDB, netdump, netconsol, netlog, diskdump (conusingly known as
> lkdump) etc. etc. And then, even if you do figure out what tools are
> current, you then have to match the tool to the particular kernel
> version you are running -- which can be a task and a half unto itself.
>
> Is diskdump available for 2.4? Can anyone comment on the choice of
> tools below?
>
> Anyway, I have also done all of the following:
>
> (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel,
> after first fixing the startup scripts. Fixes can be found at
> http://www.memeplex.com/Linux.html Note that after I also fixed crash.c to
> be a 2.6 compliant kernel module, and loading it to test netdump, I
> always end up with a vmcore-incomplete image approx 45k in size, on
> the netdump-server. Can anyone tell me if this is absurdly small, and
> if so, what might be the solution? The client box always reboots so I
> suspect too-small timeouts are the issue.
>
> (2) Downloaded the latest 2.4 kernel, installed KDB patches and
> modified configs on the system to accept the 2.4 kernel --
> specifically, /etc/modules.conf and xorg.conf changes (added
> Mouse1/SendCoreEvents on /dev/psaux). I don't think I found any
> netdump patches for the 2.4 line of kernels. Can someone point me in
> the right direction?
I don't have personaly much experience with debugging by above tools
so I won't be of much help. As you describe the problem below I
personaly think that you won't get much from them if the system is as
unresponsive as you write.
> (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
>
> I'll wait for the next hang now, trying it on both kernels. By the
> way, the system is hung VERY badly--doesn't respond to anything, no
> switching consoles, no keyboard events, no disk activity. Dunno about
> network, since I haven't put a sniffer on it yet.
Hmm.. that looks bad. Do you debug things under console and not
in X? If that is the case either there is some hardware problem (you
likely generate quite high load on the machine) or some driver is stuck
with interrupts disabled. In case debugging tools don't help you can try
to compile kernel with minimal config (just disable everything not
needed to run the test). Also reproducing on a different machine would
be useful to rule out hardware...
Honza
>
> -----Original Message-----
> From: Jan Kara [mailto:[email protected]]
> Sent: Sunday, September 26, 2004 1:42 PM
> To: Andrew A.
> Cc: [email protected]
> Subject: Re: Consistent kernel hang during heavy TCP connection handling
> load
>
>
> >
> > I would not normally quote an an entire message, but it contains data
> > relevant to this problem.
> >
> > The hang below occurs even outside of GDB, and also occurs after
> > upgrading the kernel:
> >
> > Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
> > i686 i686 i386 GNU/Linux
> >
> >
> >
> > Can anyone please give me a clue/pointer to tools/techniques that
> > might help identify where in the kernel the hang occurs? The system
> > is so completely unresponsive when this occurs that I cannot provide
> > any forensic data.
> How unresponsive exactly it is? Can you switch consoles and write? I
> suppose ps(1) hangs... Is the disk working?
>
> You can compile kernel with the magic Sysrq key (it is the option in the
> kernel debugging section), run it and then press alt-sysrq-t and the
> state of all processes will be printed. That might help...
>
> > Does anyone's experience show that these types of hangs might occur
> > purely as the result of use (or mis-use) of the pthreads library? I'm
> > looking for hints about what parts of my code to review.
> >
> > There could easily be erroneous calls to pthread_detach(),
> > pthread_join(), close(), and other system calls involved.
> >
> > Thanks,
> > Andrew Athan
>
> Honza
> --
> Jan Kara <[email protected]>
> SuSE CR Labs
>
--
Jan Kara <[email protected]>
SuSE CR Labs
Jan/all:
Yes, I have reproduced the problem on another machine running a similar kernel but with different network card, CPU, etc.
A.
-----Original Message-----
From: Jan Kara
Subject: Re: Consistent kernel hang during heavy TCP connection handling
load
Hello,
> Thanks for responding. When I got no responses, I searched for ways
I don't have personaly much experience with debugging by above tools
so I won't be of much help. As you describe the problem below I
personaly think that you won't get much from them if the system is as
unresponsive as you write.
> (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
>
> I'll wait for the next hang now, trying it on both kernels. By the
> way, the system is hung VERY badly--doesn't respond to anything, no
> switching consoles, no keyboard events, no disk activity. Dunno about
> network, since I haven't put a sniffer on it yet.
Hmm.. that looks bad. Do you debug things under console and not
in X? If that is the case either there is some hardware problem (you
likely generate quite high load on the machine) or some driver is stuck
with interrupts disabled. In case debugging tools don't help you can try
to compile kernel with minimal config (just disable everything not
needed to run the test). Also reproducing on a different machine would
be useful to rule out hardware...
Honza
Hello,
> Yes, I have reproduced the problem on another machine running a
> similar kernel but with different network card, CPU, etc.
OK, so it probably won't be hardware. Any debugging output? If I got
it right you are using RH kernel - can you try with the vanilla one from
ftp.kernel.org to rule out some RH specific patches? Can you send your
kernel configuration?
Honza
> -----Original Message-----
> From: Jan Kara
> Subject: Re: Consistent kernel hang during heavy TCP connection handling
> load
>
>
> Hello,
>
> > Thanks for responding. When I got no responses, I searched for ways
> I don't have personaly much experience with debugging by above tools
> so I won't be of much help. As you describe the problem below I
> personaly think that you won't get much from them if the system is as
> unresponsive as you write.
>
> > (3) Enabled sysrq on both kernels, including echo "1" > /proc/sys/kernel/sysrq
> >
> > I'll wait for the next hang now, trying it on both kernels. By the
> > way, the system is hung VERY badly--doesn't respond to anything, no
> > switching consoles, no keyboard events, no disk activity. Dunno about
> > network, since I haven't put a sniffer on it yet.
> Hmm.. that looks bad. Do you debug things under console and not
> in X? If that is the case either there is some hardware problem (you
> likely generate quite high load on the machine) or some driver is stuck
> with interrupts disabled. In case debugging tools don't help you can try
> to compile kernel with minimal config (just disable everything not
> needed to run the test). Also reproducing on a different machine would
> be useful to rule out hardware...
>
> Honza
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Jan Kara <[email protected]>
SuSE CR Labs
Andrew A. wrote:
> Jan,
>
> Thanks for responding. When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
> quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
> debugging/crash-data-gathering tools. LKCD for example, is not available at the location you'll eventually arrive at if you search
> for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
> netlog, diskdump (conusingly known as lkdump) etc. etc. And then, even if you do figure out what tools are current, you then have
> to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.
>
> Is diskdump available for 2.4? Can anyone comment on the choice of tools below?
>
> Anyway, I have also done all of the following:
>
> (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts. Fixes can be found at
> http://www.memeplex.com/Linux.html Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
> netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server. Can anyone tell me if this is
> absurdly small, and if so, what might be the solution? The client box always reboots so I suspect too-small timeouts are the issue.
My experience is 100% with RH kernels, but the dump should be about
memory size, in my case 2.5G or 4G and it is. But I did see hangs which
resulted in the size you mention, a few k and hang.
There was a patch floating around to write a core image to a disk
partition like Solaris, AIX, and other commercial systems, but Linus was
opposed for some reason I remember as "I don't need this and it culd be
dangerous" or similar. If that can be retrofitted to a current kernel it
would be more useful than netdump, I suspect.
In any case, the short answer is that what you see is way too short, it
sounds like the header info on config, registers, or somesuch that
netdump sends first before the core.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
On Thu, 2004-09-30 at 09:34, Bill Davidsen wrote:
> Andrew A. wrote:
> > Jan,
> >
> > Thanks for responding. When I got no responses, I searched for ways to get more data out of the kernel--I must say that it has been
> > quite a journey to identify what is working, where to get it, and how to install it when it comes to kernel
> > debugging/crash-data-gathering tools. LKCD for example, is not available at the location you'll eventually arrive at if you search
> > for it in google ... it's not obvious what it's state is (current/defunct/superceded), there's KDB, KGDB, netdump, netconsol,
> > netlog, diskdump (conusingly known as lkdump) etc. etc. And then, even if you do figure out what tools are current, you then have
> > to match the tool to the particular kernel version you are running -- which can be a task and a half unto itself.
> >
> > Is diskdump available for 2.4? Can anyone comment on the choice of tools below?
> >
> > Anyway, I have also done all of the following:
> >
> > (1) Enabled netdump/netconsole on 2.6.8.1-521 Fedora Core kernel, after first fixing the startup scripts. Fixes can be found at
> > http://www.memeplex.com/Linux.html Note that after I also fixed crash.c to be a 2.6 compliant kernel module, and loading it to test
> > netdump, I always end up with a vmcore-incomplete image approx 45k in size, on the netdump-server. Can anyone tell me if this is
> > absurdly small, and if so, what might be the solution? The client box always reboots so I suspect too-small timeouts are the issue.
>
> My experience is 100% with RH kernels, but the dump should be about
> memory size, in my case 2.5G or 4G and it is. But I did see hangs which
> resulted in the size you mention, a few k and hang.
Yep, the vmcore should be around the size of the memory on the dumping
system. The size is too small and vmcore-incomplete wasn't made into
vmcore, so it's incomplete.
Do you have access to making a new kernel? Here's a fix that I think
will help. This will speed things up and help with the timeout issues.
drivers/net/netdump.c :
1) Initialize jiffy_cyles to 1000 * (1000000/HZ) -
-static unsigned long long t0, jiffy_cycles;
+static unsigned long long t0, jiffy_cycles = 1000 * (1000000/HZ);
2) Change prev_jiffies from an "int" to an "unsigned long" in
print_status() function -
- static int prev_jiffies = 0;
+ static unsigned long prev_jiffies = 0;
Let me know if this helps,
Thanks,
Dan