2001-04-25 07:00:06

by Tobias Ringstrom

[permalink] [raw]
Subject: Weird problem with 2.4.4-pre6

Yesterday, I was running tcpdump, paging the output with less. All of a
sudden, less started to dump core (SIGSEGV). I could not even start less
by itself:

> less

without it getting a SIGSEGV, and in fact no user could run less without
getting a SIGSEGV, but it did work perfectly a few minutes earlier. This
morning, I tried to run less again, and now it was working! No core
dumps!

How can this happen? Something overwriting the page/buffer cache?
Unfortunately, I don't know how to reproduce it. I'm writing this because
it was so strange that I felt I had to share it. There are no messages in
the (dmesg) log.

/Tobias, a little bit worried


Semi-random info:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:07.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] (rev 01)
00:07.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:07.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:0b.0 VGA compatible controller: ATI Technologies Inc 210888GX [Mach64 GX] (rev 01)
00:0f.0 Ethernet controller: Davicom Semiconductor, Inc. Ethernet 100/10 MBit (rev 31)
00:11.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)

hda is running with DMA enabled in mdma2 mode.


2001-04-26 01:16:55

by Andrew Morton

[permalink] [raw]
Subject: Re: Weird problem with 2.4.4-pre6

Tobias Ringstrom wrote:
>
> Yesterday, I was running tcpdump, paging the output with less. All of a
> sudden, less started to dump core (SIGSEGV). I could not even start less
> by itself:
>
> > less
>
> without it getting a SIGSEGV, and in fact no user could run less without
> getting a SIGSEGV, but it did work perfectly a few minutes earlier. This
> morning, I tried to run less again, and now it was working! No core
> dumps!
>
> How can this happen? Something overwriting the page/buffer cache?

Yes. Something scribbled on the pagecache, most likely.

If this happens, take a copy of the offending binary and all its shared
libraries - simply copy them into a temp directory. The corrupted version
will be written to disk, from the pagecache. Make sure you keep
a copy of the offending vmlinux as well for looking things up in.

Then reboot and start diffing things; the differences can provide
clues. If the diffs show single-bit errors then it's a RAM
problem. If the diffs look like pointers into kernel space
then look 'em up in vmlinux and shout loudly. etc.

-