2001-02-20 10:16:38

by Kevin Turner

[permalink] [raw]
Subject: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

Version:
Linux version 2.4.1-pre12 (gcc version 2.95.3 20010125 (prerelease))

Possible suspect players:
dpkg seems to trigger the bug
ReiserFS is the partition that doesn't sync
binfmt_misc shows up in the call traces.

Symptoms:

The system assumes glacial speeds. If you're *lucky*, you'll see one
widget re-paint in X before the next ice age. Ctrl-alt-delete is
unresponsive, as are attempts to start proccesses via the network or
joystick port. Keypresses to programs such as getty are not echoed.
All program output to console and network is stopped dead. If you leave
for a several-hour-long coffee break and come back to it, there's still
no evidence that you banged on the keyboard.

What does work: If not in X, I may switch between VTs. Keypresses are
echoed on VTs without running processes. And the machine still pings.
Occasionally (a few times a minute at most), you'll see the HDD LED
blink.

When it happens: Sometimes, not always, when accessing the package
database or updating installed packages with the Debian's "dpkg" tool.

When it *didn't* happen: before I switched to kernel 2.4.x.

Debugging information:

After I had about enough of this, I recompiled the kernel with
CONFIG_REISERFS_CHECK and Magic SysRq. ReiserFS gives no debugging
messages of any sort during these episodes. Magic SysRq, however, does
work.

Magic-tErminate clears everything up and the system resumes normal
operation.

Magic-Sync appears not to work, as "Ok" and "Done" don't appear. But
after Magic-E, they do show up. Logs show that one partition actually
sync'd before termination, the others didn't. The partition where it
stalled just so happens to be a ReiserFS partition (3.5.x format).

kernel: SysRq: Emergency Sync
kernel: Syncing device 03:01 ... OK
kernel: Syncing device 16:02 ... <6>SysRq: Terminate All Tasks
kernel: OK
kernel: Syncing device 16:01 ... OK
kernel: Syncing device 03:03 ... OK


03:01 - ext2 /
16:02 - ReiFS /usr (note: /var is a symlink to /usr/root-var)
16:01 - ext2 /home
03:03 - ext2 /other

In ReiserFS's defense, /usr and /var are the patitions that dpkg was
most likely to have open files on, so perhaps that was it.

When I asked magic about running processes, it said that, among other
things, there was one apt-get, one dpkg, and three instances of
dpkg-deb. Of those, dpkg was the only one in 'R' state, the others were
'S'. Here's the call trace:


dpkg R 00000000 0 9278 9152 9285 (NOTLB)
Call Trace: [__alloc_pages+372/720] [grow_buffers+62/364]
[refill_freelist+28/48] [refill_freelist+41/48] [getblk+242/264]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-621596/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-970133/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-750728/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-976628/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-918150/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-976370/96]
[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-1080871/96]
[permission+42/48] [vfs_link+141/192] [sys_link+202/300]


Why binfmt_misc? I'll be burned if I know. It is true that one of the
packages I was installing this run was related to binfmt_misc
("binfmt-support"), but that wasn't the case the previous times this bug
has happened. On the other hand, I don't have call traces for those
times.


Linux version 2.4.1-pre12 (gcc version 2.95.3 20010125 (prerelease))
dpkg version 1.3.8.1
arch is i586 (GenuineIntel Pentium MMX)
disks are IDE
48 MB RAM
swap > 200 MB

Motherboard: ABIT-TX5
Host bridge: Intel Corporation 430TX - 82439TX MTXC (rev 01)
IDE interface: Intel Corporation 82371AB PIIX4 IDE (rev 01)

More logs are posted at:
http://olug.cs.oberlin.edu/~kturner/bugs/glacial-2.4.1.html
The kernel's configuration is at:
http://olug.cs.oberlin.edu/~kturner/bugs/config-2.4.1-pre12


And yes, I will try upgrading to a newer kernel. But since I can't
quite reproduce this on demand yet, I thought I'd report it now while
the logbits are fresh.

--
Kevin Turner <[email protected]> | OpenPGP encryption welcome here
I usually subscribe to lists I post to, but for LKML I'm making an exception =)
Please feel free to CC me on all follow-ups.


2001-02-20 10:31:01

by Keith Owens

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

On Tue, 20 Feb 2001 02:16:09 -0800,
Kevin Turner <[email protected]> wrote:
>[binfmt_misc:__insmod_binfmt_misc_O/lib/modules/2.4.1-pre12/kernel/fs/bi+-621596/96]
>
>Why binfmt_misc? I'll be burned if I know.

Because klogd conversion of addresses to symbols is a pile of crud.
Turn off klogd symbol conversion (klogd -x) and run the raw addresses
through ksymoops.

2001-02-20 11:33:56

by David Ford

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

Kevin Turner wrote:

> Version:
> Linux version 2.4.1-pre12 (gcc version 2.95.3 20010125 (prerelease))
>
> Possible suspect players:
> dpkg seems to trigger the bug
> ReiserFS is the partition that doesn't sync
> binfmt_misc shows up in the call traces.
>
> Symptoms:
>
> The system assumes glacial speeds. If you're *lucky*, you'll see one
> widget re-paint in X before the next ice age. Ctrl-alt-delete is
> unresponsive, as are attempts to start proccesses via the network or
> joystick port. Keypresses to programs such as getty are not echoed.
> All program output to console and network is stopped dead. If you leave
> for a several-hour-long coffee break and come back to it, there's still
> no evidence that you banged on the keyboard.


Wild shot in the dark....I'd lay odds that you had about 6-7 Megs free
in your buffers/cache line, yes?

-d

2001-02-20 12:25:10

by Kevin Turner

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

On Tue, Feb 20, 2001 at 03:33:33AM -0800, David wrote:
> Wild shot in the dark....I'd lay odds that you had about 6-7 Megs free
> in your buffers/cache line, yes?

David! You're psychic!


SysRq: Show Memory
Mem-info:
Free pages: 712kB ( 0kB HighMem)
( Active: 1779, inactive_dirty: 1507, inactive_clean: 0, free: 178 (192 384 576) )
0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB = 512kB)
0*4kB 1*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB = 200kB)
= 0kB)
Swap cache: add 73994, delete 73947, find 19040/117035
Free swap: 206664kB
12288 pages of RAM
0 pages of HIGHMEM
653 reserved pages
4740 pages shared
47 pages swap cached
0 pages in page table cache
Buffer memory: 6028kB

2001-02-20 12:34:50

by Kevin Turner

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

Thanks to Keith who pointed out that "klogd conversion of addresses to
symbols is a pile of crud." Here's what I'm getting out of ksymoops
now. It's not as much, since it's just the parts I copied down by
hand... (I'll get it next time. Whenever that happens to be.
Installing the ksymoops package didn't trigger it.)


from magic-showPc:


EIP: 0010:[<c0126533>] CPU: 0 EFLAGS: 00000207
Using defaults from ksymoops -t elf32-i386 -a i386
EAX: c0206f98 EBX: c10af1e0 ECX: c10af1fc EDX: c1091200
ESI: c10af1fc EDI: 00000000 EBP: 000005ac DS: 0018 ES: 0018
CR0: 8005003b CR2: 08052beb CR3: 00839000 CR4: 00000010
Call Trace: [<c0126dec>] [<c0126f66>] [<c0127c58>] [<c0127d0c>] [<c01283b1>] [<c011e0ea>] [<c011e140>]
[<c011e485>] [<c010f83c>] [<c010f704>] [<c0110336>] [<c0127d5e>] [<c0127d84>] [<c0139cea>] [<c0138fe8>]
[<c0108ee4>] [<c013a3e4>] [<c0108da3>]
Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; c0126533 <page_launder+2d3/8a0> <=====
Trace; c0126dec <do_try_to_free_pages+34/7c>
Trace; c0126f66 <try_to_free_pages+22/2c>
Trace; c0127c58 <__alloc_pages+230/2d0>
Trace; c0127d0c <__get_free_pages+14/20>
Trace; c01283b1 <read_swap_cache_async+31/a0>
Trace; c011e0ea <swapin_readahead+8e/c4>
Trace; c011e140 <do_swap_page+20/114>
Trace; c011e485 <handle_mm_fault+fd/154>
Trace; c010f83c <do_page_fault+138/3fc>
Trace; c010f704 <do_page_fault+0/3fc>
Trace; c0110336 <schedule+26a/394>
Trace; c0127d5e <__free_pages+1a/1c>
Trace; c0127d84 <free_pages+24/28>
Trace; c0139cea <poll_freewait+3a/44>
Trace; c0138fe8 <do_fcntl+14c/204>
Trace; c0108ee4 <error_code+34/40>
Trace; c013a3e4 <sys_select+3bc/494>
Trace; c0108da3 <system_call+33/40>

The running dpkg process:

Warning (Oops_read): Code line not seen, dumping what data is available

Trace; c0127b9c <__alloc_pages+174/2d0>
Trace; c012f932 <grow_buffers+3e/16c>
Trace; c012dad4 <refill_freelist+1c/30>
Trace; c012dae1 <refill_freelist+29/30>
Trace; c012dec2 <getblk+f2/108>
Trace; c38b7378 <[reiserfs].bss.end+45979/af661>
Trace; c386226b <[reiserfs]do_journal_end+633/ab4>
Trace; c3897b78 <[reiserfs].bss.end+26179/af661>
Trace; c386090c <[reiserfs]do_journal_begin_r+188/258>
Trace; c012c0bc <filp_close+5c/64>
Trace; c0108da3 <system_call+33/40>

The running dpkg process, several minutes later is the same, but
the "bss.end" line above the 'do_journal_end' call reads:

Trace; c38b73e4 <[reiserfs].bss.end+459e5/af661>




2001-02-20 16:26:51

by Chris Mason

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync



On Tuesday, February 20, 2001 03:33:33 AM -0800 David <[email protected]>
wrote:

> Kevin Turner wrote:
>
>> Version:
>> Linux version 2.4.1-pre12 (gcc version 2.95.3 20010125 (prerelease))
>>
>> Possible suspect players:
>> dpkg seems to trigger the bug
>> ReiserFS is the partition that doesn't sync
>> binfmt_misc shows up in the call traces.
>>
>> Symptoms:
>>
>> The system assumes glacial speeds. If you're *lucky*, you'll see one
>> widget re-paint in X before the next ice age. Ctrl-alt-delete is
>> unresponsive, as are attempts to start proccesses via the network or
>> joystick port. Keypresses to programs such as getty are not echoed.
>> All program output to console and network is stopped dead. If you leave
>> for a several-hour-long coffee break and come back to it, there's still
>> no evidence that you banged on the keyboard.
>
>
> Wild shot in the dark....I'd lay odds that you had about 6-7 Megs free in
> your buffers/cache line, yes?
>

David, have any of Rik's patches helped here?

-chris

2001-02-23 23:45:47

by Dwayne Litzenberger

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

I have the same problem with 2.4.1 (and 2.4.2). Two processes that are
actively using the disk (multiple files) seem to deadlock the system. Killing
the right process (SysRq-K) seems to fix things.

I'm kind of new to kernel debugging. Anyone want to guide me through it?


--
Dwayne C. Litzenberger - [email protected]

- Please always Cc to me when replying to me on the lists.
- See the mail headers for GPG/advertising/homepage information.


Attachments:
(No filename) (448.00 B)
(No filename) (240.00 B)
Download all attachments

2001-02-24 02:21:57

by Dwayne Litzenberger

[permalink] [raw]
Subject: Re: [2.4.1] system goes glacial, Reiser on /usr doesn't sync

I think Alan fixed it.

I've been running 2.4.2-ac3 under heavy load for about a half-hour now under
heavy disk usage (apt-get + 2 kernel builds + Netscape/X11), and it hasn't
locked up yet.

--
Dwayne C. Litzenberger - [email protected]

- Please always Cc to me when replying to me on the lists.
- See the mail headers for GPG/advertising/homepage information.


Attachments:
(No filename) (362.00 B)
(No filename) (240.00 B)
Download all attachments