2003-01-27 07:01:09

by Andrew Morton

[permalink] [raw]
Subject: 2.5.59-mm6


http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm6/

. Some rework and restructuring of the anticipatory scheduling code.

The reported slowdown in RAID1 rebuild _may_ have been fixed. At least,
it doesn't happen for me with this patchset.

. The request aliasing problem hasn't been fixed yet, so this kernel (and
2.5.59) will still fail under heavy direct-IO load.

. The mysterious "machine hangs late in boot" problem has been narrowed
down thanks to some great work by Andres Salomon. The machine is stuck
waiting on I/O completion when performing the initial lookup for
/sbin/devfs_helper:

Thread 11 (Thread 10):
#0 io_schedule () at include/asm/atomic.h:122
#1 0xc014cd0a in __wait_on_buffer (bh=0xd3fe45b0) at fs/buffer.c:132
#2 0xc014dfa6 in __bread_slow (bh=0xd3fe45b0)
at include/linux/buffer_head.h:260
#3 0xc014e1c8 in __bread (bdev=0x0, block=0, size=0) at fs/buffer.c:1385
#4 0xc0181774 in ext3_get_inode_loc (inode=0xd3d697bc, iloc=0xd3d13ce0)
at include/linux/buffer_head.h:235
#5 0xc0181841 in ext3_read_inode (inode=0xd3d697bc) at fs/ext3/inode.c:2205
#6 0xc0183db4 in ext3_lookup (dir=0x0, dentry=0xd3d4cae0)
at include/linux/fs.h:1199
#7 0xc01585fb in real_lookup (parent=0xd3d4cce0, name=0xd3d13d94, flags=0)
at fs/namei.c:372
#8 0xc0158849 in do_lookup (nd=0xd3d4cae0, name=0xd3d13d94, path=0xd3d13d84,
cached_path=0xd3d13d8c, flags=-1071144428) at fs/namei.c:537
#9 0xc01589ef in link_path_walk (name=0x0, nd=0xd3d13dc8) at fs/namei.c:651
#10 0xc01558c1 in open_exec (name=0x0) at fs/exec.c:454
#11 0xc0156200 in do_execve (filename=0xd3d6d000 "/sbin/devfs_helper",
argv=0xc133bd08, envp=0xd3d13dc8, regs=0x0) at fs/exec.c:1032
#12 0xc0107e0d in sys_execve (regs=
{ebx = -1071125472, ecx = -1053573880, edx = -1071125308, esi =
-740448672, edi = 0, ebp = -741261356, eax = 11, xds = -1072562053,

Which _looks_ like a request queueing problem, but Andres says it goes
away when devfs is disabled in config. So I've dropped the smalldevfs
patch for now - would be appreciated if devfs users could retest this
patch, with CONFIG_DEVFS=y.

. There appears to be a CPU utilisation problem with
reiserfs_file_write.patch - but it doesn't oops or corrupt data so I've
left that in for now while Oleg scratches his head over that one.


Changes since 2.5.59-mm5:

-devfs-fix.patch

This might have caused interactions with Adam's patch (which isn't here
anyway), so leave it out.

+sync-fix.patch

Fix rare data loss problem with ext2 and heavy use of sync()

+direct-io-ENOSPC-fix.patch

Fix inode accounting error which occurs when an O_DIRECT write hits ENOSPC.

+frlock-xtime.patch
+frlock-xtime-i386.patch
+frlock-xtime-ia64.patch
+frlock-xtime-other.patch

An alternative version of the lockless gettimeofday() patch. Needs testing
on other architectures.

+inode-accounting-race-fix.patch

Fix SMP race in i_blocks/i_bytes accounting.

-lockless-current_kernel_time.patch

Replaced by the frlock version.

+agp-warning-fix.patch

Fix a warning

+slab-poisoning-fix.patch

Slab debug fix

+modversions.patch

Resurrect module versioning support

+pcmcia_timer_init.patch

Timer initialisation fixes

+no_space_in_slabnames.patch

/proc/slabinfo sanity

+epoll-update.patch

Latest from Davide (I think. May be latest-but-one)

+hash-warnings.patch

Compile warnings.

+discarded-section-fix.patch

Build fix

-smalldevfs.patch

Might be causing the boot hangs

+atyfb-compile-fix.patch

Build fix

+floppy-locking-fix.patch

Floppy forgot to take queue_lock

+lost-tick.patch

Kep time going forward when someone disables interrupts for ages

-exit_mmap-fix-ppc64.patch
-exit_mmap-ia64-fix.patch
+exit_mmap-fix-47.patch

Yet another take on the TASK_SIZE fix for exit_mmap()

anticipatory_io_scheduling-2_5_59-mm3.patch
+ant-cleanup.patch
+antsched-update-1.patch

Anticipatory scheduler changes



All 82 patches:

kgdb.patch

sync-fix.patch
Fix data loss problem due to sys_sync

direct-io-ENOSPC-fix.patch
direct-IO: fix i_size handling on ENOSPC

frlock-xtime.patch
fast reader locks for gettimeofday() and friends

frlock-xtime-i386.patch

frlock-xtime-ia64.patch

frlock-xtime-other.patch

inode-accounting-race-fix.patch
Fix inode size accounting race

vmlinux-fix.patch
vmlinux fix

maestro-fix.patch
Compile fix in sound/oss/maestro.c

deadline-np-42.patch
(undescribed patch)

deadline-np-43.patch
(undescribed patch)

setuid-exec-no-lock_kernel.patch
remove lock_kernel() from exec of setuid apps

buffer-debug.patch
buffer.c debugging

warn-null-wakeup.patch

reiserfs-readpages.patch
reiserfs v3 readpages support

fadvise.patch
implement posix_fadvise64()

ext3-scheduling-storm.patch
ext3: fix scheduling storm and lockups

auto-unplug.patch
self-unplugging request queues

less-unplugging.patch
Remove most of the blk_run_queues() calls

scheduler-tunables.patch
scheduler tunables

htlb-2.patch
hugetlb: fix MAP_FIXED handling

kirq.patch

kirq-up-fix.patch
Subject: Re: 2.5.59-mm1

agp-warning-fix.patch
fix agp compile warning

ext3-truncate-ordered-pages.patch
ext3: explicitly free truncated pages

prune-icache-stats.patch
add stats for page reclaim via inode freeing

vma-file-merge.patch

mmap-whitespace.patch

read_cache_pages-cleanup.patch
cleanup in read_cache_pages()

remove-GFP_HIGHIO.patch
remove __GFP_HIGHIO

quota-lockfix.patch
quota locking fix

quota-offsem.patch
quota semaphore fix

slab-poisoning-fix.patch
slab poison checking fix

oprofile-p4.patch

oprofile_cpu-as-string.patch
oprofile cpu-as-string

preempt-locking.patch
Subject: spinlock efficiency problem [was 2.5.57 IO slowdown with CONFIG_PREEMPT enabled)

wli-11_pgd_ctor.patch
(undescribed patch)

wli-11_pgd_ctor-update.patch
pgd_ctor update

stack-overflow-fix.patch
stack overflow checking fix

ext2-allocation-failure-fix.patch
Subject: [PATCH] ext2 allocation failures

ext2_new_block-fixes.patch
ext2_new_block cleanups and fixes

hangcheck-timer.patch
hangcheck-timer

slab-irq-fix.patch
slab IRQ fix

Richard_Henderson_for_President.patch
Subject: [PATCH] Richard Henderson for President!

parenthesise-pgd_index.patch
Subject: i386 pgd_index() doesn't parenthesize its arg

sendfile-security-hooks.patch
Subject: [RFC][PATCH] Restore LSM hook calls to sendfile

macro-double-eval-fix.patch
Subject: Re: i386 pgd_index() doesn't parenthesize its arg

mmzone-parens.patch
asm-i386/mmzone.h macro paren/eval fixes

blkdev-fixes.patch
blkdev.h fixes

modversions.patch
Subject: [PATCH] new modversions

pcmcia_timer_init.patch
pcmcia timer initialisation fixes

no_space_in_slabnames.patch
remove spaces from slab names

remove-will_become_orphaned_pgrp.patch
remove will_become_orphaned_pgrp()

buffer-io-accounting.patch
correct wait accounting in wait_on_buffer()

aic79xx-linux-2.5.59-20030122.patch
aic7xxx update

MAX_IO_APICS-ifdef.patch
MAX_IO_APICS #ifdef'd wrongly

dac960-error-retry.patch
Subject: [PATCH] linux2.5.56 patch to DAC960 driver for error retry

epoll-update.patch
epoll timeout and syscall return types ...

topology-remove-underbars.patch
Remove __ from topology macros

mandlock-oops-fix.patch
ftruncate/truncate oopses with mandatory locking

put_user-warning-fix.patch
Subject: Re: Linux 2.5.59

hash-warnings.patch
fix #warning's

discarded-section-fix.patch
Subject: [PATCH] discarded section errors (2.5.59)

reiserfs_file_write.patch
Subject: reiserfs file_write patch

atyfb-compile-fix.patch
atyfb compilation fix

floppy-locking-fix.patch
floppy locking fix

lost-tick.patch
Lost tick compensation

sound-firmware-load-fix.patch
soundcore.c referenced non-existent errno variable

generic_file_readonly_mmap-fix.patch
Fix generic_file_readonly_mmap()

seq_file-page-defn.patch
Include <asm/page.h> in fs/seq_file.c, as it uses PAGE_SIZE

exit_mmap-fix-47.patch

show_task-fix.patch
Subject: [PATCH] 2.5.59: show_task() oops

scsi-iothread.patch
scsi_eh_* needs to run even during suspend

numaq-ioapic-fix2.patch
NUMAQ io_apic programming fix

misc.patch
misc fixes

writeback-sync-cleanup.patch
Remove unneeded code in fs/fs-writeback.c

dont-wait-on-inode.patch
Fix latencies during writeback

unlink-latency-fix.patch
fix i_sem contention in sys_unlink()

anticipatory_io_scheduling-2_5_59-mm3.patch
Subject: [PATCH] 2.5.59-mm3 antic io sched

ant-cleanup.patch

antsched-update-1.patch
Subject: [PATCH] 2.5.59-snap2 updates



2003-01-27 08:09:29

by Andres Salomon

[permalink] [raw]
Subject: Re: 2.5.59-mm6

This one boots for me (with devfs enabled). I got some rather interesting
stack dumps, however, during boot.


Linux version 2.5.59 (dilinger@pea) (gcc version 3.2.2 20030124 (Debian prerelease)) #4 Mon Jan 27 03:02:50 EST 2003
Video mode to be used for restore is f00
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000c0000 - 00000000000cc000 (reserved)
BIOS-e820: 0000000000100000 - 0000000013fec000 (usable)
BIOS-e820: 0000000013fec000 - 0000000013ff0000 (reserved)
BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
319MB LOWMEM available.
On node 0 totalpages: 81900
DMA zone: 4096 pages, LIFO batch:1
Normal zone: 77804 pages, LIFO batch:16
HighMem zone: 0 pages, LIFO batch:1
Dell Inspiron with broken BIOS detected. Refusing to enable the local APIC.
Building zonelist for node : 0
Kernel command line: auto BOOT_IMAGE=Linux-2.5 ro root=302 devfs=mount gdb gdbttyS=1 gdbbaud=115200
Initializing CPU#0
PID hash table entries: 2048 (order 11: 16384 bytes)
Detected 498.395 MHz processor.
Console: colour VGA+ 80x25

Warning! Detected 2173 micro-second gap between interrupts.
Compensating for 1 lost ticks.
Call Trace:
[<c010b8a8>] handle_IRQ_event+0x38/0x60
[<c010bade>] do_IRQ+0xae/0x160
[<c0105000>] _stext+0x0/0x30
[<c010a150>] common_interrupt+0x18/0x20
[<c0105000>] _stext+0x0/0x30

Calibrating delay loop... 985.08 BogoMIPS
Memory: 321540k/327600k available (1328k kernel code, 5320k reserved, 396k data, 120k init, 0k highmem)
Dentry cache hash table entries: 65536 (order: 7, 524288 bytes)
Inode-cache hash table entries: 32768 (order: 6, 262144 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
-> /dev
-> /dev/console
-> /root
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 128K
CPU: After generic, caps: 0383f9ff 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: Intel Celeron (Coppermine) stepping 03
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled
tts/1 at I/O 0x2f8 (irq = 3) is a 16550A
Waiting for connection from remote gdb... <4>
Warning! Detected 6839271 micro-second gap between interrupts.
Compensating for 6838 lost ticks.
Call Trace:
[<c010b8a8>] handle_IRQ_event+0x38/0x60
[<c010bade>] do_IRQ+0xae/0x160
[<c010a850>] do_int3+0x0/0x80
[<c010a150>] common_interrupt+0x18/0x20
[<c010a850>] do_int3+0x0/0x80
[<c0115df8>] handle_exception+0x7a8/0x7f0
[<c01c905f>] vt_console_print+0x21f/0x310
[<c0105000>] _stext+0x0/0x30
[<c0115e7d>] breakpoint+0xd/0x10
[<c010a850>] do_int3+0x0/0x80
[<c010a8c9>] do_int3+0x79/0x80
[<c011e2d8>] release_console_sem+0xd8/0xe0
[<c010a1ed>] error_code+0x2d/0x38
[<c0105000>] _stext+0x0/0x30
[<c0115e7d>] breakpoint+0xd/0x10
[<c01cafb2>] gdb_hook+0xa2/0xf0
[<c01cae80>] gdb_interrupt+0x0/0x80

Connected.
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Initializing RT netlink socket
mtrr: v2.0 (20020519)
PCI: PCI BIOS revision 2.10 entry at 0xfc0be, last bus=1
PCI: Using configuration type 1
BIO: pool of 256 setup, 14Kb (56 bytes/bio)
biovec pool[0]: 1 bvecs: 256 entries (12 bytes)
biovec pool[1]: 4 bvecs: 256 entries (48 bytes)
biovec pool[2]: 16 bvecs: 256 entries (192 bytes)

...and so on


On Sun, 26 Jan 2003 23:10:15 -0800, Andrew Morton wrote:

>
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.59/2.5.59-mm6/
>
[...]
>
> antsched-update-1.patch
> Subject: [PATCH] 2.5.59-snap2 updates


2003-01-27 08:15:53

by Joshua Kwan

[permalink] [raw]
Subject: Re: 2.5.59-mm6

On Mon, Jan 27, 2003 at 03:17:54AM -0500, Andres Salomon wrote:
> This one boots for me (with devfs enabled). I got some rather interesting
> stack dumps, however, during boot.

I'm experiencing similar problems without devfs...

> Warning! Detected 2173 micro-second gap between interrupts.
> Compensating for 1 lost ticks.
> Call Trace:
> [<c010b8a8>] handle_IRQ_event+0x38/0x60
> [<c010bade>] do_IRQ+0xae/0x160
> [<c0105000>] _stext+0x0/0x30
> [<c010a150>] common_interrupt+0x18/0x20
> [<c0105000>] _stext+0x0/0x30

Each of these warnings reproduces for each input device on my system
(there are 3 now, so if i disconnect, say, my USB mouse, there will be
only 2.)

In other news (this happened in -mm5, not sure if this happened to
others or not:)

Hangcheck: starting hangcheck timer 0.5.0 (tick is 180 seconds, margin
is 60 seconds).
Uninitialised timer!
This is just a warning. Your computer is OK
function=0xc0216100, data=0x0
Call Trace:
[<c0121ce1>] check_timer_failed+0x61/0x70
[<c0216100>] hangcheck_fire+0x0/0xc0
[<c0121e5f>] mod_timer+0x2f/0x180
[<c0105075>] init+0x35/0x160
[<c0105040>] init+0x0/0x160
[<c010713d>] kernel_thread_helper+0x5/0x18

No visible problems though, at all.

-Josh


Attachments:
(No filename) (1.19 kB)
(No filename) (189.00 B)
Download all attachments

2003-01-27 08:31:50

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.59-mm6

Joshua Kwan <[email protected]> wrote:
>
> On Mon, Jan 27, 2003 at 03:17:54AM -0500, Andres Salomon wrote:
> > This one boots for me (with devfs enabled). I got some rather interesting
> > stack dumps, however, during boot.
>
> I'm experiencing similar problems without devfs...
>
> > Warning! Detected 2173 micro-second gap between interrupts.
> > Compensating for 1 lost ticks.
> > Call Trace:
> > [<c010b8a8>] handle_IRQ_event+0x38/0x60
> > [<c010bade>] do_IRQ+0xae/0x160
> > [<c0105000>] _stext+0x0/0x30
> > [<c010a150>] common_interrupt+0x18/0x20
> > [<c0105000>] _stext+0x0/0x30
>
> Each of these warnings reproduces for each input device on my system
> (there are 3 now, so if i disconnect, say, my USB mouse, there will be
> only 2.)

This is debug stuff - it tells us which drivers are disabling interrupts for
more than one or two clock ticks. Please send the full trace so we can bug
the maintainers into fixing the drivers up.

> In other news (this happened in -mm5, not sure if this happened to
> others or not:)
>
> Hangcheck: starting hangcheck timer 0.5.0 (tick is 180 seconds, margin
> is 60 seconds).
> Uninitialised timer!
> This is just a warning. Your computer is OK
> function=0xc0216100, data=0x0
> Call Trace:
> [<c0121ce1>] check_timer_failed+0x61/0x70
> [<c0216100>] hangcheck_fire+0x0/0xc0
> [<c0121e5f>] mod_timer+0x2f/0x180
> [<c0105075>] init+0x35/0x160
> [<c0105040>] init+0x0/0x160
> [<c010713d>] kernel_thread_helper+0x5/0x18

Ah, bug. Thanks, I shall repair that.

> No visible problems though, at all.
>

No, the uninitialised timer detector fixes the timer up.


2003-01-27 08:37:06

by Joshua Kwan

[permalink] [raw]
Subject: Re: 2.5.59-mm6

On Mon, Jan 27, 2003 at 12:40:59AM -0800, Andrew Morton wrote:
[snip]

> This is debug stuff - it tells us which drivers are disabling interrupts for
> more than one or two clock ticks. Please send the full trace so we can bug
> the maintainers into fixing the drivers up.
>

Sure:

------
Warning! Detected 30879 micro-second gap between interrupts.
Compensating for 29 lost ticks.
Call Trace:
[<c010a948>] handle_IRQ_event+0x38/0x60
[<c010ab77>] do_IRQ+0x97/0x120
[<c010957c>] common_interrupt+0x18/0x20
[<c02601f4>] i8042_command+0x94/0xc0
[<c02602b6>] i8042_aux_write+0x36/0x70
[<c025e1cd>] atkbd_sendbyte+0x7d/0x80
[<c025e2b1>] atkbd_command+0xe1/0xf0
[<c025e64b>] atkbd_probe+0x12b/0x180
[<c025e96a>] atkbd_connect+0x25a/0x2b0
[<c025fb93>] serio_find_dev+0x53/0x60
[<c0105075>] init+0x35/0x160
[<c0105040>] init+0x0/0x160
[<c010713d>] kernel_thread_helper+0x5/0x18

Warning! Detected 113343 micro-second gap between interrupts.
Compensating for 112 lost ticks.
Call Trace:
[<c010a948>] handle_IRQ_event+0x38/0x60
[<c010ab77>] do_IRQ+0x97/0x120
[<c010957c>] common_interrupt+0x18/0x20
[<c02601f4>] i8042_command+0x94/0xc0
[<c0260436>] i8042_close+0x46/0x90
[<c025ff81>] serio_close+0x11/0x20
[<c025e989>] atkbd_connect+0x279/0x2b0
[<c025fb93>] serio_find_dev+0x53/0x60
[<c0105075>] init+0x35/0x160
[<c0105040>] init+0x0/0x160
[<c010713d>] kernel_thread_helper+0x5/0x18

Warning! Detected 30145 micro-second gap between interrupts.
Compensating for 29 lost ticks.
Call Trace:
[<c010a948>] handle_IRQ_event+0x38/0x60
[<c010ab77>] do_IRQ+0x97/0x120
[<c010957c>] common_interrupt+0x18/0x20
[<c02601f4>] i8042_command+0x94/0xc0
[<c0260436>] i8042_close+0x46/0x90
[<c025ff81>] serio_close+0x11/0x20
[<c025fa7e>] psmouse_connect+0x19e/0x1c0
[<c025fb93>] serio_find_dev+0x53/0x60
[<c0105075>] init+0x35/0x160
[<c0105040>] init+0x0/0x160
[<c010713d>] kernel_thread_helper+0x5/0x18
---
>> Each of these warnings reproduces for each input device on my system
>> (there are 3 now, so if i disconnect, say, my USB mouse, there will be
>> only 2.)

A closer look tells me that this isn't quite true. Sorry..

Regards
Josh


Attachments:
(No filename) (2.11 kB)
(No filename) (189.00 B)
Download all attachments

2003-01-27 09:41:58

by Helge Hafting

[permalink] [raw]
Subject: Re: 2.5.59-mm6

Andrew Morton wrote:
>
> Which _looks_ like a request queueing problem, but Andres says it goes
> away when devfs is disabled in config. So I've dropped the smalldevfs
> patch for now - would be appreciated if devfs users could retest this
> patch, with CONFIG_DEVFS=y.

mm6 works where mm5 failed. You are probably right suspecting devfs,
I have devfs enabled although I don't actually use it. No problems
with RAID1 either.

I enabled hangcheck timer, and gets this now and then:

Warning! Detected 2106 micro-second gap between interrupts.
Compensating for 1 lost ticks.
Call Trace:
[<c010a6ad>] handle_IRQ_event+0x29/0x4c
[<c010a881>] do_IRQ+0xbd/0x138
[<c0106cc0>] default_idle+0x0/0x28
[<c0106cc0>] default_idle+0x0/0x28
[<c01093e0>] common_interrupt+0x18/0x20
[<c0106cc0>] default_idle+0x0/0x28
[<c0106cc0>] default_idle+0x0/0x28
[<c0106ce3>] default_idle+0x23/0x28
[<c0106d63>] cpu_idle+0x37/0x48
[<c0105000>] rest_init+0x0/0x50
[<c010504d>] rest_init+0x4d/0x50

Warning! Detected 2043 micro-second gap between interrupts.
Compensating for 1 lost ticks.
Call Trace:
[<c010a6ad>] handle_IRQ_event+0x29/0x4c
[<c010a881>] do_IRQ+0xbd/0x138
[<c0106cc0>] default_idle+0x0/0x28
[<c0106cc0>] default_idle+0x0/0x28
[<c01093e0>] common_interrupt+0x18/0x20
[<c0106cc0>] default_idle+0x0/0x28
[<c0106cc0>] default_idle+0x0/0x28
[<c0106ce3>] default_idle+0x23/0x28
[<c0106d63>] cpu_idle+0x37/0x48
[<c0105000>] rest_init+0x0/0x50
[<c010504d>] rest_init+0x4d/0x50


Helge Hafting

2003-01-27 11:16:43

by Luuk van der Duim

[permalink] [raw]
Subject: Re: 2.5.59-mm6

Hello mm-users,


. The mysterious "machine hangs late in boot" problem has been narrowed
down thanks to some great work by Andres Salomon. The machine is stuck
waiting on I/O completion when performing the initial lookup for
/sbin/devfs_helper:


I don't believe it to be an exclusively small-devfs helper problem.

It is an interaction at best. Sure I had problems using devfs-small, but
mm2 worked and mm3 was the first that halted during boot. Both have
devfs-small, and both need its helper. Or I am missing a subtlety here?

Secondly, Andrew sent me a rollup of patches against 2.5.59 he thought
were suspicious, without smalldevfs and it also halted, but at another
place in boot, at adding swap.

Can someone besides me confirm this behavior or am I the loon who just
won't understand?

Luuk





2003-01-27 18:44:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.5.59-mm6

At 01:27 PM 1/27/2003 +0100, Luuk van der Duim wrote:
>Hello mm-users,
>
>
> . The mysterious "machine hangs late in boot" problem has been narrowed
> down thanks to some great work by Andres Salomon. The machine is stuck
> waiting on I/O completion when performing the initial lookup for
> /sbin/devfs_helper:
>
>
>I don't believe it to be an exclusively small-devfs helper problem.

Well, my test box agrees (I have never ever used devfs, but could lock hard
in minutes) mm6 works fine here, so I _think_ it's probably resolved...

>It is an interaction at best. Sure I had problems using devfs-small, but
>mm2 worked and mm3 was the first that halted during boot. Both have
>devfs-small, and both need its helper. Or I am missing a subtlety here?

I don't think you're missing anything, but I also don't know wtf the
interaction is. I put a couple of man-days into looking for it, and came
up with exactly nada of interest.

>Secondly, Andrew sent me a rollup of patches against 2.5.59 he thought
>were suspicious, without smalldevfs and it also halted, but at another
>place in boot, at adding swap.

Mine locked hard hard hard. Booted fine, but died reliably under heavy load.

(something seems funky with nmi_watchdog... hard lock = no_more_nmi_ticks
. Anybody out there know enough about local APIC to explain why idle=poll
gives nice 1 second nmi, but everything else depends upon cpu load?... and
why when hardlock happens, it _stops_)

>Can someone besides me confirm this behavior or am I the loon who just
>won't understand?

My box agrees that you're not a loon fwTw :)

-Mike

2003-01-27 19:08:40

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.59-mm6

On Mon, 27 Jan 2003, Mike Galbraith wrote:

> (something seems funky with nmi_watchdog... hard lock = no_more_nmi_ticks
> . Anybody out there know enough about local APIC to explain why idle=poll
> gives nice 1 second nmi, but everything else depends upon cpu load?... and
> why when hardlock happens, it _stops_)

Because we base the performance counter on unhalted cycles, whilst the
normal idle function does an hlt. I think the K7 can do halted too.

Zwane
--
function.linuxpower.ca

2003-01-27 20:05:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.5.59-mm6

At 02:17 PM 1/27/2003 -0500, Zwane Mwaikambo wrote:
>On Mon, 27 Jan 2003, Mike Galbraith wrote:
>
> > (something seems funky with nmi_watchdog... hard lock = no_more_nmi_ticks
> > . Anybody out there know enough about local APIC to explain why idle=poll
> > gives nice 1 second nmi, but everything else depends upon cpu load?... and
> > why when hardlock happens, it _stops_)
>
>Because we base the performance counter on unhalted cycles, whilst the
>normal idle function does an hlt. I think the K7 can do halted too.

(well bugger, I _know_ I'm gonna regret this;)

When can the darn thing actually trigger an oops?

-Mike

2003-01-27 21:05:02

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.59-mm6

On Mon, 27 Jan 2003, Mike Galbraith wrote:

> (well bugger, I _know_ I'm gonna regret this;)
>
> When can the darn thing actually trigger an oops?

Depends, i have seen hardlocks where you don't get an oops, the nmi
watchdog will work if the kernel is still running but say stuck in a busy
loop and without the timer interrupt firing. Sometimes upping the interval
by using idle=poll does help me out. Otherwise your cpu or kernel is
really in a bad state.

Zwane
--
function.linuxpower.ca