2008-03-09 18:49:48

by Christian Kujau

[permalink] [raw]
Subject: Re: 2.6.25-rc hangs

On Sun, 9 Mar 2008, Eric Sandeen wrote:
> I meant the test you were using to determine "good" from "bad" - I guess
> it was "boot up and do IO for a while to see if it hangs"

Maybe I was too confusing, let me try again:

* 2.6.24.1 was running fine for weeks (and is now)
* 2.6.25-rc3, -rc4 comes with system hang. Trying to bisect it failed
at a really early stage:

[check out current -git, 2.6.25-rc4)
$ git bisect start
$ git bisect bad # because I know, that current -git is bad
$ git bisect good v2.6.24 # because I know 2.6.24 is good
[ compiling, and first reboot]
-> failed, because hard lockup.

After rebooting to a working kernel I can now do either:

1) mark the current one as "bad", solely on the fact that "it does not
boot" and is "bad" per se, ignoring the fact that if the box *had*
booted, the system hang *perhaps* did not occur. IOW, I'm marking
it "bad" because of a totally different issue.

2) although booting failed, I still mark it "good", which means I'm
literally *guessing* that this current kernel (bd45ac0c5daa...)
does NOT have the system hang and thus affecting all subsequent
bisects. Worst case: this guess turns out to be false and I'm bisecting
all through all ~2800 revisions without ever getting the correct "bad"
one (because they were actually all "bad").

I tried 1), but did not get any further, as the next kernel did not boot
either.
I tried 2), but did not get any further, as the next kernel did boot but
locked up when I tried to use the device mapper.

Hope that's a bit better explained than before...

Thanks,
C.
--
BOFH excuse #58:

high pressure system failure


2008-03-12 18:06:18

by Samuel Tardieu

[permalink] [raw]
Subject: Re: 2.6.25-rc hangs

>>>>> "Christian" == Christian Kujau <[email protected]> writes:

Christian> After rebooting to a working kernel I can now do either:

Christian> 1) mark the current one as "bad" [...]

Christian> 2) although booting failed, I still mark it "good" [...]

Can't you skip it with "git bisect skip" in this case? (I've never
tried "skip" but from "git bisect" documentation it does what you
want, and "git bisect" will select a nearby revision)

Sam
--
Samuel Tardieu -- [email protected] -- http://www.rfc1149.net/

2008-03-12 21:02:52

by Christian Kujau

[permalink] [raw]
Subject: Re: 2.6.25-rc hangs

On Wed, 12 Mar 2008, Samuel Tardieu wrote:
> Can't you skip it with "git bisect skip" in this case? (I've never
> tried "skip" but from "git bisect" documentation it does what you
> want, and "git bisect" will select a nearby revision)

Yes, I shall try so. In the meantime I tried -rc5 but even more headaches
with this one:

[ 1219.355352] ------------[ cut here ]------------
[ 1219.355359] WARNING: at drivers/usb/host/ehci-hcd.c:287 ehci_iaa_watchdog+0x7a/0x80()
[ 1219.355362] Modules linked in: act_police sch_ingress cls_u32 sch_sfq sch_cbq xt_tcpudp ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_nat_ftp nf_nat nf_conntrack_ftp xt_conntrack nf_conntrack iptable_filter ip_tables ipt_ULOG x_tables nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc tun fuse sg sr_mod twofish_i586 twofish_common eeprom w83l785ts asb100 hwmon_vid hwmon usb_storage ecb zd1211rw firmware_class snd_intel8x0 snd_ac97_codec mac80211 ac97_bus snd_pcm snd_timer snd soundcore pl2303 usbserial snd_page_alloc cfg80211 i2c_nforce2 i2c_core
[ 1219.355411] Pid: 450, comm: md0_raid1 Not tainted 2.6.25-rc5 #1
[ 1219.355415] [<c0119a0f>] warn_on_slowpath+0x5f/0x90
[ 1219.355429] [<c0138d57>] __lock_acquire+0x537/0x10b0
[ 1219.355441] [<c0117062>] task_tick_fair+0x32/0x60
[ 1219.355446] [<c01159cf>] scheduler_tick+0xcf/0x200
[ 1219.355453] [<c011b47d>] profile_tick+0x3d/0x80
[ 1219.355459] [<c0102ed7>] restore_nocheck+0x12/0x15
[ 1219.355467] [<c043c9ff>] _spin_lock_irqsave+0x3f/0x50
[ 1219.355475] [<c03620ba>] ehci_iaa_watchdog+0x7a/0x80
[ 1219.355480] [<c0122748>] run_timer_softirq+0x128/0x190
[ 1219.355486] [<c0105688>] do_softirq+0x98/0xd0
[ 1219.355492] [<c0362040>] ehci_iaa_watchdog+0x0/0x80
[ 1219.355498] [<c011ecb2>] __do_softirq+0x52/0xa0
[ 1219.355503] [<c0105688>] do_softirq+0x98/0xd0
[ 1219.355507] [<c0144960>] handle_level_irq+0x0/0xe0
[ 1219.355515] [<c011ec3d>] irq_exit+0x4d/0x70
[ 1219.355518] [<c0105750>] do_IRQ+0x90/0xf0
[ 1219.355522] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355529] [<c0103886>] common_interrupt+0x2e/0x34
[ 1219.355536] [<c043c982>] _spin_unlock_irq+0x22/0x30
[ 1219.355540] [<c02ba101>] __make_request+0xc1/0x320
[ 1219.355550] [<c02b91e7>] generic_make_request+0x177/0x230
[ 1219.355556] [<c043caf5>] _spin_unlock_irqrestore+0x45/0x60
[ 1219.355561] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355567] [<c03705d9>] raid1d+0x419/0xcc0
[ 1219.355574] [<c0138d57>] __lock_acquire+0x537/0x10b0
[ 1219.355581] [<c01159cf>] scheduler_tick+0xcf/0x200
[ 1219.355590] [<c0102ed7>] restore_nocheck+0x12/0x15
[ 1219.355593] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355599] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355608] [<c043caf5>] _spin_unlock_irqrestore+0x45/0x60
[ 1219.355612] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355616] [<c013837c>] trace_hardirqs_on+0x9c/0x110
[ 1219.355622] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355626] [<c037b702>] md_thread+0x22/0xe0
[ 1219.355631] [<c012c6d0>] autoremove_wake_function+0x0/0x40
[ 1219.355639] [<c037b6e0>] md_thread+0x0/0xe0
[ 1219.355643] [<c012c402>] kthread+0x42/0x70
[ 1219.355647] [<c012c3c0>] kthread+0x0/0x70
[ 1219.355651] [<c0103a1f>] kernel_thread_helper+0x7/0x18
[ 1219.355657] =======================
[ 1219.355660] ---[ end trace d0567d7a35270324 ]---


@dm-devel, can you get something useful out of these traces?

Thank you,
Christian.
--
BOFH excuse #152:

My pony-tail hit the on/off switch on the power strip.