Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964909AbWH1Xxb (ORCPT ); Mon, 28 Aug 2006 19:53:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S964911AbWH1Xxb (ORCPT ); Mon, 28 Aug 2006 19:53:31 -0400 Received: from relay.2ka.mipt.ru ([194.85.82.65]:36826 "EHLO 2ka.mipt.ru") by vger.kernel.org with ESMTP id S964909AbWH1Xx3 (ORCPT ); Mon, 28 Aug 2006 19:53:29 -0400 Date: Tue, 29 Aug 2006 03:46:52 +0400 From: Jim Klimov X-Mailer: The Bat! (v2.10.01) CD5BF9353B3B7091 Reply-To: Jim Klimov Organization: MIPT Campus-Net X-Priority: 3 (Normal) Message-ID: <1926236045.20060829034652@2ka.mipt.ru> To: linux-kernel@vger.kernel.org CC: linux-raid@vger.kernel.org Subject: 3ware glitches cause softraid rebuilds MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.7.5 (2ka.mipt.ru [194.85.82.65]); Tue, 29 Aug 2006 03:52:32 +0400 (MSD) Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 20197 Lines: 398 Hello linux-kernel, In July I wrote about problems with our Xeon/3ware fileservers which occasionally dump a stacktrace, usually of smbd somewhere in iptables routines, and reboot. Possibly this can be because one of PCI-X buses is shared by the 3Ware controller and onboard Intel Gigabit chips, and some interrupt events have a conflict? Regardless, it's not good anyway :) Then I went on vacation and didn't reply in a timely fashion to a suggestion to build a kernel with more debugging info. Now I did it with a newer kernel and wait for those bugs to repeat. Meanwhile I found a new strange behavior, not seen before. Sometimes a 3Ware card times out, and usually begins to verify itself. That's "normal", or rather usual and not considered critical. But now in exactly 10 seconds a soft lockup bug happens and the raid1 arrays (made by the linux md driver from 2 system ide drives) are rebuilt. So far this has hit us twice: Kernel: 2.6.17.11 vanilla [ 423.569222] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=1. [ 429.688317] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0. [ 7042.478001] 3w-9xxx: scsi0: AEN: INFO (0x04:0x002B): Verify completed:unit=1. [ 8387.242115] 3w-9xxx: scsi1: AEN: INFO (0x04:0x003D): Verify paused:unit=0. [ 8402.400749] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0. [17184.832241] sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x88) timed out, resetting card. [17194.818537] BUG: soft lockup detected on CPU#0! [17194.823249] show_trace+0xd/0xf dump_stack+0x17/0x1b [17194.825809] softlockup_tick+0xa9/0xc2 run_local_timers+0x12/0x14 [17194.825834] update_process_times+0x34/0x78 smp_apic_timer_interrupt+0x58/0x60 [17194.825858] apic_timer_interrupt+0x1c/0x24 twa_reset_sequence+0x37/0x1e1 [17194.825884] twa_reset_device_extension+0xe9/0x173 twa_scsi_eh_reset+0x6e/0xe4 [17194.825908] scsi_try_host_reset+0x43/0xb0 scsi_eh_host_reset+0x2f/0xf4 [17194.825932] scsi_eh_ready_devs+0x56/0x65 scsi_unjam_host+0xa3/0x1d0 [17194.825954] scsi_error_handler+0x83/0x112 kthread+0x9f/0xa4 [17194.825977] kernel_thread_helper+0x5/0xb [17240.675713] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0. [19019.998212] md: syncing RAID array md0 [19019.998231] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19019.998243] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19019.998268] md: using 128k window, over a total of 500352 blocks. [19020.022029] md: delaying resync of md1 until md0 has finished resync (they share one or more physical units) [19020.079784] md: delaying resync of md10 until md1 has finished resync (they share one or more physical units) [19020.079813] md: delaying resync of md1 until md0 has finished resync (they share one or more physical units) [19020.265076] md: delaying resync of md11 until md10 has finished resync (they share one or more physical units) [19020.265106] md: delaying resync of md1 until md0 has finished resync (they share one or more physical units) [19020.265130] md: delaying resync of md10 until md1 has finished resync (they share one or more physical units) [19020.392908] md: delaying resync of md12 until md1 has finished resync (they share one or more physical units) [19020.392930] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.392953] md: delaying resync of md1 until md0 has finished resync (they share one or more physical units) [19020.392972] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.460745] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.468102] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19020.468122] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.468145] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19020.468162] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.468190] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.468210] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19020.476311] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19020.476334] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.476362] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19020.476382] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.476404] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19020.476434] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.476455] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19020.486455] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19020.486478] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.486503] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19020.486519] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.486542] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19020.486571] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.486593] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19020.486612] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19020.498362] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19020.498394] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19020.498414] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.498438] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19020.498457] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.498475] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19020.498509] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.498534] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19020.498553] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19020.612240] md: delaying resync of md9 until md12 has finished resync (they share one or more physical units) [19020.612309] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19020.612338] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19020.612371] md: delaying resync of md3 until md0 has finished resync (they share one or more physical units) [19020.612392] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19020.612412] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19020.612435] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19020.612456] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19020.612481] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19020.612501] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19047.955571] md: md0: sync done. [19047.996082] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19047.996105] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19047.996129] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19047.996143] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19047.996154] RAID1 conf printout: [19047.996171] md: syncing RAID array md3 [19047.996179] --- wd:2 rd:2 [19047.996191] disk 0, wo:0, o:1, dev:hdc1 [19047.996199] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19047.996211] disk 1, wo:0, o:1, dev:hda1 [19047.996220] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19047.996233] md: using 128k window, over a total of 5859904 blocks. [19047.998803] md: delaying resync of md5 until md3 has finished resync (they share one or more physical units) [19047.998820] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19047.998833] md: delaying resync of md9 until md12 has finished resync (they share one or more physical units) [19048.005192] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19048.005209] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19363.392599] md: md3: sync done. [19363.453882] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19363.453894] md: delaying resync of md9 until md12 has finished resync (they share one or more physical units) [19363.453926] md: delaying resync of md1 until md5 has finished resync (they share one or more physical units) [19363.453934] md: syncing RAID array md5 [19363.453942] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19363.453950] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19363.453963] md: using 128k window, over a total of 3906368 blocks. [19363.453970] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19363.453985] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19363.454003] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19363.454122] RAID1 conf printout: [19363.455970] md: delaying resync of md12 until md5 has finished resync (they share one or more physical units) [19363.455985] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19363.457941] --- wd:2 rd:2 [19363.461140] disk 0, wo:0, o:1, dev:hdc3 [19363.465654] disk 1, wo:0, o:1, dev:hda3 [19514.855587] md: md5: sync done. [19514.894392] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19514.894431] md: delaying resync of md12 until md1 has finished resync (they share one or more physical units) [19514.894467] md: syncing RAID array md1 [19514.894476] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19514.894489] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19514.894497] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19514.894508] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19514.894518] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19514.894531] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19514.894540] md: using 128k window, over a total of 2000256 blocks. [19514.894628] RAID1 conf printout: [19514.896747] md: delaying resync of md9 until md12 has finished resync (they share one or more physical units) [19514.898392] --- wd:2 rd:2 [19514.901370] disk 0, wo:0, o:1, dev:hdc5 [19514.905901] disk 1, wo:0, o:1, dev:hda5 [19590.739467] md: md1: sync done. [19590.797061] md: delaying resync of md10 until md12 has finished resync (they share one or more physical units) [19590.797080] md: delaying resync of md9 until md12 has finished resync (they share one or more physical units) [19590.797120] md: delaying resync of md7 until md12 has finished resync (they share one or more physical units) [19590.797131] md: syncing RAID array md12 [19590.797142] md: delaying resync of md6 until md12 has finished resync (they share one or more physical units) [19590.797162] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19590.797174] md: delaying resync of md11 until md12 has finished resync (they share one or more physical units) [19590.797192] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19590.797209] md: using 128k window, over a total of 5244992 blocks. [19590.797232] RAID1 conf printout: [19590.799289] md: delaying resync of md8 until md12 has finished resync (they share one or more physical units) [19590.800846] --- wd:2 rd:2 [19590.803920] disk 0, wo:0, o:1, dev:hdc2 [19590.808568] disk 1, wo:0, o:1, dev:hda2 [19942.989053] md: md12: sync done. [19943.085278] md: delaying resync of md8 until md7 has finished resync (they share one or more physical units) [19943.085291] md: delaying resync of md6 until md11 has finished resync (they share one or more physical units) [19943.085315] md: delaying resync of md9 until md8 has finished resync (they share one or more physical units) [19943.085340] md: delaying resync of md11 until md10 has finished resync (they share one or more physical units) [19943.085377] RAID1 conf printout: [19943.085386] md: syncing RAID array md7 [19943.085396] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [19943.085404] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [19943.085417] md: using 128k window, over a total of 1953408 blocks. [19943.086535] md: delaying resync of md10 until md9 has finished resync (they share one or more physical units) [19943.088928] --- wd:2 rd:2 [19943.091916] disk 0, wo:0, o:1, dev:hdc12 [19943.096249] disk 1, wo:0, o:1, dev:hda12 [20033.019114] md: md7: sync done. [20033.090440] md: delaying resync of md6 until md11 has finished resync (they share one or more physical units) [20033.090460] md: syncing RAID array md8 [20033.090470] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [20033.090477] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [20033.090489] md: using 128k window, over a total of 5859904 blocks. [20033.090571] md: delaying resync of md10 until md9 has finished resync (they share one or more physical units) [20033.090589] md: delaying resync of md9 until md8 has finished resync (they share one or more physical units) [20033.090602] md: delaying resync of md11 until md10 has finished resync (they share one or more physical units) [20033.090646] RAID1 conf printout: [20033.094083] --- wd:2 rd:2 [20033.097113] disk 0, wo:0, o:1, dev:hdc7 [20033.101318] disk 1, wo:0, o:1, dev:hda7 [20283.332772] md: md8: sync done. [20283.392384] md: delaying resync of md11 until md10 has finished resync (they share one or more physical units) [20283.392411] md: delaying resync of md6 until md11 has finished resync (they share one or more physical units) [20283.392432] RAID1 conf printout: [20283.392479] md: syncing RAID array md9 [20283.392495] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [20283.392521] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [20283.392557] md: using 128k window, over a total of 3906368 blocks. [20283.395898] md: delaying resync of md10 until md9 has finished resync (they share one or more physical units) [20283.396045] --- wd:2 rd:2 [20283.399015] disk 0, wo:0, o:1, dev:hdc8 [20283.403257] disk 1, wo:0, o:1, dev:hda8 [20452.540323] md: md9: sync done. [20452.610616] md: delaying resync of md6 until md11 has finished resync (they share one or more physical units) [20452.610633] md: syncing RAID array md10 [20452.610661] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [20452.610669] RAID1 conf printout: [20452.610679] --- wd:2 rd:2 [20452.610692] disk 0, wo:0, o:1, dev:hdc9 [20452.610706] disk 1, wo:0, o:1, dev:hda9 [20452.611783] md: delaying resync of md11 until md10 has finished resync (they share one or more physical units) [20452.625535] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [20452.625576] md: using 128k window, over a total of 3906368 blocks. [20608.982515] md: md10: sync done. [20609.027444] md: delaying resync of md6 until md11 has finished resync (they share one or more physical units) [20609.027460] md: syncing RAID array md11 [20609.027473] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [20609.027485] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [20609.027503] md: using 128k window, over a total of 65430144 blocks. [20609.027558] RAID1 conf printout: [20609.031138] --- wd:2 rd:2 [20609.034093] disk 0, wo:0, o:1, dev:hdc10 [20609.038435] disk 1, wo:0, o:1, dev:hda10 [22231.702114] sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x88) timed out, resetting card. [22241.688436] BUG: soft lockup detected on CPU#1! [22241.693331] show_trace+0xd/0xf dump_stack+0x17/0x1b [22241.695899] softlockup_tick+0xa9/0xc2 run_local_timers+0x12/0x14 [22241.695929] update_process_times+0x34/0x78 smp_apic_timer_interrupt+0x58/0x60 [22241.695961] apic_timer_interrupt+0x1c/0x24 twa_reset_sequence+0x37/0x1e1 [22241.695994] twa_reset_device_extension+0xe9/0x173 twa_scsi_eh_reset+0x6e/0xe4 [22241.696027] scsi_try_host_reset+0x43/0xb0 scsi_eh_host_reset+0x2f/0xf4 [22241.696059] scsi_eh_ready_devs+0x56/0x65 scsi_unjam_host+0xa3/0x1d0 [22241.696086] scsi_error_handler+0x83/0x112 kthread+0x9f/0xa4 [22241.696119] kernel_thread_helper+0x5/0xb [22335.486623] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0. [22887.014006] 3w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Verify completed:unit=0. [23994.650143] md: md11: sync done. [23994.804031] md: syncing RAID array md6 [23994.804051] RAID1 conf printout: [23994.804065] md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. [23994.804075] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. [23994.804094] md: using 128k window, over a total of 1953408 blocks. [23994.808154] --- wd:2 rd:2 [23994.811253] disk 0, wo:0, o:1, dev:hdc11 [23994.815967] disk 1, wo:0, o:1, dev:hda11 [24225.817451] md: md6: sync done. [24225.939633] RAID1 conf printout: [24225.943358] --- wd:2 rd:2 [24225.946614] disk 0, wo:0, o:1, dev:hdc6 [24225.950917] disk 1, wo:0, o:1, dev:hda6 -- Best regards, Jim Klimov mailto:klimov@2ka.mipt.ru - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/